| 402 | 2 | 120 |
| 下载次数 | 被引频次 | 阅读次数 |
定量研究收入分配需要收入信息的支撑,而收入信息常常由于各种原因缺失。使用含缺失收入信息的样本进行不平等、贫困分析会有偏差,因此科学处理社会调查中的收入缺失信息非常重要。本文对收入信息缺失的原因进行梳理概括,系统整理收入研究常用的分布,提出基于不同数据基础填补收入缺失信息的逻辑,对填补方法进行详细研究,包括加权调整法、单一插补法、多重插补法、计算机模拟数据生成法和分组数据细化法,并结合实际数据对不同方法的填补效果进行比较。结果表明,单一插补法效果较差;多重插补法、加权调整法和计算机模拟数据生成法都可以在一定程度上对调查数据中存在的高收入缺失现象进行修正;分组数据低估了组内个体差异,进而会低估不平等程度;而采用本文提出的分组数据细化法可以得到更精确的个体数据。后续研究应着力推进多源数据协同分析、探索统计推断与机器学习建模的协同框架。
Abstract:Quantitative research of income distribution needs the support of income information, which is often missing due to various reasons. Using samples with missing income information for inequality and poverty analysis may introduce bias. Consequently, it is crucial to address the missing income data in social surveys through scientifically rigorous methods. In this paper, we summarize the causes of missing income information, systematically sort out the distributions commonly used in income research, propose the logic of filling in missing income information based on different data status, and conduct a detailed study on the filling methods, including the weighted adjustment method, single imputation method, multiple imputation method, computer simulation data generation method, and grouped data refinement method. Finally, the filling effect of different methods is compared based on actual data, and the results show that the single imputation method has poor performance; the multiple imputation method, the weighted adjustment method and the computer simulation data generation method can correct the missing high income in the survey data to a certain extent. Besides, grouped data underestimates individual differences within a group, which in turn underestimates the degree of inequality, but more accurate individual data can be obtained using the grouped data refinement method proposed in this paper. Subsequent research ought to focus on orchestrating multi-source data analysis and exploring an integrated framework that synergizes statistical inference and machine learning modeling.
[1]李宝瑜,刘雪晨,刘洋.特征样本重复抽样建模方法和应用研究[J].统计研究, 2016, 33(10):93–99.
[2]刘凤芹.基于链式方程的收入变量缺失值的多重插补[J].统计研究, 2009, 26(1):71–77.
[3]阮敬,丁琳,纪宏.收入分布视角下的收入分配研究[J].数理统计与管理, 2018, 37(1):104–121.
[4]王海港,周开国.中国城乡居民收入分配的不平等程度被低估了吗?——基于帕雷托分布的一个检验[J].统计研究, 2006(4):8–15.
[5]熊巍,潘传快,祁春节.农业经济调查缺失数据的贝叶斯和Bootstrap多重插补的比较[J].统计与决策, 2019, 35(4):11–15.
[6] Alvaredo F, Assouad L, Piketty T. Measuring Inequality in the Middle East 1990—2016:The World’s Most Unequal Region?[J]. Review of Income and Wealth, 2019, 65(4):685–711.
[7] Alvaredo F, Juliana LV. High Income and Income Tax in Colombia, 1993—2010[J]. Revista De Economia Institucional, 2014, 16(31):157–194.
[8] Atkinson A, Piketty T, Saez E. Top Incomes in the Long Run of History[J]. Post-Print, 2011:28–29.
[9] A?mann, Christian, Würbach, et al. Nonparametric Multiple Imputation for Questionnaires with Individual Skip Patterns and Constraints:The Case of Income Imputation in the National Educational Panel Study[J]. Sociological Methods&Research, 2017, 46(4):864–897.
[10] Bazan T D. Measuring Inequality from Top to Bottom[R]. Policy Research Working Paper, 2015.
[11] Burkhauser RV, Feng S, Jenkins SP. Recent Trends in Top Income Shares in the United States:Reconciling Estimates from March CPS and IRS Tax Return Data[J]. Review of Economics and Statistics, 2012, 94(2):371–388.
[12] Garbinti B, Goupille-Lebret J, Piketty T. Income inequality in France, 1900—2014:Evidence from Distributional National Accounts(DINA)[J].Journal of Public Economics, 2018, 162:63–77.
[13] Ghorpade-Aher J, Sonkamble B. A Machine Learning Algorithm for Multi-Source Heterogeneous Data with Block-Wise Missing Information[J].Indian Journal of Computer Science and Engineering, 2022, 13(6):1893–1904.
[14] Higgins S, Lustig N, Vigorito A. The Rich Underreport Their Income:Assessing Bias in Inequality Estimates and Correction Methods Using Linked Survey and Tax Data[J]. Ecineq WP, 2018, 475.
[15] Hlasny V, Verme P. Top Incomes and the Measurement of Inequality:A Comparative Analysis of Correction Methods using Egyptian, EU and US Survey Data[C]. Proceedings of the 6th Meeting of the Society for the Study of Economic Inequality(ECINEQ), Luxembourg, July 13–15,2015.
[16] Jenkins S P. Pareto Models, Top Incomes and Recent Trends in UK Income Inequality[J]. Economica, 2017(334):84.
[17] Jntti M, Trmlehto V M, Marlier E. The Use of Registers in the Context of EU–SILC:Challenges and Opportunities[M]. 2013.
[18] Jorda V, Sarabia J M, J?ntti M. Estimation of Income Inequality from Grouped Data[R]. LIS Working Paper, 2020.
[19] Korinek A, Mistiaen J A, Ravallion M. An Econometric Method of Correcting for Unit Nonresponse Bias in Surveys[J]. Journal of Econometrics,2007, 136(1):213–235.
[20] Li Q, Li S, Wan H. Top Incomes in China:Data Collection and the Impact on Income Inequality[J]. China Economic Review, 2020, 62:101495.
[21] Mcdonald J B. Some Generalized Functions for the Size Distribution of income[J]. Econometrica, 1984, 52(3):647–663.
[22] Richardson A J, Loeis M. Estimation of Missing Income in Household Travel Surveys[J]. 21st Australasian Transport Research Forum Adelaide,1997, 21(1):249–266.
[23] Sherwood B, Wang L, Zhou X H. Weighted Quantile Regression for Analyzing Health Care Cost Data With Missing Covariates[J]. Statistics in Medicine, 2013, 32(28):4967–4979.
[24] Shorrocks A, Wan G. Ungrouping Income Distributions:Synthesising Samples for Inequality and Poverty Analysis[R]. WIDER Research Paper,2008.
[25] Si Y, Reiter J P. Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-scale Assessment Surveys[J].Journal of educational and behavioral statistics, 2013, 38(5):499–521.
[26] Singh S, Maddala G. A Function for Size Distribution of Incomes[J]. Econometrica, 1976, 44(5):963–970.
[27] Zhang Y, Tang N, Qu A. Imputed Factor Regression for High-Dimensional Block-Wise Missing Data[J]. Statistica Sinica, 2020, 30(2):631–651.
(1)Fisk分布是Ⅲ型帕累托分布的另一名称,其以经济学家Philip Fisk命名,常见于经济学领域。
(1)GB2分布族参数设置灵活、能精准拟合收入数据的厚尾特征与多模态形态,且兼容帕累托、对数正态等经典分布,成为收入建模的最优选择。
(1)CFPS调查的对象为上海、辽宁、河南、甘肃、广东、江苏、浙江、福建、江西、安徽、山东、河北、山西、吉林、黑龙江、广西、湖北、湖南、四川、贵州、云南、天津、北京、重庆、陕西。其样本所在人口覆盖了我国除香港、澳门、台湾外总人口数的94.5%。
(1)在计算机模拟中,设置1、5、7和10组是为了体现收入数据的不同详细程度,其中,5组和7组是国家统计局通常公布的组数。
(2)因篇幅所限,对CFPS2020数据采用不同方法填补收入后计算的基尼系数以附表1展示,见《统计研究》网站所列附件。下同。
(3)因篇幅所限,计算结果以附表2展示。
基本信息:
DOI:10.19343/j.cnki.11-1302/c.2025.07.012
中图分类号:F124.7
引用信息:
[1]高艳云,段囡.社会调查中收入信息缺失的填补逻辑与方法研究[J].统计研究,2025,42(07):147-160.DOI:10.19343/j.cnki.11-1302/c.2025.07.012.
基金信息:
教育部人文社会科学研究规划基金项目“社会调查中高收入群体信息缺失的推断方法及应用研究”(21YJA910003); 国家社会科学基金重点项目“国内国际双循环测度与评价研究”(22AZD140)
2025-05-28
2025
2025-05-30
2025-09-05
2025
1
2025-07-25
2025-07-25