统计研究

2017, 01, v.34;No.304 5-11

大数据时代统计学发展的若干问题

基金项目(Foundation): 国家社会科学基金项目“大数据的高维变量选择方法及其应用研究”(批准号13CTJ001);; 国家自然科学基金面上项目“广义线性模型的组变量选择及其在信用评分中的应用”(批准号71471152)的资助

邮箱(Email):

DOI: 10.19343/j.cnki.11-1302/c.2017.01.001

发布时间： 2017-01-15

出版时间： 2017-01-15

移动端阅读

2,985	47	107
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

近年来,计算机和互联网的发展使得人类信息的拥有量达到了前所未有的程度,各类信息被保存流通起来,人类进入了大数据时代。大数据具有规模性、多样性,高速性等特点,给统计学的发展带来了新的机遇,同时也带来了新的挑战。本文回顾了统计学的发展历史,剖析了统计学的发展特点,在此基础上讨论了大数据背景下统计学的发展定位;并进一步分析统计学与计算机之间的关系,最后分析了大数据研究中存在的若干误区。

关键词： 大数据计算机; 因果关系; 抽样; 数据质量;

Abstract：

In the past decades,the development of computer science and internet techniques has enabled researchers to collect,store,and analyze data at an unparalleled speed,with which we have entered the era of big data. Big data have unique characteristics(volume, variety, velocity, and veracity), which bring opportunities as well as challenges to statistics and statisticians. In this article,we examine the history of statistical methodological development and analyze the characteristics of statistical development,based on which we propose the positioning of statistics in the big data era and discuss the interconnections and interactions between statistics and computer science/internet technologies. At the end,we clarify a few misunderstandings in big data analysis.

KeyWords： Big Data; Computer Science; Causality; Sampling; Data Quality;

如需获取全文，请访问cnki.net

参考文献

[1]IBM.Big Data and Analytics[N/OL].http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html,2016-10-05.

[2]Mckinsey Global Institute,Big Data:The next frontier for innovation,competition and productivity,2011-5.

[3]Grobelink M.Big-data computing:Creating revolutionary breakthroughs in commerce,science and society[N/OL].2012-10-02.

[4]Viktor Mayer-Schnberger.大数据时代[M].杭州:浙江人民出版社,2012.

[5]Harford T.Big data:A big mistake?[J].Significance,2014,11(5):14-19.

[6]Lazer D,et al.Big data.The parable of Google Flu:traps in big data analysis[J].Science,2014,343(6176):1203-1205.

[7]Olsen C.The Lady Tasting Tea:How Statistics Revolutionized Science in the Twentieth Century[J].Journal of the American Statistical Association,2002,286(458):1238-1239.

[8]Everitt B S.Modern Medical Statistics:A Practical Guide[J].Biometrics,2002,60(1):291.

[9]Wagstaff A.QALYs and the equity-efficiency trade-off[J].Journal of Health Economics,1991,10(1):21-41.

[10]Ning L,Li Z,He Q,et al.Parallel Implementation of Apriori Algorithm Based on MapReduce[J].2012,1(2):236-241.

[11]Zhao W,Ma H,He Q.Parallel K-Means Clustering Based on MapReduce[C].Heidelberg:Springer,2009.674-679.

[12]Kleiner A,Talwalkar A,Sarkar P,et al.The Big Data Bootstrap[J].Computer Science,2012.1759-1766.

[13]Murugesan S,Bojanova I.47.Cloud Data Management[M].New York:John Wiley&Sons,Ltd,2016.572-581.

[14]Fan J,Samworth R,Wu Y.Ultrahigh dimensional variable selection:beyond the linear model[J].Journal of Machine Learning Research Jmlr,2008,10(5):2013-2038.

[15]Fan J,Rui S.Sure independence screening in generalized linear models with NP-dimensionality[J].Annals of Statistics,2009,38(6):3567-3604.

[16]Yadav R K,Bhadoria R S,Suri A.GPU-accelerated Large Scale Analytics using MapReduce Model[J].International Journal of Hybrid Information Technology,2015,8(6):375-380.

[17]Reshef D N,Reshef Y A,Finucane H K,et al.Detecting novel associations in large data sets[J].Science,2011,334(6062):1518-1524.

[18]Nguyen H V,et al.Multivariate maximal correlation analysis[A].31st International Conference on Machine Learning:(ICML2014):Beijing,China,21-26 June 2014[M].New York:International Machine Learning Society,2014.775-783.

[19]Hewitt C.Open Information Systems Semantics for distributed artificial intelligence[J].Artificial Intelligence,1991,47(1/3):79-106.

[20]Preis T,Moat H S,Stanley H E.Quantifying Trading Behavior in Financial Markets Using Google Trends[J].Scientific Reports,2013,3(7446):542.

[21]Butler D.When Google got flu wrong[J].Nature,2013,494(7436):155-156.

[22]Wright S.The Method of Path Coefficients[J].Annals of Mathematical Statistics,1934,5(3):161-215.

[23]Kumar V S,et al.Causal Models and Big Data Learning Analytics[M].Heidelberg:Springer,2015.

[24]Spirtes P,Glymour C,Scheines R.Causation,prediction,and search[J].Lecture Notes in Statistics,1993,81(3):272-273.

[25]Pearl J.Probabilistic reasoning in intelligent systems:networks of plausible inference[J].Computer Science Artificial Intelligence,1988,70(2):1022-1027.

[26]Fan J,Han F,Liu H.Challenges of Big Data analysis[J].National Science Review,2014,1(2):293-314.

[27]Cohen E,Cormode G,Duffield N.Structure-Aware Sampling:Flexible and Accurate Summarization[J].Proceedings of the Vldb Endowment,2011,4(11).

[28]耿直.大数据时代统计学面临的机遇与挑战[J].统计研究,2014,31(1):5-9.

[29]Crawford K.The hidden biases in big data[J].HBR Blog Network,2013(1).

[30]Saha B,Srivastava D.Data quality:The other face of Big Data[A].IEEE.the 30th International Conference on Data Engineering[C].New York:IEEE,2014:19-46.

[31]Lukoianova T,Rubin V L.Veracity Roadmap:Is Big Data Objective,Truthful and Credible?[J].2013,24(1).

[32]Sciences E P.Frontiers in massive data analysis[M].Washington,D.C:The National Academies Press,2013.

基本信息:

DOI：10.19343/j.cnki.11-1302/c.2017.01.001

中图分类号:TP311.13;C8

引用信息:

[1]"大数据中的统计方法"课题组,马双鸽.大数据时代统计学发展的若干问题[J].统计研究,2017,34(01):5-11.DOI:10.19343/j.cnki.11-1302/c.2017.01.001.

基金信息:

国家社会科学基金项目“大数据的高维变量选择方法及其应用研究”(批准号13CTJ001);; 国家自然科学基金面上项目“广义线性模型的组变量选择及其在信用评分中的应用”(批准号71471152)的资助

发布时间：

2017-01-15

出版时间：

2017-01-15

请选择需要下载的pdf数据

统计研究

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文

请选择需要下载的pdf数据

统计研究

使用微信“扫一扫”功能。将此内容分享给您的微信好友或者朋友圈

引用

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈