nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo journalinfonormal searchdiv searchzone qikanlogo popupnotification paper paperNew
2016, 08, v.33;No.299 101-105
大数据下Leverage重要性抽样方法的稳健改进
基金项目(Foundation): 对外经济贸易大学学科建设专项经费“大数据下Leverage重要性抽样方法的稳健改进”(XK2016107);; 对外经济贸易大学中央高校基本科研业务费专项资金资助“高维成分数据的稀硫分类算法及应用”(15QD15);; 对外经济贸易大学中国企业“走出去”协同创新中心科研项目“大数据与投资选点”(201504YY006A)资助
邮箱(Email):
DOI: 10.19343/j.cnki.11-1302/c.2016.08.013
发布时间: 2016-08-15
出版时间: 2016-08-15
移动端阅读
摘要:

大数据以其巨大的样本容量或超高的变量维度使得直接计算变得不再可能,如何有效地抽取一个合适的计算样本是值得思考的问题。本文借鉴Leverage重要性抽样的思想,提出了两种稳健的改进抽样算法,不仅有效地抽取了代表性高的计算样本进行回归估计,还规避了方差大和异质性导致协方差矩阵估计不准的问题。模拟数据的分析显示,相比于Ma(2015)的方法,本文提出的方法具有更为优良的估计结果。

Abstract:

Big data,due to the massive sample size or ultra high dimensionality,makes classical computation impossible. Thus how to obtain an effective sample is becoming crucial. This paper introduced two robust modification sampling methods based on the idea of Leverage importance sampling. The proposed approaches can conduct sampling efficiently and have significant improvement on estimation of covariance matrix. Simulation results indicate that our proposed methods perform better compared with Ma( 2015).

参考文献

[1]Doctorow C.Big data:welcome to the petacentre[J].Nature News,2008,455(7209):16-21.

[2]Jonathan T O,Gerald A M.Special online collection:dealing with data[J].Science,2011,331(6018):639-806.

[3]Ma P,Mahoney M W,Yu B.A statistical perspective on algorithmic leveraging[J].Journal of Machine Learning Research,2015,16:861-911.

[4]Drineas P,Mahoney M W,Muthukrishnan S.Sampling algorithms for L2 regression and applications[C].Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm.Society for Industrial and Applied Mathematics,2006:1127-1136.

[5]Drineas P,Mahoney M W,Muthukrishnan S,et al.Faster least squares approximation[J].Numerische Mathematik,2011,117(2):219-249.

[6]Drineas P,Magdon-Ismail M,Mahoney M W,et al.Fast approximation of matrix coherence and statistical leverage[J].The Journal of Machine Learning Research,2012,13(1):3475-3506.

[7]Everitt B S,Skrondal A.The Cambridge dictionary of statistics[J].Cambridge:Cambridge,2002.

[8]Fan J,Han F,Liu H.Challenges of big data analysis[J].National science review,2014,1(2):293-314.

[9]Giloni A,Simonoff J S,Sengupta B.Robust weighted LAD regression[J].Computational statistics&data analysis,2006,50(11):3124-3140.

[10]Zhao W,Ma H,He Q.Parallel k-means clustering based on mapreduce[M].Cloud Computing.Springer Berlin Heidelberg,2009:674-679.

[11]Fanaee-T H,Gama J.Event labeling combining ensemble detectors and background knowledge[J].Progress in Artificial Intelligence,2014,2(2):113-127.

[12]Mahoney M W,Drineas P.CUR matrix decompositions for improved data analysis[J].Proceedings of the National Academy of Sciences,2009,106(3):697-702.

(1)篇幅限制,此处未列示做结果,读者需要可向笔者索取。

基本信息:

DOI:10.19343/j.cnki.11-1302/c.2016.08.013

中图分类号:F224

引用信息:

[1]秦磊,熊巍,田茂再.大数据下Leverage重要性抽样方法的稳健改进[J].统计研究,2016,33(08):101-105.DOI:10.19343/j.cnki.11-1302/c.2016.08.013.

基金信息:

对外经济贸易大学学科建设专项经费“大数据下Leverage重要性抽样方法的稳健改进”(XK2016107);; 对外经济贸易大学中央高校基本科研业务费专项资金资助“高维成分数据的稀硫分类算法及应用”(15QD15);; 对外经济贸易大学中国企业“走出去”协同创新中心科研项目“大数据与投资选点”(201504YY006A)资助

发布时间:

2016-08-15

出版时间:

2016-08-15

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文