nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo journalinfonormal searchdiv searchzone qikanlogo popupnotification paper paperNew
2020, 08, v.37;No.347 91-103
基于非负矩阵分解的函数型聚类算法
基金项目(Foundation): 国家社会科学基金西部项目“大规模稀疏函数型数据修复方法与应用研究”(19XTJ002)
邮箱(Email):
DOI: 10.19343/j.cnki.11-1302/c.2020.08.007
发布时间: 2020-08-24
出版时间: 2020-08-24
网络发布时间: 2020-08-24
移动端阅读
摘要:

函数型聚类分析算法涉及投影和聚类两个基本要素。通常,最优投影结果未必能够有效地保留类别信息,从而影响后续聚类效果。为此,本文梳理了函数型聚类的构成要素及运行过程;借助非负矩阵分解的聚类特性,提出了基于非负矩阵分解的函数型聚类算法,构建了"投影与聚类"并行的实现框架,并采用交替迭代方法更新求解,分析了算法的计算时间复杂度。针对随机模拟数据验证和语音识别数据的实例检验结果显示,该函数型聚类算法有助于提高聚类效果;针对北京市二氧化氮(NO2)污染物小时浓度数据的实例应用表明,该函数型聚类算法对空气质量监测点类型的区分能够充分识别站点布局的空间模式,具有良好的实际应用价值。

Abstract:

Functional clustering algorithm involves two basic elements: projecting and clustering. Generally speaking,the optimal projection results may not effectively retain category information,thus affecting the subsequent clustering effect. In this paper,the elements and operating process of functional clustering are reviewed,a functional clustering algorithm is proposed by virtue of the clustering characteristics of non-negative matrix factorization,and a one-step implementation framework of "clustering while projecting"is constructed.Meanwhile,the alternative and interative algorithm is used to update the solution,and the computational complexity of our proposed algorithm is discussed. The test results of random simulation data and speech recognition data show that our functional clustering algorithm is helpful to improve the clustering effect. A case study on the hourly concentration of nitrogen dioxide( NO2) in Beijing shows that our algorithm can distinguish the types of air quality monitoring stations and fully identify the spatial pattern of station layout,which has good application value.

参考文献

[1]黄恒君.基于B-样条基底展开的曲线聚类方法[J].统计与信息论坛,2013,28(9):3-8.

[2]黄恒君,高海燕,张梦瑶.函数型聚类分析:基于距离的一步法框架[J].数理统计与管理,2019,38(6):986-995.

[3]黄恒君,漆威.海量半结构化数据采集、存储及分析———基于实时空气质量数据处理的实践[J].统计研究,2014,31(5):10-16.

[4]黄恒君.基于加权深度的异常曲线探测方法———以空气质量函数型数据为例[J].统计与信息论坛,2014,29(9):3-10.

[5]王德青,朱建平,刘晓葳,等.函数型数据聚类分析研究综述与展望[J].数理统计与管理,2018,37(1):51-63.

[6]许腾腾,王瑞,黄恒君.一种加入类间因素的曲线聚类算法[J].智能系统学报,2019,14(2):362-368.

[7]Abraham C. Unsupervised Curve Clustering Using B-Splines[J]. Scandinavian Journal of Statistics,2003,30(3):581-595.

[8]Bauckhage C. K-Means Clustering is Matrix Factorization[J]. Statistics,2015,ar Xiv:1512. 07548v1.

[9]Bishop C M. Pattern Recognition and Machine Learning(Information Science and Statistics)[M]. New York:Springer-Verlag,Inc,2006.

[10]Bouveyron C,Camille B. Model-Based Clustering of High-Dimensional Data:A Review[J]. Computational Statistics&Data Analysis,2014,71(1):52-78.

[11]Chiou J M. Functional Clustering and Identifying Substructures of Longitudinal Data[J]. Journal of the Royal Statistical Society,2007,69(4):679-699.

[12]Coffey N,Hinde J,Holian E. Clustering Longitudinal Profiles Using P-Splines and Mixed Effects Models Applied to Time-Course Gene Expression Data[J]. Computational Statistics&Data Analysis,2014,71(3):14-29.

[13]Ding C,He X,Simon H D. On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering[J]. Society for Industrial15 and Applied Mathematics. Proceedings of the SIAM International Conference on Data Mining,2005,5:606-610.

[14]Ding C,Li T,Jordan M I. Convex and Semi-Nonnegative Matrix Factorizations[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(1):45-55.

[15]Ding C H Q,Li T,Peng W,et al. Orthogonal Nonnegative Matrix T-Factorizations for Clustering[J]. Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2006,126-135.

[16]Eubank R L. Nonparametric Regression and Spline Smoothing[M]. 2ed. New York:Marcel Dekker,Inc.,1999.

[17]Giacofci M,Lambert-Lacroix S,Marot G,et al. Wavelet-Based Clustering for Mixed-Effects Functional Models in High Dimension[J]. Biometrics,2013,69(1):31-40.

[18]Green P J,Silverman B W. Nonparametric Regression and Generalized Linear Models:A Roughness Penalty Approch[M]. Chapman&Hall,1994.

[19]Hoyer P O. Non-Negative Matrix Factorization with Sparseness Constraints[J]. J Machine Learning Research,2004,9(5):1457-1469.

[20]Jacques J,Preda C. Model-Based Clustering for Multivariate Functional Data[J]. Computational Statistics&Data Analysis,2014,71(3):92-106.

[21]Lee D D,Seung H S. Learning the Parts of Objects by Non-Negative Matrix Factorization[J]. Nature,1999,6755(401):788-791.

[22]Lee D D,Seung H S. Algorithms for Non-Negative Matrix Factorization[C]. In NIPS,MIT Press,2001.

[23]Liang N,Yang Z,Li Z,et al. Semi-Supervised Multi-View Clustering with Graph-Regularized Partially Shared Non-negative Matrix Factorization[J]. Knowledge-Based Systems,2020,190:105185.

[24]Marx B D. Generalized Linear Regression on Sampled Signals and Curves:A P-Spline Approach[J]. Technometrics,1999,41(1):1-13.

[25]Paatero P,Tapper U. Positive Matrix Factorization:A Non-Negative Factor Model with Optimal Utilization of Error Estimates of Data Values[J]. Environmetrics,1994,5(2):111-126.

[26]Ramsay J O,Silverman B W. Functional Data Analysis[M]. 2 ed. New York:Springer,2005.

[27]Tropp A J. Literature Survey:Non-Negative Matrix Factorization[J]. University of Texas at Asutin,preprint,2003.

[28]Yamamoto M,Hwang H. Dimension-Reduced Clustering of Functional Data via Subspace Separation[J]. Journal of Classification,2017,34(2):294-326.

[29]Yamamoto M,Terada Y. Functional Factorial K-Means Analysis[J]. Computational Statistics&Data Analysis,2014,79(4):133-148.

(1)降维在函数型聚类文献中主要采用函数型主成分方法。

(2)TIMIT是由德州仪器、麻省理工学院和SRI International合作构建的声学-音素连续语音语料库。数据来源:http://statweb. stanford. edu/~tibs/Elem Stat Learn. 1st Ed/datasets/phoneme. data。

(1)其他函数型数据分析方法也有类似的分析逻辑。

(2)本文主要讨论稠密型的函数型数据。

(3)对于一组非零均值的可观测曲线x(t),可通过中心化z(t)=x(t)-x(t),使其满足零均值假定。

(1)“SA1”语音数据的文本内容为“She had your dark suit in greasy wash water all year”。

(2)TA、FCOF和FFKM算法的聚类结果分别为最小化式(6)、式(7)和式(8)。参数设置基本一致是指,TA和FCOF仅需设置基底及数量,采用与FNMF相同的设置; FFKM除基底及数量与FNMF一致外,还需设置投影矩阵P的列数,设置为5行。

(1)根据北京市环境监测中心(http://www. bjmemc. com. cn/)的分类。为描述方便起见,分别定义为类别1,类别2,类别3,类别4。

基本信息:

DOI:10.19343/j.cnki.11-1302/c.2020.08.007

中图分类号:TP301.6

引用信息:

[1]高海燕,黄恒君,王宇辰.基于非负矩阵分解的函数型聚类算法[J].统计研究,2020,37(08):91-103.DOI:10.19343/j.cnki.11-1302/c.2020.08.007.

基金信息:

国家社会科学基金西部项目“大规模稀疏函数型数据修复方法与应用研究”(19XTJ002)

发布时间:

2020-08-24

出版时间:

2020-08-24

网络发布时间:

2020-08-24

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文