nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo journalinfonormal searchdiv searchzone qikanlogo popupnotification paper paperNew
2025, 02, v.42 122-134
多源异质数据下深度神经网络的整合分析及其应用
基金项目(Foundation): 国家自然科学基金面上项目“多源数据融合的高维整合分析分类模型及其信用风险应用”(72271088); 教育部人文社会科学基金规划项目“面向超高频金融数据的函数型分类预测方法及应用研究”(22YJAZH099); 湖南省研究生科研创新项目“多源数据的深度神经网络及其应用”(CX20230418); 国家社会科学基金后期资助重点项目“金融市场函数型数据挖掘的统计方法及应用研究”(24FTJA001)
邮箱(Email): gangjw1997@hnu.edu.cn;
DOI: 10.19343/j.cnki.11-1302/c.2025.02.010
发布时间: 2025-03-04
出版时间: 2025-03-04
网络发布时间: 2025-03-04
移动端阅读
摘要:

随着计算机技术的发展,各行各业累积和存储了丰富的数据。这些数据往往具有来源差异性、高维性特点,基于这些特征的多源数据建模是统计学的热点问题。针对多源异质数据,本文提出深度神经网络整合分析模型(IADNN)。该模型建立了L1-CMCP惩罚,以识别重要特征以及处理数据的异质性,其中外层MCP识别对多源数据集整体显著的特征;中层MCP识别特征在数据集层面的异质性;内层Lasso识别DNN节点的异质性。这种嵌套设计旨在促进数据集间的信息共享。本文对L1-CMCP进行局部线性近似,再采用近端梯度下降算法进行模型估计。模拟分析表明,IADNN在特征选择和分类预测方面均有良好表现。当多源数据部分异质时,所提方法的F1分数、FPR等评估指标均优于各数据集独立建模和合并建模的方法;在多源数据完全异质或完全同质时,所提方法取得了与理论最佳模型相近的效果。最后,将IADNN应用于不同经济发展水平地区的信用违约数据,发现该模型在风险指标选择和违约预测方面具备有效性。

Abstract:

With the development of computer technology,all walks of life have accumulated and stored rich data.They often have characteristics of source diversity and high dimensionality,and modeling multi-source data based on these characteristics is a popular topic in statistics.For the multi-source heterogeneous data,the study proposes the Integrative Analysis Deep Neural Network (IADNN),which employs the L1-CMCP penalty to identify significant feature variables and address data heterogeneity.The outer layer of L1-CMCP identifies features that significantly impact the entire multi-source data,the middle layer identifies the heterogeneity of features at the dataset level,and the inner Lasso layer detects heterogeneity among DNN nodes.This nested design is intended to enhance information sharing.For model estimation,a local linear approximation and a proximal gradient descent algorithm are adopted.Simulation study shows that the proposed IADNN performs satisfactorily in terms of feature selection and classification prediction.When the multi-source data are partially heterogeneous,the evaluation metrics of the IADNN model,such as the F1 score and the FPR,outperform both independent modeling and merged modeling approaches.When the multi-source data are completely heterogeneous or homogeneous,IADNN performance is similar to the theoretical best model.Lastly,the application of IADNN to credit default data from regions with different economic levels demonstrates its effectiveness in selecting risk indicators and predicting default.

参考文献

[1]范新妍,方匡南,郑陈璐,等.基于整合治愈率模型的信贷违约时点预测[J].统计研究, 2021, 38(2):99–113.

[2]方匡南,张晴雯,林洪伟.考虑数据源网络结构的高维数据整合分析与子群识别研究[J].统计研究, 2022, 39(7):125–136.

[3]方匡南,赵梦峦.基于多源数据融合的个人信用评分研究[J].统计研究, 2018, 35(12):92–101.

[4]马双鸽,王小燕,方匡南.大数据的整合分析方法[J].统计研究, 2015, 32(11):3–11.

[5]王小燕,冮建伟,徐龙滔.基于CMCP和余弦间隔交叉熵的深度神经网络及其应用[J].数量经济技术经济研究, 2022, 39(10):170–188.

[6] Bourassa S C, Hoesli M, Peng V S. Do Housing Submarkets Really Matter?[J]. Journal of Housing Economics, 2003, 12(1):12–28.

[7] Breheny P, Huang J. Penalized Methods for Bi-Level Variable Selection[J]. Statistics and its Interface, 2009, 2(3):369–380.

[8] Ghosh A. Banking-Industry Specific and Regional Economic Determinants of Non-Performing Loans:Evidence from Us States[J]. Journal of Financial Stability, 2015, 20:93–104.

[9] Guerra R, Goldstein D R. Meta-Analysis and Combining Information in Genetics and Genomics[M]. Boca Raton:CRC Press, 2009.

[10] Hinton G, Deng L, Yu D, et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition[J]. Ieee Signal Processing Magazine, 2012,29(6):82–97.

[11] Jin M, Yin M M, Chen Z F. Do Investors Prefer Borrowers from High Level of Trust Cities? Evidence from China’s P2p Market[J]. Research in International Business and Finance, 2021, 58:101505.

[12] Kriebel J, Stitz L. Credit Default Prediction from User-Generated Text in Peer-to-Peer Lending Using Deep Learning[J]. European Journal of Operational Research, 2022, 302(1):309–323.

[13] Krizhevsky A, Sutskever I, Hinton G E. Imagenet Classification with Deep Convolutional Neural Networks[J]. Communications of the ACM,2017, 60(6):84–90.

[14] Lee J W, Lee W K, Sohn S Y. Graph Convolutional Network-Based Credit Default Prediction Utilizing Three Types of Virtual Distances among Borrowers[J]. Expert Systems with Applications, 2021, 168:114411.

[15] Lemhadri I, Ruan F, Abraham L, et al. Lassonet:A Neural Network with Feature Sparsity[J]. Journal of Machine Learning Research, 2021, 22:1–29.

[16] Liu M, Fan X, Fang K, et al. Integrative Sparse Principal Component Analysis of Gene Expression Data[J]. Genetic Epidemiology, 2017a, 41(8):844–865.

[17] Liu J, Ma S G, Huang J. Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization[J]. Scandinavian Journal of Statistics,2014, 41(1):87–103.

[18] Liu B, Wei Y, Zhang Y, et al. Deep Neural Networks for High Dimension, Low Sample Size Data[A]. Twenty-sixth International Joint Conference on Artificial Intelligence. 2017b:2287–2293.

[19] Ma S G, Huang J, Song X. Integrative Analysis and Variable Selection with Multiple High-Dimensional Data Sets[J]. Biostatistics, 2011a, 12(4):763–775.

[20] Ma S G, Huang J, Wei F R, et al. Integrative Analysis of Multiple Cancer Prognosis Studies with Gene Expression Measurements[J]. Statistics in Medicine, 2011b, 30(28):3361–3371.

[21] Mhaskar H, Liao Q, Poggio T, et al. When and Why Are Deep Networks Better Than Shallow Ones?[A]. 31st AAAI Conference on Artificial Intelligence[C]. California:AAAI Press, 2017:2343–2349.

[22] Nesterov Y. Gradient Methods for Minimizing Composite Functions[J]. Mathematical Programming, 2013, 140(1):125–161.

[23] Otter D W, Medina J R, Kalita J K. A Survey of the Usages of Deep Learning for Natural Language Processing[J]. Ieee Transactions on Neural Networks and Learning Systems, 2021, 32(2):604–624.

[24] Shi X, Liu J, Huang J, et al. Integrative Analysis of High-Throughput Cancer Studies with Contrasted Penalization[J]. Genetic Epidemiology,2014, 38(2):144–151.

[25] Wang X Y, Fang K N, Zhang Q Z, et al. Network-Incorporated Integrative Sparse Linear Discriminant Analysis[J]. Statistics and Its Interface,2019, 12(1):149–166.

[26] Watkins, Craig A. The Definition and Identification of Housing Submarkets[J]. Environment&Planning A, 2008, 33(12):2235–2253.

[27] Yang K, Ganguli A, Maiti T. ENNS:Variable Selection, Regression, Classification, and Deep Neural Network for High-Dimensional Data[J].Journal of Machine Learning Research, 2024, 25(335):1–45.

[28] Zhang C H. Nearly Unbiased Variable Selection under Minimax Concave Penalty[J]. The Annals of Statistics, 2010, 38(2):894–942.

[29] Zhang H Q, Wang J, Sun Z Q, et al. Feature Selection for Neural Networks Using Group Lasso Regularization[J]. Ieee Transactions on Knowledge and Data Engineering, 2020, 32(4):659–673.

[30] Zou H, Li R Z. One-Step Sparse Estimates in Nonconcave Penalized Likelihood Models[J]. Annals of Statistics, 2008, 36(4):1509–1533.

(1)因篇幅所限,模拟结果以附表1~4展示,见《统计研究》网站所列附件。下同。

(2)因篇幅所限,不同重叠情形的F1均值以附图1展示。

(1)因篇幅所限,美国三大地区的划分以附表5展示。

(1)因篇幅所限,前30个高频特征以附表6展示。

(2)因篇幅所限,不同方法下各地区的重要特征重叠数以附表7展示。

基本信息:

DOI:10.19343/j.cnki.11-1302/c.2025.02.010

中图分类号:F831.2;TP183;TP311.13

引用信息:

[1]王小燕,冮建伟,王洁丹,等.多源异质数据下深度神经网络的整合分析及其应用[J].统计研究,2025,42(02):122-134.DOI:10.19343/j.cnki.11-1302/c.2025.02.010.

基金信息:

国家自然科学基金面上项目“多源数据融合的高维整合分析分类模型及其信用风险应用”(72271088); 教育部人文社会科学基金规划项目“面向超高频金融数据的函数型分类预测方法及应用研究”(22YJAZH099); 湖南省研究生科研创新项目“多源数据的深度神经网络及其应用”(CX20230418); 国家社会科学基金后期资助重点项目“金融市场函数型数据挖掘的统计方法及应用研究”(24FTJA001)

发布时间:

2025-03-04

出版时间:

2025-03-04

网络发布时间:

2025-03-04

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文