nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo journalinfonormal searchdiv searchzone qikanlogo popupnotification paper paperNew
2025, 06, v.42 149-160
存在中介效应的高斯混合模型的子群选择
基金项目(Foundation): 国家自然科学基金面上项目“大数据背景下几类统计模型的稳健子抽样”(12271294);国家自然科学基金面上项目“断点回归中模型识别、效应估计和大数据统计学习”(12071248); 山东省自然科学基金面上项目“超高维非对称数据的稳健统计推断”(ZR2024MA089);山东省自然科学基金面上项目“因果中介分析中的模型探索和理论研究”(ZR2024MA058)
邮箱(Email): mqwang@vip.126.com;
DOI: 10.19343/j.cnki.11-1302/c.2025.06.011
摘要:

随着因果中介分析的应用越来越广泛,当高斯混合模型中存在中介效应时,子群数量的选择问题已成为研究热点之一。本文提出一种新的惩罚似然方法来估计有中介效应的高斯混合模型。对于存在中介效应的高斯混合模型,分别利用Lasso和SCAD方法对混合概率进行惩罚,构造惩罚对数似然函数,然后提出改进的EM算法,分两步进行迭代对构造的惩罚对数似然函数求最大值点。使用贝叶斯信息准则函数来选择最优调整参数λ,其对应的子群数量和参数估计即为最优估计。本文给出子群数量估计的渐近性质,并通过模拟验证Lasso和SCAD方法能准确地选择子群数量并进行参数估计。同时,本文将两种方法与畸形贝叶斯信息准则(SBIC)方法进行比较,结果表明SCAD方法在三种方法中正确选择子群数量的比例最高,参数估计的效果也最好。应用本文提出的方法分析规范衰老表观遗传学研究(NAS)的DNA甲基化数据集,结果表明SCAD方法在三种方法中得出的结果最合理,胞嘧啶–磷酸–鸟嘌呤(Cp G)位点的中介效应存在异质性。

Abstract:

With the application of causal mediation analysis more and more extensive, when there is mediation effect in the Gaussian mixture model, the subgroup selection has become one of the key topics in academic research. For the Gaussian mixture model with mediation effect, this paper proposes a penalized likelihood method, Lasso and SCAD methods are used to penalize the mixture probability and the penalized log-likelihood function is constructed, respectively. Then, we propose an improved EM algorithm to maximize the objective function with two-step iteration. The BIC function is constructed to select the optimal tuning parameter, and the corresponding subgroup numbers and parameter estimations are obtained. We show the asymptotic properties of the estimator for subgroup numbers. Simulation studies show that both Lasso and SCAD methods can accurately select the number of subgroup and estimate parameters simultaneously. In addition, we compare these two methods with the SBIC method. The results show that the SCAD has the highest probability to select the correct number of subgroup among three considered methods, and it performs the best in terms of parameter estimations. Finally, the proposed method is applied to analyze the DNA methylation data sets from standardized aging epigenetics studies. The results show that the SCAD method is the most reasonable among the three methods, and there is heterogeneity of the mediating effect of CpG sites.

参考文献

[1]逯进,赵亚楠,高艳云.我国省域人口结构对环境污染影响的异质性研究——基于有限混合模型[J].统计研究, 2022, 39(11):88–101.

[2]温忠麟,刘红云,侯杰秦.调节效应和中介效应分析[M].北京:教育科学出版社, 2012.

[3]赵明,王晓军.我国人口死亡风险异质与混合模型研究[J].统计研究, 2023, 40(3):139–150.

[4]朱映秋,黄丹阳,张波.基于高斯混合模型的分布因子聚类方法[J].统计研究, 2024, 41(6):147–160.

[5] Ai C, Huang L, Zhang Z. A Simple and Efficient Estimation of Average Treatment Effects in Models with Unmeasured Confounders[J].Statistica Sinica, 2022, 32(2):1007–1026.

[6] Barfield R, Shen J, Just A C, et al. Testing for the Indirect Effect under the Null for Genome-Wide Mediation Analyses[J]. Genetic Epidemiology,2017, 41(8):824–833.

[7] Baron R M, Kenny D A. The Moderator-Mediator Variable Distinction in Social Psychological Research:Conceptual, Strategic, and Statistical Considerations[J]. Journal of Personality and Social Psychology, 1986, 51(6):1173–1182.

[8] Bibikova M, Barnes B, Tsan C, et al. High Density DNA Methylation Array with Single Cp G Site Resolution[J]. Genomics, 2011, 98(4):288–295.

[9] Bunea F, Tsybakov A B, Wegkamp M, et al. SPADES and Mixture Models[J]. The Annals of Statistics, 2010, 38(4):2525–2558.

[10] Chen J, Kalbfleisch J D. Penalized Minimum-Distance Estimates in Finite Mixture Models[J]. The Canadian Journal of Statistics, 1996, 24(2):167–175.

[11] Chen J, Khalili A. Order Selection in Finite Mixture Models with a Nonsmooth Penalty[J]. Journal of the American Statistical Association, 2008,104(485):187–196.

[12] Chen X, Liu Y, Ma S, et al. Causal Inference of General Treatment Effects Using Neural Networks with a Diverging Number of Confounders[J].Journal of Econometrics, 2024, 238(1):105555.

[13] Dacunha-Castelle D, Gassiat E. Testing in Locally Conic Models and Application to Mixture Models[J]. ESAIM:Probability and Statistics, 1997,1:285–317.

[14] Du P, Zhang X, Huang C C, et al. Comparison of Beta-Value and M-Value Methods for Quantifying Methylation Levels by Microarray Analysis[J]. BMC Bioinformatics, 2010, 11:587.

[15] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties[J]. Journal of the American Statistical Association, 2001, 96(456):1348–1360.

[16] Fasanelli F, Baglietto L, Ponzi E, et al. Hypomethylation of Smoking-Related Genes Is Associated with Future Lung Cancer in Four Prospective Cohorts[J]. Nature Communications, 2015, 6:10192.

[17] Gelman A, Imbens G W. Why Ask Why? Forward Causal Inference and Reverse Causal Questions[R]. NBER Working Paper, 2013.

[18] Huang L, Huang W, Linton O, et al. Nonparametric Estimation of Mediation Effects with a General Treatment[J]. Econometric Reviews, 2024,43(2–4):215–237.

[19] Huang T, Peng H, Zhang K. Model Selection for Gaussian Mixture Models[J]. Statistica Sinica, 2017, 27(1):147–169.

[20] Huang W, Zhang Z. Nonparametric Estimation of the Continuous Treatment Effect with Measurement Error[J]. Journal of the Royal Statistical Society Series B:Statistical Methodology, 2023, 85(2):474–496.

[21] Imai K, Keele L, Yamamoto T. Identification, Inference, and Sensitivity Analysis for Causal Mediation Effects[J]. Statistical Science, 2010,25(1):51–71.

[22] Imai K, Tingley D, Yamamoto T. Experimental Designs for Identifying Causal Mechanisms[J]. Journal of the Royal Statistical Society Series A:Statistics in Society, 2013, 176(1):5–51.

[23] James L F, Marchette D J, Priebe C E. Consistent Estimation of Mixture Complexity[J]. The Annals of Statistics, 2001, 29(5):1281–1296.

[24] Kim C, Daniels M, Li Y, et al. A Bayesian Semiparametric Latent Variable Approach to Causal Mediation[J]. Statistics in Medicine, 2018, 37(7):1149–1161.

[25] Leroux B. Consistent Estimation of a Mixing Distribution[J]. The Annals of Statistics, 1992, 20(3):1350–1360.

[26] Pearl J. Direct and Indirect Effects[A]. Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence[C]. San Francisco:Morgan Kaufmann, 2001:411–420.

[27] Rubin D B. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies[J]. Journal of Educational Psychology, 1974,66(5):688–701.

[28] Wang W W, Xu J F, Schwartz J, et al. Causal Mediation Analysis with Latent Subgroups[J]. Statistics in Medicine, 2021, 40(25):5628–5641.

[29] Woo M J, Sriram T N. Robust Estimation of Mixture Complexity[J]. Journal of the American Statistical Association, 2006, 101(476):1475–1486.

(1)暴露变量和中介变量之间存在交互作用。

(1)因篇幅所限,两个子群异质性中介效应模型的因果有向无环图(DAG)以附图1展示,见《统计研究》网站所列附件。下同。

(1)因篇幅所限,定理1~3的证明以附录1展示。

(1)因篇幅所限,三种情况下当K=4时,Lasso、SCAD和SBIC方法正确选择子群数量的比例以附图2~4展示。

(1)因篇幅所限,Case I和Case II中Lasso和SCAD方法正确选择子群数量的比例分别以附表1~2展示。

(2)因篇幅所限,K=4时,Case I、Case II和Case III估计参数的平均偏差和标准误分别以附表3~5展示。

(3)因篇幅所限,K=6和K=8时,Case I和Case II估计参数的平均偏差和标准误分别以附表6~7展示;Case III在K=6和K=8时估计参数的平均偏差和标准误分别以附表8~9展示。

(4)NAS是一项正在进行的前瞻性队列研究,由美国退伍军人事务部(VA)于1963年在马萨诸塞州东部建立。研究对象在入组时没有已知的慢性疾病,每3~5年对研究对象进行详细的身体检查,采集包括血液在内的生物标本,并收集与饮食、吸烟状况和可能影响健康的其他生活方式因素有关的问卷数据。使用Illumina Infinium Human Methylation450 Bead Chips对一夜禁食后采集的血液样本进行DNA甲基化检测(Bibikova等,2011)。Barfield等(2017)描述了详细的质量控制程序。

(5)数据来源网址为http://www.stat.sinica.edu.tw/ythuang/JT-Comp.zip。

(1)beta-values和M-values都表示Cp G位点的甲基化水平。beta-values更容易解释生物学上的意义;M-values具有更好的统计特征,更适合进行下游基因的统计分析。

基本信息:

DOI:10.19343/j.cnki.11-1302/c.2025.06.011

中图分类号:O212.1

引用信息:

[1]李姝雅,王文武,王明秋.存在中介效应的高斯混合模型的子群选择[J].统计研究,2025,42(06):149-160.DOI:10.19343/j.cnki.11-1302/c.2025.06.011.

基金信息:

国家自然科学基金面上项目“大数据背景下几类统计模型的稳健子抽样”(12271294);国家自然科学基金面上项目“断点回归中模型识别、效应估计和大数据统计学习”(12071248); 山东省自然科学基金面上项目“超高维非对称数据的稳健统计推断”(ZR2024MA089);山东省自然科学基金面上项目“因果中介分析中的模型探索和理论研究”(ZR2024MA058)

发布时间:

2025-06-25

出版时间:

2025-06-25

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文