Turn off MathJax
Article Contents
Yulin HE, Yingting HE, Zhaowu ZHAN, et al., “A Novel Subspace-Based GMM Clustering Ensemble Algorithm for High-dimensional Data,” Chinese Journal of Electronics, vol. x, no. x, pp. 1–18, xxxx doi: 10.23919/cje.2023.00.153
Citation: Yulin HE, Yingting HE, Zhaowu ZHAN, et al., “A Novel Subspace-Based GMM Clustering Ensemble Algorithm for High-dimensional Data,” Chinese Journal of Electronics, vol. x, no. x, pp. 1–18, xxxx doi: 10.23919/cje.2023.00.153

A Novel Subspace-Based GMM Clustering Ensemble Algorithm for High-dimensional Data

doi: 10.23919/cje.2023.00.153
More Information
  • Author Bio:

    Yulin HE received the Ph.D. degree from Hebei University, China in 2014. From 2011 to 2014, he has served as a Research Fellow with the Department of Computing, The Hong Kong Polytechnic University, China. From 2014 to 2017, he worked as a Post-doctoral Fellow in the College of Computer Science and Software Engineering, Shenzhen University, China. He is currently a Research Fellow with Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), China. His main research interests include big data approximate computing technologies, multi-sample statistical analysis theories and methods, and data mining/machine learning algorithms and their applications. He has published over 100+ research papers in ACM Transactions, CAAI Transactions, IEEE Transactions, Elsevier, Springer Journals and PAKDD, IJCNN, CEC, DASFAA conferences. He is an ACM member, CAAI member, CCF member, IEEE member, and the Editorial Review Board members of several international journals. (Email: yulinhe@gml.ac.cn)

    Yingting HE is currently pursuing her master degree with the College of Computer Science and Software Engineering at Shenzhen University, Shenzhen, China. Her main research interests include data mining and machine learning algorithms and applications, big data processing and analysis, and big data system computing technology. (Email: 2110276016@email.szu.edu.cn)

    Zhaowu ZHAN received Ph.D. degree in Electrical Engineering from Institut National des Sciences Appliquees de Lyon (INSA Lyon) and is currently a vice general manager of R&D center of China Gridcom Co., Ltd. He is engaging in core technology research and development in the field of telecommunication and artificial intelligence applied in electronic power system. (Email: zhanzhaowu@sgchip.sgcc.com.cn)

    Fournier-Viger PHILIPPE is a distinguished professor at the College of Computer Science and Software Engineering at Shenzhen University, China. He obtained a title of national talent from the National Natural Science Foundation of China. He has published more than 300 research papers related to data mining, big data, intelligent systems and applications, which have received more than 10,000 citations (H-Index 51). He is the editor-in-chief of the Data Science and Pattern Recognition journal and the former associate editor-inchief of the Applied Intelligence journal (SCI, Q1). He is the founder of the SPMF data mining library, offering more than 230 algorithms, used in more than 1,000 research papers. He is co-founder of the UDML, PMDB, and MLiSE series workshop held at the ICDM, PKDD, DASFAA, and KDD conferences. His interests are data mining, algorithm design, pattern mining, sequence mining, big data, and applications. (Email: philfv@szu.edu.cn)

    Joshua Zhexue HUANG received the Ph.D. degree from The Royal Institute of Technology, Stockholm, Sweden in 1993. He is currently a Distinguished Professor with the College of Computer Science and Software Engineering, Shenzhen University, China. He is also the Director of Big Data Institute, China, and the Deputy Director of National Engineering Laboratory for Big Data System Computing Technology. He has published over 200 research papers in conferences and journals. His main research interests include big data technology and applications. Prof. Huang received the first PAKDD Most Influential Paper Award in 2006. He is known for his contributions to the development of a series of k-means type clustering algorithms in data mining, such as k-modes, fuzzy kmodes, k-prototypes, and w-k-means that are widely cited and used, and some of which have been included in commercial software. He has extensive industry expertise in business intelligence and data mining and has been involved in numerous consulting projects in many countries. (Email: zx.huang@szu.edu.cn)

  • Corresponding author: Email: yulinhe@gml.ac.cn
  • Received Date: 2023-04-26
  • Accepted Date: 2024-02-21
  • Available Online: 2024-06-05
  • The Gaussian mixture model (GMM) is a classical probability representation model widely used in unsupervised learning. GMM performs poorly on high-dimensional data (HDD) due to the requirement of estimating a large number of parameters with relatively few observations. To address this, the paper proposes a novel subspace-based GMM clustering ensemble (SubGMM-CE) algorithm tailored for HDD. The proposed SubGMM-CE algorithm comprises three key components. First, a series of low-dimensional subspaces are dynamically determined, considering the optimal number of GMM components. The GMM-based clustering algorithm is applied to each subspace to obtain a series of heterogeneous GMM models. These GMM base clustering results are merged using the newly-designed relabeling strategy based on the average shared affiliation probability, generating the final clustering result for high-dimensional unlabeled data. An exhaustive experimental evaluation validates the feasibility, rationality, effectiveness, and robustness to noise of the SubGMM-CE algorithm. Results show that SubGMM-CE achieves higher stability and more accurate clustering results, outperforming nine state-of-the-art clustering algorithms in normalized mutual information, clustering accuracy, and adjusted rand index scores. This demonstrates the viability of the SubGMM-CE algorithm in addressing HDD clustering challenges.
  • 1https://jundongl.github.io/scikit-feature/datasets.html
    2https://archive.ics.uci.edu/ml/index.php
    3https://pan.baidu.com/s/1FNy3hV_CB6OG27JcBRsA9w
  • loading
  • [1]
    G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications to Clustering, Marcel Dekker, New York, USA, 1988.
    [2]
    A. Boukerche, L. N. Zhang, and O. Alfandi, “Outlier detection: Methods, models, and classification,” ACM Computing Surveys, vol. 53, no. 3, article no. 55, 2020. doi: 10.1145/3381028
    [3]
    W. C. Lin and C. F. Tsai, “Missing value imputation: a review and analysis of the literature (2006–2017),” Artificial Intelligence Review, vol. 53, no. 2, pp. 1487–1509, 2020. doi: 10.1007/s10462-019-09709-4
    [4]
    A. Chaddad, “Automated feature extraction in brain tumor by magnetic resonance imaging using Gaussian mixture models,” Journal of Biomedical Imaging, vol. 2015, article no. 8, 2015. doi: 10.1155/2015/868031
    [5]
    T. I. Lin, J. C. Lee, and H. J. Ho, “On fast supervised learning for normal mixture models with missing information,” Pattern Recognition, vol. 39, no. 6, pp. 1177–1187, 2006. doi: 10.1016/j.patcog.2005.12.014
    [6]
    A. Saxena, M. Prasad, A. Gupta, et al., “A review of clustering techniques and developments,” Neurocomputing, vol. 267, pp. 664–681, 2017. doi: 10.1016/j.neucom.2017.06.053
    [7]
    J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020. doi: 10.1007/s10994-019-05855-6
    [8]
    Y. Wang, D. J. Miller, and R. Clarke, “Approaches to working in high-dimensional data spaces: Gene expression microarrays,” British Journal of Cancer, vol. 98, no. 6, pp. 1023–1028, 2008. doi: 10.1038/sj.bjc.6604207
    [9]
    D. L. Donoho, “High-dimensional data analysis: The curses and blessings of dimensionality,” AMS Math Challenges Lecture, vol. 1, no. 2000, pp. 1–32, 2000.
    [10]
    Q. Chen, W. Wang, F. Y. Wu, et al., “A survey on an emerging area: Deep learning for smart city data,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 3, no. 5, pp. 392–410, 2019. doi: 10.1109/TETCI.2019.2907718
    [11]
    M. Mittal, L. M. Goyal, D. J. Hemanth, et al., “Clustering approaches for high-dimensional databases: A review,” WIREs Data Mining and Knowledge Discovery, vol. 9, no. 3, article no. e1300, 2019. doi: 10.1002/widm.1300
    [12]
    G. T. Reddy, M. P. K. Reddy, K. Lakshmanna, et al., “Analysis of dimensionality reduction techniques on big data,” IEEE Access, vol. 8, pp. 54776–54788, 2020. doi: 10.1109/ACCESS.2020.2980942
    [13]
    C. Bouveyron and C. Brunet-Saumard, “Model-based clustering of high-dimensional data: A review,” Computational Statistics & Data Analysis, vol. 71, pp. 52–78, 2014. doi: 10.1016/j.csda.2012.12.008
    [14]
    D. Wang, X. Y. Guo, S. Li, et al., “Robust high dimensional expectation maximization algorithm via trimmed hard thresholding,” Machine Learning, vol. 109, no. 12, pp. 2283–2311, 2020. doi: 10.1007/s10994-020-05926-z
    [15]
    M. Przyborowski and D. Ślęzak, “Approximation of the expectation-maximization algorithm for Gaussian mixture models on big data,” in Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, pp. 6256–6260, 2022.
    [16]
    H. Hotelling, “Analysis of a complex of statistical variables into principal components,” Journal of Educational Psychology, vol. 24, no. 7, pp. 498–520, 1933. doi: 10.1037/h0070888
    [17]
    X. F. He and P. Niyogi, “Locality preserving projections,” in Proceedings of the 16th International Conference on Neural Information Processing Systems, Whistler, Canada, pp. 153–160, 2003.
    [18]
    M. Sugiyama, “Local fisher discriminant analysis for supervised dimensionality reduction,” in Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, pp. 905–912, 2006.
    [19]
    D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proceedings of the 13th International Conference on Neural Information Processing Systems, Denver, CO, USA, pp. 535–541, 2000.
    [20]
    D. S. Huang, F. W. Jiang, K. P. Li, et al., “Scaled PCA: A new approach to dimension reduction,” Management Science, vol. 68, no. 3, pp. 1678–1695, 2022. doi: 10.1287/mnsc.2021.4020
    [21]
    M. Espadoto, R. M. Martins, A. Kerren, et al., “Toward a quantitative survey of dimension reduction techniques,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 3, pp. 2153–2173, 2021. doi: 10.1109/TVCG.2019.2944182
    [22]
    K. Nur’aini, I. Najahaty, L. Hidayati, et al., “Combination of singular value decomposition and K-means clustering methods for topic detection on Twitter,” in Proceedings of the 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, pp. 123–128, 2015.
    [23]
    J. L. Liu, D. Cai, and X. F. He, “Gaussian mixture model with local consistency,” in Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, pp. 512–517, 2010.
    [24]
    X. F. He, D. Cai, Y. L. Shao, et al., “Laplacian regularized Gaussian mixture model for data clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 9, pp. 1406–1418, 2011. doi: 10.1109/TKDE.2010.259
    [25]
    L. Y. Yang and T. T. Wu, “Model-based clustering of high-dimensional longitudinal data via regularization,” Biometrics, vol. 79, no. 2, pp. 761–774, 2023. doi: 10.1111/biom.13672
    [26]
    J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008. doi: 10.1093/biostatistics/kxm045
    [27]
    O. Banerjee, L. E. Ghaoui, and A. d'Aspremont, “Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data,” The Journal of Machine Learning Research, vol. 9, pp. 485–516, 2008.
    [28]
    E. Levina, A. Rothman, and J. Zhu, “Sparse estimation of large covariance matrices via a nested lasso penalty,” The Annals of Applied Statistics, vol. 2, no. 1, pp. 245–263, 2008. doi: 10.1214/07-AOAS139
    [29]
    L. Y. Ruan, M. Yuan, and H. Zou, “Regularized parameter estimation in high-dimensional Gaussian mixture models,” Neural Computation, vol. 23, no. 6, pp. 1605–1622, 2011. doi: 10.1162/NECO_a_00128
    [30]
    Y. Chen, C. G. Li, and C. You, “Stochastic sparse subspace clustering,” in Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 4154–4163, 2020.
    [31]
    Y. Zhao, A. K. Shrivastava, and K. L. Tsui, “Regularized Gaussian mixture model for high-dimensional clustering,” IEEE Transactions on Cybernetics, vol. 49, no. 10, pp. 3677–3688, 2019. doi: 10.1109/TCYB.2018.2846404
    [32]
    R. Agrawal, J. Gehrke, D. Gunopulos, et al., “Automatic subspace clustering of high dimensional data for data mining applications,” in Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA, pp. 94–105, 1998.
    [33]
    C. H. Cheng, A. W. Fu, and Y. Zhang, “Entropy-based subspace clustering for mining numerical data,” in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA pp. 84–93, 1999.
    [34]
    K. Kailing, H. P. Kriegel, and P. Kröger, “Density-connected subspace clustering for high-dimensional data,” in Proceedings of the 4th SIAM International Conference on Data Mining, Orlando, FL, USA, pp. 246–256, 2004.
    [35]
    Z. Ghahramani and G. E. Hinton, “The EM algorithm for mixtures of factor analyzers,” Technical Report, CRG-TR-96-1, 1996.
    [36]
    J. Baek, G. J. McLachlan, and L. K. Flack, “Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1298–1309, 2010. doi: 10.1109/TPAMI.2009.149
    [37]
    A. Montanari and C. Viroli, “Heteroscedastic factor mixture analysis,” Statistical Modelling, vol. 10, no. 4, pp. 441–460, 2010. doi: 10.1177/1471082X0901000405
    [38]
    C. C. Hsu and C. W. Lin, “CNN-based joint clustering and representation learning with feature drift compensation for large-scale image data,” IEEE Transactions on Multimedia, vol. 20, no. 2, pp. 421–429, 2018. doi: 10.1109/TMM.2017.2745702
    [39]
    Y. Lukic, C. Vogt, O. Dürr, et al., “Speaker identification and clustering using convolutional neural networks,” in Proceedings of the 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), Vietri sul Mare, Italy, pp. 1–6, 2016.
    [40]
    J. Y. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, pp. 478–487, 2016.
    [41]
    J. W. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 5147–5156, 2016.
    [42]
    J. L. Chang, L. F. Wang, G. F. Meng, et al., “Deep adaptive image clustering,” in Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp. 5880–5888, 2017.
    [43]
    D. Y. Bo, X. Wang, C. Shi, et al., “Structural deep clustering network,” in Proceedings of the Web Conference 2020, Taipei, China, pp. 1400–1410, 2020.
    [44]
    R. N. Bai, R. Z. Huang, L. Y. Zheng, et al., “Structure enhanced deep clustering network via a weighted neighbourhood auto-encoder,” Neural Networks, vol. 155, pp. 144–154, 2022. doi: 10.1016/j.neunet.2022.08.006
    [45]
    A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” ACM Transactions on Knowledge Discovery from Data, vol. 1, no. 1, pp. 4–es, 2007. doi: 10.1145/1217299.1217303
    [46]
    A. Bagherinia, B. Minaei-Bidgoli, M. Hosseinzadeh, et al., “Reliability-based fuzzy clustering ensemble,” Fuzzy Sets and Systems, vol. 413, pp. 1–28, 2021. doi: 10.1016/j.fss.2020.03.008
    [47]
    M. S. Mahmud, J. Z. Huang, S. Salloum, et al., “A survey of data partitioning and sampling methods to support big data analysis,” Big Data Mining and Analytics, vol. 3, no. 2, pp. 85–101, 2020. doi: 10.26599/BDMA.2019.9020015
    [48]
    Y. Hong, S. Kwong, Y. C. Chang, et al., “Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm,” Pattern Recognition, vol. 41, no. 9, pp. 2742–2756, 2008. doi: 10.1016/j.patcog.2008.03.007
    [49]
    M. Ye, W. F. Liu, J. H. Wei, et al., “Fuzzy c-means and cluster ensemble with random projection for big data clustering,” Mathematical Problems in Engineering, vol. 2016, article no. 6529794, 2016. doi: 10.1155/2016/6529794
    [50]
    A. L. N. Fred and A. K. Jain, “Combining multiple clusterings using evidence accumulation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835–850, 2005. doi: 10.1109/TPAMI.2005.113
    [51]
    D. Huang, C. D. Wang, and J. H. Lai, “LWMC: A locally weighted meta-clustering algorithm for ensemble clustering,” in Proceedings of the 24th International Conference on Neural Information Processing, Guangzhou, China, pp. 167–176, 2017.
    [52]
    S. S. Hamidi, E. Akbari, and H. Motameni, “Consensus clustering algorithm based on the automatic partitioning similarity graph,” Data & Knowledge Engineering, vol. 124, article no. 101754, 2019. doi: 10.1016/j.datak.2019.101754
    [53]
    H. W. Kuhn, “The Hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955. doi: 10.1002/nav.3800020109
    [54]
    Z. H. Zhou and W. Tang, “Clusterer ensemble,” Knowledge-Based Systems, vol. 19, no. 1, pp. 77–83, 2006. doi: 10.1016/j.knosys.2005.11.003
    [55]
    K. Soufiane and M. T. Khadir, “A multiple clustering combination approach based on iterative voting process,” Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 1, pp. 1370–1380, 2022. doi: 10.1016/j.jksuci.2019.09.013
    [56]
    A. P. Topchy, A. K. Jain, and W. F. Punch, “A mixture model for clustering ensembles,” in Proceedings of the 4th SIAM International Conference on Data Mining, Orlando, FL, USA, pp. 379–390, 2004.
    [57]
    A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B, no. Methodological, pp. 1–22, 1977. doi: 10.1111/j.2517-6161.1977.tb01600.x
    [58]
    C. Fraley and A. E. Raftery, “How many clusters? Which clustering method? Answers via model-based cluster analysis,” The Computer Journal, vol. 41, no. 8, pp. 578–588, 1998. doi: 10.1093/comjnl/41.8.578
    [59]
    K. Golalipour, E. Akbari, S. S. Hamidi, et al., “From clustering to clustering ensemble selection: A review,” Engineering Applications of Artificial Intelligence, vol. 104, article no. 104388, 2021. doi: 10.1016/j.engappai.2021.104388
    [60]
    S. Dudoit and J. Fridlyand, “Bagging to improve the accuracy of a clustering procedure,” Bioinformatics, vol. 19, no. 9, pp. 1090–1099, 2003. doi: 10.1093/bioinformatics/btg038
    [61]
    M. S. Yang, C. Y. Lai, and C. Y. Lin, “A robust EM clustering algorithm for Gaussian mixture models,” Pattern Recognition, vol. 45, no. 11, pp. 3950–3961, 2012. doi: 10.1016/j.patcog.2012.04.031
    [62]
    A. Strehl and J. Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,” The Journal of Machine Learning Research, vol. 3, pp. 583–617, 2003. doi: 10.1162/153244303321897735
    [63]
    D. Huang, C. D. Wang, and J. H. Lai, “Locally weighted ensemble clustering,” IEEE Transactions on Cybernetics, vol. 48, no. 5, pp. 1460–1473, 2018. doi: 10.1109/TCYB.2017.2702343
    [64]
    C. Yuan and H. Yang, “Research on K-value selection method of K-means clustering algorithm,” Math. Appl, vol. 2, pp. 284–288, 2021.
    [65]
    A. Idrus, N. Tarihoran, U. Supriatna, et al., “Distance analysis measuring for clustering using K-means and Davies Bouldin index algorithm,” TEM Journal, vol. 11, no. 4, pp. 1871–1876, 2022. doi: 10.18421/TEM114-55
    [66]
    S. Salloum, J. Z. Huang, and Y. L. He, “Random sample partition: a distributed data model for big data analysis,” IEEE Transactions on Industrial Informatics, vol. 15, no. 11, pp. 5846–5854, 2019. doi: 10.1109/TII.2019.2912723
    [67]
    C. T. Ding, A. Zhou, X. L. Liu, et al., “Resource-aware feature extraction in mobile edge computing,” IEEE Transactions on Mobile Computing, vol. 21, no. 1, pp. 321–331, 2022. doi: 10.1109/TMC.2020.3007456
    [68]
    S. G. Wang, C. T. Ding, N. Zhang, et al., “A cloud-guided feature extraction approach for image retrieval in mobile edge computing,” IEEE Transactions on Mobile Computing, vol. 20, no. 2, pp. 292–305, 2021. doi: 10.1109/TMC.2019.2944371
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(7)  / Tables(5)

    Article Metrics

    Article views (226) PDF downloads(15) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return