Volume 31 Issue 5
Sep.  2022
Turn off MathJax
Article Contents
GUANG Mingjian, YAN Chungang, LIU Guanjun, et al., “A Novel Neighborhood-Weighted Sampling Method for Imbalanced Datasets,” Chinese Journal of Electronics, vol. 31, no. 5, pp. 969-979, 2022, doi: 10.1049/cje.2021.00.121
Citation: GUANG Mingjian, YAN Chungang, LIU Guanjun, et al., “A Novel Neighborhood-Weighted Sampling Method for Imbalanced Datasets,” Chinese Journal of Electronics, vol. 31, no. 5, pp. 969-979, 2022, doi: 10.1049/cje.2021.00.121

A Novel Neighborhood-Weighted Sampling Method for Imbalanced Datasets

doi: 10.1049/cje.2021.00.121
Funds:  This work was supported by the National Key Research and Development Program of China (2018YFB2100801)
More Information
  • Author Bio:

    received the B.S. and M.S. degrees in software engineering from Mongolian University, Inner Mongolia, China, in 2014 and 2018, respectively. He is currently pursuing the Ph.D. degree at the College of Electronic and Information Engineering, Tongji University, Shanghai, China. He has authored and coauthored several papers in conferences and journals, including IJCNN and TNNLS. His research interests include machine learning, GNN, and fraud detection. (Email: guangmingjian204@163.com)

    received the Ph.D. degree from Tongji University, Shanghai, China, in 2006. She is currently a Professor with the Department of Computer Science and Technology, Tongji University. Her current research interests include concurrent model and algorithm, Petri net theory, formal verification of software, trusty theory on software process. (Email: yanchungang@tongji.edu.cn)

    (corresponding author) received the Ph.D. degree in computer software and theory from Tongji University, China, in 2011. From 2011 to 2013, he was a Post-Doctoral Research Fellow with the Singapore University of Technology and Design, Singapore. From 2013 to 2014, he was a Post-Doctoral Research Fellow with the Humboldt University of Berlin, Berlin, Germany, supported by the Alexander von Humboldt Foundation. He is currently a Professor with the Department of Computer Science and Technology, Tongji University. His current research interests include Petri net theory, model checking, Web service, workflow, machine learning, and credit card fraud detection. (Email: liuguanjun@tongji.edu.cn)

    received the Ph.D. degree in computer science from Tongji University in 2007. She is currently an Associate Researcher at the College of Electronics and Information Engineering, Tongji University. Her research interests include text data analysis, deep learning, and AI. (Email: liuguanjun@tongji.edu.cn)

    (corresponding author) received the Ph.D. degree from the Institute of Automation, Chinese Academy of Science, Beijing, China, in 1995. He is currently the Leader of the Key Laboratory of the Ministry of Education for Embedded System and Service Computing with Tongji University. He is a Honorary Professor with Brunel University London, London, UK. His current research interests include concurrence theory, Petri nets, formal verification of software, cluster, grid technology, intelligent transportation systems, and service-oriented computing. Dr. Jiang is an IET Fellow. He was a recipient of one international prize and seven prizes in the field of science and technology.(Email: cjjiang@tongji.edu.cn)

  • Received Date: 2021-04-07
  • Accepted Date: 2021-10-31
  • Available Online: 2022-03-09
  • Publish Date: 2022-09-05
  • The weighted sampling methods based on k-nearest neighbors have been demonstrated to be effective in solving the class imbalance problem. However, they usually ignore the positional relationship between a sample and the heterogeneous samples in its neighborhood when calculating sample weight. This paper proposes a novel neighborhood-weighted based sampling method named NWBBagging to improve the Bagging algorithm’s performance on imbalanced datasets. It considers the positional relationship between the center sample and the heterogeneous samples in its neighborhood when identifying critical samples. And a parameter reduction method is proposed and combined into the ensemble learning framework, which reduces the parameters and increases the classifier’s diversity. We compare NWBBagging with some state-of-the-art ensemble learning algorithms on 34 imbalanced datasets, and the result shows that NWBBagging achieves better performance.
  • loading
  • [1]
    F. Zhang, G. Liu, Z. Li, et al., “GMM-based undersampling and its application for credit card fraud detection,” in Proc. of International Joint Conference on Neural Networks, Budapest, Hungary, pp.1–8, 2019.
    [2]
    L. Zheng, G. Liu, C. Yan, et al., “Transaction fraud detection based on total order relation and behavior diversity,” IEEE Transactions on Computational Social Systems, vol.5, no.3, pp.796–806, 2018. doi: 10.1109/TCSS.2018.2856910
    [3]
    C. Jiang, J. Song, G. Liu, et al., “Credit card fraud detection: A novel approach using aggregation strategy and feedback mechanism,” IEEE Internet of Things Journal, vol.5, no.5, pp.3637–3647, 2018. doi: 10.1109/JIOT.2018.2816007
    [4]
    Z. Li, M. Huang, G. Liu, et al., “A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection,” Expert Systems with Applications, vol.175, pp.1–10, 2021.
    [5]
    C. Yang, G. Liu, C. Yan, et al., “A clustering-based flexible weighting method in AdaBoost and its application to transaction fraud detection,” Science China-Information Science, vol.64, no.12, pp.1–11, 2021.
    [6]
    L. Zheng, G. Liu, C. Yan, et al., “Improved tradaboost and its application to transaction fraud detection,” IEEE Transactions on Computational Social Systems, vol.7, no.5, pp.1304–1316, 2020. doi: 10.1109/TCSS.2020.3017013
    [7]
    Z. Li, G. Liu, and C. Jiang, “Deep representation learning with full center loss for credit card fraud detection,” IEEE Transactions on Computational Social Systems, vol.7, no.2, pp.569–579, 2020. doi: 10.1109/TCSS.2020.2970805
    [8]
    S. Xuan, G. Liu, Z. Li, et al., “Random forest for credit card fraud detection,” in Proc. of IEEE 15th International Conference on Networking, Sensing and Control (ICNSC) , Zhuhai, China, pp.1–6, 2018.
    [9]
    Q. Yang and X. Wu, “10 challenging problems in data mining research,” International Journal of Information Technology and Decision Making, vol.5, no.4, pp.597–604, 2006. doi: 10.1142/S0219622006002258
    [10]
    C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, et al., “RUSBoost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol.40, no.1, pp.185–197, 2009.
    [11]
    N. V. Chawla, A. Lazarevic, L. O. Hall, et al., “SMOTEBoost: Improving prediction of the minority class in boosting,” in Proc. of European Conference on Principles of Data Mining and Knowledge Discovery, Berlin, Heidelberg, Germany, pp.107–119, 2003.
    [12]
    V. López, A. Fernández, S. García, et al., “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Information Sciences, vol.250, pp.113–141, 2013. doi: 10.1016/j.ins.2013.07.007
    [13]
    J. Błaszczyński and J. Stefanowski, “Neighbourhood sampling in bagging for imbalanced data,” Neurocomputing, vol.150, pp.529–542, 2015. doi: 10.1016/j.neucom.2014.07.064
    [14]
    S. Barua, M. M. Islam, X. Yao, et al., “MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Transactions on Knowledge and Data Engineering, vol.26, no.2, pp.405–425, 2014. doi: 10.1109/TKDE.2012.232
    [15]
    M. Bader-El-Den, E. Teitei, and T. Perry, “Biased random forest for dealing with the class imbalance problem,” IEEE Transactions on Neural Networks and Learning Systems, vol.30, no.7, pp.2163–2172, 2019. doi: 10.1109/TNNLS.2018.2878400
    [16]
    L. Breiman, “Bagging predictors,” Machine Learning, vol.24, no.2, pp.123–140, 1996.
    [17]
    I. Tomek, “An experiment with the edited nearest-neighbor rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol.6, no.6, pp.448–452, 1976.
    [18]
    J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” in Proc. of Artificial Intelligence in Medicine in Europe, Berlin, Heidelberg, Germany, pp.63–66, 2001.
    [19]
    N. V. Chawla, K. W. Bowyer, L. O. Hall, et al., “SMOTE: Synthetic minority over-sampling technique,” Journal of Articial Intelligence Research, vol.16, no.1, pp.321–357, 2002.
    [20]
    A. Fernández, S. Garcia, F. Herrera, et al., “SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary,” Journal of Artificial Intelligence Research, vol.61, pp.863–905, 2018. doi: 10.1613/jair.1.11192
    [21]
    H. Han, W. Y. Wang, and B. H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” in Proc. of International Conference on Intelligent Computing, Berlin, Heidelberg, Germany, pp.878–887, 2005.
    [22]
    H. He, Y. Bai, E. A. Garcia, et al., “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” in Proc. of 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, pp.1322–1328, 2008.
    [23]
    Y. Zhai, N. Ma, D. Ruan, et al., “An effective over-sampling method for imbalanced data sets classification,” Chinese Journal of Electronics, vol.20, no.3, pp.489–494, 2011.
    [24]
    J. A. Sáez, J. Luengo, J. Stefanowski, et al., “SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering,” Information Sciences, vol.291, pp.184–203, 2015. doi: 10.1016/j.ins.2014.08.051
    [25]
    R. E. Schapire, “The strength of weak learnability,” Machine Learning, vol.5, no.2, pp.197–227, 1990.
    [26]
    M. Galar, A. Fernandez, E. Barrenechea, et al., “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems, Man, and Cybernetics, vol.42, no.4, pp.463–484, 2011.
    [27]
    T. M. Khoshgoftaar, J. V. Hulse, and A. Napolitano, “Comparing boosting and bagging techniques with noisy and imbalanced data,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol.41, no.3, pp.552–568, 2010.
    [28]
    S. Kumar, S. K. Biswas, and D. Devi, “TLUSBoost algorithm: A boosting solution for class imbalance problem,” Soft Computing, vol.23, no.21, pp.10755–10767, 2019. doi: 10.1007/s00500-018-3629-4
    [29]
    S. Wang and X. Yao, “Diversity analysis on imbalanced data sets by using ensemble models,” in Proc. of 2009 IEEE Symposium on Computational Intelligence and Data Mining, Nashville, TN, USA, pp.324–331, 2009.
    [30]
    X. Y. Liu, J. W, and Z. H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol.39, no.2, pp.539–550, 2009. doi: 10.1109/TSMCB.2008.2007853
    [31]
    S. Hido, H. Kashima, and Y. Takahashi, “Roughly balanced bagging for imbalanced data,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol.2, no.5, pp.412–426, 2009.
    [32]
    D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Transactions on Systems, Man, and Cybernetics, vol.2, no.3, pp.408–421, 1972.
    [33]
    D. R. Wilson and T. R. Martinez, “Improved heterogeneous distance functions,” Journal of Artificial Intelligence Research, vol.6, pp.1–34, 1997. doi: 10.1613/jair.346
    [34]
    B. W. Matthews, “Comparison of the predicted and observed secondary structure of T4 phage lysozyme,” Biochimica et Biophysica Acta (BBA) - Protein Structure, vol.405, no.2, pp.442–451, 1975. doi: 10.1016/0005-2795(75)90109-9
    [35]
    D. Chicco and G. Jurman, “The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol.21, no.1, pp.1–13, 2020. doi: 10.1186/s12864-019-6419-1
    [36]
    H. B. He and Y. Q. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications, New York: John Wiley & Sons, Hoboken, NJ, USA, pp.61–82, 2013.
    [37]
    G. Lemaître, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” The Journal of Machine Learning Research, vol.18, no.1, pp.559–563, 2017.
    [38]
    J. Alcalá-Fdez, A. Fernández, J. Luengo, et al., “Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic and Soft Computing, vol.17, pp.255–287, 2011.
    [39]
    I. Mukherjee and R. E. Schapire, “A theory of multiclass boosting,” Journal of Machine Learning Research, vol.14, no.1, pp.437–497, 2011.
    [40]
    C. T. Su and Y. H. Hsiao, “An evaluation of the robustness of MTS for imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol.19, no.10, pp.1321–1332, 2007. doi: 10.1109/TKDE.2007.190623
    [41]
    D. J. Drown, T. M. Khoshgoftaar, and N. Seliya, “Evolutionary sampling and software quality modeling of high-assurance systems,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol.39, no.5, pp.1097–1107, 2009. doi: 10.1109/TSMCA.2009.2020804
    [42]
    S. García, A. Fernández, and F. Herrera, “Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems,” Applied Soft Computing, vol.9, no.4, pp.1304–1314, 2009. doi: 10.1016/j.asoc.2009.04.004
    [43]
    Pedregosa, Fabian, Varoquaux, et al., “Scikit-learn: Machine learning in python,” Journal of Machine Learning Research, vol.18, pp.2825–2830, 2011.
    [44]
    F. Wilcoxon, “Individual comparisons by ranking methods,” International Biometric Society, vol.1, no.6, pp.80–83, 1945.
    [45]
    J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, vol.7, pp.1–30, 2006.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(4)

    Article Metrics

    Article views (766) PDF downloads(57) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return