PANG Shanchen, YAO Jiamin, LIU Ting, ZHAO Hua, CHEN Hongqi. A Text Similarity Measurement Based on Semantic Fingerprint of Characteristic Phrases[J]. Chinese Journal of Electronics, 2020, 29(2): 233-241. doi: 10.1049/cje.2019.12.011
Citation: PANG Shanchen, YAO Jiamin, LIU Ting, ZHAO Hua, CHEN Hongqi. A Text Similarity Measurement Based on Semantic Fingerprint of Characteristic Phrases[J]. Chinese Journal of Electronics, 2020, 29(2): 233-241. doi: 10.1049/cje.2019.12.011

A Text Similarity Measurement Based on Semantic Fingerprint of Characteristic Phrases

doi: 10.1049/cje.2019.12.011
Funds:  This work is supported by the National Natural Science Foundation of China (No.61572523, No.61873281, No.61572522).
  • Received Date: 2019-04-23
  • Rev Recd Date: 2019-07-05
  • Publish Date: 2020-03-10
  • Text similarity measurements are the basis for measuring the degree of matching between two or more texts. Traditional large-scale similarity detection methods based on a digital fingerprint have the advantage of high detection speed, which are only suitable for accurate detection. We propose a method of Chinese text similarity measurement based on feature phrase semantics. Natural language processing (NLP) technology is used to pre-process text and extract the keywords by the Term frequency-Inverse document frequency (TF-IDF) model and further screen out the feature words. We get the exact meaning of a word and semantic similarities between words and a HowNet semantic dictionary. We substitute concepts to get the feature phrases and generate a semantic fingerprint and calculate similarity. The experimental results indicate that the method proposed is superior in similarity detection in terms of its accuracy rate, recall rate, and F-value to the traditional and digital fingerprinting method.
  • loading
  • Panigrahi, Bijaya. K, Trivedi, et al., “A text preprocessing approach for efficacious information retrieval”, Smart Innovations in Communication and Computational Sciences, Vol.669, pp.13-22, 2019.
    Kulis. B and Grauman. K, “Kernelized locality-sensitive hashing for scalable image search”, Computer Vision, IEEE 12th International Conference, Kyoto, Japan, pp.2130-2137, 2009.
    Ling. H, Zou. F, Yan. W Q, et al., “Efficient image copy detection using multi-scale fingerprints”, IEEE Multimedia, Vol.19, No.1, pp.60-69, 2011.
    Yulong. W, Xianliang. W, Ruohua. Z, et al., “Automatic piano music transcription using audio-visual features”, Chinese Journal of Electronics, Vol.24, No.3, pp.156-163, 2015.
    Wu. X, Hauptmann. A. G, Ngo. C. W, “Practical elimination of near-duplicates from web video search”, ACM Proceedings of the 15th International Conference on Multimedia, New York, USA, pp.218-227, 2007.
    Esmaeili. M. M, Fatourechi. M, Ward. R K, “A robust and fast video copy detection system using content-based fingerprinting”, IEEE Transactions Information Forensics and Security, Vol.6, No.1, pp.213-226, 2011.
    Theobald. M, Siddharth. J, Paepcke. A, “Spotsigs: Robust and efficient near duplicate detection in large web collections”, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Kent Ridge, Singapore, pp.563-570, 2008.
    M. S. Charikar, “Similarity estimation techniques from rounding algorithms”, The 34th Annual ACM Symposium on Theory of Computing, Vol.68, No.5, pp.380-388, 2002.
    Henzinger. M. “Finding near-duplicate web pages: A largescale evaluation of algorithms”, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, USA, pp.284-291, 2006.
    Manku. G. S, Jain. A, Das. Sarma. A, “Detecting nearduplicates for web crawling”, ACM Proceedings of the 16th International Conference on World Wide Web, Banff, Canada, pp.141-150, 2007.
    Mander U. “Finding similar files in a large file system”, Proceedings of the USENIX Winter 1994 Technical Conference, San Francisco, USA, pp.1-10, 1994.
    Chowdhury. A, Frieder. O, Grossman. D, et al., “Collection statistics for fast duplicate document detection”, ACM Trans. Inf. Syst., Vol.20, No.2, pp.171-191, 2002.
    Schleimer. S and Wilkerson D. S, “Winnowing: Local algorithms for document fingerprinting”, ACM SIGMOD, San Diego, USA, pp.204-212, 2003.
    Fetterly. D, Manasse. M, Najork. M, “Detecting phrase-level duplication on the world wide web”, ACM Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, USA, pp.170-177, 2005.
    Esmaeili. M. M, Fatourechi. M and Ward. R. K, “A robust and fast video copy detection system using content-based fingerprinting”, IEEE Transactions Information Forensics and Security, Vol.6, No.1, pp.213-226, 2011.
    Ma. H, Liu. W, Li. Z, et al., “Short text similarity measurement based on coupled semantic relation and strong classification features”, Advances in Knowledge Discovery and Data Mining, Shanghai, China, pp.135-147, 2019.
    Hu. H, Liang. Z and Wu. J, “Hamming distance based approximate similarity text search algorithm”, Seventh International Conference on Advanced Computational Intelligence, DOI: 10.1109/ICACI.2015.7184772, 2015.
    Charikar. M. S. “Similarity estimation techniques from rounding algorithms”, ACM Symposium on Theory of Computing, Montreal, Canada, pp.380-388, 2002.
    Manku. G. S, Jain. A and Das Sarma A., “Detecting near-duplicates for Web crawling”, ACM World Wide Web Consortium, Banff, Canada, pp.141-149, 2007.
    Al-Subaihin. A, Sarro. F, Black. S, et al., “Empirical comparison of text-based mobile apps similarity measurement techniques”, Empirical Software Engineering, Vol.24, No.6, pp.3290-3315, 2019.
    Wang. S. G, You. D and Zhou. M. C, “A necessary and suffficient condition for a resource subset to generate a strict minimal siphon in S4PR”, IEEE Transactions on Automatic Control, Vol.62, No.8, pp.4173-4179, 2017.
    You L, Yang L, Yu W, et al., “A cancelable fuzzy vault algorithm based on transformed fingerprint features”, Chinese Journal of Electronics, Vol.26, No.2, pp.243-251,2017.
    Hoad. T. C and Zobel. J, “Methods for identifying versioned and plagiarized documents”, Journal of the American Society for Information Science and Technology, Vol.54, No.3, pp.203-215, 2003.
    Manber. U, “Finding similar files in a large file system”, Winter USENIX Technical Conference, Doi. 10.1093/bioinformatics/btm393, 1998.
    Brin. S, Davis. J, Garc. I, et al., “Copy detection mechanisms for digital documents”, ACM SIGMOD Record, Vol.24, No.2, pp.398-409, 1996.
    Schleimer. S, Wilkerson. D. S, Aiken. A, “Winnowing: Local algorithms for document fingerprinting”, ACM Association for Computing Machinery's Special Interest Group on Management of Data, DOI: 10.1145/872757.872770, 2003.
    Zhu. Z and Sun. J, “An improved “HowNet” based on the lexical semantic similarity calculation”, Journal of Computer Applications, Vol.33, No.8, pp.2276-2288, 2013.
    Yen. J and Pfluger. N, “A fuzzy logic based extension to Payton and Rosenblatt's command fusion method for mobile robot navigation”, IEEE Transactions on Systems, Man and Cybernetics, Vol.25, No.6, pp.971-978, 1995.
    Jooer. M, “Obstacle avoidance of a mobile robot using hybrid learning approach”, IEEE Transactions on Industrial Electronics, Vol.52, No.3, pp.898-905, 2005.
    Liu. Q and Li. S. “Based on HowNet lexical semantic similarity calculation”, Third Chinese Lexical Semantics Workshop, Nanjing, China, pp.59-76, 2002.
    Wong. A, RAY. P and Waran. N P, “Ontology mapping for the interoperability problem in network management”, IEEE Journal on Selected Areas in Communication, Vol23. No.10, pp.2058-2068, 2005.
    Qin. Y, Leng. Q,Wang. X, et al., “Plagiarism-detection algorithm for scientific papers based on local wordfrequency fingerprint”, Computer Engineering, Vol.37, No.6, pp.192-193, 2011.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (45) PDF downloads(339) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return