TEXT DATA CORPORA GENERATION ON THE BASIS OF THE DETERMINISTIC METHOD
DOI:
https://doi.org/10.20535/kpisn.2021.3.240780Keywords:
corpus of text data, corpora generation, natural language processing, data clustering, k-means methodAbstract
Background. The solution to many problems in the field of natural language processing involves the use of corpora of text data, which makes the issue of preparing such corpora topical. At the same time, the formation of corpora based on natural texts is time-consuming and not always expedient. Therefore, an automated generation of corpora based on various methods and algorithms is gaining popularity, which greatly simplifies the preparation of experimental data.
Objective. The purpose of the paper is to increase the number of detected defects during testing of software implementations of methods for processing natural text data by developing a new method for generating text data corpora to be used as input for testing.
Methods. The deterministic method of text data corpora generation, named CorDeGen, is proposed, which satisfies the following requirements: determinism, dependence on input only from the desired number of terms in the generated corpus, as well as the non-trivial structure of the generated corpus. Based on the proposed method, the algorithm has been developed that implements it, as well as a software implementation on the .NET platform (programming language – C#). The evaluation of the speed and efficiency of the developed method has been done based on the developed software.
Results. The performed speed evaluation of the developed CorDeGen method showed a power-law dependence of the time of generating the corpus on the number of terms (input parameter), with a degree of about 1.5. In this study, the feasibility of using the developed method to test the correctness of software implementations is shown by the example of testing the k-means clustering method.
Conclusions. Testing of the developed deterministic method of text data corpora generation has shown the effectiveness of using this method in the testing of other natural language processing tasks, such as clustering, instead of natural corpora.
References
N.S. Dash and S. Arulmozi, History, Features, and Typology of Language Corpora, Singapore: Springer, 2018, doi: 10.1007/978-981-10-7458-5.
W3-Corpora Project. (2012, February 5). Introduction: Corpus Linguistics [Online]. Available: https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/introduction2.html.
Library “Chtyvo”. (2017, October 13). Corpus of Ukrainian language [Online]. Available: http://korpus.org.ua.
The UCI KDD Archive. (1999, February 16). Reuters-21578 Text Categorization Collection [Online]. Available: https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.
J. Lichtarge, C. Alberti, S. Kumar, N. Shazeer, N. Parmar, and S. Tong, “Corpora Generation for Grammatical Error Cor- rection,” in Proc. 2019 Conf. North American Chapter Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, 2019, pp. 3291–3301, doi: 10.18653/v1/N19-1333.
J.B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc Fifth Berkeley Symp. Mathematical Statistics and Probability, 1967, pp. 281–297.
Microsoft Ignite. (2019, April 12). What’s new in .NET Core 3.1 [Online]. Available: https://docs.microsoft.com/en-us/dotnet/ core/whats-new/dotnet-core-3-1.
Roslyn. (2015, January 14). The Roslyn .NET compiler [Online]. Available: https://github.com/dotnet/roslyn.
BenchmarkDotNet. (2018, January 14). Overview | BenchmarkDotNet [Online]. Available: https://benchmarkdotnet.org/articles/ overview.html.
G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975, doi: 10.1145/361219.361220.
Microsoft. (2018, May 7). ML.NET [Online]. Available: https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet.
Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, “Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 579–587.
Downloads
Published
Issue
Section
License
Copyright (c) 2021 Tetiana M. Zabolotnia, Yakiv O. Yusyn
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under CC BY 4.0 that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work