corpus of text data, corpora generation, natural language processing, data clustering, k-means method


Background. The solution to many problems in the field of natural language processing involves the use of corpora of text data, which makes the issue of preparing such corpora topical. At the same time, the formation of corpora based on natural texts is time-consuming and not always expedient. Therefore, an automated generation of corpora based on various methods and algorithms is gaining popularity, which greatly simplifies the preparation of experimental data.

Objective. The purpose of the paper is to increase the number of detected defects during testing of software implementations of methods for processing natural text data by developing a new method for generating text data corpora to be used as input for testing.

Methods. The deterministic method of text data corpora generation, named CorDeGen, is proposed, which satisfies the following requirements: determinism, dependence on input only from the desired number of terms in the generated corpus, as well as the non-trivial structure of the generated corpus. Based on the proposed method, the algorithm has been developed that implements it, as well as a software implementation on the .NET platform (programming language – C#). The evaluation of the speed and efficiency of the developed method has been done based on the developed software.

Results. The performed speed evaluation of the developed CorDeGen method showed a power-law dependence of the time of generating the corpus on the number of terms (input parameter), with a degree of about 1.5. In this study, the feasibility of using the developed method to test the correctness of software implementations is shown by the example of testing the k-means clustering method.

Conclusions. Testing of the developed deterministic method of text data corpora generation has shown the effectiveness of using this method in the testing of other natural language processing tasks, such as clustering, instead of natural corpora.


N.S. Dash and S. Arulmozi, History, Features, and Typology of Language Corpora, Singapore: Springer, 2018, doi: 10.1007/978-981-10-7458-5.

W3-Corpora Project. (2012, February 5). Introduction: Corpus Linguistics [Online]. Available: https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/introduction2.html.

Library “Chtyvo”. (2017, October 13). Corpus of Ukrainian language [Online]. Available: http://korpus.org.ua.

The UCI KDD Archive. (1999, February 16). Reuters-21578 Text Categorization Collection [Online]. Available: https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.

J. Lichtarge, C. Alberti, S. Kumar, N. Shazeer, N. Parmar, and S. Tong, “Corpora Generation for Grammatical Error Cor- rection,” in Proc. 2019 Conf. North American Chapter Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, 2019, pp. 3291–3301, doi: 10.18653/v1/N19-1333.

J.B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc Fifth Berkeley Symp. Mathematical Statistics and Probability, 1967, pp. 281–297.

Microsoft Ignite. (2019, April 12). What’s new in .NET Core 3.1 [Online]. Available: https://docs.microsoft.com/en-us/dotnet/ core/whats-new/dotnet-core-3-1.

Roslyn. (2015, January 14). The Roslyn .NET compiler [Online]. Available: https://github.com/dotnet/roslyn.

BenchmarkDotNet. (2018, January 14). Overview | BenchmarkDotNet [Online]. Available: https://benchmarkdotnet.org/articles/ overview.html.

G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975, doi: 10.1145/361219.361220.

Microsoft. (2018, May 7). ML.NET [Online]. Available: https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet.

Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, “Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 579–587.