Using the lexicon from source code to determine application domain

Tools

Capiluppi, A, Ajienka, N ORCID: https://orcid.org/0000-0002-8792-282X, Ali, N, Arzoky, M, Counsell, S, Destefanis, G, Miron, A, Nagaria, B, Neykova, R, Shepperd, M, Swift, S and Tucker, A, 2020. Using the lexicon from source code to determine application domain. In: Proceedings of The International Conference on Evaluation and Assessment in Software Engineering (EASE 2020), Trondheim, Norway, 15-17 April 2020. Association for Computing Machinery (ACM). (Forthcoming)

Text
1292587_Ajienka.pdf - Post-print
Restricted to Repository staff only
Download (568kB)

Official URL: https://doi.org/10.1145/nnnnnnn.nnnnnnn

Abstract

Context: The vast majority of software engineering research is independent of the application domain: techniques and tools usage is reported without any domain context. This has not always been so - early in the computing era, the research focus was frequently application domain specific (for example, scientific and data processing).

Objective: We believe determining the research context is often important. Therefore we propose a code-based approach to identify the application domain of a software system, via its lexicon. We compare its precision with the plain textual description attached to the same system.

Method: Using a sample of 50 Java projects, we obtained i) the description of each project (e.g., its ReadMe file), ii) the lexicon extracted from its source code, and iii) a list of its main topics extracted with the Latent Dirichlet Allocation (LDA) information retrieval technique. We assigned a random subset of these data items to different researchers (i.e., ‘experts’), and asked them to assign each item to one (or more) application domain. We then evaluated the precision and accuracy of the three techniques.

Results: Using the agreement levels between experts, We observed that the ‘baseline’ dataset (i.e., the ReadMe files) obtained the highest average in terms of agreement between experts, but we also observed that the three techniques had the same mode and median agreement levels. Additionally, in the cases where no agreement was reached for the baseline dataset, the two other techniques provided sufficient additional support.

Conclusions: We conclude that using the corpora or the topics from source code can be an adequate substitution to plain description when assigning a software system to an application domain

Item Type:	Chapter in book
Description:	Proceedings of The International Conference on Evaluation and Assessment in Software Engineering (EASE 2020), Trondheim, Norway, 15-17 April 2020.
Creators:	Capiluppi, A., Ajienka, N., Ali, N., Arzoky, M., Counsell, S., Destefanis, G., Miron, A., Nagaria, B., Neykova, R., Shepperd, M., Swift, S. and Tucker, A.
Publisher:	Association for Computing Machinery (ACM)
Date:	3 February 2020
Identifiers:	Number Type 10.1145/nnnnnnn.nnnnnnn DOI 1292587 Other
Divisions:	Schools > School of Science and Technology
Record created by:	Linda Sullivan
Date Added:	17 Feb 2020 16:23
Last Modified:	17 Feb 2020 16:24
URI:	https://irep.ntu.ac.uk/id/eprint/39221