The Multiple Goals and Data in Data-Mining for Software Engineering

Data mining for software engineering consists of collecting software engineering data, extracting some knowledge from it and, if possible, use this knowledge to improve the software engineering process, in other words “operationalize” the mined knowledge. For instance, researchers have extracted usage patterns from millions of lines of code of the Linux kernel in order to find bugs [8].

In essence, data mining for software engineering can be decomposed along three axes [12]: the goal, the input data used, and the mining technique used. For example, the goal may be to improve code completion systems [3].

The goal. Software engineering at large consists of many tasks from specification, design, development, monitoring at runtime, etc. Each task is itself decomposed in many smaller scale tasks. For instance, a developer constantly switches between tasks, such as navigating code, reading documentation, writing code, debugging, etc. During the last decade, it has been shown that most software engineering tasks can benefit from data mining approaches, the tasks being whether technical [13] or more people oriented [11]. Furthermore, the community, although favoring concrete contributions, also produces exploratory results, descriptive phenomena observed in software engineering data [4].

The input data. The software engineering process in its entirety manipulates all kinds of data. Of course, one thinks of code, but there are also many written documents (specifications, documentation), design documents (diagrams, formulas), runtime documents (logs), etc. Most of them can be versioned using a Version Control System (e.g., CVS, SVN, Git). Depending on the targeted goal, some artifacts are more or less appropriate, and blended approaches are possible (using different kinds of software engineering artifacts in conjunction). Also, there is usually a fair amount of pre-processing that is specific to the artifacts under consideration: natural language processing for written documents [6], static analyses for code, etc.

The techniques. Nowadays, there is a wealth of data-mining and machine learning techniques. Not only they exist, but mature implementations are available and powerful hardware enables techniques to scale to large datasets. To manipulate software engineering data, there is no one-size-fits-all solution. From supervised to unsupervised approaches, numerical or categorical, batch or online, many techniques have been used. There are initiatives, such as centralized datasets [9] and contests [2] to enable scientific comparisons of appropriateness and performance.

computational-software-engineering

In this line of thought, “data mining for software engineering” is not the only term that is used in the literature. “Machine learning for software engineering” [1] focuses on the algorithmic techniques and especially on the learning part, e.g., learning how to recognize defect-prone modules. “Recommendation systems for software engineering” [10] highlights the decision-helping side, in terms of systems leveraging data-mining to help software systems stakeholders and engineers to make wiser decisions. The “Mining software repositories” research [5] is called in reference to software forges (e.g., Sourceforge) and refers to all kinds of repositories (e.g., code, bugs, versions). A more recent term “big data software engineering” surfes on the “big data” hype but indeed refers to a reality: software engineering data can be extremely large, from millions of specification pages to millions of lines of code. For a larger account on this topic, we refer the reader to Hassan's survey [5].

The current challenges of data-mining for software engineering are diverse, we refer the reader to Hassan and Xie's paper [7] for an overview.

[1] MALETS '11: Proceedings of the International Workshop on Machine Learning Technologies in Software Engineering. D. Alrajeh, T. Menzies, A. Russo, editors, ACM, New York, NY, USA, 2011.

[2] Alberto Bacchelli. Mining Challenge 2013: Stack Overflow. In « The 10th Working Conference on Mining Software Repositories », to appear, 2013.

[3] Marcel Bruch, Martin Monperrus, Mira Mezini. Learning from Examples to Improve Code Completion Systems. In « Proceedings of the 7th joint meeting of the European Software Engineering Conference and the ACM Symposium on the Foundations of Software Engineering », ACM, 2009, http://www.monperrus.net/martin/Learning-from-Examples-to-Improve-Code-Completion-Systems.pdf

[4] Daniel M. German. An empirical study of fine-grained software modifications. In « Empirical Softw. Engineering », 3, 11, September, 2006, 369–393,

[5] Ahmed E. Hassan. The road ahead for Mining Software Repositories. In « Frontiers of Software Maintenance, 2008. FoSM 2008. », 48 -57, 28 2008-oct. 4, 2008,

[6] Stefan Henss, Martin Monperrus, Mira Mezini. Semi-Automatically Extracting FAQs to Improve Accessibility of Software Development Knowledge. In « Proceedings of the International Conference on Software Engineering », 793 - 803, 2012, http://hal.inria.fr/docs/00/68/19/06/PDF/paper.pdf

[7] Ahmed E Hassan, Tao Xie. Software intelligence: the future of mining software engineering data. In « Proceedings of the FSE/SDP workshop on Future of software engineering research », ACM, 161–166, 2010.

[8] Zhenmin Li, Yuanyuan Zhou. PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code. In « Proc. of the 10th European Software Engineering Conference held jointly with 13th International Symposium on Foundations of Software Engineering », ESEC/FSE-13, ACM, 306–315, 2005.

[9] Tim Menzies, Bora Caglayan, Ekrem Kocaguneli, Joe Krall, Fayola Peters, Burak Turhan. The PROMISE Repository of empirical software engineering data. June, 2012, http://promisedata.googlecode.com

[10] Martin Robillard, Robert Walker, Thomas Zimmermann. Recommendation Systems for Software Engineering. In « IEEE Software », 2009.

[11] D. Schuler, T. Zimmermann. Mining usage expertise from version archives. In « MSR », 2008.

[12] Tao Xie, Suresh Thummalapenta, David Lo, Chao Liu. Data Mining for Software Engineering. In « IEEE Computer », 8, 42, August, 2009, 55-62,

[13] Thomas Zimmermann, A. Zeller, P. Weissgerber, S. Diehl. Mining version histories to guide software changes. In « IEEE Transactions on Software Engineering », 6, 31, June, 2005, 429–445, http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1463228