Research topics for a thesis or an internship

by Martin Monperrus

Are you a KTH student looking for a fun and challenging topic in software engineering? Do you your Master's thesis / Bachelor's thesis under my supervision; or register to "Advanced Individual Course in Computer Science" with me as supervisor; or apply to be a research assistant (aka research amanuens), which is a 20% research job, where you get paid by KTH.

Are you a brilliant international student looking for an internship in a world-class research lab (remote internship included)?

Contact me by email if you want to join my group –Martin


Topics in Program Repair
     Artificial Intelligence for Repair
               Graph Neural Networks for Program Repair
               A comparison of self-healing parsers and machine learning for repairing syntax errors
               Study of the importance of pre-training for machine learning on code
               A large-scale study of patch embedding
               Automatic transfer of formatting with machine learning
               Sequence-to-sequence machine learning for automatic program repair
     Artificial Software Developer on Github (Repairnator)
               Automated Program Repair for C#
               An Artificial Bug Fixing Bot for Python.
               Automated stewardship of open-source projects.
               Dynamic Overfitting Detection by Comparing Execution Traces
               Automatic Repair of SonarQube Static Warnings.
     Generate-and-validate program repair (Astor)
               Learning of variable relationships and ingredient adaptation for automatic program repair
               Explainable repair templates for Astor
               Emergency recovering of buildability on continuous integration.
Topics in Program Hardening
     Chaos Engineering
               Chaos Engineering for Microservices in .NET and C#.
               Chaos Engineering in Service Mesh Proxies.
               Automatic Workarounds of Broken Websites due to Privacy-enhancing Plugins
     Randomization
               Side-channel attack detection in Java.
               Preventing algorithmic DOS attacks with blackbox randomization.
               Automatic Randomization of Programs with Neural Networks
Topics in Commit Analysis (Coming)
               Automatic Identification Bug-Fix Commits
               Automatic Labeling of Commits to Identify Fix Patterns
               Data science of large data from continuous integration
Topics in Software Testing
               Automatic Prioritization of Mutants in Mutation Testing
               Automatic Debloat of Test Suites
               Automatic Repair of Flaky Tests
               Automatic Repair of Rotten Green Tests
               Automatic Renaming of Test Variables to Improve Maintainability
Topics in Code Transformation
               Semantic Tree based Resolution for Git Merge Conflicts with Spoon
               Neural decompilation for JVM bytecode
               Binary Retargeting with Neural Machine Translation
               Declarative Code Transformation with the Semantic Patch Language in Java
               Metamorphic code transformation for Java

Topics in Program Repair

Artificial Intelligence for Repair

Graph Neural Networks for Program Repair

Supervision: Martin Monperrus (KTH)

Recently, it has been shown that one can use graph neural networks to synthesize code [1]. The key insight is to use self-supervision during learning. This opens up many new possibilities for program repair. The goal of this work is to extend the system in the context of Java programs and to apply it on known Java bugs. The work will be made on top of recent advances on graph neural networks.

  1. Generative Code Modeling with Graphs arxiv 1805.08490

  2. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

A comparison of self-healing parsers and machine learning for repairing syntax errors

Supervision: Martin Monperrus (KTH)

Description: Syntax errors and compiler errors happen all the times and are known to be sometimes are to locate and fix. To solve this problem, two families of approaches have been proposed: one based on self-recovering parsers [1], the other one based on machine learning [2]. The student will perform a large scale, systematic evaluation to compare them in a scientific manner on a common dataset.

  1. Reducing Cascading Parsing Errors Through Fast Error Recovery github

  2. Syntax and sensibility: Using language models to detect and correct syntax errors code

Study of the importance of pre-training for machine learning on code

Supervision: Martin Monperrus (KTH) / Hugo Mougard (source{d})

Description: Circa 2014, the field of Computer Vision saw its first publications of efficient pre-trainings and transfer learning (https://en.wikipedia.org/wiki/Transferlearning). In natural language processing, a big result was published in 2018 with performant pre-trainings on generic tasks [2] (such as word or sentence completion) allowing models to achieve state of the art with minor changes to the base model. Today, there is a growing body of research on machine learning on code [3], with a variety of models and embeddings used to perform a variety of goals (code completion, program repair, renaming, etc;). Yet, there is no research on the importance and feasibility of pretraining on those machine learning on code tasks. The student will design, implement and perform an experiment to study the importance of pre-training for machine learning on code. The student will have access to the full code analysis stack of sourced (link) to run the experiments. Weekly progress reports with the sourced engineers will be organized.

  1. Very Deep Convolutional Networks for Large-Scale Image Recognition

  2. Deep contextualized word representations

  3. A survey of machine learning for big code and naturalness

A large-scale study of patch embedding

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS (collaboration with Univ. Wisconsin-Madison)

A lot of automatic bug fixing generation techniques rely studying past patches. Recent work has proposed to embed source changes in a real-valued vectorial space [1]. The student will implement and extend this embedding technique, in order to make it work on Java code and large patches [2]. The student will perform the experiment by running it on a scientific computing grid.

  1. Learning to represent edits (2018)

  2. An Empirical Study on Real Bug Fixes (2015).

Automatic transfer of formatting with machine learning

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: It is common practice to use and enforce a certain coding style in software projects. This can become a nightmare when one copies files from one project to another, where the project use different conventions. For this, there is a need to be able to transfer the style one project to files coming from potentially anywhere with any style. It may be possible to use machine learning to perform the transfer [1,2] The student will devise, implement and evaluate an approach to automatically transfer coding style. The student will perform the experiments a scientific computing grid.

  1. Learning Natural Coding Conventions (https://github.com/mast-group/naturalize)

  2. Towards a Universal Code Formatter through Machine Learning (https://github.com/antlr/codebuff)

Sequence-to-sequence machine learning for automatic program repair

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

A lot of automatic bug fixing generation techniques rely on slightly modifying the existing code. The student will devise and evaluate a new repair algorithm that will learn from past diffs, using sequence-to-sequence learning. The planned methodology is as follows: 1) set up a training and evaluation dataset based on diffs 2) devise, implement and assess a new repair algorithm based on this data. The student will perform the experiment by running it on a scientific computing grid.

  1. An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation

  2. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

  3. Benchmark of single-line bugs

Artificial Software Developer on Github (Repairnator)

Example of related work: https://medium.com/@martin.monperrus/human-competitive-patches-in-automatic-program-repair-with-repairnator-359042e00f6a.

Automated Program Repair for C#

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: In industry, C# is a very popular programming language for enterprise applications. In order to demonstrate our research to our industrial partners, the goal of this work is to implement a first prototype of program repair for C#. The student will devise, implement and evaluate an automatic repair system for C#.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

  3. https://github.com/Spirals-Team/repairnator/

An Artificial Bug Fixing Bot for Python.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: On Travis, the #1 language is Python (millions of builds, 4x more than Java). In order to increase the impact of Repairnator, the goal of this work is to implement a first prototype of Repairnator in Python. The student will devise, implement and evaluate an automatic repair system for Python and Travis CI.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

  3. https://github.com/Spirals-Team/repairnator/

Automated stewardship of open-source projects.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: On Github, certain projects are very popular but the maintainers have not enough bandwidth to keep up the pace of pull requests. Those projects literally die under too many pull requests. For those projects, the maintainers need a robot to help them merge pull requests. The student will devise, implement and evaluate an automated steward for software development projects. The steward will try to take over dead yet popular software projects.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

Dynamic Overfitting Detection by Comparing Execution Traces

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: A big problem in automated program repair is the presence of "overfitting patches" which are patches that pass all tests yet are incorrect [1,2]. The goal of this thesis is to study, design and implement a machine-learning based system for measuring the likelihood of an execution trace to be buggy.

  1. Alleviating Patch Overfitting with Automatic Test Generation: A Study of Feasibility and Effectiveness for the Nopol Repair System (EMSE 17)

  2. Identifying Patch Correctness in Test-Based Automatic Program Repair (ICSE 18)

Automatic Repair of SonarQube Static Warnings.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: There exists several systems for statically identifying problems (FindBugs, PMD, SpotBugs, SonarQube). The student will work on a system to automatically repair those issues, focusing on SonarQube warnings. The repair algorithms will be code transformations based on antiunification, written using the Spoon transformation library for Java.

  1. Are Static Analysis Violations Really Fixed? A Closer Look at Realistic Usage of SonarQube. Dataset for OSS organizations

  2. Getafix: Learning to Fix Bugs Automatically

Generate-and-validate program repair (Astor)

Repository: https://github.com/SpoonLabs/astor

Learning of variable relationships and ingredient adaptation for automatic program repair

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: Most patches don't introduce new variables, they only reuse existing variables and method calls. The student will set up and perform an experiment to use deep learning for mining variables relationships so that snippets can be fitted in the current repair context. The planned methodology is as follows: 1) extract a dataset of variable relations based on existing code 2) apply deep-learning on this data and analyze the results 3) devise, implement and assess a extension of Nopol/Astor to use the mined informatio in particular to adapt program repair ingredients. The student will perform the experiment by running it on a scientific computing grid.

  1. ASTOR: A Program Repair Library for Java

  2. Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities

  3. SmartPaste: Learning to Adapt Source Code

Explainable repair templates for Astor

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: In practice, template-based repair is promising, because they are very easy to explain to the developer and they also provide reduced overfitting. The student will design and implement a full-fledged template based system for Astor, based on the latest literature on repair templates. A large scale evaluation will be done on the novel repair benchmarks Bugs.jar and BEARS. we can play with the vocabulary to have pure language based templates and project-specific ones

  1. Explainable Software Bot Contributions: Case Study of Automated Bug Fixes

  2. Revisiting template-based automated program repair

  3. FixMiner: Mining Relevant Fix Patterns for Automated Program Repair

Emergency recovering of buildability on continuous integration.

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: In very active software projects, it is important to always keep a passing build on master. When the master branch fails, this is potentially a blocking factor for big teams. Therefore, there is a need for a technique to recover a passing build status on master in emergency. The goal of this thesis is to study, design and implement such a technique and to evaluate it on large-scale open-source projects.

  1. Oops, my tests broke the build: An analysis of travis ci builds with github

  2. Who broke the build?: Automatically identifying changes that induce test failures in continuous integration at google scale

  3. Keeping Master Green at Scale

Topics in Program Hardening

Chaos Engineering

Chaos Engineering for Microservices in .NET and C#.

Supervision: Long Zhang, Martin Monperrus, KTH Royal Institute of Technology

Description: Chaos Engineering [1] is the discipline of verifying resilience capabilities of software systems in production. ChaosMachine [2] is such an approach for microservices implemented in Java. While microservices in the .NET world are conceptually similar, there are a number of technical differences. You will study, design and implement the ChaosMachine for microservices in .NET and C#

  1. Chaos Engineering (the book)

  2. A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM

Chaos Engineering in Service Mesh Proxies.

Supervision: Long Zhang, Martin Monperrus, KTH Royal Institute of Technology

Description: Chaos Engineering [1] is the discipline of verifying resilience capabilities of software systems in production. Durieux et al. [2] have shown that one can effectively do code transformation in user-facing proxies for safe-healing problems. It is also possible to code transformation in service mesh proxies, in order to increase observability and potentially perturb the system execution. You will study, design and implement the chaos engineering system for proxies and service mesh.

  1. Chaos Engineering (the book)

  2. Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Automatic Workarounds of Broken Websites due to Privacy-enhancing Plugins

Supervision: Martin Monperrus, KTH Royal Institute of Technology

Description: Users can improve their online privacy by installing privacy-enhancing plugins, such as uBlock. However, those plugins break important functionality of some websites [1]. The student will build on the ideas on BikiniProxy [1] to provide automatic repair of broken websites due privacy enhancing technology. The student will design and implement the system (eg "uBlock-repair") either as a browser plugin or as proxy.

  1. A comparison of web privacy protection techniques

  2. Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

  3. Automatic Visual Verification of Layout Failures in Responsively Designed Web Pages.

Randomization

Side-channel attack detection in Java.

Supervision: Martin Monperrus, KTH Royal Institute of Technology

Description: The goal of this thesis is to study side-channel attacks in Java [1]. The student will devise and perform a scientific experiment in this context based on the work of Nilizadeh et al. [1]. She/he will read the literature, implement the required software for supporting the experiment, design the inclusion criteria for subjects and run the experiment on a scientific computing grid.

  1. DifFuzz: Differential Fuzzing for Side-Channel Analysis

  2. Correctness Attraction: A Study of Stability of Software Behavior Under Runtime Perturbation

Preventing algorithmic DOS attacks with blackbox randomization.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: The goal of this thesis is to study counter-measures to algorithmic denial of service attacks. An algorithmic DOS consists of a input specifically designed by the attacker to trigger the worst case execution of a program [1]. Black-box randomization consists of identifying and injecting randomization points in software, without any knowledge of the application domain and implementation choices. The goal of this thesis is to study the usage of black-box randomization for countering algorithmic DOS attacks. The student will devise and perform a scientific experiment in this context. She/he will read the literature, implement the required software for supporting the experiment, design the inclusion criteria for subjects and run the experiment on a scientific computing grid.

  1. Denial of Service via Algorithmic Complexity Attacks

  2. Correctness Attraction: A Study of Stability of Software Behavior Under Runtime Perturbation

  3. Slowfuzz: Automated domain-independent detection of algorithmic complexity vulnerabilities

Automatic Randomization of Programs with Neural Networks

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

The fact that programs are executed mostly deterministically is a problem for security [1]. Automatic randomization is one way of overcoming this problem. In this work, we we will explore the use of neural networks, and their ability to generate likely sequences, to synthesize equivalent programs that execute differently. You will use recent libraries from machine learning [2] to learn over big amount of open source code.

  1. The Multiple Facets of Software Diversity: Recent Developments in Year 2000 and Beyond

  2. https://github.com/openai/gpt-2

Topics in Commit Analysis (Coming)

Automatic Identification Bug-Fix Commits

Supervisors: Matias Martinez, Martin Monperrus

Description: In many situations, it is important to classify the commits according to some labels. For instance, one want to automatically identify security related commits, or bug-fixing commits. The student will work on an automatic commit classifier. The work will be done in the context of the Coming platform.

  1. Mining Software Repositories for Adaptive Change Commits Using Machine Learning Techniques (2019)

  2. PatchNet: A Tool for Deep Patch Classification (2019)

  3. https://github.com/SpoonLabs/coming

Automatic Labeling of Commits to Identify Fix Patterns

Supervisors: Matias Martinez, Martin Monperrus

Description: In order to do data-driven program repair, one need to have ground truth labels about the fix patterns used in a commit. There are very few tools doing so. You will use those tools to characterize datasets, starting with the CodRep dataset, in order to have a taxonomy of fixes in Sequencer. You will extend PPD and other tools written in Java for commit analysis. The work will be done in the context of the Coming platform.

  1. Towards an automated approach for bug fix pattern detection

  2. https://github.com/SpoonLabs/coming

Data science of large data from continuous integration

Supervisors: Martin Monperrus

Description: Travis CI automatically handles thousands of builds every day to, amongst other things, provide valuable feedback to thousands of open-source developers. Durieux et al. [1] have collected an ultra large benchmark of Travis CI jobs: it contains 35 793 144 jobs from 272 917 different GitHub projects. This poses problems to scale requests for CI analysis. The student will work on using scalable data-science techniques to analyze this dataset.

  1. Interviewing the Most Successful Bot on GitHub: Dr Travis CI on 35+ Million of its Jobs

  2. https://zenodo.org/record/2560966

Topics in Software Testing

Automatic Prioritization of Mutants in Mutation Testing

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: One of the problems with mutation testing is that the developers are overwhelmed by the number of mutants to kill with new tests. One way to approach this problem is to view it as a recommendation problem based on the dependency graph. The student will design, implement and evaluate a novel technique for automatically prioritizing mutants to be killed in Java software.

  1. A Comprehensive Study of Pseudo-tested Methods

Automatic Debloat of Test Suites

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Today, test suites are really big. Not all tests are equally effective. Some tests are useless, some tests overlap. Test code has to be maintained and having too much test code can be considered as technical debt. The student will design, implement and evaluate a novel technique for automatically debloating test suites in Java in order to increase test effectiveness.

  1. Software bloat analysis: finding, removing, and preventing performance problems in modern large-scale object-oriented applications

Automatic Repair of Flaky Tests

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Flaky tests are tests that fail in an non-determistic way, and it is is a big problem in industry. Following the automatic repair philosophy, one can automatically repair a some of them by improving sandboxing or virtualizing time. The student will design, implement and evaluation a prototype system for automatic repair of flaky tests.

  1. An empirical analysis of flaky tests (2014)

  2. Automatic Software Repair: a Bibliography (CSUR 17)

  3. iFixFlakies: A Framework for Automatically Fixing Order-Dependent Flaky Tests

Automatic Repair of Rotten Green Tests

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Rotten green tests provide a false sense of security, and it is is a big problem in practice. Following the automatic repair philosophy, one can automatically repair such tests by refactoring the test case. The student will design, implement and evaluation a prototype system for automatic repair of rotten green tests.

  1. Rotten Green Tests (2019)

  2. Intent-Preserving Test Repair

  3. Automatic Software Repair: a Bibliography (CSUR 17)

Automatic Renaming of Test Variables to Improve Maintainability

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: In test code, it is very important to have good variables names, so that the test intention is clear, and so that the test is maintainable. Recent research has shown that we can use machine learning to predict good names in code. The student will work to apply the state-of-the-art technique in variable renaming to test code in C++ or Java. The work will be related to the EU H2020 project STAMP.

  1. code2vec: Learning Distributed Representations of Code (2018), implementation at https://github.com/tech-srl/code2vec.

  2. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts (2018)

Topics in Code Transformation

Semantic Tree based Resolution for Git Merge Conflicts with Spoon

In modern pull-request based development, merge conflicts are a common problem. Today, both client and server Git software still rely on basic line-based merge resolution. The student will design, implement and eavluate a proper AST-based resolution system for Java based on the Spoon analysis and transformation library.

  1. Enhancing precision of structured merge by proper tree matching

  2. Scalable software merging studies with MergAnser

  3. Predicting Merge Conflicts in Collaborative Software Development

Neural decompilation for JVM bytecode

A decompiler takes compiled code (e.g. x86 code) and produces source code. Decompilation is an essential step for program comprehension, security analyses, etc. However, it is challenging to write an accurate decompiler (that can retrieve the source code that actually corresponds to the compiled code) and the implementation of decompilers currently relies on the careful, manual design of decompilation rules. Some recent works [1,2,3] have proposed to use machine learning in order to train a decompiler. These works successfully applied this concept to decompile from binary to C source code. In this work, you will study the topic of decompiler learning for Java bytecode.

  1. A Neural-based Program Decompiler

  2. Towards Neural Decompilation

  3. Using recurrent neural networks for decompilation

Binary Retargeting with Neural Machine Translation

Binary retargeting consists of porting binary code from one platform to another, eg from x89 to arm. Binary retargeting is important for porting old code and maximizing interoperability. In the sister field of decompilation, recent progress has been made with neural machine translation [1,2,3]. In this work, you will explore the use of neural machine translation for binary retargeting.

  1. A Neural-based Program Decompiler

  2. Towards Neural Decompilation

  3. Using recurrent neural networks for decompilation

Declarative Code Transformation with the Semantic Patch Language in Java

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Code transformation is a powerful tool in dynamic software analysis [1]. Declarative transformations are easier to specify, understand and maintain. In that realm, the "Semantic Patch Language (SmPL)" is the state-of-the-art [2]. The student will implement SmPL for Java. The interpretation engine will be made in the Spoon library [1].

  1. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code

  2. SmPL: A Domain-Specific Language for Specifying Collateral Evolutions in Linux Device Drivers

Metamorphic code transformation for Java

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Code transformation is a powerful tool in dynamic software analysis. It can be used as source code transformation or binary code transformation. For the programmer, it is much easier to write a transformation at the source code level, because she is familiar with the language constructs. However, for applicability, it is better to be able to apply the transformations at the binary level. The student will design a transformation system for Java such that the transformations are applicable on source code or binary code interchangeably. This will be done in the context of Java, which means that the transformation system will be able to work both on Java source code and on JVM bytecode. The student will study different options, incl: 1) compiling a source code transformation in Spoon to a binary code transformation in ASM/Javassist 2) applying transformation after decompilation.

  1. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code

Tagged as: