Research topics for a thesis or an internship

by Martin Monperrus

Are you a KTH student looking for a fun and challenging topic in software engineering? Do you your See also Master's thesis / Bachelor's thesis under my supervision; or register to "Advanced Individual Course in Computer Science" with me as supervisor; or apply to be a research assistant (aka research amanuens), which is a 20% research job, where you get paid by KTH.

Are you a brilliant international student looking for an internship in a world-class research lab (remote internship included)?

Contact me by email if you want to join my group –Martin


     Topics in Program Repair
          Artificial Intelligence for Repair
               A comparison of self-healing parsers and machine learning for repairing syntax errors
               Study of the importance of pre-training for machine learning on code
               A Large-scale study of patch embedding (collaboration with Univ. Wisconsin-Madison)
               Automatic transfer of formatting with machine learning
               Sequence-to-sequence machine learning for automatic program repair
          Contributions to the ASTOR program repair framework
               Learning of variable relationships and ingredient adaptation for automatic program repair
               Explainable repair templates for Astor
               Emergency recovering of buildability on continuous integration.
          Artificial Software Developer on Github (Repairnator)
               Automated Program Repair for C#
               An Artificial Bug Fixing Bot for Python.
               Automated stewardship of open-source projects.
               Automated understanding and usage of software development emoticons.
               Comparing execution traces to identify buggy inputs.
               Automatic Repair of SonarQube Static Warnings.
     Topics in Program Hardening & Chaos Engineering
               Side-channel attack detection in Java.
               Automatic Repair of Broken Websites due to Privacy-enhancing Plugins
               Preventing algorithmic DOS attacks with blackbox randomization.
     Topics in Software Testing
               Automatic Debloat of Test Suites
               Automatic Repair of Flaky Tests
               Automatic Repair of Rotten Green Tests
               Automatic Renaming of Test Variables to Improve Maintainability
     Topics in Code Transformation
               Semantic Tree based Resolution for Git Merge Conflicts with Spoon
               Neural decompilation for JVM bytecode
               Declarative Code Transformation with the Semantic Patch Language in Java
               Metamorphic code transformation for Java

Topics in Program Repair

Artificial Intelligence for Repair

A comparison of self-healing parsers and machine learning for repairing syntax errors

Supervisors: Martin Monperrus (KTH)

Syntax errors happen all the times and are known to be sometimes are to locate and fix. To solve this problem, two families of approaches have been proposed: one based on self-recovering parsers [1], the other one based on machine learning [2]. The student will perform a large scale, systematic evaluation to compare them in a scientific manner on a common dataset.

  1. Reducing Cascading Parsing Errors Through Fast Error Recovery github

  2. Syntax and sensibility: Using language models to detect and correct syntax errors code

Study of the importance of pre-training for machine learning on code

Supervisors: Martin Monperrus (KTH) / Hugo Mougard (source{d})

Circa 2014, the field of Computer Vision saw its first publications of efficient pre-trainings and transfer learning (https://en.wikipedia.org/wiki/Transferlearning). In natural language processing, a big result was published in 2018 with performant pre-trainings on generic tasks [2] (such as word or sentence completion) allowing models to achieve state of the art with minor changes to the base model. Today, there is a growing body of research on machine learning on code [3], with a variety of models and embeddings used to perform a variety of goals (code completion, program repair, renaming, etc;). Yet, there is no research on the importance and feasibility of pretraining on those machine learning on code tasks. The student will design, implement and perform an experiment to study the importance of pre-training for machine learning on code. The student will have access to the full code analysis stack of sourced (link) to run the experiments. Weekly progress reports with the sourced engineers will be organized.

  1. Very Deep Convolutional Networks for Large-Scale Image Recognition

  2. Deep contextualized word representations

  3. A survey of machine learning for big code and naturalness

A Large-scale study of patch embedding (collaboration with Univ. Wisconsin-Madison)

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

A lot of automatic bug fixing generation techniques rely studying past patches. Recent work has proposed to embed source changes in a real-valued vectorial space [1]. The student will implement and extend this embedding technique, in order to make it work on Java code and large patches [2]. The student will perform the experiment by running it on a scientific computing grid.

  1. Learning to represent edits (2018)

  2. An Empirical Study on Real Bug Fixes (2015).

Automatic transfer of formatting with machine learning

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: It is common practice to use and enforce a certain coding style in software projects. This can become a nightmare when one copies files from one project to another, where the project use different conventions. For this, there is a need to be able to transfer the style one project to files coming from potentially anywhere with any style. It may be possible to use machine learning to perform the transfer [1,2] The student will devise, implement and evaluate an approach to automatically transfer coding style. The student will perform the experiments a scientific computing grid.

  1. Learning Natural Coding Conventions (https://github.com/mast-group/naturalize)

  2. Towards a Universal Code Formatter through Machine Learning (https://github.com/antlr/codebuff)

Sequence-to-sequence machine learning for automatic program repair

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

A lot of automatic bug fixing generation techniques rely on slightly modifying the existing code. The student will devise and evaluate a new repair algorithm that will learn from past diffs, using sequence-to-sequence learning. The planned methodology is as follows: 1) set up a training and evaluation dataset based on diffs 2) devise, implement and assess a new repair algorithm based on this data. The student will perform the experiment by running it on a scientific computing grid.

  1. An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation

  2. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

  3. Benchmark of single-line bugs

Contributions to the ASTOR program repair framework

Repository: https://github.com/SpoonLabs/astor

Learning of variable relationships and ingredient adaptation for automatic program repair

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: Most patches don't introduce new variables, they only reuse existing variables and method calls. The student will set up and perform an experiment to use deep learning for mining variables relationships so that snippets can be fitted in the current repair context. The planned methodology is as follows: 1) extract a dataset of variable relations based on existing code 2) apply deep-learning on this data and analyze the results 3) devise, implement and assess a extension of Nopol/Astor to use the mined informatio in particular to adapt program repair ingredients. The student will perform the experiment by running it on a scientific computing grid.

  1. ASTOR: A Program Repair Library for Java

  2. Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities

  3. SmartPaste: Learning to Adapt Source Code

Explainable repair templates for Astor

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: In practice, template-based repair is promising, because they are very easy to explain to the developer and they also provide reduced overfitting. The student will design and implement a full-fledged template based system for Astor, based on the latest literature on repair templates. A large scale evaluation will be done on the novel repair benchmarks Bugs.jar and BEARS.

  1. Explainable Software Bot Contributions: Case Study of Automated Bug Fixes

  2. Revisiting template-based automated program repair

  3. FixMiner: Mining Relevant Fix Patterns for Automated Program Repair

Emergency recovering of buildability on continuous integration.

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: In very active software projects, it is important to always keep a passing build on master. When the master branch fails, this is potentially a blocking factor for big teams. Therefore, there is a need for a technique to recover a passing build status on master in emergency. The goal of this thesis is to study, design and implement such a technique and to evaluate it on large-scale open-source projects.

  1. Oops, my tests broke the build: An analysis of travis ci builds with github

  2. Who broke the build?: Automatically identifying changes that induce test failures in continuous integration at google scale

  3. Keeping Master Green at Scale

Artificial Software Developer on Github (Repairnator)

Example of related work: https://medium.com/@martin.monperrus/human-competitive-patches-in-automatic-program-repair-with-repairnator-359042e00f6a.

Automated Program Repair for C#

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: In industry, C# is a very popular programming language for enterprise applications. In order to demonstrate our research to our industrial partners, the goal of this work is to implement a first prototype of program repair for C#. The student will devise, implement and evaluate an automatic repair system for C#.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

  3. https://github.com/Spirals-Team/repairnator/

An Artificial Bug Fixing Bot for Python.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: On Travis, the #1 language is Python (millions of builds, 4x more than Java). In order to increase the impact of Repairnator, the goal of this work is to implement a first prototype of Repairnator in Python. The student will devise, implement and evaluate an automatic repair system for Python and Travis CI.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

  3. https://github.com/Spirals-Team/repairnator/

Automated stewardship of open-source projects.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: On Github, certain projects are very popular but the maintainers have not enough bandwidth to keep up the pace of pull requests. Those projects literally die under too many pull requests. For those projects, the maintainers need a robot to help them merge pull requests. The student will devise, implement and evaluate an automated steward for software development projects. The steward will try to take over dead yet popular software projects.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

Automated understanding and usage of software development emoticons.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Repairnator is a software development bot on Github [1,2]. It constantly monitors software bugs discovered during continuous integration of open-source software and tries to fix them automatically. If it succeeds to synthesize a valid patch, Repairnator proposes the patch to the human developers, disguised under a fake human identity. A good Github developer is able to express her intentions and feelings with the appropriate usage of emoticons. Repairnator should be able to do the same: to decorate its pull-requests and comments with emoticons. The student will use the whole Github data to train a predictor of emoticons for software development.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

Comparing execution traces to identify buggy inputs.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: When one uses test generation on a buggy program, one does not know whether the generated input or scenario is in the buggy input domain or in the correct input domain [1]. Assuming one has an execution trace of a buggy input, the idea is to estimate whether a new input is buggy or correct, by comparing two executions [2]. The goal of this thesis is to study, design and implement a machine-learning based system for measuring the likelihood of an execution trace to be buggy.

  1. Alleviating Patch Overfitting with Automatic Test Generation: A Study of Feasibility and Effectiveness for the Nopol Repair System (EMSE 17)

  2. Identifying Patch Correctness in Test-Based Automatic Program Repair (ICSE 18)

Automatic Repair of SonarQube Static Warnings.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: There exists several systems for statically identifying problems (FindBugs, PMD, SpotBugs, SonarQube). The student will work on a system to automatically repair those issues, focusing on SonarQube warnings. The repair algorithms will be code transformations based on antiunification, written using the Spoon transformation library for Java.

  1. Are Static Analysis Violations Really Fixed? A Closer Look at Realistic Usage of SonarQube. Dataset for OSS organizations

  2. Getafix: Learning to Fix Bugs Automatically

Topics in Program Hardening & Chaos Engineering

Side-channel attack detection in Java.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: The goal of this thesis is to study side-channel attacks in Java [1]. The student will devise and perform a scientific experiment in this context based on the work of Nilizadeh et al. [1]. She/he will read the literature, implement the required software for supporting the experiment, design the inclusion criteria for subjects and run the experiment on a scientific computing grid.

  1. DifFuzz: Differential Fuzzing for Side-Channel Analysis

  2. Correctness Attraction: A Study of Stability of Software Behavior Under Runtime Perturbation

Automatic Repair of Broken Websites due to Privacy-enhancing Plugins

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Users can improve their online privacy by installing privacy-enhancing plugins, such as uBlock. However, those plugins break important functionality of some websites [1]. The student will build on the ideas on BikiniProxy [1] to provide automatic repair of broken websites due privacy enhancing technology. The student will design and implement the system (eg "uBlock-repair") either as a browser plugin or as proxy.

  1. A comparison of web privacy protection techniques

  2. Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

  3. Automatic Visual Verification of Layout Failures in Responsively Designed Web Pages.

Preventing algorithmic DOS attacks with blackbox randomization.

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: The goal of this thesis is to study counter-measures to algorithmic denial of service attacks. An algorithmic DOS consists of a input specifically designed by the attacker to trigger the worst case execution of a program [1]. Black-box randomization consists of identifying and injecting randomization points in software, without any knowledge of the application domain and implementation choices. The goal of this thesis is to study the usage of black-box randomization for countering algorithmic DOS attacks. The student will devise and perform a scientific experiment in this context. She/he will read the literature, implement the required software for supporting the experiment, design the inclusion criteria for subjects and run the experiment on a scientific computing grid.

  1. Denial of Service via Algorithmic Complexity Attacks

  2. Correctness Attraction: A Study of Stability of Software Behavior Under Runtime Perturbation

  3. Slowfuzz: Automated domain-independent detection of algorithmic complexity vulnerabilities

Topics in Software Testing

Automatic Debloat of Test Suites

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Today, test suites are really big. Not all tests are equally effective. Some tests are useless, some tests overlap. Test code has to be maintained and having too much test code can be considered as technical debt. The student will design, implement and evaluate a novel technique for automatically debloating test suites in Java in order to increase test effectiveness.

  1. Software bloat analysis: finding, removing, and preventing performance problems in modern large-scale object-oriented applications

Automatic Repair of Flaky Tests

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Flaky tests are tests that fail in an non-determistic way, and it is is a big problem in industry. Following the automatic repair philosophy, one can automatically repair a some of them by improving sandboxing or virtualizing time. The student will design, implement and evaluation a prototype system for automatic repair of flaky tests.

  1. An empirical analysis of flaky tests (2014)

  2. Automatic Software Repair: a Bibliography (CSUR 17)

  3. iDFlakies - A Framework for Detecting and Partially Classifying Flaky Tests.

Automatic Repair of Rotten Green Tests

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Rotten green tests provide a false sense of security, and it is is a big problem in practice. Following the automatic repair philosophy, one can automatically repair such tests by refactoring the test case. The student will design, implement and evaluation a prototype system for automatic repair of rotten green tests.

  1. Rotten Green Tests (2019)

  2. Intent-Preserving Test Repair

  3. Automatic Software Repair: a Bibliography (CSUR 17)

Automatic Renaming of Test Variables to Improve Maintainability

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: In test code, it is very important to have good variables names, so that the test intention is clear, and so that the test is maintainable. Recent research has shown that we can use machine learning to predict good names in code. The student will work to apply the state-of-the-art technique in variable renaming to test code in C++ or Java. The work will be related to the EU H2020 project STAMP.

  1. code2vec: Learning Distributed Representations of Code (2018), implementation at https://github.com/tech-srl/code2vec.

  2. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts (2018)

Topics in Code Transformation

Semantic Tree based Resolution for Git Merge Conflicts with Spoon

In modern pull-request based development, merge conflicts are a common problem. Today, both client and server Git software still rely on basic line-based merge resolution. The student will design, implement and eavluate a proper AST-based resolution system for Java based on the Spoon analysis and transformation library.

  1. Enhancing precision of structured merge by proper tree matching

  2. Scalable software merging studies with MergAnser

Neural decompilation for JVM bytecode

A decompiler takes compiled code (e.g. x86 code) and produces source code. Decompilation is an essential step for program comprehension, security analyses, etc. However, it is challenging to write an accurate decompiler (that can retrieve the source code that actually corresponds to the compiled code) and the implementation of decompilers currently relies on the careful, manual design of deceompilation rules. Some recent works [1,2] have proposed to use machine learning in order to train a decompiler. These works successfully applied this concept to decompile from binary to C source code. In this work, we wish to remix the exciting concept of decompiler learning for Java bytecode.

  1. Towards Neural Decompilation

  2. Using recurrent neural networks for decompilation

Declarative Code Transformation with the Semantic Patch Language in Java

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Code transformation is a powerful tool in dynamic software analysis [1]. Declarative transformations are easier to specify, understand and maintain. In that realm, the "Semantic Patch Language (SmPL)" is the state-of-the-art [2]. The student will implement SmPL for Java. The interpretation engine will be made in the Spoon library [1].

  1. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code

  2. SmPL: A Domain-Specific Language for Specifying Collateral Evolutions in Linux Device Drivers

Metamorphic code transformation for Java

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Code transformation is a powerful tool in dynamic software analysis. It can be used as source code transformation or binary code transformation. For the programmer, it is much easier to write a transformation at the source code level, because she is familiar with the language constructs. However, for applicability, it is better to be able to apply the transformations at the binary level. The student will design a transformation system for Java such that the transformations are applicable on source code or binary code interchangeably. This will be done in the context of Java, which means that the transformation system will be able to work both on Java source code and on JVM bytecode. The student will study different options, incl: 1) compiling a source code transformation in Spoon to a binary code transformation in ASM/Javassist 2) applying transformation after decompilation.

  1. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code

Tagged as: