Research topics for a thesis or an internship

by Martin Monperrus

Are you a KTH student looking for a fun and challenging topic in software engineering? Do you your Master's thesis / Bachelor's thesis under my supervision; or register to "Advanced Individual Course in Computer Science" with me as supervisor; or apply to be a research assistant (aka research amanuens), which is a 20% research job, where you get paid by KTH.

Are you a brilliant international student looking for an internship in a world-class research lab (remote internship included)?

Contact me by email if you want to join my group –Martin


Category Code Transformation
               Neural decompilation for WebAssembly/WASM
               Binary Retargeting with Neural Machine Translation
               Declarative Code Transformation with the Semantic Patch Language in Java
               Metamorphic code binary/source transformation for Java
               Automatic grafting of easter eggs
Category Program Repair
     Artificial Intelligence for Repair (Sequencer)
               A comparison of self-healing parsers and machine learning for repairing syntax errors
               Automatic transfer of formatting with machine learning
               Automatic Categorization on AI Patches with AST Analysis
               Sequence-to-sequence machine learning for automatic program repair
     Artificial Software Developer on Github (Repairnator)
               A Jenkins Continuous Integration Plugin for Repairnator
               Just-In-Time Repair of SonarQube Static Warnings
               Automated Program Repair for C#
               Automated stewardship of open-source projects
               Automatic Comparison of Execution Traces of Java Tests
     Generate-and-validate program repair (Astor)
               Deep Learning of variable relationships for automatic program repair
               Automatic Program Repair with Property Based Testing
               Explainable repair templates for Astor
Category Program Hardening
     Chaos Engineering
               Chaos Engineering for Microservices in .NET and C#
               Chaos Engineering in Service Mesh Proxies
               Automatic Workarounds of Broken Websites due to Privacy-enhancing Plugins
               Self-healing capabilities for Python/Flask Applications
     Randomization
               Evaluation of Randomization for WebAssembly with Fuzzing
               Side-channel attack detection in Java
               Preventing algorithmic DOS attacks with blackbox randomization
               Automatic Randomization of Programs with Neural Networks
Category Commit Analysis
               Automatic Clustering of Code Changes with Maximum Density Clustering
               Automatic Identification of Bug-Fix Commits
               Automatic Identification And Backporting of Vulnerability Fixes
               Automatic Labeling of Commits to Identify Fix Patterns
Category Software Testing
               Automatic Identification of Pseudo-tested Conditions
               Automatic Prioritization of Mutants in Mutation Testing
               Automatic Repair of Flaky Tests
               Automatic Repair of Rotten Green Tests
               Automatic Renaming of Test Variables to Improve Maintainability

Category Code Transformation

Neural decompilation for WebAssembly/WASM

A decompiler takes compiled binary code and produces textual source code. Decompilation is an essential step for program comprehension, security analyses, etc. However, it is challenging to write an accurate decompiler (that can retrieve the source code that actually corresponds to the compiled code) and the implementation of decompilers currently relies on the careful, manual design of decompilation rules. Some recent works [1,2,3] have proposed to use machine learning in order to train a decompiler. These works successfully applied this concept to decompile from x86 to C source code. In this work, you will study the topic of decompilation learning for WebAssembly/WASM.

  1. A Neural-based Program Decompiler

  2. Towards Neural Decompilation

  3. Adabot: Fault-Tolerant Java Decompiler

Binary Retargeting with Neural Machine Translation

Binary retargeting consists of porting binary code from one platform to another, eg from x89 to arm. Binary retargeting is important for porting old code and maximizing interoperability. In the sister field of decompilation, recent progress has been made with neural machine translation [1,2,3]. In this work, you will explore the use of neural machine translation for binary retargeting.

  1. A Neural-based Program Decompiler

  2. Towards Neural Decompilation

  3. Using recurrent neural networks for decompilation

Declarative Code Transformation with the Semantic Patch Language in Java

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Code transformation is a powerful tool in dynamic software analysis [1]. Declarative transformations are easier to specify, understand and maintain. In that realm, the "Semantic Patch Language (SmPL)" is the state-of-the-art [2]. The student will implement SmPL for Java. The interpretation engine will be made in the Spoon library [1].

  1. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code

  2. SmPL: A Domain-Specific Language for Specifying Collateral Evolutions in Linux Device Drivers

Metamorphic code binary/source transformation for Java

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Code transformation is a powerful tool in dynamic software analysis. It can be used as source code transformation or binary code transformation. For the programmer, it is much easier to write a transformation at the source code level, because she is familiar with the language constructs. However, for applicability, it is better to be able to apply the transformations at the binary level. The student will design a transformation system for Java such that the transformations are applicable on source code or binary code interchangeably. This will be done in the context of Java, which means that the transformation system will be able to work both on Java source code and on JVM bytecode. The student will study different options, incl: 1) compiling a source code transformation in Spoon to a binary code transformation in ASM/Javassist 2) applying transformation after decompilation.

  1. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code

Automatic grafting of easter eggs

Supervisor: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Easter eggs in software are unexpected and fun features that are deliberately engineered and hidden in code. Easter eggs bring emotions and bonding to your relation to a software product. For instance, a Tesla car is full of easter eggs [1]. You will study the cultural phenomenon of software easter eggs and set up a code transformation framework to automatically graft easter eggs in arbitrary software.

  1. New Tesla Easter Egg: Monty Python Squashes v10 Bugs!

Category Program Repair

Artificial Intelligence for Repair (Sequencer)

You would commit in project https://github.com/KTH/sequencer.

A comparison of self-healing parsers and machine learning for repairing syntax errors

Supervision: Martin Monperrus (KTH)

Description: Syntax errors and compiler errors happen all the times and are known to be hard to locate and fix. To solve this problem, two families of approaches have been proposed: one based on self-recovering parsers [1], the other one based on machine learning [2]. The student will perform a large scale, systematic evaluation to compare them in a scientific manner on a common dataset.

  1. Reducing Cascading Parsing Errors Through Fast Error Recovery github

  2. Syntax and sensibility: Using language models to detect and correct syntax errors code

Automatic transfer of formatting with machine learning

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: It is common practice to use and enforce a certain coding style in software projects. This can become a nightmare when one copies files from one project to another, where the project use different conventions. For this, there is a need to be able to transfer the style one project to files coming from potentially anywhere with any style. It may be possible to use machine learning to perform the transfer [1,2] The student will devise, implement and evaluate an approach to automatically transfer coding style. The student will perform the experiments a scientific computing grid.

  1. Learning Natural Coding Conventions (https://github.com/mast-group/naturalize)

  2. Towards a Universal Code Formatter through Machine Learning (https://github.com/antlr/codebuff)

Automatic Categorization on AI Patches with AST Analysis

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

KTH has invented a new system called Sequencer for producing patches with machine learning [1]. Sequencer learns from past diffs using sequence-to-sequence learning. The student will perform a large scale analysis of the Sequencer patches. The experiment will involve the Gumtree AST diff library and will be done on a scientific computing grid.

  1. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

  2. Benchmark of single-line bugs

Sequence-to-sequence machine learning for automatic program repair

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

A lot of automatic bug fixing generation techniques rely on slightly modifying the existing code. The student will devise and evaluate a new repair algorithm that will learn from past diffs, using sequence-to-sequence learning. The planned methodology is as follows: 1) set up a training and evaluation dataset based on diffs 2) devise, implement and assess a new repair algorithm based on this data. The student will perform the experiment by running it on a scientific computing grid.

  1. An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation

  2. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

  3. Benchmark of single-line bugs

Artificial Software Developer on Github (Repairnator)

You would commit in project https://github.com/eclipse/repairnator.

A Jenkins Continuous Integration Plugin for Repairnator

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: In industry, Jenkins is the #1 CI technology for Java projects. You will design and implement a Jenkins plugin. This plugin will be responsible to listen to Jenkin builds and to send the build errors to a Repairnator server. If a patch is found, a patch is sent to the developer. This work will involve our industrial partners. Warning: This work involves heavy sysadmin work to operate a server in a DevOps manner.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

  3. https://github.com/Spirals-Team/repairnator/

Just-In-Time Repair of SonarQube Static Warnings

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: SonarQube is very used in industry to statically detect bugs and code smells. We have a system for automatically repairing SonarQube warnings. You will research in the area of just-in-time repair of SonarQube warnings: it is the idea of repairing the warnings that appear in an ongoing pull-request and then to do an automated Github suggestion. You will integrate sonarqube-repair into repairnator and run a live Repairnator instance to repair dozens of Sonarqube warnings live in Travis.

  1. Are Static Analysis Violations Really Fixed? A Closer Look at Realistic Usage of SonarQube. Dataset for OSS organizations

  2. Automatically Generating Fix Suggestions in Response to Static Code Analysis Warnings

Automated Program Repair for C#

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: In industry, C# is a very popular programming language for enterprise applications. In order to demonstrate our research to our industrial partners, the goal of this work is to implement a first prototype of program repair for C#. The student will devise, implement and evaluate an automatic repair system for C# integrated into Repairnator.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

  3. https://github.com/Spirals-Team/repairnator/

Automated stewardship of open-source projects

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: On Github, certain projects are very popular but the maintainers have not enough bandwidth to keep up the pace of pull requests. Those projects literally die under too many pull requests. For those projects, the maintainers need a robot to help them merge pull requests. The student will devise, implement and evaluate an automated steward for software development projects. The steward would try to take over dead yet popular software projects.

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. How to Design a Program Repair Bot? Insights from the Repairnator Project

Automatic Comparison of Execution Traces of Java Tests

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: A big problem in automated program repair is the presence of "overfitting patches" which are patches that pass all tests yet are incorrect [1,2]. The goal of this thesis is to study, design and implement a machine-learning based system for measuring the likelihood of an execution trace to be buggy. The work involves collecting a large amount of execution traces.

  1. Alleviating Patch Overfitting with Automatic Test Generation: A Study of Feasibility and Effectiveness for the Nopol Repair System (EMSE 17)

  2. Identifying Patch Correctness in Test-Based Automatic Program Repair (ICSE 18)

Generate-and-validate program repair (Astor)

Repository: https://github.com/SpoonLabs/astor

Deep Learning of variable relationships for automatic program repair

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: Most patches don't introduce new variables, they only reuse existing variables and method calls. The student will set up and perform an experiment to use deep learning for mining variables relationships so that snippets can be fitted in the current repair context. The planned methodology is as follows: 1) extract a dataset of variable relations based on existing code 2) apply deep-learning on this data and analyze the results 3) devise, implement and assess a extension of Nopol/Astor to use the mined informatio in particular to adapt program repair ingredients. The student will perform the experiment by running it on a scientific computing grid.

  1. ASTOR: A Program Repair Library for Java

  2. Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities

  3. SmartPaste: Learning to Adapt Source Code

Automatic Program Repair with Property Based Testing

Supervisor: Martin Monperrus (KTH Royal Institute of Technology)

Description: In automatic program repair, some patches are incorrect because of the incompleteness of the test cases used as specification [1]. One way to overcome this is to have better tests. Property-based testing is one solution to increase completeness of tests [2]. The student will design and perform a novel experiment on using property based tests for automatic program repair, in the context of the Astor repair framework [3].

  1. Automatic Software Repair: a Bibliography (2017)

  2. In praise of property-based testing

  3. https://github.com/SpoonLabs/astor

Explainable repair templates for Astor

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: In practice, template-based repair is promising, because they are very easy to explain to the developer and they also provide reduced overfitting. The student will design and implement a full-fledged template based system for Astor, based on the latest literature on repair templates. A large scale evaluation will be done on the novel repair benchmarks Bugs.jar and BEARS. we can play with the vocabulary to have pure language based templates and project-specific ones

  1. Explainable Software Bot Contributions: Case Study of Automated Bug Fixes

  2. Revisiting template-based automated program repair

  3. FixMiner: Mining Relevant Fix Patterns for Automated Program Repair

Category Program Hardening

Chaos Engineering

Chaos Engineering for Microservices in .NET and C#

Supervision: Long Zhang, Martin Monperrus, KTH Royal Institute of Technology

Description: Chaos Engineering [1] is the discipline of verifying resilience capabilities of software systems in production. ChaosMachine [2] is such an approach for microservices implemented in Java. While microservices in the .NET world are conceptually similar, there are a number of technical differences. You will study, design and implement the ChaosMachine for microservices in .NET and C#

  1. Chaos Engineering (the book)

  2. A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM

Chaos Engineering in Service Mesh Proxies

Supervision: Long Zhang, Martin Monperrus, KTH Royal Institute of Technology

Description: Chaos Engineering [1] is the discipline of verifying resilience capabilities of software systems in production. Durieux et al. [2] have shown that one can effectively do code transformation in user-facing proxies for safe-healing problems. It is also possible to code transformation in service mesh proxies, in order to increase observability and potentially perturb the system execution. You will study, design and implement the chaos engineering system for proxies and service mesh.

  1. Chaos Engineering (the book)

  2. Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Automatic Workarounds of Broken Websites due to Privacy-enhancing Plugins

Supervision: Martin Monperrus, KTH Royal Institute of Technology

Description: Users can improve their online privacy by installing privacy-enhancing plugins, such as uBlock. However, those plugins break important functionality of some websites [1]. The student will build on the ideas on BikiniProxy [1] to provide automatic repair of broken websites due privacy enhancing technology. The student will design and implement the system (eg "uBlock-repair") either as a browser plugin or as proxy.

  1. A comparison of web privacy protection techniques

  2. Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

  3. Automatic Visual Verification of Layout Failures in Responsively Designed Web Pages

Self-healing capabilities for Python/Flask Applications

Supervisor: Martin Monperrus, KTH Royal Institute of Technology, EECS/TCS

Description: Over the last few years, the complexity of web applications has extraordinarily increased, incl. with microservice based architecture. The drawback of this complexity is the growing number of runtime errors. You will design and implement a system that provides self-healing capabilities for Python/Flask web applications. The self-healing techniques will be inspired from [1].

  1. Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

  2. DjangoChecker: Applying extended taint tracking and server side parsing for detection of context‐sensitive XSS flaws

Randomization

Evaluation of Randomization for WebAssembly with Fuzzing

Supervision: Javier Arteaga, Martin Monperrus, KTH Royal Institute of Technology

It has been shown that we can use fuzzing to evaluate the effectiveness of randomization [1]. In our group, we are working on randomizers for WebAssembly. You will deploy our randomizers on a benchmark of WebAssembly programs, and hammer them with an appropriate fuzzer (eg [2]). This is a heavily technical research topic on top of bleeding-edge WASM technology.

  1. Fuzzification: Anti-Fuzzing Techniques

  2. https://github.com/phayes/sidefuzz

Side-channel attack detection in Java

Supervision: Martin Monperrus, KTH Royal Institute of Technology

Description: The goal of this thesis is to study side-channel attacks in Java [1]. The student will devise and perform a scientific experiment in this context based on the work of Nilizadeh et al. [1]. She/he will read the literature, implement the required software for supporting the experiment, design the inclusion criteria for subjects and run the experiment on a scientific computing grid.

  1. DifFuzz: Differential Fuzzing for Side-Channel Analysis

  2. Correctness Attraction: A Study of Stability of Software Behavior Under Runtime Perturbation

Preventing algorithmic DOS attacks with blackbox randomization

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: The goal of this thesis is to study counter-measures to algorithmic denial of service attacks. An algorithmic DOS consists of a input specifically designed by the attacker to trigger the worst case execution of a program [1]. Black-box randomization consists of identifying and injecting randomization points in software, without any knowledge of the application domain and implementation choices. The goal of this thesis is to study the usage of black-box randomization for countering algorithmic DOS attacks. The student will devise and perform a scientific experiment in this context. She/he will read the literature, implement the required software for supporting the experiment, design the inclusion criteria for subjects and run the experiment on a scientific computing grid.

  1. Denial of Service via Algorithmic Complexity Attacks

  2. Correctness Attraction: A Study of Stability of Software Behavior Under Runtime Perturbation

  3. Slowfuzz: Automated domain-independent detection of algorithmic complexity vulnerabilities

Automatic Randomization of Programs with Neural Networks

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

The fact that programs are executed mostly deterministically is a problem for security [1]. Automatic randomization is one way of overcoming this problem. In this work, we we will explore the use of neural networks, and their ability to generate likely sequences, to synthesize equivalent programs that execute differently. You will use recent libraries from machine learning [2] to learn over big amount of open source code.

  1. The Multiple Facets of Software Diversity: Recent Developments in Year 2000 and Beyond

  2. https://github.com/openai/gpt-2

Category Commit Analysis

For those topics, the work will be done in the Coming project.

Automatic Clustering of Code Changes with Maximum Density Clustering

Supervisors: He Ye, Martin Monperrus

Studying similar code changes is essential for many software engineering areas, such as bug finding, program repair, program synthesis, etc. The state-of-the-art adapts Agglomerative Hierarchical Clustering and DBSCAN based on abstract syntax trees (AST) to clustering code change groups. In this project, you will explore other machine learning techniquesfor clustering code changes. You will learn (1) Learning the code diffs presentation (Coming); (2) Similarity distance metrics (Minkowski distance, Jaccard similarity coefficient, cosine similarity, Pearson similarity, Relative Entropy) (3) Non-supervised clustering algorithms such as MDCA (Maximum Density Clustering Application) and ​spectral clustering.

  1. FixMiner: Mining Relevant Fix Patterns for Automated Program Repair

  2. Automatic clustering of code changes

  3. https://github.com/SpoonLabs/coming

Automatic Identification of Bug-Fix Commits

Supervisors: Matias Martinez, Martin Monperrus

Description: In many situations, it is important to classify the commits according to some labels. For instance, one want to automatically identify security related commits, or bug-fixing commits. The student will work on an automatic commit classifier. The work will be done in the context of the Coming platform.

  1. Mining Software Repositories for Adaptive Change Commits Using Machine Learning Techniques (2019)

  2. PatchNet: A Tool for Deep Patch Classification (2019)

  3. https://github.com/SpoonLabs/coming

Automatic Identification And Backporting of Vulnerability Fixes

Supervisors: Matias Martinez, Martin Monperrus

Description: When a fix is done to remove a vulnerability, it is important to backport the fixes to other branches. To do this: we need to components: 1) automatic identification of vulnerability fixes 2) automatic porting and fixing of porting conflicts. The student will work on those components. The work will be done in the context of the Coming platform.

  1. An automatic method for assessing the versions affected by a vulnerability

  2. When a patch goes bad: Exploring the properties of vulnerability-contributing commits

  3. https://github.com/SpoonLabs/coming

Automatic Labeling of Commits to Identify Fix Patterns

Supervisors: Matias Martinez, Martin Monperrus

Description: In order to do data-driven program repair, one need to have ground truth labels about the fix patterns used in a commit. There are very few tools doing so. You will use those tools to characterize datasets, starting with the CodRep dataset, in order to have a taxonomy of fixes in Sequencer. You will extend PPD and other tools written in Java for commit analysis. The work will be done in the context of the Coming platform.

  1. Towards an automated approach for bug fix pattern detection

  2. https://github.com/SpoonLabs/coming

  3. Exploring and exploiting the correlations between bug-inducing and bug-fixing commits (2019)

Category Software Testing

Automatic Identification of Pseudo-tested Conditions

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: One of the problems with coverage is that it does not detect unspecified code [1]. There is the same problem with conditions, some conditions are well covered according to edge coverage yet the code can actually take any branch without failing. The student will design, implement and evaluate a novel technique for automatically identifying pseudo-tested conditions in Java software.

  1. A Comprehensive Study of Pseudo-tested Methods

  2. https://github.com/STAMP-project/pitest-descartes/

Automatic Prioritization of Mutants in Mutation Testing

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: One of the problems with mutation testing is that the developers are overwhelmed by the number of mutants to kill with new tests. One way to approach this problem is to view it as a recommendation problem based on the dependency graph. The student will design, implement and evaluate a novel technique for automatically prioritizing mutants to be killed in Java software.

  1. A Comprehensive Study of Pseudo-tested Methods

Automatic Repair of Flaky Tests

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Flaky tests are tests that fail in an non-determistic way, and it is is a big problem in industry. Following the automatic repair philosophy, one can automatically repair a some of them by improving sandboxing or virtualizing time. The student will design, implement and evaluation a prototype system for automatic repair of flaky tests.

  1. An empirical analysis of flaky tests (2014)

  2. Automatic Software Repair: a Bibliography (2017)

  3. iFixFlakies: A Framework for Automatically Fixing Order-Dependent Flaky Tests

Automatic Repair of Rotten Green Tests

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Rotten green tests provide a false sense of security, and it is is a big problem in practice. Following the automatic repair philosophy, one can automatically repair such tests by refactoring the test case. The student will design, implement and evaluation a prototype system for automatic repair of rotten green tests.

  1. Rotten Green Tests (2019)

  2. Intent-Preserving Test Repair

  3. Automatic Software Repair: a Bibliography (CSUR 17)

Automatic Renaming of Test Variables to Improve Maintainability

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: In test code, it is very important to have good variables names, so that the test intention is clear, and so that the test is maintainable. Recent research has shown that we can use machine learning to predict good names in code. The student will work to apply the state-of-the-art technique in variable renaming to test code in C++ or Java. The work will be related to the EU H2020 project STAMP.

  1. code2vec: Learning Distributed Representations of Code (2018), implementation at https://github.com/tech-srl/code2vec.

  2. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts (2018)

Tagged as: