Research topics for a thesis or an internship

by Martin Monperrus

Under my supervision, you can do cool research in software technology, here are our current hot topics.

Are you a KTH student? See Master's thesis / Bachelor's thesis guidelines and contact me by email

Are you a brilliant international student? Contact me by email


Category Code Transformation
     Source Code
               Delta-debugging for debloating Java source code using Spoon
               Advanced constructs in the Semantic Patch Language for Spoon
               Diverse Double Compilation for Bitcoin
     Binary Code
               Automated Software Diversity via Multi-compilation
               Patch Presence Assessment on Binary Code
               Decompilation for WebAssembly/WASM using Datalog
Category Program Repair
     Machine Learning for Program Repair
               Addressing Catastrophing Forgetting in Patch Generation
               Execution Trace Analysis for Patch Generation
               Automatic transfer of formatting with machine learning
     Program Analysis for Automatic Repair
               Automatic repair of unreproducible builds (Repairnator)
               Automatic ranking of pull-requests for merging or rejecting
     Generate-and-validate program repair
               Automatic Program Repair of Bears with Repair Templates in Astor
               Explainable repair templates for Astor
               Automatic Repair of OCRed programs
Category Program Hardening
     Chaos Engineering
               Maximum Likelihood Perturbations for Chaos Engineering
               Chaos Engineering for Microservices with Resilience4J
     Randomization & Cyber-deception
               Machine learning to synthesize faux patches
               Self-healing capabilities for Deployed WebAssembly
               Preventing cache attacks on AES with blackbox randomization
Category Software Testing
               Collection and Analysis of Code Coverage in Production
               Automatic Identification of Pseudo-tested Conditions
               Automatic Debloat of Test Suites
               Automatic Repair of Rotten Green Tests

Category Code Transformation

Source Code

See also topics on the Spoon repository: https://github.com/INRIA/spoon/labels/mastersthesis

Delta-debugging for debloating Java source code using Spoon

Supervisor: César Soto Valero, Martin Monperrus, KTH Royal Institute of Technology

Description: The problem of removing unnecessary code (aka software debloat) is relevant and challenging [1]. Hierarchical Delta Debugging (HDD) is a useful technique for minimizing all failure-inducing inputs for more effective debugging [2]. HDD can be used to improve the debloating process [3]. It allows determining the set of instructions in a program that, after execution, satisfies a set of user-provided outputs. Yet, there is currently no tool that to debloat Java code based on HDD techniques. The student will design, implement, and evaluate an HDD-based approach to debloat the source code of Java programs using the Spoon library.

  1. Living review of software debloat papers.

  2. Misherghi, Ghassan, and Zhendong Su. “HDD: hierarchical delta debugging.” Proceedings of the 28th international conference on Software engineering. 2006.

  3. Heo, Kihong, et al. “Effective program debloating via reinforcement learning.” Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 2018.

Advanced constructs in the Semantic Patch Language for Spoon

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: Code transformation is a powerful tool in dynamic software analysis [1]. Declarative transformations are easier to specify, understand and maintain. In that realm, the "Semantic Patch Language (SmPL)" is the state-of-the-art [3,4]. The student will extend SmPL in the Spoon library [1,2], starting with 1) Loop support 2) Matching on "for" elements specifically. 3) Matching on type arguments e.g <String>. 4) Matching and transformation on "try" elements.

  1. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code

  2. https://github.com/INRIA/spoon/tree/master/spoon-smpl

  3. Semantic Patches for Java Program Transformation

  4. Design and Implementation of Semantic Patch Support for the Spoon Java Transformation Engine

Diverse Double Compilation for Bitcoin

Supervisor: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Description: The problem of deceptive compilers introducing malicious code is relevant and hard [1,2]. One solution for this is to use doule diverse compilation to mitigate the problem [3] Yet it is hard to change the compilation pipeline of real software. You will design, implement and evaluate diverse double compilation for bitcoin-core.

  1. Reflection on Trusting trust 1983

  2. Defending Against Compiler-Based Backdoors

  3. Countering trusting trust through diverse double-compiling

Binary Code

Automated Software Diversity via Multi-compilation

Supervisor: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Description: The problem of deceptive compilers introducing malicious code is relevant and hard [1,2]. One solution for this is to use software diversity [3]. For instance, one can compile a C program with both GCC and CLANG. You will design, implement and evaluate a multi-compiler scheme for C.

  1. Reflection on Trusting trust 1983

  2. Defending Against Compiler-Based Backdoors Java Decompiler Diversity and its Application to Meta-decompilation

Patch Presence Assessment on Binary Code

Supervisors: Martin Monperrus

Description: It is essential to know whether a specific binary contains all required vulnerability fixes. For: we need to components: 1) extracting signatures of vulnerability fixes 2) checking for the presence of those fixes in a binary. The student will work on those components in the context of Java code.

  1. Patch Based Vulnerability Matching for Binary Programs

  2. BScout: Direct Whole Patch Presence Test for Java Executables

Decompilation for WebAssembly/WASM using Datalog

A decompiler takes compiled binary code and produces textual source code. Decompilation is an essential step for program comprehension, security analyses, etc. However, it is challenging to write an accurate decompiler (that can always reconstruct the source code that actually corresponds to the compiled code) and the implementation of decompilers currently relies on the careful, manual design of decompilation rules. Recently, researchers have proposed to use Datalog for decompilation [1]. In this work, you will will design and evaluate a decompiler for WebAssembly/WASM based on Datalog.

  1. Datalog disassembly

  2. Java Decompiler Diversity and its Application to Meta-decompilation

Category Program Repair

Machine Learning for Program Repair

You would commit in project https://github.com/KTH/sequencer.

Addressing Catastrophing Forgetting in Patch Generation

Supervision: Martin Monperrus (KTH)

Description: Recent approaches to automatic repair use machine learning on source code [1,2]. Yet, it has been shown that when done in an online context, ML-based repair systems tend to "forget" how to fix programs [3]. This is problem called catastrophic forgetting, which occurs when newly learned knowledge interferes with capabilities previously learned by the model [4]. You will study in depth the reasons and the mitigations for catastrophic forgetting in automatic patch generation.

  1. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

  2. A survey of machine learning for big code and naturalness

  3. R-Hero: A Software Repair Bot based on Continual Learning

  4. Continual lifelong learning with neural networks: A review

Execution Trace Analysis for Patch Generation

Supervision: Zimin Chen, Martin Monperrus (KTH)

Description: To reason about programs one can use source code or/and dynamic traces. Most works in machine learning on programs only use code [1]. However, the usage of traces is a very promising research direction [2,3]. You will invent a new patch generation systems, based on trace embeddings. The work will be done on top of recent works in the field [2,3], for generating patches in Java.

  1. A survey of machine learning for big code and naturalness

  2. Learning Blended, Precise Semantic Program Embeddings (2019)

  3. Learning to Encode and Classify Test Executions (2020)

Automatic transfer of formatting with machine learning

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: It is common practice to use and enforce a certain coding style in software projects. This can become a nightmare when one copies files from one project to another, where the project use different conventions. For this, there is a need to be able to transfer the style one project to files coming from potentially anywhere with any style. It may be possible to use machine learning to perform the transfer [1,2] The student will devise, implement and evaluate an approach to automatically transfer coding style. The student will perform the experiments a scientific computing grid.

  1. Learning Natural Coding Conventions (https://github.com/mast-group/naturalize)

  2. Towards a Universal Code Formatter through Machine Learning (https://github.com/antlr/codebuff)

Program Analysis for Automatic Repair

You would commit in project https://github.com/eclipse/repairnator.

Automatic repair of unreproducible builds (Repairnator)

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: Achieving reproducible builds is essential for achieving a secure software supply chain. In particular, one must be sure that no malicious actor has injected malware during the build process. The Debian linux distribution has deployed this in the large. However, when a project build process is no reproducible or becomes unstable again, it can be really hard to find the root cause and fix it. You will design, implement and evaluate a system that automatically suggests repair actions for unreproducible build processes.

  1. Automated Localization for Unreproducible Builds (2018)

  2. Debian Reproducible Builds

Automatic ranking of pull-requests for merging or rejecting

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: On Github, certain projects have too many pull requests and the maintainers need a way to order the pull requests. The ranking could be according to their maturity, importance or likelihood to be merged or rejected. You will devise, implement and evaluate an approach for ranking pull-requests based the history of merged and rejected pull-requests.

  1. Wait for it: determinants of pull request evaluation latency on GitHub

  2. Improving the pull requests review process using learning-to-rank algorithms

Generate-and-validate program repair

Automatic Program Repair of Bears with Repair Templates in Astor

Supervisor: Martin Monperrus (KTH Royal Institute of Technology)

Description: In automatic program repair, template-based repair is an effective way to reduce programs with little overfitting [1]. The student will implement template-based repair in Astor [3], based on the consolidated list by Liu et al [1]. The student will design design and perform a large scale experiment of template-based repair on the BEARS benchmark [2], with quantitative and qualitative analysis.

  1. TBar: Revisiting Template-based Automated Program Repair

  2. Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies

  3. https://github.com/SpoonLabs/astor

Explainable repair templates for Astor

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: In practice, template-based repair is promising, because they are very easy to explain to the developer and they also provide reduced overfitting. The student will design and implement a full-fledged template based system for Astor, based on the latest literature on repair templates. A large scale evaluation will be done on the novel repair benchmarks Bugs.jar and BEARS. we can play with the vocabulary to have pure language based templates and project-specific ones

  1. Explainable Software Bot Contributions: Case Study of Automated Bug Fixes

  2. Revisiting template-based automated program repair

  3. FixMiner: Mining Relevant Fix Patterns for Automated Program Repair

  4. https://github.com/SpoonLabs/astor

Automatic Repair of OCRed programs

Supervisor: Martin Monperrus (KTH Royal Institute of Technology)

Description: In difficult situations, programs have to be transferred with other means than electronic channels. For instance, in the 90ies, the code of PGP had to be sent outside the USA as a printed book [1]. Yet, performing OCR on programs is a challenge, because a single incorrectly recognized character can make the program non compilable or non executable. For this, automatic repair of OCR errors can be used [2] or one can encode the program differently [3]. The goal of this thesis is to push the state of the art of automatic repair of OCRed programs.

  1. Scanning the source code

  2. Tools for Publishing Source Code via OCR

  3. How to get perfect OCR of digital data with 100% accuracy?

Category Program Hardening

See also https://github.com/KTH/royal-chaos.

Chaos Engineering

Maximum Likelihood Perturbations for Chaos Engineering

Supervision: Long Zhang, Martin Monperrus, KTH Royal Institute of Technology

Description: Chaos Engineering [1] is the discipline of verifying resilience capabilities of software systems in production. Our group is doing research in this field, in the context of Java applications [2]. Today, perturbation models are naive: they are either random or based on enumeration. You will study, design and implement a chaos engineering algorithm where each perturbation is chosen according to a utility function in order to maximize the knowledge gained from each perturbation.

  1. Chaos Engineering (the book)

  2. A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM

Chaos Engineering for Microservices with Resilience4J

Supervision: Long Zhang, Martin Monperrus, KTH Royal Institute of Technology

Description: Chaos Engineering [1] is the discipline of verifying resilience capabilities of software systems in production. Resilience4J [2] is a notable middleware for microservices and DevOps orginally done by Netflix. You will study, design and implement a chaos engineering framework for Resilience4J.

  1. Chaos Engineering (the book)

  2. https://github.com/resilience4j/resilience4j

  3. A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM

Randomization & Cyber-deception

Machine learning to synthesize faux patches

Supervisor: Benoit Baudry, Martin Monperrus, Pontus Johnson, KTH Royal Institute of Technology

Description: Cyber-deception is the body of knowledge for securing software systems by "denial, deceit, misinformation, camouflage and obfuscation" [1]. Introducing faux patches in software [2] is a way to prevent attackers from easily reverse engineer legitimate vulnerability patches. The student will study a way to automatically synthesize faux patches with machine learning. The work will be based on the latest progress made at KTH on sequence to sequence learning for vulnerabilities [3].

  1. Demystifying Deception Technology:A Survey

  2. Ghost patches: Fake patches for fake vulnerabilities

  3. Using Sequence-to-Sequence Learning for Repairing C Vulnerabilities

Self-healing capabilities for Deployed WebAssembly

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: Over the last few years, the complexity of web applications has extraordinarily increased and WebAssembly has been introduced to overcome performance issues with Javascript. You will design and implement a system that provides self-healing capabilities for web applications with WebAssembly code. The self-healing techniques will be inspired from [1].

  1. Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

  2. Assure: automatic software self-healing using rescue points

  3. From hack to elaborate technique—a survey on binary rewriting

Preventing cache attacks on AES with blackbox randomization

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: The goal of this thesis is to study counter-measures to cache attacks on AES. A cache attack consists of inferring a private key by means of cache observation [1]. Black-box randomization consists of identifying and injecting randomization points in software, without any knowledge of the application domain and implementation choices. The goal of this thesis is to study the usage of black-box randomization for countering cache attacks. The student will devise and perform a scientific experiment in this context. She/he will read the literature, implement the required software for supporting the experiment, design the inclusion criteria for subjects and run the experiment on a scientific computing grid.

  1. Efficient Cache Attacks on AES, and Countermeasures

  2. Correctness Attraction: A Study of Stability of Software Behavior Under Runtime Perturbation

Category Software Testing

Collection and Analysis of Code Coverage in Production

Supervisors: Martin Monperrus (KTH)

Description: Code coverage usually relates to test code. Production code coverage is the coverage over real interactions made by users in production. Obtaining and analysing production code coverage enables to identify useless code as well as relevant test data and values. It enables testers and developers to better align the test intentions with what matters for users. The student will compare and analyze techniques for automatically collecting code coverage in production for Java software.

  1. Code Pulse: Real-time code coverage for penetration testing activities

  2. Measuring production code coverage with JaCoCo

  3. Perpetual testing

Automatic Identification of Pseudo-tested Conditions

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: One of the problems with coverage is that it does not detect unspecified code [1]. There is the same problem with conditions, some conditions are well covered according to edge coverage yet the code can actually take any branch without failing. The student will design, implement and evaluate a novel technique for automatically identifying pseudo-tested conditions in Java software.

  1. A Comprehensive Study of Pseudo-tested Methods

  2. https://github.com/STAMP-project/pitest-descartes/

Automatic Debloat of Test Suites

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Today, test suites are really big. Not all tests are equally effective, some tests are useless, some tests overlap. Test code has to be maintained and having too much test code can be considered as technical debt. The student will design, implement and evaluate a novel technique for automatically debloating test suites in Java in order to increase test effectiveness.

  1. An empirical study of junit test-suite reduction

  2. Balancing trade-offs in test-suite reduction (2014)

Automatic Repair of Rotten Green Tests

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: Rotten green tests provide a false sense of security, and it is is a big problem in practice. Following the automatic repair philosophy, one can automatically repair such tests by refactoring the test case. The student will design, implement and evaluation a prototype system for automatic repair of rotten green tests.

  1. RTj: a Java framework for detecting and refactoring rotten green test cases (2019) Github: https://github.com/UPHF/RTj

  2. Intent-Preserving Test Repair

  3. Automatic Software Repair: a Bibliography (CSUR 17)

Tagged as: