Research topics for a thesis or an internship

by Martin Monperrus

Under my supervision, you can do cool research in software technology, here are our current hot topics.

Are you a KTH student? See Master's thesis / Bachelor's thesis guidelines and contact me by email

Are you a brilliant international student? Contact me by email


Category Program Repair
     Machine Learning for Program Repair
          Neural Repair of Compiler Warnings
          Conversational Program Repair Bots
          Addressing Catastrophic Forgetting in Patch Generation
     Code Analysis for program repair
          Automatic Repair of Breaking Dependency Updates
          Automatic Program Repair of Code4Bench
          Software Integrity Verification for Maven Builds
          Explainable repair templates for Astor
Category Software Reliability
     Chaos Engineering
          Do Applications Overreact to EAGAIN Errors?
          Automatic USDT Observability Probe Injection for Java Applications
     Software Testing
          Collection and Analysis of Code Coverage in Production
          Automatic Identification of Pseudo-tested Conditions
Category Blockchain & Crypto
     Software Supply Chain
          Dependency Diversification for Ethereum Clients Using Production Data
          Test Generation for Ethereum Clients Using Production Data
          Behavioral Analysis for Bitcoin Core
          Securing Software Updates via the Ethereum Blockchain
          N-Version Programming for Blockchain Nodes
     Smart Contracts
          Obfuscation for Solidity and EVM Bytecode
          Automated Program Repair for Smart Contracts
          Future Package Managers for Smart Contracts and Solidity

Category Program Repair

Machine Learning for Program Repair

Neural Repair of Compiler Warnings

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: It is a best practice to activate all warnings in a compiler. However, much work is needed to remediate the all. You will research in the area of machine learning for repairing compiler warnings. You will devise, implement and evaluate an approach based on sequence-to-sequence learning. The considered compilers are open and could be for example rust, go, clang, etc.

  1. Break-It-Fix-It: Unsupervised Learning for Program Repair (2021)

  2. The Unexplored Terrain of Compiler Warnings

  3. Master's thesis: Exploring the Usage of Neural Networks for Repairing Static Analysis Warnings

  4. MACER A Modular Framework for Accelerated Compilation Error Repair

Conversational Program Repair Bots

Supervisor: Martin Monperrus, KTH Royal Institute of Technology

Description: Repairnator is a program repair bot [1,2]. In order to work effectively with developers, development bots must be able to interact with the developers. We imagine conversational systems for pull request explanation: developers would be able to ask questions about the pull request, and the bot would answer to those questions [3]. Such a system can be data-driven, based on the analysis of the millions of similar conversations that have happened in open-source repositories. The system could use a chatbot library such as Rasa [4].

  1. Human-competitive Patches in Automatic Program Repair with Repairnator

  2. https://github.com/eclipse/repairnator/issues/840

  3. Explainable Software Bot Contributions: Case Study of Automated Bug Fixes

  4. https://github.com/RasaHQ/rasa/

Addressing Catastrophic Forgetting in Patch Generation

Supervision: Martin Monperrus (KTH)

Description: Recent approaches to automatic repair use machine learning on source code [1,2]. Yet, it has been shown that when done in an online context, ML-based repair systems tend to "forget" how to fix programs [3]. This is problem called catastrophic forgetting, which occurs when newly learned knowledge interferes with capabilities previously learned by the model [4]. You will study in depth the reasons and the mitigations for catastrophic forgetting in automatic patch generation.

  1. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

  2. A survey of machine learning for big code and naturalness

  3. R-Hero: A Software Repair Bot based on Continual Learning

  4. Continual lifelong learning with neural networks: A review

Code Analysis for program repair

Automatic Repair of Breaking Dependency Updates

Description: This work aims at automatically proposing patches for breaking updates of software libraries. It is a best practice to keep all software dependencies to use the latest version. However, some dependency versions are not compatible with the previous version. In this case, automated dependency management (eg with DependaBot or Renovate) still involves some heavy manual work in order to adapt the code to the new version of the library. The student will design, implement and evaluate novel program analysis and program synthesis techniques to automatically repair breaking updates.

  1. Automatically repairing dependency-related build breakage

  2. APIfix: output-oriented program synthesis for combating breaking changes in libraries

Automatic Program Repair of Code4Bench

Description: The student will design design and perform a large scale experiment of program repair on the Code4Bench benchmark [1], with quantitative and qualitative analysis [2].

  1. Code4Bench: A multidimensional benchmark of Codeforces data for different program analysis techniques

  2. Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts

  3. A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark

Software Integrity Verification for Maven Builds

Supervisor: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Description: Trustable builds is an essential property for secure software supply chains [1]. There is ongoing effort in some Linux distributions, in particular Debian, to ensure reproducible builds. There is also some work done for web apps [3]. In the Java world, there is a strong need of work on this topic [2]. You will implement and evaluate a solution for the Maven toolchain.

  1. Reproducible Builds: Increasing the Integrity of Software Supply Chains

  2. https://issues.apache.org/jira/browse/MNG-6026

  3. Resource Integrity

Explainable repair templates for Astor

Supervisor: Martin Monperrus (KTH Royal Institute of Technology), Matias Martinez (University of Valenciennes)

Description: In practice, template-based repair is promising, because they are very easy to explain to the developer and they also provide reduced overfitting. The student will design and implement a full-fledged template based system for Astor, based on the latest literature on repair templates. A large scale evaluation will be done on the novel repair benchmarks Bugs.jar and BEARS. we can play with the vocabulary to have pure language based templates and project-specific ones

  1. Explainable Software Bot Contributions: Case Study of Automated Bug Fixes

  2. Revisiting template-based automated program repair

  3. FixMiner: Mining Relevant Fix Patterns for Automated Program Repair

  4. https://github.com/SpoonLabs/astor

Category Software Reliability

Chaos Engineering

See also our recent papers.

Do Applications Overreact to EAGAIN Errors?

Description: EAGAIN is an error code that is predefined in a Linux kernel. Usually, if a system call invocation returns an error code EAGAIN, it means that the resource requested by an application is temporarily unavailable [1]. According to the error code's literal meaning, the application should try it 'again' later. This thesis project will explore to what extent software applications are overreacting to EAGAIN errors such as crashing directly [2] instead of a retry. For those overreacting applications, the student will apply self-healing techniques such as failure-oblivious computing [3] to investigate if the resilience of the application is improved.

  1. https://man7.org/linux/man-pages/man3/errno.3.html

  2. Maximizing Error Injection Realism for Chaos Engineering with System Calls

  3. TripleAgent: Monitoring, Perturbation and Failure-obliviousness for Automated Resilience Improvement in Java Applications

Automatic USDT Observability Probe Injection for Java Applications

Description: In software debugging or software performance analysis, it is essential to have sufficient observability of the target software system [1]. One of the most popular techniques to improve observability is USDT (Userland Statically Defined Tracing) [2]. In the world of Java, such techniques are only applied to the JVM rather than to Java applications [3]. However, it is promising to add USDT probes directly in a Java application so that more application-specific metrics can be collected with negligible overhead. The thesis project will explore how to use Spoon [4] to automatically generate USDT probes for key events in a Java application.

  1. Automatic Observability for Dockerized Java Applications

  2. https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html

  3. Profiling JVM Applications in Production

  4. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code (https://github.com/INRIA/spoon/)

Software Testing

See also our recent papers.

Collection and Analysis of Code Coverage in Production

Supervisors: Martin Monperrus (KTH)

Description: Code coverage usually relates to test code. Production code coverage is the coverage over real interactions made by users in production. Obtaining and analysing production code coverage enables to identify useless code as well as relevant test data and values. It enables testers and developers to better align the test intentions with what matters for users. The student will compare and analyze techniques for automatically collecting code coverage in production for Java software.

  1. Code Pulse: Real-time code coverage for penetration testing activities

  2. Measuring production code coverage with JaCoCo

  3. Perpetual testing

Automatic Identification of Pseudo-tested Conditions

Supervisors: Benoit Baudry, Martin Monperrus (KTH)

Description: One of the problems with coverage is that it does not detect unspecified code [1]. There is the same problem with conditions, some conditions are well covered according to edge coverage yet the code can actually take any branch without failing. The student will design, implement and evaluate a novel technique for automatically identifying pseudo-tested conditions in Java software.

  1. A Comprehensive Study of Pseudo-tested Methods

  2. https://github.com/STAMP-project/pitest-descartes/

Category Blockchain & Crypto

See our previous work: Chaos Engineering of Ethereum Blockchain Clients The Multi-Billion Dollar Software Supply Chain of Ethereum

Software Supply Chain

Dependency Diversification for Ethereum Clients Using Production Data

Reliability is an essential motivation of the blockchain paradigm. As a distributed system, the system cannot crash because of some nodes failing after a hardware failure, a power outage or targeted attacks. However, distributed ledgers are less resilient to software faults: the same bug can be triggered in many different nodes at the same time, which would crash the whole chain in a systemic manner. Our core idea is to build upon existing research on automated synthesis of software diversity [] to solve this problem. We will address the client diversity problem in Ethereum by synthesizing diverse implementations with automated code trasnformations, such no single client is used more than 33% of the nodes.

  1. The Multiple Facets of Software Diversity: Recent Developments in Year 2000 and Beyond

  2. Automatic Diversity in the Software Supply Chain

Test Generation for Ethereum Clients Using Production Data

Description: Unit testing is one of the essential ways to improve the quality of software It is also helpful for correctness checking when there are different implementations based on the same software specification. Let us take Ethereum clients as an example, there are thousands of common tests [1] provided for all the Ethereum client projects. Though these tests have already cover various cases, there are corner cases in production that are missing in the test suite [2]. In this thesis project, you will design, implement and evaluate a prototype that collects production data and generate new valuable test cases for Ethereum clients.

  1. https://github.com/ethereum/tests

  2. Production Monitoring to Improve Test Suites

Behavioral Analysis for Bitcoin Core

Description: An important concept in software security is to protect resources with whitelists. It has been implemented at different levels of the software stack (kernel, virtual machines, application frameworks). In Bitcoin Core, white lists of system calls can be used and enforced via Linux SecComp [1]. From a research perspective, the hard problem is to infer the whitelist of accessible resources via behavioral analysis [2,3]. You will design and perform an experiment to compare different behavior inference techniques for Bitcoin-Core.

  1. https://github.com/bitcoin/bitcoin/pull/20487

  2. A Sense of Self for UNIX Processes

  3. Shredder: Breaking Exploits through API Specialization

Securing Software Updates via the Ethereum Blockchain

Description: Today's software systems deployed in production are so big that it is not realistic to update them at once when a new version is available. To overcome this problem, the best-of-breed platforms for software update transfer only the newest changes to the execution machines. Typically, an update is a binary patch, which is orders of magnitude smaller than the whole application. While this solves the resource problem (bandwidth and disk are saved), it causes a security problem wrt to verifying the integrity and provenance of the update. You will design, implement and evaluate a system for securing software updates via a distributed ledger (eg Ethereum).

  1. Blockchain-Based Certificate Transparency and Revocation Transparency

  2. CHAINIAC: Proactive software-update transparency via collectively signed skipchains and verified builds

  3. TUF on the Tangle – Securing software updates using a distributed ledger

  4. The Update Framework

N-Version Programming for Blockchain Nodes

Description: In successful blockchains, there exists more than one implementation of nodes. For example, in Bitcoin there is Bitcoin-core and Bitcoinj, anb in Ethereum there is Geth and Besu. The existence of this diversity of implementations opens up possibilities. In this thesis, you will explore the use of multi-version execution [1] and N-version programming [2] in the context of blockchain nodes. You will devise, implement and evaluate a prototype system in the context of one single blockchain. In this paper, our contributions are:

  1. Safe software updates via multi-version execution

  2. Enter the hydra: Towards principled bug bounties and exploit-resistant smart contracts

Smart Contracts

Obfuscation for Solidity and EVM Bytecode

Description: In a blockchain paradigm, program binaries are always available. This means that attacker can do any kind of binary analysis to reverse engineer smart contracts. A counter-measure is obfuscation. You will study the limitation of current obfuscation tools. Then, you will design and evaluation a new obfuscator for Solidity and EVM bytecode.

  1. EShield: Protect Smart Contracts against Reverse Engineering 2021

  2. Erays: reverse engineering ethereum’s opaque smart contracts Usenix 2018

Automated Program Repair for Smart Contracts

Description: Smart contracts are software, and hence, cannot be perfect. Smart contracts suffer from bugs, some of which putting high financial stakes at risk. There is a new line of research on automated patching of smart contract. You will devise, perform and analyze a comparative experiment to identify the successes, challenges and limitations of automated program repair for smart contracts.

  1. Elysium: Automagically Healing Vulnerable Smart Contracts Using Context-Aware Patching

  2. EVMPatch: Timely and automated patching of ethereum smart contracts

Future Package Managers for Smart Contracts and Solidity

Programming smart contracts is a paradigm shift in software engineering. The fundamental smart contract concepts of immutability and paid execution change the way we design and ship software. In addition, the financial stakes are so large (billions of dollars for financial smart contracts) that security concerns are amplified by orders of magnitude. As a result, many phases of a typical software engineering pipeline have to be redesigned. In the context of the software supply chain, this means it is not possible to reuse existing package managers. In this PhD thesis, the student will design, implement and evaluate future package managers for smart contracts and Solidity. The package manager will have built-in support for software integrity verification, audits and deployments to immutable blockchain infrastructure.

  1. CHAINIAC: Proactive Software-Update Transparency via Collectively Signed Skip-chains and Verified Builds

  2. Enter the hydra: Towards principled bug bounties and exploit-resistant smart contracts

Tagged as: