Open research topics

by Martin Monperrus

Under my supervision, you can do cool research in software technology, here are our current hot topics.

Are you a KTH student? See Master's thesis / Bachelor's thesis guidelines and contact me by email

Are you a brilliant international student? Contact me by email


Category Program Repair
     Machine Learning for Program Repair
          Explaining Code LLms with monosemanticity
          Segregated fine-tuning for code LLMs
          Automated Prompt Engineering for Program Repair
          Learning Program Transformations with Transformers
          Neural Program Repair of CodeQL Warnings
          Using generative AI to adapt software components
          Automatic translation of C to Rust with Language Models
          An Empirical Comparison on Semantics Preserving Transformation Tools
     Code Analysis for program repair
          Self-supervised learning for proving program equivalence in LLVM
          Automated Program Repair for Smart Contracts
          Analyzing the Effectiveness of Embeddings for Patch Correctness Assessment in Program Repair
Category Software Supply Chain (CHAINS)
          Empirical Study of Compilation Reproducibility in Solidity
          Zero-knowledge software bills of materials
          Study of non-reproducible builds in the Java ecosystem
          Diverse-double compilation for Java
          Diverse-double compilation in a CI/CD Pipeline
          Dynamic Integrity Verification & Repair for Java Applications
          Dynamic Introspection of Dependencies in Java Applications
          Automatic Backporting of Java Libraries to Older Bytecode Versions
Category Crypto & Smart Contracts
          Investigation of the Software Supply Chain of Smart Contract Wallets
          Building a Rock Solid Dataset of Smart Contracts for Machine Learning
          Automated Program Repair for Smart Contracts
          On-chain code coverage
          Behavioral Hardening for Blockchain Nodes
          Automatic Exploit Synthesis for Smart Contracts
          Synthetic Vulnerability Generation for Smart Contracts
          Effective Mutation Testing for Solidity Smart Contracts

Category Program Repair

Machine Learning for Program Repair

Explaining Code LLms with monosemanticity

LLMs have revolutionized machine learning on code. However, they are mostly black-boxes which we still do not understand. In this project, you will explore the monosemanticity in LLMs trained on code. Monosemanticity is a recent area of mechanistic interpretability which learns monosemantic (i.e. they only have one meaning) linear combinations of neuron activations, overcoming the problem of a single neuron representing different semantic features. Your work will aim to understand and activate features related with code, specifically ones that improve code quality.

  1. https://transformer-circuits.pub/2023/monosemantic-features/index.html (original paper)

  2. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html (scaling paper)

  3. https://www.astralcodexten.com/p/god-help-us-lets-try-to-understand (explanation blogpost)

  4. https://github.com/jbloomAus/SAELens (open-source implementation of sparse auto-encoder)

  5. https://transformerlensorg.github.io/TransformerLens/index.html (mech interpretability repo, supports loading of SAE trained by SAELens)

Segregated fine-tuning for code LLMs

The project focuses on exploring the concept of segregated fine-tuning of code (LLMs) for optimizing their performance. The concept involves splitting the LLMs into separate parts to fine-tune ('heads'), each fine-tuning dedicated to a specific function. Certain fine-tuned heads will be dedicated to understanding and processing programming languages, while others will be specifically fine-tuned for tasks such as program repair. This segregation approach aims to enhance the efficiency of LLMs on code tasks, in a composable manner.

  1. RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair

  2. Personal Copilot: Train Your Own Coding Assistant

Automated Prompt Engineering for Program Repair

Description: Prompt engineering is a crucial aspect of utilising large language models effectively. In this project, you will explore the use of automated prompt engineering methods in the context of program repair. The goal is to develop, implement, and evaluate a system that can generate effective prompts to guide a language model in repairing faulty code.

  1. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

  2. Large Language Models as Optimizers

Learning Program Transformations with Transformers

Description: The application of program transformations, such as bug fixing and refactoring, is essential for maintaining and improving software quality. In this project, you will investigate the use of transformer models to learn from a diverse set of program transformation applied across multiple projects. The objective is to develop a system that can automatically generate transformations for given code snippets by training on historical transformation data. This involves collecting a dataset of projects, curating code transformations, designing an appropriate transformer architecture, and evaluating the model's ability to generalize transformations to unseen code.

  1. Attention is all you need

  2. Learning to represent edits

Neural Program Repair of CodeQL Warnings

Description: Static analysis tools are much used in industry to statically detect bugs. CodeQL is a state-of-the-art tool in this domain. You will research in the area of machine learning for repairing CodeQL warnings in Java. You will devise, implement and evaluate an approach based on large languade models.

  1. Sorald: Automatic Patch Suggestions for SonarQube Static Analysis Violations

  2. CodeQL query help for Java and Kotlin¶

Using generative AI to adapt software components

Description: Software substitutability is a property which measures how readily a software component can be replaced by a different but equivalent component. In software supply chains, it is critical for faulty or vulnerable components to be replaced as quickly as possible [1,2]. However, software substitutes might not be immediately available. Generative AI tools like ChatGPT may be used to efficiently produce software substitutes in diverse programming languages and stacks. You will determine the feasibility of using generative AI tools to enhance substitutablity of components in software supply chains.

  1. AdaptivePaste: Code Adaptation through Learning Semantics-aware Variable Usage Representations

  2. Better Together? An Evaluation of AI-Supported Code Translation

  3. Formalization of Component Substitutability

Automatic translation of C to Rust with Language Models

Description: The thesis aims to develop an automatic translation system for converting C language code to Rust language code (or JS->Typescript or Java->Kotlin), using state-of-the-art natural language processing techniques and deep learning models. The primary goal is to facilitate the migration of legacy C codebases to Rust, ensuring safer, more efficient, and more maintainable software systems. GPT4 version The same topic of automatic translation can be applied to JS->Typescript or Java->Kotlin or even Python->Typed-Python if you're excited about it).

  1. Concrat: An Automatic C-to-Rust Lock API Translator for Concurrent Programs

  2. Code translation with compiler representations 2023

  3. Unsupervised translation of programming languages 2020

An Empirical Comparison on Semantics Preserving Transformation Tools

Description: In recent years, various tools have been developed to generate equivalent programs using semantics preserving transformations. These tools aim to produce code that is semantically identical but syntactically different from the original code. In this thesis, you will embark on a comparative study of these existing tools, examining their efficiency and effectiveness in generating equivalent programs. This comparative study will shed light on the strengths and weaknesses of each tool, potentially inspiring further advancements in the field of semantics preserving transformations.

  1. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations

  2. Self-Supervised Learning to Prove Equivalence Between Programs via Semantics-Preserving Rewrite Rules

  3. NatGen: generative pre-training by “naturalizing” source code

Code Analysis for program repair

Self-supervised learning for proving program equivalence in LLVM

In recent years, self-supervised learning has emerged as a powerful technique for encoding high-level semantic properties in the absence of explicit supervision signals. The focus of this thesis is to explore the application of self-supervised learning methodologies towards proving program equivalence in LLVM bytecode. LLVM provides a structured format for representing program constructs at the intermediate level. Program equivalence is a fundamental problem in computer science, concerned with proving that two programs exhibit the same behavior under all possible inputs. By utilizing self-supervised learning techniques, we aim to develop a practical approach for efficient and accurate program equivalence verification in a mainstream binary format.

  1. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations

  2. Self-Supervised Learning to Prove Equivalence Between Programs via Semantics-Preserving Rewrite Rules

Automated Program Repair for Smart Contracts

Description: Smart contracts are software, and hence, cannot be perfect. Smart contracts suffer from bugs, some of which putting high financial stakes at risk. There is a new line of research on automated patching of smart contract. You will devise, perform and analyze a comparative experiment to identify the successes, challenges and limitations of automated program repair for smart contracts.

  1. Elysium: Automagically Healing Vulnerable Smart Contracts Using Context-Aware Patching

  2. EVMPatch: Timely and automated patching of ethereum smart contracts

Analyzing the Effectiveness of Embeddings for Patch Correctness Assessment in Program Repair

Description: In program repair, not all patches generated are equally effective. This thesis aims to investigate the effectiveness of patch generation for program repair using embeddings, with state of the art embedding APIs, incl. OpenAI embeddings. The goal is to develop a system that can measure the likelihood of automatically generated patches based on their semantic similarity in the embedding space. This research will contribute to enhancing the accuracy of program repair systems.

  1. Automated Classification of Overfitting Patches with Statically Extracted Code Features

  2. Evaluating representation learning of code changes for predicting patch correctness in program repair

Category Software Supply Chain (CHAINS)

Work done as part of the CHAINS research project

Empirical Study of Compilation Reproducibility in Solidity

Description: The reproducibility of software builds is a critical aspect of secure software development This concept has been pushed forward in the context of Solidity, the programming language used for writing smart contracts on the Ethereum blockchain, with the notion of "verified contracts". In this thesis, you will conduct an empirical study on the reproducibility of compilation in Solidity. You will recompile verified Solidity contracts and analyze the consistency of the results. The datasets for this study will be sourced from Etherscan and Sourcify. This research will contribute to the understanding of software integrity in the growing field of technology and could potentially inform best practices for Solidity development.

  1. Reproducible Builds: Increasing the Integrity of Software Supply Chains

  2. Etherscan

  3. Sourcify

Zero-knowledge software bills of materials

Description: Software bills of materials (SBOMs) are complete lists of software components [1], these can be helpful in tracing vulnerabilities, license compliance, etc. However, revealing an SBOM publicly also means revealing said vulnerabilities to malicious actors. Furthermore, some proprietary software developers advocate for access control for SBOM distribution [2]. Zero-knowledge proofs allows a party to convey that a statement is true without disclosing any additional information. [3] You will design, develop, and evaluate a zero-knowledge SBOM system, which allows developers to disclose limited, but verifiable SBOM information to authorized users.

  1. The Minimum Elements For a Software Bill of Materials https://www.ntia.doc.gov/files/ntia/publications/sbomminimumelementsreport.pdf

  2. An Empirical Study on Software Bill of Materials: Where We Stand and the Road Ahead http://arxiv.org/abs/2301.05362

  3. Zero-knowledge proof https://en.wikipedia.org/wiki/Zero-knowledgeproof

  4. Trust in Software Supply Chains: Blockchain-Enabled SBOM and the AIBOM Future 2024

Study of non-reproducible builds in the Java ecosystem

Description: Build Reproducibility means that a software build always results in a bit-by-bit identical output provided the source code and build environment is also the exact same [1]. This property is a good safeguard against compromised build process threat [2] and hence it is an important safeguard for software supply chain security. In Java ecosystem, Reproducible Central attempts to reproduce Maven/Gradle/sbt artifacts on Maven Central. It does so by building the artifact from source and then comparing it with the artifact in Maven registry. If it is bit-by-bit identical, then the maven package is said to be reproducible, else the package is non-reproducible. In this thesis, you will create a taxonomy of reasons for non-reproducible builds of Maven packages.

  1. https://reproducible-builds.org/

  2. AROMA: Automatic Reproduction of Maven Artifacts

Diverse-double compilation for Java

Description: Java is a key programming language for enterprise applications. As such, the Java compiler is an ideal target for a trusting trust attack. This thesis aims to investigate the feasibility of diverse-double compilation (DDC) to mitigate this problem You will design, implement and evaluate DDC for Java.

  1. Reflections on Trusting Trust

  2. Countering trusting trust through diverse double-compiling

  3. Diverse Double-Compiling to Harden Cryptocurrency Software (Master's thesis KTH 2023)

(a related crazy idea is to do diverse-double compilation for a JIT compiler)

Diverse-double compilation in a CI/CD Pipeline

Description: C is a fundamental programming language for system-level software. Given its widespread use, the C compiler is a prime target for trusting trust attacks. This thesis aims to explore the systematic use of diverse-double compilation (DDC) in a modern Continuous Integration/Continuous Deployment (CI/CD) pipeline. You will design, implement and evaluate DDC in a CI/CD environment.

  1. Reflections on Trusting Trust

  2. Countering trusting trust through diverse double-compiling

  3. Diverse Double-Compiling to Harden Cryptocurrency Software (Master's thesis KTH 2023)

Dynamic Integrity Verification & Repair for Java Applications

Description: Attackers constantly try to tamper with the code of software applications in production. Chang and Attalah have proposed a technique to not only detect modifications and also repairing the code after attacks by a network of small security units called guards. These guards can be programmed to perform tasks such as checksumming the program code, and they work in concert to create mutual protection. In this thesis, you will devise, implement and evaluate such as an approach in the context of modern Java software with dependencies. An open question is how to set up guard inside or around dependency code.

  1. Protecting Software Code by Guards

  2. Reflection as a mechanism for software integrity verification

  3. Practical integrity protection with oblivious hashing

Dynamic Introspection of Dependencies in Java Applications

Description: We aim to design and develop a prototype for dynamic introspection of dependencies in Java applications. This would enable real-time tracking and decision based on the dependency execution context. By leveraging Java's instrumentation capabilities, the proposed system will monitor and identify the active dependencies at any given point during program execution. The focus will be on minimizing performance overhead to ensure that the introspection process does not significantly impact the application's responsiveness or efficiency, while integrating seamlessly with any existing Java application. Rigorous evaluation against various benchmarks will be one to assess its accuracy, performance, and usability.

Automatic Backporting of Java Libraries to Older Bytecode Versions

Description: With the rapid evolution of Java, libraries often get updated to new bytecode versions. This causes compatibility issues and breakages for applications that are still running on older versions of Java. To address this, a possible solution is to automatically backport Java libraries to older bytecode versions. This thesis will focus on designing and implementing an automated tool for backporting Java libraries. The tool should be capable of translating new bytecode instructions to their older equivalents, maintaining the functional behavior of the library while ensuring compatibility with older Java versions. An open question is how to handle new language features and APIs that do not have direct equivalents in older versions.

  1. Back to the past–analysing backporting practices in package dependency networks

  2. Recommending code changes for automatic backporting of Linux device drivers

  3. Transforming C++11 Code to C++03 to Support Legacy Compilation Environments

Category Crypto & Smart Contracts

Investigation of the Software Supply Chain of Smart Contract Wallets

Description: Smart Contract Wallets form a critical component of the blockchain ecosystem, storing and managing digital assets. However, they are also a potential target for software supply chain attacks, where vulnerabilities in the contract dependencies can be exploited, leading to significant losses. In this thesis, you will conduct a comprehensive investigation of the software supply chains of major Smart Contract Wallets. The goal is to understand their security landscape, identify potential vulnerabilities, and propose actionable improvements. This research will not only contribute to the understanding of software supply chain security in the context of blockchain technology, but also provide valuable insights for developers, users, and stakeholders in the crypto space.

  1. Security Aspects of Cryptocurrency Wallets-A Systematic Literature Review

  2. Software supply chain attacks on crypto

  3. Smart Contract-based Wallets for Blockchain Systems: A Systematic Review

Building a Rock Solid Dataset of Smart Contracts for Machine Learning

Description: This Master's thesis will explore the development of a high-quality dataset of secure Solidity smart contracts. Such datasets are of extreme importance for any kind of machine learning on Solidity code. The primary objective is specifically focus on those contracts that manage a large amount of funds. To accomplish this, the study will involve data collection, preprocessing, and analysis of smart contracts from the Ethereum blockchain.

  1. Scrawld: A dataset of real world ethereum smart contracts labelled with vulnerabilities

  2. Performance benchmarking of smart contracts to assess miner incentives in Ethereum

Automated Program Repair for Smart Contracts

Description: Smart contracts are software, and hence, cannot be perfect. Smart contracts suffer from bugs, some of which putting high financial stakes at risk. There is a new line of research on automated patching of smart contract. You will devise, perform and analyze a comparative experiment to identify the successes, challenges and limitations of automated program repair for smart contracts.

  1. Elysium: Automagically Healing Vulnerable Smart Contracts Using Context-Aware Patching

  2. EVMPatch: Timely and automated patching of ethereum smart contracts

On-chain code coverage

Programs executed on the Ethereum blockchain are defined through smart contracts. Solidity is the de-facto programming language used to implement smart contracts. Since much is at stake, good test coverage is essential for Solidity programs [1]. Coverage in production gives additional information about field usage [2], and the blockchain is a fully reproducible production workload. You will design and perform experiments to study production coverage in the context of smart contracts specified in Solidity.

  1. solidity-coverage: https://github.com/sc-forks/solidity-coverage

  2. Behavioral execution comparison: Are tests representative of field behavior? 2017.

Behavioral Hardening for Blockchain Nodes

Description: An important concept in software security is to protect resources with whitelists. It has been implemented at different levels of the software stack (kernel, virtual machines, application frameworks). In Bitcoin Core, white lists of system calls can be used and enforced via Linux SecComp [1]. From a research perspective, the hard problem is to infer the whitelist of accessible resources via behavioral analysis [2,3]. You will design and perform an experiment to compare different behavior inference techniques for Bitcoin-Core.

  1. https://github.com/bitcoin/bitcoin/pull/20487

  2. A Sense of Self for UNIX Processes

  3. Shredder: Breaking Exploits through API Specialization

Automatic Exploit Synthesis for Smart Contracts

Smart contracts typically hold large stakes and consequently, they are under constant attack by malicious actors. As counter-measure, engineering smart contracts involves auditing and formal verification. Another option is automatic exploit synthesis In this thesis, you will evaluate the state of the art of exploit synthesis for smart contracts. You will then design, implement and evaluate a better system that improves upon the state of the art.

  1. ExGen: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities

  2. FlashSyn: Flash Loan Attack Synthesis via Counter Example Driven Approximation

  3. Smart Contract and DeFi Security: Insights from Tool Evaluations and Practitioner Surveys

Synthetic Vulnerability Generation for Smart Contracts

We need robust security measures to protect digital assets from vulnerabilities and attacks. Traditional methods of vulnerability detection often rely on too small vulnerability benchmarks. In this thesis, you will explore the concept of synthetic vulnerability generation for smart contracts. The goal is to develop a system that leverages deep learning models to automatically generate synthetic vulnerabilities in smart contracts, thereby facilitating the testing and evaluation of security tools and practices.

  1. SGDL: Smart Contract Vulnerability Generation via Deep Learning

  2. Lava: Large-scale automated vulnerability addition.

Effective Mutation Testing for Solidity Smart Contracts

Description: One of the problems with mutation testing is that the developers are overwhelmed by the number of mutants to kill with new tests. One way to approach this problem is to view it as a recommendation problem. The student will design, implement and evaluate a novel technique for automatically prioritizing mutants to be killed in Solidity smart contracts.

  1. Selecting fault revealing mutants (2020)

  2. A Comprehensive Study of Pseudo-tested Methods

  3. https://github.com/Certora/gambit/