Under my supervision, you can do cool research in software technology, here are our current hot topics.
Are you a KTH student? See Master's thesis / Bachelor's thesis guidelines and contact me by email
Are you a brilliant international student? Contact me by email
Category Program Repair
Machine Learning for Program Repair
Segregated fine-tuning for code LLMs
Automated Prompt Engineering for Program Repair
Neural Program Repair of CodeQL Warnings
Using generative AI to adapt software components
Automatic translation of C to Rust with Language Models
An Empirical Comparison on Semantics Preserving Transformation Tools
Neural Repair of Compiler Warnings
Code Analysis for program repair
Self-supervised learning for proving program equivalence in LLVM
Automated Program Repair for Smart Contracts
Analyzing the Effectiveness of Embeddings for Patch Correctness Assessment in Program Repair
Category Software Supply Chain (CHAINS)
Zero-knowledge software bills of materials
Fine-grained Java bytecode differencing
Diverse-double compilation for Java
Diverse-double compilation in a CI/CD Pipeline
Dynamic Integrity Verification & Repair for Java Applications
Embedding the software supply chain at runtime with Java classloaders
Investigation of the Software Supply Chain of Crypto Wallets
Category Crypto & Smart Contracts
Make it Offensive: Using AI to generate offensive code for smart contracts in Solidity.
Building a Rock Solid Dataset of Smart Contracts for Machine Learning
Automated Program Repair for Smart Contracts
On-chain code coverage
Behavioral Hardening for Blockchain Nodes
Automatic Exploit Synthesis for Smart Contracts
Effective Mutation Testing for Solidity Smart Contracts
Category Program Repair
Machine Learning for Program Repair
Segregated fine-tuning for code LLMs
The project focuses on exploring the concept of segregated fine-tuning of code (LLMs) for optimizing their performance. The concept involves splitting the LLMs into separate parts to fine-tune ('heads'), each fine-tuning dedicated to a specific function. Certain fine-tuned heads will be dedicated to understanding and processing programming languages, while others will be specifically fine-tuned for tasks such as program repair. This segregation approach aims to enhance the efficiency of LLMs on code tasks, in a composable manner.
Automated Prompt Engineering for Program Repair
Description: Prompt engineering is a crucial aspect of utilising large language models effectively. In this project, you will explore the use of automated prompt engineering methods in the context of program repair. The goal is to develop, implement, and evaluate a system that can generate effective prompts to guide a language model in repairing faulty code.
Neural Program Repair of CodeQL Warnings
Description: Static analysis tools are much used in industry to statically detect bugs. CodeQL is a state-of-the-art tool in this domain. You will research in the area of machine learning for repairing CodeQL warnings in Java. You will devise, implement and evaluate an approach based on large languade models.
Using generative AI to adapt software components
Description: Software substitutability is a property which measures how readily a software component can be replaced by a different but equivalent component. In software supply chains, it is critical for faulty or vulnerable components to be replaced as quickly as possible [1,2]. However, software substitutes might not be immediately available. Generative AI tools like ChatGPT may be used to efficiently produce software substitutes in diverse programming languages and stacks. You will determine the feasibility of using generative AI tools to enhance substitutablity of components in software supply chains.
AdaptivePaste: Code Adaptation through Learning Semantics-aware Variable Usage Representations
Better Together? An Evaluation of AI-Supported Code Translation
Automatic translation of C to Rust with Language Models
Description: The thesis aims to develop an automatic translation system for converting C language code to Rust language code (or JS->Typescript or Java->Kotlin), using state-of-the-art natural language processing techniques and deep learning models. The primary goal is to facilitate the migration of legacy C codebases to Rust, ensuring safer, more efficient, and more maintainable software systems. GPT4 version The same topic of automatic translation can be applied to JS->Typescript or Java->Kotlin or even Python->Typed-Python if you're excited about it).
An Empirical Comparison on Semantics Preserving Transformation Tools
Description: In recent years, various tools have been developed to generate equivalent programs using semantics preserving transformations. These tools aim to produce code that is semantically identical but syntactically different from the original code. In this thesis, you will embark on a comparative study of these existing tools, examining their efficiency and effectiveness in generating equivalent programs. This comparative study will shed light on the strengths and weaknesses of each tool, potentially inspiring further advancements in the field of semantics preserving transformations.
Neural Repair of Compiler Warnings
Description: It is a best practice to activate all warnings in a compiler. However, much work is needed to remediate the all. You will research in the area of machine learning for repairing compiler warnings. You will devise, implement and evaluate an approach based on sequence-to-sequence learning. The considered compilers are open and could be for example rust, go, clang, etc.
Break-It-Fix-It: Unsupervised Learning for Program Repair (2021)
Master's thesis: Exploring the Usage of Neural Networks for Repairing Static Analysis Warnings
MACER A Modular Framework for Accelerated Compilation Error Repair
Code Analysis for program repair
Self-supervised learning for proving program equivalence in LLVM
In recent years, self-supervised learning has emerged as a powerful technique for encoding high-level semantic properties in the absence of explicit supervision signals. The focus of this thesis is to explore the application of self-supervised learning methodologies towards proving program equivalence in LLVM bytecode. LLVM provides a structured format for representing program constructs at the intermediate level. Program equivalence is a fundamental problem in computer science, concerned with proving that two programs exhibit the same behavior under all possible inputs. By utilizing self-supervised learning techniques, we aim to develop a practical approach for efficient and accurate program equivalence verification in a mainstream binary format.
Automated Program Repair for Smart Contracts
Description: Smart contracts are software, and hence, cannot be perfect. Smart contracts suffer from bugs, some of which putting high financial stakes at risk. There is a new line of research on automated patching of smart contract. You will devise, perform and analyze a comparative experiment to identify the successes, challenges and limitations of automated program repair for smart contracts.
Elysium: Automagically Healing Vulnerable Smart Contracts Using Context-Aware Patching
EVMPatch: Timely and automated patching of ethereum smart contracts
Analyzing the Effectiveness of Embeddings for Patch Correctness Assessment in Program Repair
Description: In program repair, not all patches generated are equally effective. This thesis aims to investigate the effectiveness of patch generation for program repair using embeddings, with state of the art embedding APIs, incl. OpenAI embeddings. The goal is to develop a system that can measure the likelihood of automatically generated patches based on their semantic similarity in the embedding space. This research will contribute to enhancing the accuracy of program repair systems.
Category Software Supply Chain (CHAINS)
Work done as part of the CHAINS research project
Zero-knowledge software bills of materials
Description: Software bills of materials (SBOMs) are complete lists of software components [1], these can be helpful in tracing vulnerabilities, license compliance, etc. However, revealing an SBOM publicly also means revealing said vulnerabilities to malicious actors. Furthermore, some proprietary software developers advocate for access control for SBOM distribution [2]. Zero-knowledge proofs allows a party to convey that a statement is true without disclosing any additional information. [3] You will design, develop, and evaluate a zero-knowledge SBOM system, which allows developers to disclose limited, but verifiable SBOM information to authorized users.
The Minimum Elements For a Software Bill of Materials https://www.ntia.doc.gov/files/ntia/publications/sbomminimumelementsreport.pdf
An Empirical Study on Software Bill of Materials: Where We Stand and the Road Ahead http://arxiv.org/abs/2301.05362
Zero-knowledge proof https://en.wikipedia.org/wiki/Zero-knowledgeproof
Trust in Software Supply Chains: Blockchain-Enabled SBOM and the AIBOM Future 2024
Fine-grained Java bytecode differencing
Description: Dynamic code generated at runtime could be different between executions due to how threads manage classloading in the JVM. This hampers runtime reproducibility and it becomes hard to know what code exactly is being executed. Thus, it is important to understand the precise differences in the generated bytecode. The solution to understanding the differences today is diffoscope [1] that provides diff using two strategies [2]: 1) line based diff between the decompiled source code of the bytecode. 2) line-based diff between the dissassembled bytecode. First strategy returns diff between the source files so the difference in the bytecode needs to implied. The second strategy provides spurious differences which can hide semantic differences. Moreover, both strategies do not directly enable transformation of bytecode. The diff still needs to be parsed and then applied. In this project, the researcher will create a fine-grained bytecode diff-ing tool based on the bytecode model [3] in the internal Java code. A related work is Gumtree [4,5] which is a fine-grain differencing tool for Java source code.
Diverse-double compilation for Java
Description: Java is a key programming language for enterprise applications. As such, the Java compiler is an ideal target for a trusting trust attack. This thesis aims to investigate the feasibility of diverse-double compilation (DDC) to mitigate this problem You will design, implement and evaluate DDC for Java.
(a related crazy idea is to do diverse-double compilation for a JIT compiler)
Diverse-double compilation in a CI/CD Pipeline
Description: C is a fundamental programming language for system-level software. Given its widespread use, the C compiler is a prime target for trusting trust attacks. This thesis aims to explore the systematic use of diverse-double compilation (DDC) in a modern Continuous Integration/Continuous Deployment (CI/CD) pipeline. You will design, implement and evaluate DDC in a CI/CD environment.
Dynamic Integrity Verification & Repair for Java Applications
Description: Attackers constantly try to tamper with the code of software applications in production. Chang and Attalah have proposed a technique to not only detect modifications and also repairing the code after attacks by a network of small security units called guards. These guards can be programmed to perform tasks such as checksumming the program code, and they work in concert to create mutual protection. In this this, you will devise, implement and evaluate such as an approach in the context of modern Java software with dependencies. An open question is how to set up guard inside or around dependency code.
Embedding the software supply chain at runtime with Java classloaders
Description: In Java, class loading refers to retrieving the binary form of a class or interface and constructing, from that binary form, a class object to represent the class or interface [1]. Today, different subclasses of the `ClassLoader` may implement different loading policies [2]. For example, a class loader may cache the binary representation of a class, prefetch it based on expected usage, or load a group of related classes together. These activities may not be completely transparent to a running application. In this context, determining the third-party suppliers of classes loaded at runtime allows for controlling and hardening the software supply chain of third-party components used during program execution. Monitoring the origins of the “actually” executed code is a critical task for building more reliable and secure systems. The student will design and implement a novel software tool to build a representation of the software supply chain at runtime.
The Java® Virtual Machine Specification. Chapter 5. 01182103
Sharing the runtime representation of classes across class loaders
Investigation of the Software Supply Chain of Crypto Wallets
Description: Software supply chain attacks compromise target applications from software dependencies. In the context of crypto, a successful attack results in the loss of funds for all users of a compromised wallets. For example, the Copay wallet and its users were victim of such an attack in 2018. You will perform an in-depth investigation of the major crypto wallets, in order to rank them wrt software supply chain security and propose actionable improvements.
Security Aspects of Cryptocurrency Wallets-A Systematic Literature Review
Backstabber's knife collection: A review of open source software supply chain attacks
Category Crypto & Smart Contracts
Make it Offensive: Using AI to generate offensive code for smart contracts in Solidity.
Recently, AI-code-generators have been proof great for the generation of offensive code in popular languages like python, but how much can we rely on it for other programming languages? Solidity is a relatively new programming language, and its main purpose is to write smart contracts (SC). On this matter, when developing SC, it is extremely necessary that the code is safe and secure to prevent exploits and therefore the lost of significant funds. On this thesis you will recreate the experiments from [1] but on the context of smart contracts in Solidity. With your contribution, we can prove to what extent AI-code-generator can be a reliable companion when working on Solidity.
Building a Rock Solid Dataset of Smart Contracts for Machine Learning
Description: This Master's thesis will explore the development of a high-quality dataset of secure Solidity smart contracts. Such datasets are of extreme importance for any kind of machine learning on Solidity code. The primary objective is specifically focus on those contracts that manage a large amount of funds. To accomplish this, the study will involve data collection, preprocessing, and analysis of smart contracts from the Ethereum blockchain.
Scrawld: A dataset of real world ethereum smart contracts labelled with vulnerabilities
Performance benchmarking of smart contracts to assess miner incentives in Ethereum
Automated Program Repair for Smart Contracts
Description: Smart contracts are software, and hence, cannot be perfect. Smart contracts suffer from bugs, some of which putting high financial stakes at risk. There is a new line of research on automated patching of smart contract. You will devise, perform and analyze a comparative experiment to identify the successes, challenges and limitations of automated program repair for smart contracts.
Elysium: Automagically Healing Vulnerable Smart Contracts Using Context-Aware Patching
EVMPatch: Timely and automated patching of ethereum smart contracts
On-chain code coverage
Programs executed on the Ethereum blockchain are defined through smart contracts. Solidity is the de-facto programming language used to implement smart contracts. Since much is at stake, good test coverage is essential for Solidity programs [1]. Coverage in production gives additional information about field usage [2], and the blockchain is a fully reproducible production workload. You will design and perform experiments to study production coverage in the context of smart contracts specified in Solidity.
solidity-coverage: https://github.com/sc-forks/solidity-coverage
Behavioral execution comparison: Are tests representative of field behavior? 2017.
Behavioral Hardening for Blockchain Nodes
Description: An important concept in software security is to protect resources with whitelists. It has been implemented at different levels of the software stack (kernel, virtual machines, application frameworks). In Bitcoin Core, white lists of system calls can be used and enforced via Linux SecComp [1]. From a research perspective, the hard problem is to infer the whitelist of accessible resources via behavioral analysis [2,3]. You will design and perform an experiment to compare different behavior inference techniques for Bitcoin-Core.
Automatic Exploit Synthesis for Smart Contracts
Smart contracts typically hold large stakes and consequently, they are under constant attack by malicious actors. As counter-measure, engineering smart contracts involves auditing and formal verification. Another option is automatic exploit synthesis In this thesis, you will evaluate the state of the art of exploit synthesis for smart contracts. You will then design, implement and evaluate a better system that improves upon the state of the art.
ExGen: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities
FlashSyn: Flash Loan Attack Synthesis via Counter Example Driven Approximation
Smart Contract and DeFi Security: Insights from Tool Evaluations and Practitioner Surveys
Effective Mutation Testing for Solidity Smart Contracts
Description: One of the problems with mutation testing is that the developers are overwhelmed by the number of mutants to kill with new tests. One way to approach this problem is to view it as a recommendation problem. The student will design, implement and evaluate a novel technique for automatically prioritizing mutants to be killed in Solidity smart contracts.