Open Theses
Important remark on this page
The following list is by no means exhaustive or complete. There is always some student work to be done in various research projects, and many of these projects are not listed here. For those, please have a look at the different members of the chair for more specific interests and contact them directly.
Abbreviations:
- PhD = PhD Dissertation
- BA = Bachelorarbeit, Bachelor's Thesis
- MA = Masterarbeit, Master's Thesis
- GR = Guided Research
- CSE = Computational Science and Engineering
Cloud Computing / Edge Computing / IoT / Distributed Systems
Summary:
In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Most of LLMs are GPT-like architectures that consist of multiple Decoder layers. In this project, we study the performance and sustainability that LLMs can gain by using Advanced Matrix Extensions (AMX) CPU technology. AMX is a new CPU accelerator technology that is targeting the AI workloads performance enhancement.
Testing Environment:
Intel Labs in Bangalore
Project Period:
6 to 9 months
Contact:
TUM: Prof. Michael Gerndt
Intel Labs: Mohamed Elsaid, Ph.D (mohamed.elsaid(at)intel.com)
The topic will focus on vertical scaling and horizontal scaling for deep learning inference applications in serverless computing platforms. Currently, the scaling in K8s(Kubernetes) and serverless frameworks mainly utilizes horizontal scaling. However, deep learning (DL) applications usually have large parameter models that consume significant amounts of GPU memory. Horizontal scaling for DL apps entails that each replica needs to load a copy of model parameters, thereby exacerbating memory consumption issues. Meanwhile, most dedicated inference engines and systems in the cloud, like NVIDIA Triton, Kserve, KubeRay, and, etc. , are using vertical scaling, which refers to allocating more GPU resources to a replica to meet the increasing request load. Both types of scaling have their own advantages. The topic will revolve around designing an auto-scaling system for deep learning applications that supports hybrid auto-scaling in serverless computing platforms, enabling the system to achieve SLO-aware and seamless scaling. The technique will include the Batch system, Tensor migration and relative Algorithms in auto-scaling (vertical and horizontal) mechanism in K8s, and Tensor storage.
Goals:
1. Propose a vertical scaling/batch mechanism based on horizontal scaling in serverless computing/K8s for more efficient and SLO-aware deep learning inference as well as improving GPU utilization.
2. Experients and Analysis.
Requirements
- Familiar with C++, Python, Linux Shell, K8s, Container.
- Familiar with Tensorflow or PyTorch.
- Basic Knowledge of CUDA and NVIDIA GPU.
We offer:
- Thesis in the area that is highly demanded by the industry
- Our expertise in data science and systems areas
- Supervision and support during the thesis
- Access to different systems required for the work
- Opportunity to publish a research paper with your name on it
What we expect from you:
- Devotion and persistence (= full-time thesis)
- Critical thinking and initiativeness
- Attendance of feedback discussions on the progress of your thesis
Apply now by submitting your CV and grade report to Jianfeng Gu (jianfeng.gu(at)tum.de).
This project focuses on the finer-grained utilization of FPGA resources in Data Centers. Currently, the FPGA resources have been partitioned at a granular level, and the project will involve further optimization of resource allocation and scheduling at the system level for FPGA task flows, as well as scheduling and system optimization at the FPGA cluster/Cloud level.
Goals:
1. System Optimization for Fine-grained FPGA Usage in K8s.
2. Experients and Analysis.
Requirements
- Basic Knowledge with, C/C++, Linux Shell, K8s, Container.
- Good Knowledge of FPGA.
We offer:
- Thesis in the area that is highly demanded by the industry
- Our expertise in data science and systems areas
- Supervision and support during the thesis
- Access to different systems required for the work
- Opportunity to publish a research paper with your name on it
What we expect from you:
- Devotion and persistence (= full-time thesis)
- Critical thinking and initiativeness
- Attendance of feedback discussions on the progress of your thesis
Apply now by submitting your CV and grade report to Jianfeng Gu (jianfeng.gu(at)tum.de).
Large Language Models (LLMs) have revolutionized natural language processing (NLP) tasks, but their deployment remains challenging due to high memory and computational demands. While modern GPUs offer substantial processing power, their large-memory variants are prohibitively expensive. In contrast, CPU memory is more cost-effective but lacks the computational efficiency of GPUs. To optimize deployment costs while maintaining performance, this project explores a hybrid CPU-GPU strategy, where model parameters and KV cache are intelligently offloaded to CPU memory. We investigate efficient offloading techniques, dynamic memory management, and workload balancing strategies to enhance inference efficiency. Our research aims to bridge the gap between performance and cost-effectiveness in large-scale LLM deployment.
References:
1. https://dl.acm.org/doi/10.5555/3618408.3619696
2. https://github.com/kvcache-ai/ktransformers
3. https://docs.ray.io/en/latest/index.html
4. PowerInfer, dl.acm.org/doi/pdf/10.1145/3694715.3695964
Requirements
- Basic Knowledge of LLM model, Ray System, LLM Inference, any LLM inference system, eg. vLLM, SGLang.
We offer:
- Thesis in the area that is highly demanded by the industry
- Our expertise in data science and systems areas
- Supervision and support during the thesis
- Access to different systems required for the work
- Opportunity to publish a research paper with your name on it
What we expect from you:
- Devotion and persistence (= full-time thesis)
- Critical thinking and initiativeness
- Attendance of feedback discussions on the progress of your thesis
Apply now by submitting your CV and grade report to Jianfeng Gu (jianfeng.gu(at)tum.de).
Currently, more and more autonomous driving systems are using end-to-end model frameworks [1] [2] [3], and the size of these model parameters is increasing. However, the GPU resources available in vehicles are very limited. Efficiently deploying large end-to-end models on these GPUs is a highly challenging task, requiring more rational and finer-grained resource allocation optimization to improve GPU utilization and task throughput. This research will investigate how to design and implement an efficient and general automatic GPU resource allocation optimization mechanism based on end-to-end model frameworks and fine-grained GPU resource reuse. The implementation will be based on Baidu Apollo autonomous driving system [4].
[1] https://github.com/OpenDriveLab/End-to-end-Autonomous-Driving
[2] https://developer.nvidia.com/blog/end-to-end-driving-at-scale-with-hydra-mdp/
https://github.com/NVlabs/Hydra-MDP
[3] https://github.com/OpenDriveLab/UniAD
[4] https://github.com/ApolloAuto/apollo
Goals:
1. Deploy end-to-end autonomous driving framework in the resource-limited GPU device.
2. Design an automatic fine-grained GPU allocation mechanism/optimization method for end-to-end autonomous driving frameworks.
3. Experients and Performance Analysis.
Requirements
- Familiar with C++, Python, Linux Shell.
- Tensorflow or PyTorch.
- Basic Knowledge of CUDA and NVIDIA GPU.
We offer:
- Thesis in the area that is highly demanded by the industry
- Our expertise in data science and systems areas
- Supervision and support during the thesis
- Access to different systems required for the work
- Opportunity to publish a research paper with your name on it
What we expect from you:
- Devotion and persistence (= full-time thesis)
- Critical thinking and initiativeness
- Attendance of feedback discussions on the progress of your thesis
Apply now by submitting your CV and grade report to Jianfeng Gu (jianfeng.gu(at)tum.de).
Currently, fine-grained GPU allocation for serverless computing is primarily focused on containers. However, many cloud-based runtime environments are built on microVMs, mainly due to their superior isolation mechanisms. These microVM-based runtimes, however, do not support GPU or fine-grained GPU resource allocation. Therefore, it is necessary to design a GPU allocation mechanism that supports microVM environments.
Reference:
1. Firecraker — MicroVM
https://firecracker-microvm.github.io/
github.com/firecracker-microvm/firecracker
2. vGPU for MicroVM:
https://www.youtube.com/watch?v=Lz98xv4ZxJo
Goals:
1. Basically implement a basic mechanism to support GPU/vGPU for KVM.
2. Performance analysis.
Requirements
- Good Knowledge of KVM.
- Good Knowledge of CUDA and NVIDIA GPU.
We offer:
- Thesis in the area that is highly demanded by the industry
- Our expertise in data science and systems areas
- Supervision and support during the thesis
- Access to different systems required for the work
- Opportunity to publish a research paper with your name on it
What we expect from you:
- Devotion and persistence (= full-time thesis)
- Critical thinking and initiativeness
- Attendance of feedback discussions on the progress of your thesis
Apply now by submitting your CV and grade report to Jianfeng Gu (jianfeng.gu(at)tum.de).
Background
With the rapid development of Cloud in the recent years, attempts have been made to bridge the widening gap between the escalating demands for complex simulations to be performed against tight deadlines and the constraints of a static HPC infrastructure by working on a Hybrid Infrastructure which consists of both the current HPC Clusters and the seemingly infinite resources of Cloud. Cloud is flexible and elastic and can be scaled as per requirements.
The BMW Group, which runs thousands of compute-intensive CAEx (Computer Aided Engineering) simulations every day needs to leverage the various offerings of Cloud along with optimal utilization of its current HPC Clusters to meet the dynamic market demands. As such, research is being carried out to schedule moldable CAE workflows in a Hybrid setup to find optimal solutions for different objectives against various constraints.
Goals
- The aim of this work is to develop and implement scheduling algorithms for CAE workflows on a Hybrid Cloud on an existing simulator using meta-heuristic approaches such as Ant Colony or Particle Swarm Optimization. These algorithms need to be compared against other baseline algorithms, some of which have already been implemented in the non-meta-heuristic space.
- The scheduling algorithms should be a based on multi-objective optimization methods and be able to handle multiple objectives against strict constraints.
- The effects of moldability of workflows with regards to the type and number of resource requirements and the extent of moldability of a workflow is to be studied and analyzed to find optimal solutions in the solution space.
- Various Cloud offerings should be studied, and the scheduling algorithms should take into account these different billing and infrastructure models while make decisions regarding resource provisioning and scheduling.
Requirements
- Experience or knowledge in Scheduling algorithms
- Experience or knowledge in the principles of Cloud Computing
- Knowledge or Interest in heuristic and meta-heuristic approaches
- Knowledge on algorithmic analysis
- Good knowledge of Python
We offer:
- Collaboration with BMW and its researchers
- Work with industry partners and giants of Cloud Computing such as AWS
- Solve tangible industry-specific problems
- Opportunity to publish a paper with your name on it
What we expect from you:
- Devotion and persistence (= full-time thesis)
- Critical thinking and initiativeness
- Attendance of feedback discussions on the progress of your thesis
The work is a collaboration between TUM and BMW
Apply now by submitting your CV and grade report to Srishti Dasgupta (srishti.dasgupta(at)bmw.de)
Background: Social good applications such as monitoring environments require several technologies, including federated learning. Implementing federated learning expects a robust balance between communication and computation costs involved in the hidden layers. It is always a challenge to diligently identify the optimal values for such learning architectures.
Keywords: Edge, Federated Learning, Optimization, Social Good,
Research Questions:
1. How to design a decentralized federated learning framework that applies to social good applications?
2. Which optimization parameters need to be considered for efficiently targeting the issue?
3. Are there any optimization algorithms that could deliver a tradeoff between the communication and computation parameters?
Goals: The major goals of the proposed research are given below:
1. To develop a framework that delivers a decentralized federated learning platform for social good applications.
2. To develop at least one optimization strategy that addresses the existing tradeoffs in hidden neural network layers.
3. To compare the efficiency of the algorithms with respect to the identified optimization parameters.
Expectations: The students are expected to have an interest to develop frameworks with more emphasis on federated learning; they have to committedly work and participate in the upcoming discussions/feedbacks (mostly online); they have to stick to the deadlines which will be specified in the meetings.
For more information, contact: Prof. Michael Gerndt (gerndt@tum.de) and Shajulin Benedict (shajulin@iiitkottayam.ac.in)
Background:
The Flux scheduler is a graph-based hierarchical scheduler for exascale systems that has been developed by the Lawrence Livermore Nation Laboratory to schedule HPC workloads with an efficient temporal management scheme across a range of HPC resources. Since it is graph-based, resource scheduling can be broken down to different levels. The flux scheduler has been extended to run on the Cloud as well and is flexible and elastic in its design.
Workflows represent a set of inter-dependent steps to achieve a particular goal. For example, a machine learning workflow is a series of steps for developing, training, validating and deploying machine learning models. In particular, a LLM(Large Language Model) workflow includes additional steps that define the complexities for training and deploying language models. These workflows are iterative in nature and each stage may inform changes in previous steps as the model’s limitations and new requirements emerge. Accordingly, the computational requirements vary and the hierarchical scheduler should adapt accordingly for each splitted data set by the resources (Cloud+HPC) growing and shrinking to reduce costs, deadlines and avail optimal resource utilization.
The work would be a part of the new field of Computer Science, namely Converged Computing, that aims to bridge the gap between Cloud technologies and the HPC world.
Goals:
- Extend the flux scheduler to allow for the growing and shrinking of Hybrid resources as per the demands of the individual jobs or tasks. The communication between these levels need to be extended as per the already established protocols of Flux.
- Implement an infrastructure that takes in an input of MLOps workflows and schedules these and the individual tasks according to the resource requirements in each iterative step by growing and shrinking according to the scheduler.
- Implement a Hybrid Infrastructure for the workflows to run on, thus taking advantage of the flexibility and elasticity of Cloud.
Requirements:
- Experience or knowledge or interest in Kubernetes and related cloud computing concepts.
- Preferred knowledge on MLOps workflows and other related topics
- Preferred experience on working with HPC systems and related libraries such as SLURM
- Knowledge on scheduling algorithms
- Basic knowledge of C++, Python, Linux Shell
- Basic Knowledge on TensorFlow or PyTorch
What we expect from you:
- Devotion and persistence(=full-time thesis)
- Critical thinking and ability to think out-of-the-box and take initiatives to explore ideas and use-cases
- Attendance of feedback discussions on the direction and progress of thesis
What we offer:
- Collaboration between a pioneering research center (Lawrence Livermore National Laboratory), a leader in the automobile industry (BMW) and the Technical University of Munich.
- Thesis in an area that is in the highest demand in the current industry; combining MLOps and infrastructure for MLOps, esp. in the Cloud.
- Supervision and collaboration from a multi-disciplinary team during the thesis.
- Opportunity to publish a paper with your name on it.
References:
1. flux-framework/flux-sched: Fluxion Graph-based Scheduler
2. Copy of HPC Knowledge Meeting: Converged Computing Flux Framework - Shared
The work is a collaboration between TUM, LLNL and BMW.
Apply now by submitting your CV and grade report to Srishti Dasgupta(srishti.dasgupta@bmw.de/srishti.dasgupta@tum.de)
Modeling and Analysis of HPC Systems/Applications
This work will be largely run and evaluated on the HPC systems of LRZ, and will also be co-advised by LRZ colleagues.
Background:
We have previously developed a framework for automatic running of various benchmarks, which enables fine-grained level of detail with respect to the configuration of the benchmark and the resources it runs on. There are multiple popular benchmarks included, stressing different aspects of the system, such as memory, cache or compute. More benchmarks can easily be added. As a second step, this framework automatically processes the raw data from the runs to provide basic statistical analysis. Finally, the data is either visualized or it is loaded by the sys-sage library(paper) for further use, such as for scheduling and mapping processes/tasks on resources (nodes, sockets, cores,..). The main focus of the framework is to seamlessly collect performance data and to evaluate performance variations on CPUs in (large-scale production) HPC systems.
Task:
In this thesis, we want to explore characteristics of modern supercomputers and provide a deeper analysis of these – bringing understanding to the performance variations the users of these systems face. You will prepare and run large-scale experiments of the existing benchmarking framework on one or multiple supercomputers (SuperMUC-NG, NG Phase2, MPG, …), and analyze the resulting data. We are interested in multiple aspects:
- Analyzing the performance variability on (the whole or some reasonable fraction of) the system, to see how much performance variability do users experience.
- Looking more closely at the supercomputer topology — can we see some differences in performance on the island/rack/chassis-level? If so, can we find something like a “representative node of a rack/island” that would indicate what the expected performance on said island/rack/.. would be? If there is such a node, we could just run the benchmarks on this one node and extrapolate the performance of the entire rack/island.
- Showcasing the effects on performance, and indicating that the benchmark data can be used for better scheduling. (The import of the data into sys-sage for later use can already be done out-of-the-box but maybe a small updates may be necessary) Very roughly, we would be looking for a use-case, where we use sys-sage to make explicit resource allocation based on the performance variability data (e.g. place performance-critical processes on good-quality nodes/cores). The expected result would see performance improvement over random resource assignment.
Contact:
In case of interest, please contact Stepan Vanecek (stepan.vanecek at tum.de) of the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and/or Amir Raoofy (Amir.Raoofy at lrz.de) from LRZ and attach your CV & transcript of records.
Published on 06.03.2025
This work will be largely run and evaluated on the HPC systems of LRZ, and will also be co-advised by LRZ colleagues.
Background:
CAPS TUM has previously developed a framework for automatic running of various benchmarks, which enables fine-grained level of detail with respect to the configuration of the benchmark and the resources it runs on. There are multiple popular benchmarks included, stressing different aspects of the system, such as memory, cache or compute. More benchmarks can easily be added. As a second step, this framework automatically processes the raw data from the runs to provide basic statistical analysis. Finally, the data is either visualized or it is loaded by the sys-sage library(paper) for further use, such as for scheduling and mapping processes/tasks on resources (nodes, sockets, cores,..). The main focus of the framework is to seamlessly collect performance data and to evaluate performance variations on CPUs in (large-scale production) HPC systems.
Parallel to this, LRZ has developed AutoBench, which is a platform for automated and reproducible benchmarking in HPC Testbeds. It is capable of setting and controlling different SW and HW knobs, having more control over the exact configuration of the test system. It enables benchmarking the tested system with different configurations, so that the effect of these configurations can also be analyzed.
Task:
The first part of this thesis lies in integrating these two approaches, so that we can benefit from the fine granularity and performance variations focus of the CAPS tool as well as from the control over the system configuration on from the LRZ side.
Once the integration is in place, there are multiple analyses and experiments to have a look at. Due to time constraints, we will likely restrict the work to one or two of the following:
- Exploring variability in different configurations: can we see some differences in the performance variability of the benchmarks when we run them on differently configured nodes? (do all benchmarks behave the same variability differences or are there differences?)
- Finding optimal configurations for different workloads: do all benchmarks show higher/lower variability with certain config? is some configuration better for some benchmark and another config for another benchmark? For instance, can we find (different) optimal configurations for compute- vs memory-heavy benchmarks/jobs?
- Exploring differences in variability on different configurations and different nodes: Comparing always nodes with identical configuration, are some nodes always performing worse than others regardless of the configuration? Are there some configurations that amplify/reduce the differences in the perf. variability?
Contact:
In case of interest, please contact Stepan Vanecek (stepan.vanecek at tum.de) of the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and/or Amir Raoofy (Amir.Raoofy at lrz.de) from LRZ and attach your CV & transcript of records.
Published on 06.03.2025
Background:
HPC systems are becoming increasingly heterogeneous as a consequence of the end of Dennard scaling, slowing down of Moore's law, and various emerging applications including LLMs, HPDAs, and others. At the same time, HPC systems consume a tremendous amount of power (can be over 20MW), which requires sophisticated power management schemes at different levels from a node component to the entire system. Driven by those trends, we are studing on sophisticated resource and power management techniques specifically tailored for modern HPC systems, as a part of Regale project (https://regale-project.eu/).
Research Summary:
In this work, we will focus on co-scheduling (co-locating multiple jobs on a node to minimize the resource wastes) and/or power management on HPC systems, with a particular focus on heterogeneous computing systems, consisting of multiple different processors (CPU, GPU, etc.) or memory technologies (DRAM, NVRAM, etc.). Recent hardware components generally support a variety of resource partitioning and power control features, such as cache/bandwidth partitioning, compute resource partitioning, clock scaling, power/temperature capping, and others, controllable via previlaged software. You will first pick up some of them and investigate their impact on HPC applications in performance, power, energy, etc. You will then build an analytical or emperical model to predict the impact and develop a control scheme to optimize the knob setups using your model. You will use hardware available in CAPS Cloud (https://www.ce.cit.tum.de/caps/hw/caps-cloud/) or LRZ Beast machines (https://www.lrz.de/presse/ereignisse/2020-11-06_BEAST/) to conduct your study.
Requirements:
- Basic knowledge/skills on computer architecture, high performance computing, and statistics
- Basic knowledge/skills on surrounding areas would also help (e.g., machine learning, control theory, etc.).
- In genreral, we would be very happy with guiding anyone self-motivated, capable of critical thinking, and curious about computer science.
- We don't want you to be too passive – you are supposed to think/try yourself to some extend, instead of fully following our instructions step by step.
- If your main goal is passing with any grade (e.g., 2.3), we'd suggest you look into a different topic.
See also our former studies:
- Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz "Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach" In Proceedings of IEEE International Conference on Cluster Computing (CLUSTER), pp.185-196, Nov. (2023)
- Issa Saba, Eishi Arima, Dai Liu, Martin Schulz "Orchestrated Co-Scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning" In Proceedings of 35th International Conference on Architecture of Computing Systems (ARCS), pp.51-67, Sep. (2022)
- Eishi Arima, Minjoon Kang, Issa Saba, Josef Weidendorfer, Carsten Trinitis, Martin Schulz "Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps" In Proceedings of International Conference on Parallel Processing Workshops, no. 9, pp.1-10, Aug. (2022)
- Eishi Arima, Toshihiro Hanawa, Carsten Trinitis, Martin Schulz "Footprint-Aware Power Capping for Hybrid Memory Based Systems" In Proceedings of the 35th International Conference on High Performance Computing, ISC High Performance (ISC), pp.347--369, Jun. (2020)
Contact:
Dr. Eishi Arima, eishi.arima@tum.de, https://www.ce.cit.tum.de/caps/mitarbeiter/eishi-arima/
Prof. Dr. Martin Schulz
Description:
Benchmarks are an essential tool for performance assessment of HPC systems. During the pro-
curement process of HPC systems both benchmarks and proxy applications are used to assess
the system which is to be procured. New generations of HPC systems often serve the current
and evolving needs of the applications for which the system is procured. Therefore, with new
generations of HPC systems, the selected proxy application and benchmarks to assess the sys-
tems’ performance are also selected for the specific needs of the system. Only a few of these
have stayed persistent over longer time periods. At the same time the quality of benchmarks
is typically not questioned as they are seen to only be representatives of specific performance
indicators.
This work aims to provide a more systematic approach with the goal of evaluating benchmarks
targeting the memory subsystem, looking at capacity latency and bandwidth.
Problem statement:
How can benchmarks used to assess memory performance, including cache usage, be system-
atically compared amongst each others?
Description:
Benchmarks are an essential tool for performance assessment of HPC systems. During the
procurement process of HPC systems both benchmarks and proxy applications are used to as-
sess the system which is to be procured. With new generations of HPC systems, the selected
proxy application and benchmarks are often exchanged and benchmarks for specific needs of
the system are selected. Only a few of these have stayed persistent over longer time periods. At
the same time the quality of benchmarks is typically not questioned as they are seen to only be
representatives of specific performance indicators.
This work targets to provide a more systematic approach with the goal of evaluating bench-
marks targeting Network performance, namely regarding MPI (Message Passing Interface) in
both functional test as well as for benchmark applications.
Problem statement:
How can benchmarks used to assess Network performance, using MPI routines, be systemati-
cally compared amongst each others?
Description:
Benchmarks are an essential tool for performance assessment of HPC systems. During the pro-
curement process of HPC systems both benchmarks and proxy applications are used to assess
the system which is to be procured. New generations of HPC systems often serve the current
and evolving needs of the applications for which the system is procured. Therefore, with new
generations of HPC systems, the selected proxy application and benchmarks to assess the sys-
tems’ performance are also selected for the specific needs of the system. Only a few of these
have stayed persistent over longer time periods. At the same time the quality of benchmarks
is typically not questioned as they are seen to only be representatives of specific performance
indicators.
This work aims to evaluate benchmarks for input and output (I/O) performance to provide a
systematic approach to evaluate benchmarks targeting read and write performance of different
characteristics as seen in application behavior, mimiced by benchmarks.
Problem statement:
How can benchmarks used to assess I/O performance be systematically compared amongst
each others?
Memory Management and Optimizations on Heterogeneous HPC Architectures
Background
sys-sage(https://github.com/caps-tum/sys-sage) is a library for capturing and manipulating hadrware topology of compute systems, and their attributes. It collects, stores, and provides different kinds of information regarding an HPC node, heterogeneous chips, such as CPUs of GPUs, or their components, such as caches, cores or thread blocks. This information is needed by various different users in the areas of scheduling, power management or performance optimizations, to name a few examples.
sys-sage, being a software library, is available to a single process running on a single node. However, in real HPC systems, we have multiple nodes running multiple processes each.
Task:
The goal of this work is to design, implement, and test a mechanism to export arbitrary data contained in sys-sage from one process (as an API call) to a selected output format, and to implement a respective import mechanism enabling import of the data. The final goal is to completely recreate the sys-sage internal representation in another process. There is an XML export and import functionality already implemented, which could serve as a basis. Moreover, a prototypical implementation using shared memory (mmap) is also available – this can serve as a basis for a further, robust, implementation.
Contact:
In case of interest or any questions, please contact Stepan Vanecek (stepan.vanecek@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and attach your CV & transcript of records.
Published on 6.3.2025
Background:
GPUScout (paper) is a performance analysis tool developed at TUM that performs analyses of NVidia CUDA kernels with the aim to identify common pitfalls related to data movements on a GPU. It combines static SASS code analysis, PC stall sampling, and NCU metrics collection to identify the bottlenecks, assess its severity, and provide additional information about the identified code section.
On top of the original implementation, as presented in the paper, we have also added a GUI, presenting the findings in a user-friendly manner, so that all the useful information is clearly visible and correlated together, giving the user an easy-to-navigate interface.
Moreover, we also have a prototypical implementation of GPUscout on AMD GPUs – so that the analysis is now available for NVidia as well as AMD GPUs. Nevertheless, there are a few drawbacks as of now:
- The AMD implementation is only in the form of a prototype now, with some information collection being unavailable;
- Both implementations are separated in two code-bases, and
- The GUI functionality for AMD is not available.
Task:
The goal of this thesis is to design and develop a unified, vendor-agnostic solution to GPUscout and its GUI. This will be done by tackling the drawbacks mentioned above, and making sure the different pieces work well together to form one tool.
Apart from this implementation task, the Master Thesis students will also extend the list of the bottleneck analyses, covering more performance drawbacks present in GPU codes.
Contact:
In case of interest, please contact Stepan Vanecek (stepan.vanecek@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and attach your CV & transcript of records.
Published on 6.3.2025
Context:
- HWloc gives a good overview of CPU memory hierarchy. This information is used by the sys-sage library and represents the CPU there.
- It only provides very little information about memory and compute unit hierarchy/grouping on GPUs. Therefore, we need to find a different way to gather these data.
- We have developed a first version of a microbenchmark-based approach to obtain the compute and memory hierarchy information on NVidia GPUs. (https://github.com/caps-tum/mt4g)
- On top of the NVidia implementation, core parts for AMD GPUs are implemented as well.
Tasks/Goals:
- The first goal of this work is to extend the AMD GPU part to cover the same depth/level of information as the NVidia part.
- Next, we will unify the existing NVidia part with the extend AMD part.
- As a part of this work, you will have to make sure the architectural differences between the vendors are covered.
- (For Master Thesis; BA optional) Finally, you can extend the microbenchmarking to include additional attributes, such as bandwidth (you can also plug in existing bandwidth benchmarks) or improve the precision of exsiting microbenchmarks, such as L1 size.
Contact:
In case of interest, please contact Stepan Vanecek (stepan.vanecek@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz).
Updated on 6.3.2025
Various MPI-Related Topics
Please Note: MPI is a high performance programming model and communication library designed for HPC applications. It is designed and standardised by the members of the MPI-Forum, which includes various research, academic and industrial institutions. The current chair of the MPI-Forum is Prof. Dr. Martin Schulz. The following topics are all available as Master's Thesis and Guided Research. They will be advised and supervised by Prof. Dr. Martin Schulz himself, with help of researches from the chair. If you are very familiar with MPI and parallel programming, please don't hesitate to drop a mail to Prof. Dr. Martin Schulz. These topics are mostly related to current research and active discussions in the MPI-Forum, which are subject of standardisation in the next years. Your contribution achieved in these topics may make you become contributor to the MPI-Standard, and your implementation may become a part of the code base of OpenMPI. Many of these topics require a collaboration with other MPI-Research bodies, such as the Lawrence Livermore National Laboratories and Innovative Computing Laboratory. Some of these topics may require you to attend MPI-Forum Meetings which is at late afternoon (due to time synchronisation worldwide). Generally, these advanced topics may require more effort to understand and may be more time consuming - but they are more prestigious, too.
LAIK is a new programming abstraction developed at LRR-TUM
- Decouple data decompositionand computation, while hiding communication
- Applications work on index spaces
- Mapping of index spaces to nodes can be adaptive at runtime
- Goal: dynamic process management and fault tolerance
- Current status: works on standard MPI, but no dynamic support
Task 1: Port LAIK to Elastic MPI
- New model developed locally that allows process additions and removal
- Should be very straightforward
Task 2: Port LAIK to ULFM
- Proposed MPI FT Standard for “shrinking” recovery, prototype available
- Requires refactoring of code and evaluation of ULFM
Task 3: Compare performance with direct implementations of same models on MLEM
- Medical image reconstruction code
- Requires porting MLEM to both Elastic MPI and ULFM
Task 4: Comprehensive Evaluation
ULFM (User-Level Fault Mitigation) is the current proposal for MPI Fault Tolerance
- Failures make communicators unusable
- Once detected, communicators an be “shrunk”
- Detection is active and synchronous by capturing error codes
- Shrinking is collective, typically after a global agreement
- Problem: can lead to deadlocks
Alternative idea
- Make shrinking lazy and with that non-collective
- New, smaller communicators are created on the fly
Tasks:
- Formalize non-collective shrinking idea
- Propose API modifications to ULFM
- Implement prototype in Open MPI
- Evaluate performance
- Create proposal that can be discussed in the MPI forum
ULFM works on the classic MPI assumptions
- Complete communicator must be working
- No holes in the rank space are allowed
- Collectives always work on all processes
Alternative: break these assumptions
- A failure creates communicator with a hole
- Point to point operations work as usual
- Collectives work (after acknowledgement) on reduced process set
Tasks:
- Formalize“hole-y” shrinking
- Proposenew API
- Implement prototype in Open MPI
- Evaluate performance
- Create proposal that can be discussed in the MPI Forum
With MPI 3.1, MPI added a second tools interface: MPI_T
- Access to internal variables
- Query, read, write
- Performance and configuration information
- Missing: event information using callbacks
- New proposal in the MPI Forum (driven by RWTH Aachen)
- Add event support to MPI_T
- Proposal is rather complete
Tasks:
- Implement prototype in either Open MPI or MVAPICH
- Identify a series of events that are of interest
- Message queuing, memory allocation, transient faults, …
- Implement events for these through MPI_T
- Develop tool using MPI_T to write events into a common trace format
- Performance evaluation
Possible collaboration with RWTH Aachen
PMIxis a proposed resource management layer for runtimes (for Exascale)
- Enables MPI runtime to communicate with resource managers
- Come out of previous PMI efforts as well as the Open MPI community
- Under active development / prototype available on Open MPI
Tasks:
- Implement PMIx on top of MPICH or MVAPICH
- Integrate PMIx into SLURM
- Evaluate implementation and compare to Open MPI implementation
- Assess and possible extend interfaces for tools
- Query process sets
MPI was originally intended as runtime support not as end user API
- Several other programming models use it that way
- However, often not first choice due to performance reasons
- Especially task/actor based models require more asynchrony
Question: can more asynchronmodels be added to MPI
- Example: active messages
Tasks:
- Understand communication modes in an asynchronmodel
- Charm++: actor based (UIUC)•Legion: task based (Stanford, LANL)
- Propose extensions to MPI that capture this model better
- Implement prototype in Open MPI or MVAPICH
- Evaluation and Documentation
Possible collaboration with LLNL and/or BSC
MPI can and should be used for more than Compute
- Could be runtime system for any communication
- Example: traffic to visualization / desktops
Problem:
- Different network requirements and layers
- May require different MPI implementations
- Common protocol is unlikely to be accepted
Idea: can we use a bridge node with two MPIs linked to it
- User should see only two communicators, but same API
Tasks:
- Implement this concept coupling two MPIs
- Open MPI on compute cluster and TCP MPICH to desktop
- Demonstrate using on-line visualization streaming to front-end
- Document and provide evaluation
- Warning: likely requires good understanding of linkers and loaders
Field-Programmable Gate Arrays
Field Programmable Gate Arrays (FPGAs) are considered to be the next generation of accelerators. Their advantages reach from improved energy efficiency for machine learning to faster routing decisions in network controllers. If you are interested in one of it, please send your CV and transcript record to the specified Email address.
Our chair offers various topics available in this area:
- Open-Source EDA tools: If you are interested in exploring open-source EDA tools, especially High Level Synthesis, you can do an exploration of available tools. (dirk.stober(at)tum.de)
- HLS tools: Evaluation of different HLS tools and extension of capabilities (dirk.stober(at)tum.de)
- Memory on FPGA: Exploration of memory on FPGA and devolping profiling tools for AXI-Interconnects. Development of Memory Bound FPGA benchmarks (dirk.stober(at)tum.de).
- Quantum Computing: Your tasks will be to explore architectures that harness the power of traditional computer architecture to control quantum operations and flows. Now we focus on superconducting qubits & neutral atoms control. (xiaorang.guo(at)tum.de)
- Direct network operations: Here, FPGAs are wired closer to the networking hardware itself, hence allows to overcome the network stack which a regular CPU-style communication would be exposed to. Your task would be to investigate FPGAs which can interact with the network closer than CPU-based approaches. ( martin.schreiber(at)tum.de )
- Linear algebra: Your task would be to explore strategies to accelerate existing linear algebra routines on FPGA systems by taking into account applications requirements. ( martin.schreiber(at)tum.de )
- Varying accuracy of computations: The granularity of current floating-point computations is 16, 32, or 64 bit. Your work would be on tailoring the accuracy of computations towards what's really required. ( martin.schreiber(at)tum.de )
- ODE solver: You would work on an automatic toolchain for solving ODEs originating from computational biology. ( martin.schreiber(at)tum.de )
Quantum Computing
Background
Quantum computing faces a critical challenge: the high error rates of qubits. Quantum error correction (QEC) codes (e.g., surface codes) protect quantum information through redundancy and real-time error mitigation. However, practical implementations require efficient hardware platforms. FPGAs, with their parallel processing capabilities and low latency, are ideal for prototyping QEC schemes. This project aims to implement a QEC code (e.g., surface code) on an FPGA and explore its potential for real-time error correction.
Task
1. Study the principles of QEC codes; investigate decoding algorithms for QEC (e.g., minimum-weight perfect matching (MWPM), QULATIS…)
2. Hardware Implementation:
- Design and implement the decoder on an FPGA.
- Optimize resource utilization (logic units, memory) and latency, leveraging FPGA parallelism.
3. Performance Evaluation:
- Analyze error correction success rates, latency, and resource efficiency via simulations and hardware testing.
- Compare the impact of code distances on the correction performance.
Requirement
Experience in programming with VHDL/Verilog. Understand basic quantum theory.
Contact:
In case of interest or any questions, please contact Xiaorang Guo (xiaorang.guo@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and attach your CV & transcript of records.
Background
Superconducting qubits are among the most promising candidates for quantum information processors. However, their performance is often constrained by slow and error-prone qubit readout, which is crucial for achieving high-fidelity operations. Existing approaches primarily use feedforward neural networks (FNNs) to discriminate qubit states from readout traces (e.g., arXiv:2102.12481). This project aims to develop more advanced machine learning (ML) models to enhance qubit state classification while ensuring efficient FPGA implementation for hardware acceleration.
Task
- Understanding Multiplexed Qubit Readout: Study the current state of FPGA-based qubit readout techniques and gain familiarity with quantum datasets.
- ML Model Development: Investigate and design ML models for qubit state discrimination, with a particular focus on Mamba-based architectures.
- FPGA Implementation: Optimize and deploy the trained ML model on FPGA for real-time qubit readout.
As this task is complex, we will split this general focus into multiple sub-tasks. They can be adjusted to the student's education status (BSc/MSc) and level of expertise in the area.
Requirement
- Experience in FPGA programming (VHDL/Verilog).
- Background in machine learning.
Contact:
In case of interest or any questions, please contact Xiaorang Guo (xiaorang.guo@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and attach your CV & transcript of records.
Background: Neutral atom (NA) detection plays a crucial role in NA quantum computer in terms of state preparation and readout. Traditional detection methods often rely on computationally intensive image processing techniques that can be inefficient for real-time applications. Implementing an ML-based detection algorithm on FPGA provides a hardware-accelerated solution with low latency, high throughput, and energy efficiency, making it suitable for large-scale atom array experiments.
Tasks
1. Evaluate various ML models (e.g., CNNs, Mamba models) and train them on labeled atom datasets.
2. Hardware Implementation:
- Optimize the trained model for FPGA deployment, eg, with tiling or tensor core-based architectures
- Aim to balence the resource utilization (logic units, memory) and latency, leveraging FPGA parallelism.
3. Performance Evaluation:
- Compare accuracy, inference latency with C++/python based methods
Required Skills & Resources:
- Have backgrounds in ML
- Experience with FPGA programming (e.g., VHDL, Verilog, or HLS)
- Familiarity with PYNQ, Xilinx tools, or equivalent platforms
Contact: Xiaorang Guo(xiaorang.guo(at)tum.de), Jonas Winklmann (jonas.winklmann(at)tum.de), Prof. Martin Schulz
Various Thesis Topics in Collaboration with Leibniz Supercomputing Centre
We have a variety of open topics. Get in contact with Josef Weidendorfer or Amir Raoofy
Benchmarking of (Industrial-) IoT & Message-Oriented Middleware
DDS (Data Distribution Service) is a message-oriented middleware standard that is being evaluated at the chair. We develop and maintain DDS-Perf, a cross-vendor benchmarking tool. As part of this work, several open theses regarding DDS and/or benchmarking in general are currently available. This work is part of an industry cooperation with Siemens.
Please see the following page for currently open positions here.
Note: If you are conducting industry or academic research on DDS and are interested in collaborations, please see check the open positions above or contact Vincent Bode directly.
Applied mathematics & high-performance computing

There are various topics available in the area bridging applied mathematics and high-performance computing. Please note that this will be supervised externally by Prof. Dr. Martin Schreiber (a former member of this chair, now at Université Grenoble Alpes).
This is just a selection of some topics to give some inspiration:
(MA=Master in Math/CS, CSE=Comput. Sc. and Engin.)
- HPC tools:
- Automated Application Performance Characteristics Extraction
- Portable performance assessment for programs with flat performance profile, BA, MA, CSE
- Projects targeting Weather (and climate) forecasting
- Implementation and performance assessment of ML-SDC/PFASST in OpenIFS (collaboration with the European Center for Medium-Range Weather Forecast), CSE, MA
- Efficient realization of fast Associated Legendre transformations on GPUs (collaboration with the European Center for Medium-Range Weather Forecast), CSE, MA
- Fast exponential and implicit time integration, BA, MA, CSE
- MPI parallelization for the SWEET research software, MA, CSE
- Semi-Lagrangian methods with Parareal, CSE, MA
- Non-interpolating Semi-Lagrangian Schemes, CSE, MA
- Time-splitting methods for exponential integrators, CSE, MA
- Machine learning for non-linear time integration, CSE, MA
-
Exponential integrators and higher-order Semi-Lagrangian methods
- Ocean simulations:
- Porting the NEMO ocean simulation framework to GPUs with a source-to-source compiler
- Porting the Croco ocean simulation framework to GPUs with a source-to-source compiler
- Health science project: Biological parameter optimization
- Extending a domain-specific language with time integration methods
- Performance assessment and improvements for different hardware backends (GPUs / FPGAs / CPUs)
If you're interested in any of these projects or if you search for projects in this area, please drop me an Email for further information
AI or Deep Learning Related Topics
If you have interests regarding the following topics, and would like to know how to implement efficient AI or how to implement AI on different hardware setups, such as:
- DL Application on Heteragenous system
- Network Compression
- ML and Architecture
- AI for Quantum
- AI on Hardware (e.g. restricted edge devices, Cerebras)
Please feel free to contact dai.liu@tum.de for MA, BA, GR.
Compiler & Language Tools

Background:
Created in the 1950s Fortran is still the prevailing language in many high-performance computing applications today. Most of the quantum chemistry codes that form the foundations of modern materials science are written in Fortran and it is not realistically feasible to rewrite these massive and complex computer programs in another language. In light of today’s advances in software development Fortran might be viewed as a dinosaur, but the language has a rich history and inspired many of the programming paradigms that we take for granted today. In more recent iterations of Fortran’s standardization process features were added to bring the language more on par with the current ways to develop software, such as object oriented programming [1]. One consequence of Fortran having fallen out of fashion for contemporary projects is that the palette of developer tooling remains fairly limited.
Around the mid 2000s there occurred a change of attitude in the industry that took software safety and security much more serious [2]. This hugely impacted the approach to development and tremendous innovation and research has since been expended on the improvement of code quality, mainly through automated testing for regressions during the development lifecycle. Another successful and nowadays popular technique is static analysis of the source code, beyond the warnings and errors emitted by the compiler, to facilitate the identification of programming mistakes, performance problems, and enforcement of a coherent coding style upfront.
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies [3]. The clang-tidy program is a C++ “linter” tool based on the LLVM Clang C/C++ compiler with the intended purpose to provide an extensible framework for diagnosing and fixing typical programming errors, like style violations, interface misuse, or bugs that can be deduced via static analysis. LLVM also ships with a Fortran compiler named Flang [4], which is currently under active development, but is already being used in production as the basis for the AMD Optimizing C/C++ and Fortran Compilers (AOCC). At the moment there does not exist a “flang-tidy” program.
In this thesis we will implement a “flang-tidy” program based on LLVM Flang. This tool will allow us to perform static analysis on a vast number of high-performance applications with focus on electronic structure codes. We offer an interesting hands-on project that deals with the inner workings of one of the most advanced industry-standard compilers and thereby teaches important transferable skills. The ideal candidate has a good knowledge of compiler construction and a strong background in concepts of programming languages, ideally with good knowledge of C++ and some basic familiarity with Fortran.
References:
[1] J. Reid, The new features of fortran 2018, ACM SIGPLAN Fortran Forum 37, 5 (2018).
[2] B. Taylor and S. Azadegan, Threading secure coding principles and risk analysis into the undergraduate computer science and information systems curriculum, in Proceedings of the 3rd annual conference on Information security curriculum development, InfoSecCD06 (ACM, 2006).
[3] C. Lattner and V. Adve, LLVM: A compilation framework for lifelong program analysis & transformation, in International Symposium on Code Generation and Optimization, 2004. CGO 2004. (IEEE).
[4] https://flang.llvm.org/
Contact:
Henri Menke (henri.menke(at)mpcdf.mpg.de), Prof. Erwin Laure