Bachelorarbeiten
Offene Arbeiten
Interesse an einer Studien- oder Abschlussarbeit?
In unseren Arbeitsgruppen sind oftmals Arbeiten in Vorbereitung, die hier noch nicht aufgelistet sind. Teilweise besteht auch die Möglichkeit, ein Thema entsprechend Ihrer speziellen Interessenslage zu definieren. Kontaktieren Sie hierzu einfach einen Mitarbeiter aus dem entsprechenden Arbeitsgebiet. Falls Sie darüber hinaus allgemeine Fragen zur Durchführung einer Arbeit am LIS haben, wenden Sie sich bitte an Dr. Thomas Wild.
Laufende Arbeiten
Implementation of a Data Flow Classification Mechanism for Chiplet Interconnects
Beschreibung
In the BCDC project, a working group at TUM collaborates on designing a RISC-V-based chiplet demonstration chip, with two of them connected via an interposer to represent a system of interconnected chiplets. At LIS, we work on an efficient, low-latency chiplet interconnect with additional application-specific features managed by a Smart Chiplet Interconnect layer stack. It bridges the underlying physical layer, which handles data transmission across the interposer, and the system bus, which connects the inter-chiplet interface to other components of the demonstration chip. The design is based on the PULP platform's Serial Link.
The additional features of the Smart Chiplet Interconnect should be applied data-selectively. Therefore, data flows need to be classified and forwarded to the intended modules, such as those responsible for data compression or encryption. A classification mechanism in our interconnect stack will handle this for both outgoing and incoming data. In this project, the student will implement the required classifiers and integrate them into the Smart Chiplet Interconnect's additional feature layer as sub-layers.
The classification mechanism should combine explicit feature flags with heuristics based on transmission metadata and detectable data characteristics. This includes information such as destination address, payload size, and, for example, recognized compressible patterns. Classified data flows are then forwarded to the appropriate feature modules. If multiple features are to be applied, proper ordering must be ensured, and incoming and outgoing traffic must be differentiated. Classification and forwarding should introduce minimal latency overhead to avoid negatively impacting the interconnect's performance. For future extensibility, the design should be modular and allow for easy addition of new classification strategies and additional features.
After implementing the classifier sub-layers in SystemVerilog, the student will evaluate their impact on the interconnect stack. The test environment should provide the extended interconnect stack with realistic input data as it arrives from the AXI system bus or the connected chiplet. The main evaluation metrics will be classification accuracy, classification/end-to-end latency, and hardware cost. As smart classification should reduce wasted latency and power consumption when feature application is not beneficial, the mechanism will be compared to a naive strategy. To estimate resource usage and the maximum achievable clock frequency, the student will synthesize the design for the VCU118 FPGA evaluation board.
The project will be accompanied by another Bachelor's thesis surrounding the Smart Chiplet Interconnect. Depending on the progress of both projects, a combination and joint evaluation of the two designs may be possible and is encouraged.
Voraussetzungen
- Experience with hardware design in (System)Verilog
- Motivation to familiarize oneself with a complex existing design
- Structured way of working and strong problem-solving skills
- Interest in novel system architectures
Kontakt
michael.meidinger@tum.de
Betreuer:
Generischer AXI-Traffic-Analyser zur Charakterisierung von Speicher- und Buszugriffen in eingebetteten Systemen
Beschreibung
Ziel dieser Bachelorarbeit ist die Konzeption und prototypische Umsetzung eines generischen AXI-Traffic-Analysers, der an beliebigen Stellen eines Systems mit AXI-Interface instanziiert werden kann. Der Analyser soll es ermöglichen, den Datenverkehr auf dem Bus flexibel an unterschiedlichen Beobachtungspunkten zu erfassen und auszuwerten. Damit soll eine wiederverwendbare Hardware-Komponente entstehen, die zur Charakterisierung von Kommunikations- und Speicherzugriffsmustern eingesetzt werden kann.
Neben klassischen Kenngrößen wie Datenrate, Zugriffsrate oder Verhältnis von Lese- und Schreibzugriffen sollen insbesondere weiterführende Metriken zur Beschreibung des Zugriffsverhaltens untersucht und implementiert werden. Dazu gehören unter anderem die Verteilung von Strides zwischen aufeinanderfolgenden Speicherzugriffen, die Stabilität dieser Strides über die Zeit, die Dominanz bzw. Verteilung einzelner Cores oder Initiatoren auf dem Bus sowie die Abschätzung des aktuellen Working Sets, beispielsweise über die Anzahl gleichzeitig genutzter Speicherseiten. Darüber hinaus soll analysiert werden, ob ein beobachtetes System eher konstantes oder stark schwankendes Zugriffsverhalten aufweist.
Der AXI-Traffic-Analyser soll zunächst so ausgelegt werden, dass seine Messdaten über den Bus ausgelesen werden können. Bereits bei der Architektur und Schnittstellendefinition soll jedoch eine spätere Erweiterung hin zu einer non-intrusiven Ethernet-Streaming-Variante berücksichtigt werden. Perspektivisch soll der Analysator somit nicht nur lokal auslesbare Statistiken bereitstellen, sondern auch eine kontinuierliche Übertragung von Messdaten über Ethernet unterstützen, sobald ein entsprechendes Interface im Gesamtsystem verfügbar ist.
Im Rahmen der Arbeit sollen die Anforderungen an einen solchen Analysator definiert, eine Hardware-Architektur entworfen und ein erster Prototyp implementiert sowie beispielhaft evaluiert werden.
Voraussetzungen
- Good Knowledge about MPSoCs
- Good C programming skills
- Very good VHDL programming skills
- High motivation
- Self-responsible workstyle
Kontakt
Oliver Lenke
o.lenke@tum.de
Betreuer:
Implementation of a Lossless Data Compression Algorithm for Chiplet Interconnects
Beschreibung
In the BCDC project, a working group at TUM collaborates on designing a RISC-V-based chiplet demonstration chip, with two of them connected via an interposer to represent a system of interconnected chiplets. At LIS, we work on an efficient, low-latency chiplet interconnect with additional application-specific features managed by a Smart Chiplet Interconnect layer stack. It closes the gap between the underlying physical layer that handles data transmission across the interposer and the system bus that attaches the inter-chiplet interface to the other components of the demonstration chip. The design is based on the PULP platform's Serial Link.
As one of the key features of the Smart Chiplet Interconnect, we are developing an on-the-fly lossless data compression module to reduce the amount of data transmitted across the interposer and thus increase the effective bandwidth via a low-pin interface. A Python version of the LZ4-based algorithm is available and extends the baseline by features such as an inserted encoding stage and preloaded or fixed dictionary entries.
In this project, the student will be responsible for implementing the module in SystemVerilog. This includes realizing hardware-specific optimizations for performance and resource usage. Alongside the compression module, the student will also implement the simpler corresponding decompression module for optional decompression on the receiving chiplet.
After verifying matching functionality with the Python reference, the student will evaluate the performance of the implemented modules. For this, the modules should be integrated into a minimal version of the chiplet interconnect stack and fed with realistic data patterns as they would arrive from the system bus or the interconnect. The evaluation will focus on the achievable compression ratio and latency of the hardware implementation. To estimate resource usage and the maximum achievable clock frequency, the student will synthesize the design for the VCU118 FPGA evaluation board.
The project will be accompanied by another Bachelor's thesis surrounding the Smart Chiplet Interconnect. Depending on the progress of the two projects, a combination and joint evaluation of the two designs may be possible and is encouraged.
Voraussetzungen
- Experience with hardware design in (System)Verilog
- Ideally, familiarity with data compression algorithms
- Structured way of working and strong problem-solving skills
- Interest in novel system architectures
Kontakt
michael.meidinger@tum.de
Betreuer:
TriCore architecture instruction sequence cycle estimation tool
Beschreibung
Static analysis plays an important role in understanding how software is expected to behave and in identifying potential defects early in the development process. While functional correctness can often be assessed statically, precise timing analysis is typically performed through dynamic methods, as execution time strongly depends on architectural features and shared hardware resources that influence temporal behavior.
Nevertheless, even when abstracting from certain dynamic effects—such as shared resource contention and detailed memory access behavior—it is possible to estimate the number of processor cycles required to execute a given sequence of instructions, provided that the target architecture is well understood. Hardware simulators such as Spike or gem5 follow this principle by modeling processor behavior to approximate execution timing.
The objective of this thesis is to develop a tool that estimates the cycle count required to execute specific instruction sequences on Infineon’s TriCore architecture. The focus will be on modeling the architectural pipelines, taking into account their structure, parallelism capabilities, and constraints. In particular, the work will analyze how different pipeline configurations influence instruction throughput and latency, as well as the benefits and limitations introduced by the multi-pipeline design.
Voraussetzungen
- Programming skills (preferably in C/C++ or Python).
- Understanding of computer architecture fundamentals.
- Knowledge of pipelining concepts, instruction scheduling, and processor microarchitecture.
- Familiarity with version control systems (e.g., Git).
- Basic understanding of compilers, assembly language, or low-level software development is advantageous.
- Ability to read and interpret technical hardware documentation.
- Analytical thinking and structured problem-solving skills.
Kontakt
Technische Universität München
TUM School of Computation, Information and Technology
Lehrstuhl für Integrierte Systeme
Arcisstr. 21
80333 München
Tel.: +49.89.289.22963
Fax: +49.89.289.28323
Gebäude: N1 (Theresienstr. 90)
Raum: N2138
Email: ibai.irigoyen(at)tum.de
Betreuer:
Evaluation of a Page-Based Memory Preload Architecture Using Standardized Embedded Benchmarks
Beschreibung
Modern MPSoC architectures are increasingly limited by off-chip memory latency. To mitigate this bottleneck, a page-based hardware preload unit has been developed that speculatively transfers DRAM pages upon last-level cache misses in order to hide memory access latency.
The goal of this bachelor thesis is to perform a systematic and scientifically sound evaluation of this architecture using internationally recognized embedded benchmark suites. The work will focus on identifying, porting, and executing suitable bare-metal benchmarks on an FPGA-based RISC-V platform (CVA6 architecture). Candidate benchmark suites include Embench, CoreMark, PolyBench/C, MiBench, and other memory-intensive workloads. The final selection will be made during the course of the thesis based on feasibility and relevance.
The thesis involves implementing the benchmarks in the existing hardware/software framework, conducting structured performance measurements, and comparing different system configurations (e.g., with and without the preload unit). Particular emphasis will be placed on analyzing memory behavior, working-set characteristics, and access patterns.
Beyond implementation, the thesis will provide a scientific evaluation of how different workload classes interact with page-based preloading. Results will be analyzed quantitatively and presented in a clear and reproducible manner using normalized speedups and workload classifications.
The outcome of this work will provide a solid experimental foundation for further research and potential publications in the area of memory-optimized MPSoC architectures.
Voraussetzungen
- Good Knowledge about MPSoCs
- Good C programming skills
- Basic understanding of hardware-oriented programming style
- High motivation
- Self-responsible workstyle
Kontakt
Oliver Lenke
o.lenke@tum.de
Betreuer:
Student
Balancing Preload Efficiency and Responsiveness through Adaptive Burst Lengths
Beschreibung
Page-based memory preloading typically relies on fixed burst lengths to transfer data efficiently from DRAM. While long bursts maximize preload throughput, they reduce responsiveness to demand-driven CPU memory accesses. Short bursts improve reactivity but underutilize available memory bandwidth.
This thesis builds on the existing page-based preload unit and investigates a hardware-based mechanism for dynamically adjusting preload burst length according to current memory system utilization. The goal is to balance preload efficiency and fast reaction to demand accesses at runtime. The proposed mechanism adapts burst length based on simple runtime indicators such as DRAM activity or the presence of competing CPU requests. The implementation extends the existing preload FSM and does not require any modifications to the CPU microarchitecture
Evaluation on an FPGA-based platform analyzes execution time, interference with demand accesses, and bandwidth utilization under different memory-intensive workloads. The results aim to demonstrate that adaptive burst sizing is an effective and low-overhead technique to improve the robustness of memory-side preloading.
Voraussetzungen
- Good Knowledge about MPSoCs
- Good C programming skills
- High motivation
- Self-responsible workstyle
Kontakt
Oliver Lenke
o.lenke@tum.de
Betreuer:
Student
Exploration of Multi-Centroid in the Hyperdimensional Computing Model
Hyperdimensional Computing, Hardware acceleration, Anomaly Detection
Beschreibung
Hyperdimensional Computing is drawing attention as a novel brain-inspired computing paradigm. By exploiting the ultra-high-dimensional random vector, it can deliver comparable accuracy while consuming low energy. Therefore, this approach is ideal for energy-limited scenarios in which cognitive tasks are performed. Due to the algorithm's high parallelism (as in vector computation), it is commonly implemented on a proprietary hardware accelerator.
However, the raw HDC algorithm cannot achieve the same level of accuracy as DNN in many tasks, such as MNIST or Face Recognition. This is considered a major disadvantage of HDC. Therefore, many training techniques have been developed to improve the accuracy of HDC models, such as online and adaptive learning. While these techniques can significantly improve the model's performance, they use real numbers or multi-bit precision during training. This is deviating from a hardware-friendly algorithm. Besides, they typically need to retrain the model to achieve higher accuracy, which can be burdensome when training on computation-limited devices.
Here, we focus on a under-developed technique: multi-centroid. By introducing multiple centroids in the model. It can resolve the data imbalance and low memory parallelism. This technique is encoding-irrelevant and can be employed in basically all the feature-based classification tasks.
The student is supposed to implement a model and explore its benefits and drawbacks across different tasks, such as MNIST, Hand Gesture, and ISOLET.
Voraussetzungen
Solid Python and C programming skills.
Experience with Pytorch
Kontakt
Yiming Lu
yiming_p.lu@tum.de