Seminars
Time-Division-Multiplexed Network-on-Chips
Description
Overview:
Modern MPSoCs are heavily reliant on efficient and scalable interconnects. However fast or numerous the processors may be, the system will not be able to take advantage of these compute resources unless data and messages can be shared effectively. For this reason, network-on-chips (NoCs) are a virtual part of the design of modern SoCs. NoCs are highly scalable, while still being able to achieve low latency and high bandwidth utilisation.
However, current NoCs are not always suited for time-sensitive applications. Standard NoC designs use a "best effort" approach; this offers good average performance and can be used with the vast majority of workloads without requiring any modification of NoC components. But best effort NoCs offer no guarantee that a given transaction is completed in a given timeframe, which makes them wholly unsuited for real-time systems with hard deadlines.
A well-known alternative approach to best effort is time-division-multiplexing (TDM). In TDM NoCs a global schedule is made and each node is allocated a certain time-slot in which it may transmit information. This approach therefore allows for transmission times for a given program to be determined exactly at compile time.
Task:
For this seminar, the student will investigate TDM and mixed best-effort/TDM NoCs, with the goal of exploring and summarising state-of-the-art TDM NoC techniques, as well as the performance trade-offs of TDM NoCs compared to standard best-effort NoCs.
Relevant literature:
R. A. Stefan, A. Molnos and K. Goossens, "dAElite: A TDM NoC Supporting QoS, Multicast, and Fast Connection Set-Up," in IEEE Transactions on Computers, vol. 63, no. 3, pp. 583-594, March 2014
M. Schoeberl, F. Brandner, J. Sparsø and E. Kasapaki, "A Statically Scheduled Time-Division-Multiplexed Network-on-Chip for Real-Time Systems," 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip, Lyngby, Denmark, 2012
S. Hesham, D. Goehringer and M. A. Abd El Ghany, "HPPT-NoC: A Dark-Silicon Inspired Hierarchical TDM NoC with Efficient Power-Performance Trading," in IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 3, pp. 675-694, 1 March 2020
Contact
William Wulff
william.wulff@tum.de
Supervisor:
On-The-Fly Lossless Data Compression Techniques
Description
As systems-on-chip (SoCs) consist of increasing numbers of processing elements, be it separate CPU cores in traditional SoCs or distinct processing dies in chiplet-based architectures, the requirements for transmission bandwidth between these and other system elements keep rising. While this can be achieved by scaling the transfer rate and the number of parallel transmission lanes, see, for example, the evolution over PCIe's generations, area and power consumption for the interconnect are rising as well.
Another approach to manage the interconnect load is to reduce the amount of data to transfer in the first place. Besides optimizing the system architecture and applications to require fewer transfers between system components, on-the-fly data compression can be used in systems where area and power consumption are more critical than transmission latency. After data is generated or given as input, it can be compressed before transmitting it over particularly longer-distance links. It can be either decompressed or used as is on the receiving end, depending on the application and the destination element. Example use cases are compressing sensor values or camera images before being processed or stored in memory on the other side of the link or compressing (sparse) matrices into a certain format to be processed by AI applications.
Many algorithms for lossless data compression exist. Those focusing on compression and/or decompression speed instead of compression ratio are more relevant for a system as described. A potential candidate could be the Lempel-Ziv 4 algorithm. This seminar work should investigate the viability of this and other lossless compression algorithms, how data should be structured for efficient operation, and what applications could especially benefit from this approach compared to more classical methods to handle high interconnect bandwidth requirements. Further literature research could look into hardware implementations of the considered algorithms.
Potential starting points could be the following papers:
https://ieeexplore.ieee.org/abstract/document/1549812
https://ieeexplore.ieee.org/abstract/document/7818601
https://koreascience.kr/article/JAKO201313660603091.page
https://cdn.zeropoint-tech.com/f/174713/x/2ef77c7d31/ziptilion-memorycompression-
ip-zeropoint-technology-whitepaper-2023-10-18-ver-2-6.pdf
Contact
michael.meidinger@tum.de
Supervisor:
High Dynamic Range Camera Sensors for Advanced Driver Assistance Systems and Autonomous Drive
Description
Camera sensors are an important input to Advanced Driver Assistance Systems (ADAS) and Autonomous Drive (AD) of cars. A challenge for the camera sensors are the very high dynamic ranges of the input signal and the variation of the illumination of the environment. The candidate should work on understanding principles of high dynamic range (HDR) image capturing, different pixel technologies for HDR sensing, exposure control for HDR images, relations to LED flicker mitigation, algorithms to create HDR images from the captured input data and algorithms to compress the high dynamic range images to display the images to a human driver or vision processing system.
Contact
Dr. Stephan Herrmann
NXP Semiconductors Germany, Munich
Email: stephan.herrmann@nxp.com
Supervisor:
Modern GPU Synchronization Methods in Parallel Computing
GPU, multi-threading, synchronization
Description
As GPU architectures continue to evolve, their ability to execute thousands of parallel threads has become fundamental to accelerating workloads in fields such as deep learning, scientific computing, and real-time graphics. However, this massive parallelism introduces significant challenges in coordinating thread execution and data access across GPU cores and multiple GPUs. Effective synchronization is therefore critical to ensure correct program behaviour, maximize hardware utilization, and achieve optimal performance.
This seminar topic focuses on investigating modern GPU synchronization methods, which provide the necessary mechanisms to coordinate parallel execution while minimizing overhead. A starting point of literature will be provided.
Through this seminar, participants are expected to gain more insights into parallel execution and GPU synchronization, preparing them to tackle synchronization challenges in high-performance computing and heterogeneous system design and GPU programming scenarios.
Prerequisites
Have a fundemental understanding of how GPU works
Contact
shichen.huang@tum.de
Supervisor:
Categorization of Ethernet-Detected Anomalies Induced by Processing Unit Deviations
Description
Sporadic anomalies in automotive systems can degrade performance over time and may originate from various system components. In automotive applications, anomalies are often observed at the sensor and ECU levels, with potential propagation through the in-vehicle network via Ethernet. Such anomalies may be the result of deviations in electronic control units, highlighting the importance of monitoring these signals over Ethernet.
Not all processing anomalies are equally detectable over Ethernet due to inherent limitations in the monitoring techniques and the nature of the anomalies. This seminar will explore various anomaly categories, investigate their potential causes, and assess the likelihood of their propagation through the network.
The goal of this seminar is to provide a comprehensive analysis of these anomaly categories, evaluate the underlying causes, and discuss the potential for their detection and mitigation when monitored over Ethernet.
Contact
Zafer Attal
zafer.attal@tum.de
Supervisor:
Comparative Analysis of Local vs. Cloud Processing Approaches
Description
In today’s data-driven world, processing approaches are typically divided between cloud-based solutions—with virtually unlimited resources—and localized processing, which is constrained by hardware limitations. While the cloud offers extensive computational power, localized processing is often required for real-time applications where latency and data security are critical concerns.
To bridge this gap, various algorithms have been developed to pre-process data or extract essential information before it is sent to the cloud.
The goal of this seminar is to explore and compare these algorithms, evaluating their computational load on local hardware and their overall impact on system performance.
Contact
Zafer Attal
zafer.attal@tum.de
Supervisor:
Analysis Algorithms for Processor Traces and Instructions
Description
Modern CPUs execute a vast number of instructions while managing large volumes of data. On-chip debugging modules, located adjacent to the CPU, play a critical role in capturing valuable execution information. This data is essential for analyzing system behavior and detecting anomalies—such as timing issues or execution faults—that may occur in the processing unit.
Over time, various algorithms have been developed to analyze processor traces and instructions. These algorithms not only deepen our understanding of system behavior but also support the debugging of potential faults and anomalies.
The goal of this seminar is to explore and compare different trace analysis algorithms, and that is by evaluating their efficiency, performance, and potential applications in debugging and optimizing processor operations.
Contact
Zafer Attal
zafer.attal@tum.de