Seminar VLSI-Entwurfsverfahren

Vortragende/r (Mitwirkende/r)
Nummer0820073263
ArtSeminar
Umfang3 SWS
SemesterWintersemester 2024/25
UnterrichtsspracheDeutsch
Stellung in StudienplänenSiehe TUMonline

Termine

* Termin abgesagt

Teilnahmekriterien

Siehe TUMonline
Anmerkung: Die Studierenden wählen VOR der Einführungsveranstaltung ein Thema aus. Dazu setzen sie sich mit dem entsprechenden Betreuer in Verbindung. Themen werden nach dem Prinzip "first come, first serve" verteilt. Erst wenn der Betreuer das gewählte Thema bestätigt hat, gilt der/die Studierende als registriert. Eine Liste von Themen ist unter folgendem Link zu finden: https://www.ce.cit.tum.de/eda/lehrveranstaltungen/seminare/wissenschaftliches-seminar-vlsi-entwurfsverfahren/

Lernziele

Nach erfolgreichem Abschluss des Seminares sind die Studierenden in der Lage, eine neue Idee oder einen bestehenden Ansatz auf dem Gebiet des rechnergestützten Schaltungs- und Systementwurfs in verständlicher und überzeugender Weise zu präsentieren. Zu diesem Zwecke werden im Einzelnen folgende Fähigkeiten erworben: • Die teilnehmende Person kann sich selbstständig ein wissenschaftliches Thema aus dem Bereich des rechnergestützten Schaltungs- und Systementwurfs aneignen. • Die teilnehmende Person ist fähig, ein Thema strukturiert nach Problemstellung, Stand der Technik, Ziele, Methoden und Ergebnisse darzustellen. • Die teilnehmende Person ist in der Lage, ein Thema in der genannten Strukturierung mündlich zu präsentieren, in einem Foliensatz zu visualisieren, und in einem wissenschaftlichen Bericht schriftlich darzustellen. • Die teilnehmende Person ist mit den Grundlagen einer konstruktiven Begutachtung vertraut und kann diese auf eine fremde Arbeit anwenden.

Beschreibung

Spezifische Seminarthemen aus dem Bereich der Entwurfsautomatisierung für elektronische Schaltungen und Systeme werden angeboten. Beispiele sind Analogentwurfsmethodik, Entwurfsmethodik für digitale Schaltungen, Layoutsynthese, und Entwurfsmethodik auf der Systemebene. Teilnehmende arbeiten eigenständig auf einem wissenschaftlichen Thema und schreiben ein Paper von 4 Seiten. Abschließend präsentieren die Teilnehmenden ihr Thema in einem Vortrag. In einer anschließenden Diskussion wird ihr Thema detailliert behandelt.

Inhaltliche Voraussetzungen

Keine spezifischen Voraussetzungen.

Lehr- und Lernmethoden

Lernmethode: Die Studierenden arbeiten eigenständig und unter Beratung durch einen wissenschaftlichen Assistenten ein wissenschaftliches Thema aus. Lehrmethode: In Einführungsveranstaltungen werden den Teilnehmenden Hinweise zur fachlichen Arbeit, schriftlichen Ausarbeitung sowie zur Erstellung der Präsentation und zum mündlichen Vortrag gegeben. Während eines zusätzlichen interaktiven Präsentationtrainings können Techniken für einen gelungenen Vortrag von den Studierenden erlernt und geprobt werden. Weitere Details werden zwischen Studierenden und wissenschaftlichen Assistenten auf individueller Basis diskutiert. Alle geläufigen Techniken zur Vorbereitung und Präsentation von Papern und Vorträgen werden angewendet, z. B.: - Klassische Tafel, Weißwandtafel - Elektronische Folien, Beamer - Elektronische Textverarbeitung - Elektronische Folienbearbeitung

Studien-, Prüfungsleistung

Die Prüfung wird in Form einer wissenschaftlichen Ausarbeitung vorgenommen. Sie besteht zum einen aus einem schriftlichen Teil (50%), und zwar einem Paper (4 Seiten). Zum anderen besteht sie aus einem mündlichen Teil (50%) in Form einer ca. 30-minütigen Präsentation (inklusive nachfolgender Diskussion). Mit der wissenschaftlichen Ausarbeitung weisen die Studierenden nach, dass sie z. B. den wissenschaftlichen Stand der Technik, eine neue Idee oder einen bestehenden Ansatz auf dem Gebiet des rechnergestützten Schaltungs- und Systementwurfs für ein Fachpublikum aufbereiten, strukturiert darstellen und präsentieren können.

Empfohlene Literatur

Ein Satz an Themen und zugehöriger Literatur wird am Anfang des Kurses bereitgestellt. Die Studierenden wählen ihr Thema selbst aus.

Links

Themenwahl - offen

Die Themenliste für das Wintersemester 24/25 finden Sie unten.

Themen werden im FCFS Verfahren vergegeben. Bitte kontaktieren Sie dann direkt den Betreuer per E-Mail. Bitte versichern Sie sich, dass Sie eine Bestätigung Ihres Betreues erhalten, wenn Sie sich für ein Thema entschieden haben.

Seminars

Performance and energy aware wavelength allocation on ring-based WDM 3D optical NoC

Description

Optical Network-on-Chip (ONoC) is a promising communication medium for large-scale Multiprocessor System on Chip (MPSoC). ONoC outperforms classical electrical NoC in terms of throughput and latency. The medium can support multiple transactions at the same time on different wavelengths by using Wavelength Division Multiplexing (WDM). Moreover multiple wavelengths can be used as high-bandwidth channel to reduce transmission time. However, multiple signals sharing simultaneously a waveguide can lead to inter-channel crosstalk noise. This problem impacts the Signal to Noise Ratio (SNR) of the optical signal, which leads to an increase in the Bit Error Rate (BER) at the receiver side. In this paper we first formulate the crosstalk noise and execution time models and then propose a Wavelength Allocation (WA) method in a ring-based WDM ONoC allowing to search for performance and energy trade-offs, based on the application constraints. As result, most promising WA solutions are highlighted for a defined application mapping onto 16-core WDM ONoC.

Contact

liaoyuan.cheng@tum.de

Supervisor:

Liaoyuan Cheng

Silicon Photonic Microring Resonators: A Comprehensive Design-Space Exploration and Optimization Under Fabrication-Process Variations

Description

Silicon photonic microring resonators (MRRs) offer many advantages (e.g., compactness) and are often considered as the fundamental building block in optical interconnects and emerging photonic nanoprocessors and accelerators. Such devices are, however, sensitive to inevitable fabrication-process variations (FPVs) stemming from optical lithography imperfections. Consequently, silicon photonic integrated circuits (PICs) integrating MRRs often suffer from high power overhead required to compensate for the impact of FPVs on MRRs and, hence, realizing a reliable operation. On the other hand, the design space of MRRs is complex, including several correlated design parameters, thereby further exacerbating the design optimization of MRRs under FPVs. In this article, we present, for the first time, a comprehensive design-space exploration in passive and active MRRs under FPVs. In addition, we present design optimization in MRRs under FPVs while considering different performance metrics, such as tolerance to FPVs, quality factor, and 3-dB bandwidth in MRRs. Simulation and fabrication results obtained by measuring multiple fabricated MRRs designed using our designspace exploration demonstrate a significant 70% improvement on average in the MRRs’ tolerance to different FPVs. Furthermore, we apply the proposed design optimization to a case study of a wavelength-selective MRR-based demultiplexer, where we show considerable channel-spacing accuracy within 0.5 nm even when the MRRs are placed 500 µm apart on a chip. Such improvements indicate the efficiency of the proposed design-space exploration and optimization to enable power-efficient and variation-resilient PICs and optical interconnects integrating MRRs.

Contact

liaoyuan.cheng@tum.de

Supervisor:

Liaoyuan Cheng

Methodologies for Accelerating Deep Learning Inference on Different Tensor Machines

Description

The proliferation of deep learning (DL) applications has created an unprecedented demand for high-performance hardware accelerators capable of handling the computationally intensive tasks involved in DL processing. To address this need, various instruction set architectures (ISAs) have introduced specialized matrix extensions, such as Arm's Scalable Matrix Extension (SME) [1] and Intel's Advanced Matrix Extension (AMX) [2], to accelerate matrix operations that are at the heart of DL computations.

 

However, the proprietary nature of these implementations has limited their adoption and customization, highlighting the need for open-source and flexible solutions. The RISC-V ISA, with its open-source architecture, has emerged as a promising platform for developing custom extensions for tensor operations [3] [4]. Researchers have proposed various methodologies for both dependent and independent matrix extensions, including the use of matrix registers and accumulator registers, to improve performance, efficiency, and scalability.

 

The seminar will provide a comprehensive overview of the current state of custom extensions for tensor operations, highlighting the advantages and limitations of existing programming models, design methodologies, and performance evaluation techniques.

 

The seminar should cover the following topics:

  1. Research existing programming models for custom extensions for tensor operations particularly for RISCV, including their advantages and limitations.

  2. Design methodologies for different tensor extensions, from DL compilation, design, simulation, and deployment, highlighting their strengths and weaknesses.

  3. Analysis and evaluation of the performance of different programming models for custom extensions of tensor operations, considering factors such as parallelism, latency, and data transfer.

 

References:

[1] Intel®Architecture Instruction Set Extensions and Future Features Programming Reference

 

[2] Arm® Architecture Reference Manual Supplement, The Scalable Matrix Extension (SME), for Armv9-A

 

[3] V. Verma, T. Tracy II, and M. R. Stan, “EXTREM-EDGE - EXtensions To RISC-V for Energy-efficient ML inference at the EDGE of IoT,” Sust. Comp.: Informatics and Systems, vol. 35, p. 100742, 2022.

 

[4] Perotti, Matteo & Zhang, Yichao & Cavalcante, Matheus & Mustafa, Enis & Benini, Luca. (2024). MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication.

 

 

Contact

Supervisor:

Conrad Foik

Compression and Decompression Techniques for Activation data

Description

Deep Neural Networks (DNNs) offer possibilities for tackling practical challenges and broadening the scope of Artificial Intelligence (AI) applications. The current neural network's considerable computational and memory needs are attributed to the increasing complexity of network structures, which involve numerous layers containing millions of parameters. The energy consumption during the inference execution of deep neural networks (DNNs) is predominantly attributed to the access and processing of these parameters. One of the main possible areas to tackle is the storage and access of activation data computed during inference.

 

The objective of this seminar is to conduct a comprehensive literature survey around compression and decompression techniques available for activation data. Gather the advantages and disadvantages posed by the available solutions. Depending on the time and reviewed contents, the survey can be extended to find a hardware-efficient technique.

[1] Chen, Yu-Hsin, Yang, Tien-Ju, Emer, Joel and Sze, Vivienne (2019): Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9(2): 292–308: Conference Name: IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

 

[2] Wang, Cong, Xiao, Yanru, Gao, Xing, Li, Li and Wang, Jun (2023): A Framework for Behavioral Biometric Authentication Using Deep Metric Learning on Mobile Devices, IEEE Transactions on Mobile Computing 22(1): 19–36: Conference Name: IEEE Transactions on Mobile Computing.

 

[3] Hawlader, Faisal, Robinet, Fran¸cois and Frank, Rapha¨el (2023): Poster: Lightweight Features Sharing for Real-Time Object Detection in Cooperative Driving, 2023 IEEE Vehicular Networking Conference (VNC), S. 159–160: ISSN: 2157-9865.

 

[4] Lee, Minjae, Park, Seongmin, Kim, Hyungmin, Yoon, Minyong, Lee, Janghwan, Choi, Jun Won, Kim, Nam Sung, Kang, Mingu and Choi, Jungwook (2024): SPADE: Sparse Pillar-based 3D Object Detection Accelerator for Autonomous Driving, 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), S. 454–467: ISSN: 2378-203X.

 

[5] Price, Ilan and Tanner, Jared (2023): Improved Projection Learning for Lower Dimensional Feature Maps, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), S. 1–5: ISSN: 2379-190X.

 

 

Contact

Andrew.Stevens@infineon.com ; Mounika.Vaddeboina@infineon.com

 

Supervisor:

Conrad Foik

Tensor program/graph rewriting-based optimization techniques

Description

In machine learning (ML), tensor kernels often translate into pure mathematical expressions. This presents an interesting prospect for optimization through term rewriting [1]. A fundamental optimization technique used by deep learning frameworks is graph rewriting [2] [3]. Within production frameworks, the decision to apply rewrite rules and in what sequence rests heavily on heuristics. Research indicates that seeking a more optimal sequence of substitutions, rather than relying solely on heuristics, can lead to the discovery of better tensor computation graphs.

 

Moreover, term rewriting techniques prove beneficial in optimizing low-level tensor programs [4] alongside tensor graphs. Traditionally, application programmers manually add hardware function calls, or compilers incorporate them through handcrafted accelerator-specific extensions. Integrating domain-specific instruction or operation support into an existing compiler typically involves custom pattern matching to map resource-intensive tensor operations from applications to hardware-specific invocations. Despite these modifications related to pattern matching, users may still need to manually adjust their applications to aid the compiler in identifying opportunities for dispatching operations to target hardware, such as by altering data types or optimizing loops.

 

Leveraging term rewriting techniques offers a promising approach for streamlining various transformation and mapping tasks both for tensor graphs as well as programs. This approach not only enhances efficiency but also holds the potential for simplifying the deployment of DSLs, thus advancing the field of machine learning and computational optimization.

 

This seminar topic should cover literature research on existing rewriting techniques on tensor programs and graphs which includes:

  1. Research on existing rewriting techniques
  2. Its application on tensor programs and graphs.
  3. Challenges and the relations between different rewriting techniques
  4. Applications in and with the existing machine learning compiler frameworks

[1] Franz Baader et al. 1998. Term Rewriting and All That. Cambridge University Press. https://doi.org/10.1017/ CBO9781139172752.

[2] Zhihao Jia et al. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19). Association for Computing Machinery, New York, NY, USA, 47–62. https://doi.org/10.1145/3341301.3359630.

[3] Yang, Y., et al. (2021). Equality Saturation for Tensor Graph Superoptimization. ArXiv, abs/2101.01332.

[4] Gus Henry Smith et al. 2021. Pure Tensor Program Rewriting via Access Patterns (Representation Pearl). In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming (Virtual, Canada) (MAPS 2021). Association for Computing Machinery, New York, NY, USA, 21–31. https://doi.org/10.1145/3460945.3464953.

 

 

Contact

Supervisor:

Conrad Foik

Physically Aware RTL Generation: Navigating Design and Performance Challenges

Description

Currently, digital IPs are implemented using hardware description languages at the Register Transfer Level (RTL) abstraction. In principle, RTL models are independent of physical features and constraints, as they are mapped to hardware implementations through synthesis and RTL-to-Gates (RTL2GDSII) tools. However, the chosen RTL implementation still significantly impacts the quality of the result, particularly in technology-dependent features such as timing, power consumption, and overall performance.

This seminar delves into how to bridge the gap between technology-independent RTL code and technology-dependent features by making RTL generators aware of physical characteristics. By doing so, it becomes possible to optimize designs for real-world applications, ensuring that they meet both performance and power efficiency targets. Additionally, generation should integrate further constraints to refine and enhance the implementation outcomes. This dual approach of awareness and constraint-driven design aims to elevate current design and generation methodologies in several key areas:

  • More efficient IPs through physical-aware RTL: Incorporating physical features directly into RTL can lead to the development of digital IPs that are not only high-performing but also power-efficient. This ensures that designs can meet stringent operational requirements while maintaining energy efficiency.
  • Increased automation in IP generation: By enhancing automation in the IP generation process, it becomes possible to reduce manual effort, minimize errors, and accelerate time-to-market. This increased level of automation can streamline workflows and enable designers to focus on innovation rather than routine tasks.
  • Enhanced quality of results: By making RTL generators cognizant of physical features, the resulting hardware implementations can achieve superior quality in terms of both timing and power consumption. This ensures that the end products are reliable and meet the required specifications.
  • Reduced design iterations: With physical-aware RTL and automated constraint integration, the number of design iterations needed to achieve optimal results can be significantly reduced. This not only saves time but also reduces costs associated with prolonged design cycles.

By addressing these areas, the seminar aims to provide insights and practical solutions for overcoming the inherent challenges in the current design and generation approaches, paving the way for more efficient, automated, and high-quality digital IPs.

 

[1] Alex Carsello, James Thomas, Ankita Nayak, Po-Han Chen, Mark Horowitz, Priyanka Raina, and Christopher Torng. 2021. Enabling Reusable Physical Design Flows with Modular Flow Generators. arXiv preprint arXiv:2111.14535 (2021).

[2] Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou, Nathan Pemberton, et al. 2020. Chipyard: Integrated design, simulation, and implementation framework for custom socs. IEEE Micro 40, 4 (2020), 10–21

[3] Edward Wang, Colin Schmidt, Adam Izraelevitz, John Wright, Borivoje Nikoli?, Elad Alon, and Jonathan Bachrach. 2020. A methodology for reusable physical design. In 2020 21st International Symposium on Quality Electronic Design (ISQED). IEEE, 243–249.

Contact

Mohamed.Badawy@infineon.com, Wolfgang.Ecker@tum.de

Supervisor:

Conrad Foik

Survey of the Annual Reactive Synthesis Competition (SYNTCOMP)

Description

Today, digital hardware design is most commonly done on the Register-Transfer Level (RTL). This low abstraction level uses sequential logic to describe the behavior and structure of digital circuits using Hardware Description Languages (HDLs) like VHDL or SystemVerilog. To lower design and verification complexity, researchers have proposed moving beyond RTL and using so-called temporal logic instead, which allows for a higher-level expression of temporal correlations. For example, implementing a multi-cycle delay between two actions in VHDL or SystemVerilog requires manually describing this behavior in sequential logic, such as FSMs, buffers, or counters. Conversely, temporal logic can directly specify that action A implies action B after N cycles or even that an event will eventually happen.


Despite the advantages of easily expressing complex temporal correlations, synthesizing actual circuits from temporal logic remains an open challenge. Acknowledging this gap, the annual Reactive Synthesis Competition (SYNTCOMP) added a track to compare and benchmark logic synthesis tools for Linear Temporal Logic (LTL) in 2016 [1].

This seminar aims to conduct a comprehensive literature survey of the SYNTCOMP problem statements, benchmarks, and submitted synthesis tools [2, 3]. Students will critically analyze the advantages and potential drawbacks of these tools and algorithms to provide a perspective on their applicability in digital design. Depending on the findings, the seminar can also be extended with other (more theoretical) approaches to temporal logic synthesis beyond SYNTCOMP.

 

[1] S. Jacobs and R. Bloem, “The Reactive Synthesis Competition: SYNTCOMP 2016 and Beyond,” 2016, https://arxiv.org/abs/1611.07626
[2] S. Jacobs, G. A. Pérez, and P. Schlehuber-Caissier, “The Reactive Synthesis Competition,” 2023, https://www.syntcomp.org/
[3] P. J. Meyer, S. Sickert, and M. Luttenberger, “Strix: Explicit Reactive Synthesis Strikes Back!,” 2018, https://strix.model.in.tum.de/publications/MeyerSL18.pdf

Contact

RobertNiklas.Kunzelmann@infineon.com, Wolfgang.Ecker@tum.de

Supervisor:

Conrad Foik

Efficient Transformer Models Using Low-Rank Representations

Description

Transformer models have become the cornerstone of many state-of-the-art solutions in natural language
processing (NLP) [1,2] and computer vision tasks [3,4]. These models, known for their self-attention
mechanisms, excel at capturing complex dependencies within data. However, the computational and
memory demands of transformer models present significant challenges, particularly for deployment in
resource-constrained environments.
Low-rank approximation methods offer a promising solution to mitigate these challenges by reducing the
dimensionality of the model parameters, effectively pruning the model while retaining most of its
performance. These methods decompose the weight matrices of the model into products of lower-rank
matrices, thereby reducing the number of parameters and computational complexity [5].


This seminar will delve into approaches that use low-rank methods for pruning transformer models,
highlighting the latest research [6,7,8] and future directions.


[1] A. Vaswani et al., “Attention is All you Need”, NeurIPS, 2017, vol. 30.
[2] OpenAI et al., “GPT-4 Technical Report”, arXiv [cs.CL]. 2023.
[3] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR, 2021.
[4] B. Cheng et al., “Masked-attention mask transformer for universal image segmentation.” CVPR, 2022.
[5] X. Liu and K. K. Parhi, “Tensor Decomposition for Model Reduction in Neural Networks: A Review,” IEEE Circuits Syst. Mag., vol.
23, no. 2, pp. 8–28, 2023, doi: 10.1109/MCAS.2023.3267921.
[6] S. Ren and K. Q. Zhu, “Low-Rank Prune-And-Factorize for Language Model Compression,” ICCL, 2024, pp. 10822–10832.
[7] Y. Guo et al., “PELA: Learning Parameter-Efficient Models with Low-Rank Approximation,” CVPR, 2024, pp. 15699–15709.
[8] C.-C. Chang et al., “FLORA: Fine-Grained Low-Rank Architecture Search for Vision Transformer”, WACV, 2024, pp. 2482–2491.

Contact

Please contact Moritz Thoma (Moritz.Thoma@bmw.de)

Supervisor:

Conrad Foik

Optimizing Vision Transformer Models: Techniques for Memory and Time-Efficient Pruning

Description

In recent years, vision transformers (ViTs) have revolutionized the field of computer vision, achieving
remarkable success across a wide range of tasks such as image classification [1], object detection [2],
and semantic segmentation [3]. Despite their impressive performance, vision transformers come with
significant computational costs and memory requirements, making them challenging to deploy in
resource-constrained environments. This is where model pruning - a technique aimed at reducing the size
and complexity of neural networks - comes into play. By selectively removing less important weights and
neurons, pruning can substantially reduce the computational burden and memory footprint of ViTs
without significantly affecting their accuracy. However, traditional pruning methods can be timeconsuming
and computationally intensive, which limits their practicality - especially for very large models.
Hence, memory and time-efficient pruning techniques are essential to make large models deployable
with reasonable compute effort.


This seminar aims to explore these advanced pruning strategies specifically tailored for vision
transformers [4,5], focusing on achieving a balance between model size reduction and computational
efficiency.


[1] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR, 2021.
[2] N. Carion et al., “End-to-end object detection with transformers.” ECCV, 2020.
[3] B. Cheng et al., “Masked-attention mask transformer for universal image segmentation.” CVPR, 2022.
[4] W. Kwon et al., “A Fast Post-Training Pruning Framework for Transformers.” NeurIPS, 2022.
[5] M. Sun et al., “A Simple and Effective Pruning Approach for Large Language Models,” ICLR, 2024.

Contact

Please contact Moritz Thoma (Moritz.Thoma@bmw.de)

Supervisor:

Conrad Foik

On Memory Optimization of Tensor Programs

Short Description:
In this seminar the student will review state-of-the art memory-aware optimization techniques applied to tensor-level AI programs.

Description

Compact electronic edge devices have limited memory resources. As AI models can require large amounts of memory, running AI models on edge devices becomes challenging. Thus, optimizing AI programs that can be deployed on edge devices is necessary while saving costly memory transfers.

This need has motivated current works exploring different memory-aware optimization techniques that reduce memory utilization but do not modify the DNN parameters (as during compression or network architecture search (NAS)), such as fused tiling, memory-aware scheduling, and memory layout planning [1].   For instance, DORY (Deployment Oriented to memoRY) is an automated tool designed for deploying deep neural networks (DNNs) on low-cost microcontroller units with less than 1MB of on-chip SRAM memory. It tackles the challenge of tiling by framing it as a Constraint Programming (CP) problem, aiming to maximize the utilization of L1 memory while adhering to the topological constraints of each DNN layer. DORY then generates ANSI C code to manage the transfers between off-chip and on-chip memory and the computation phases [2].  DORY has been integrated with TVM to ease the support for heterogeneous compilation and offloading operations not supported by the accelerator to a regular host CPU [3].

This seminar topic reviews state-of-the-art approaches for memory-aware optimization techniques of ML tensor programs targeting constrained edge devices. The different methods and results shall be reviewed and compared.

References:

[1] Rafael Christopher Stahl.Code Optimization and Generation of Machine Learning and Driver Software for Memory-Constrained Edge Devices. 2024. Technical University of Munich, PhD Thesis. URL: https://mediatum.ub.tum.de/doc/1730282/1730282.pdf

[2] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi and F. Conti, "DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs," in IEEE Transactions on Computers, vol. 70, no. 8, pp. 1253-1268, 2021, https://doi.org/10.1109/TC.2021.3066883

 

[3] Van Delm, Josse, et al. "HTVM: Efficient neural network deployment on heterogeneous TinyML platforms." 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 2023. https://doi.org/10.1109/DAC56929.2023.10247664

Contact

Andrew.stevens@infineon.com 

Daniela.sanchezlopera@infineon.com  

 

Supervisor:

Daniela Sanchez Lopera - Andrew Stevens (Infineon Technologies )

On-device learning

Short Description:
In this seminar the student will review state-of-the art contributions to the on-device learning research area.

Description

TinyML is a research area aiming to bring machine learning models to resource-constrained IoT devices and microcontrollers. Current research mainly focuses on enabling inference on such devices, tackling challenges such as limited memory and computation resources available. But for specific sensing and IoT applications, on-device learning would allow retraining and refining ML models directly on small and low-power devices. However, on-device learning on edge devices is much more challenging than inference due to larger memory footprints and increased computing operations to store intermediate activations and gradients [1].

To tackle those challenges, different strategies involving, among others, quantization, sparse backpropagation, or new layer types have been proposed and summarized [1, 2]. This seminar will review state-of-the-art approaches for on-device learning techniques targeting constrained edge devices. The different methods and results shall be reviewed and compared.

 

References:

[1] J. Lin, L. Zhu, W. -M. Chen, W. -C. Wang and S. Han, "Tiny Machine Learning: Progress and Futures [Feature]," in IEEE Circuits and Systems Magazine, vol. 23, no. 3, pp. 8-34, 2023, https://doi.org/10.1109/MCAS.2023.3302182

[2] Shuai Zhu, Thiemo Voigt, Fatemeh Rahimian, and JeongGil Ko. 2024. On-device Training: A First Overview on Existing Systems. ACM Trans. Sen. Netw. Just Accepted (September 2024). https://doi.org/10.1145/3696003  

Contact

Supervisor:

Daniela Sanchez Lopera - Andrew Stevens (Infineon Technologies )

Innovative Memory Architectures in DNN Accelerators

Description

With the growing complexity of neural networks, more efficient and faster processing solutions are vital to enable the widespread use of artificial intelligence. Systolic arrays are among the most popular architectures for energy-efficient and high-throughput DNN hardware accelerators.

While many works implement DNN accelerators using systolic arrays on FPGAs, several (ASIC) designs from industry and academia have been presented [1-3]. To fulfill the requirements that such accelerators place on memory accesses, both in terms of data availability and latency hiding, innovative memory architectures can enable more efficient data access, reducing latency and bridging the gap towards even more powerful DNN accelerators.

One example is the Eyeriss v2 ASIC [1], which uses a distributed Global Buffer (GB) layout tailored to the demands of their row-stationary systolic array dataflow.

In this seminar, a survey of state-of-the-art DNN accelerator designs and design frameworks shall be created, focusing on their memory hierarchy.

References and Further Resources:

[1] Y. -H. Chen, T. -J. Yang, J. Emer and V. Sze. 2019 "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019, doi: https://doi.org/10.1109/JETCAS.2019.2910232

[2] Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. "DianNao family: energy-efficient hardware accelerators for machine learning." In Commun. ACM 59, 11 (November 2016), 105–112. https://doi.org/10.1145/2996864

[3] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al. 2017. "In-Datacenter Performance Analysis of a Tensor Processing Unit." In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3079856.3080246

[4] Rui Xu, Sheng Ma, Yang Guo, and Dongsheng Li. 2023. A Survey of Design and Optimization for Systolic Array-based DNN Accelerators. ACM Comput. Surv. 56, 1, Article 20 (January 2024), 37 pages. https://doi.org/10.1145/3604802

[5] Bo Wang, Sheng Ma, Shengbai Luo, Lizhou Wu, Jianmin Zhang, Chunyuan Zhang, and Tiejun Li. 2024. "SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow." ACM Trans. Des. Autom. Electron. Syst. 29, 2, Article 26 (March 2024), 32 pages. https://doi.org/10.1145/3634703
 

Contact

benedikt.schaible@tum.de

Supervisor:

Benedikt Schaible

From Tree to Bus: Modifying Obstacle-Avoiding Steiner Tree Algorithms for the Synthesis of Bus Topology

Description

The ultimate goal of this study is to generate a bus topology that minimizes wire length while considering obstacles. When examining general obstacle-aware routing problems considering wire length minimization, the most widely acknowledged automatic routing method is the Obstacle-Avoiding Steiner Minimum Tree (OASMT). The OASMT algorithm is typically used to generate tree topologies, connecting nodes through branching structures. To achieve bus topology, we aim to modify the existing OASMT algorithm by adjusting the node connection order so that it produces a bus topology structure. The task will focus solely on this modification process, changing the node connections to achieve a bus structure without involving further wire length minimization.

Contact

m.lian@tum.de

Supervisor:

Meng Lian

Design Space Exploration Methods for Neural Network Accelerators

Description

The efficiency of an accelerator depends on three factors—mapping, deep neural network (DNN) layers, and hardware. The process of hardware design space exploration requires both hardware parameters and mappings from the algorithm onto the target hardware to be discovered and optimized. 

This project aims to identify the most prominent approaches and compare them.

Contact

samira.ahmadifarsani@tum.de

Supervisor:

Samira Ahmadifarsani

Comparative Study of Hardware Architectures for Neural Network Accelerators

Description

This literature review will focus on exploring and comparing different hardware architectures designed specifically for neural network accelerators, examining how each architecture is optimized for specific neural network tasks (e.g., convolutional neural networks (CNNs)).

The study could highlight the trade-offs between various design choices, such as parallelism, memory hierarchy, dataflow, flexibility, and integration with CPUs.

Contact

samira.ahmadifarsani@tum.de

Supervisor:

Samira Ahmadifarsani

Post-processing Flow-Layer Routing with Length-Matching Constraint for Flow-Based Microfluidic Biochips

Description

 

Here's a consolidated project description based on your provided information:

This project addresses the challenges in the current process of synthesizing microfluidic chips, particularly focusing on the gap in the complete synthesis flow which can lead to reduced performance, resource wastage, or infeasible designs. The general synthesis process typically involves three stages: high-level synthesis, followed by the design of the flow layer, and finally, the design of the control layer.

Current state-of-the-art synthesis methods, primarily operating at the operation- and device-level, make assumptions regarding the availability of fluid transportation paths. They often overlook the physical layout of control and flow channels and neglect the flow rate. This oversight can lead to biased scheduling of fluid transportation time during synthesis.

Our project proposes an innovative approach to bridge this gap. By considering the known physical design of microfluidic chips and the desired experiments, represented as sequence graphs, we aim to improve the physical design. The approach involves adjusting the lengths of the channels according to the required fluid volume. This adjustment is expected to reduce the number of valves and control ports in the original physical design, thereby enhancing the efficiency and feasibility of microfluidic chip designs.

Contact

m.lian@tum.de

Supervisor:

Meng Lian

Pre-training Network Pruning

Short Description:
In this seminar the student will review state-of-the art pruning techniques applied before training such as SNIP.

Description

“Pruning large neural networks while maintaining their performance is often desirable due to the reduced space and time complexity. Conventionally, pruning is done within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters during training or using statistically heuristics after training. However, using suitable heuristic criteria, inspired by the “Lottery Ticket” hypothesis networks can also be pruned before training. This eliminates the need for both pretraining and the complex pruning schedules and is well suited to use in combination with neural architecture search. making it robust to architecture variations. The canonical method SNIP [1] introduces a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. These methods can obtain extremely sparse networks and are claimed to retain the same accuracy as reference network on benchmark classification tasks.” As such pre-training pruning methods are potentially a highly attractive alternative to post-training training-time co-optimization methods for use in automated industrial machine learning deployment toolchains. References: [1] Lee, Namhoon, Thalaiyasingam Ajanthan, and Philip HS Torr. "Snip: Single-shot network pruning based on connection sensitivity." arXiv 2018. https://arxiv.org/abs/1810.02340 [2] Artem Vysogorets and Julia Kempe .“Connectivity Matters: “Neural Network Pruning Through the Lens of Effective Sparsity.” https://www.jmlr.org/papers/volume24/22-0415/22-0415.pdf [3] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin: “Pruning Neural Networks at Initialization: Why are We Missing the Mark?” ICLR 2021. https://arxiv.org/abs/2009.08576 [4] Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H.S. Torr, Gregory Rogez, Puneet K. Dokania: “Progressive Skeletonization: Trimming more fat from a network at initialization”. https://arxiv.org/abs/2006.09081

Contact

Andrew.stevens@infineon.com

Daniela.sanchezlopera@infineon.com 

Supervisor:

Daniela Sanchez Lopera - Andrew Stevens (Infineon Technologies AG)

Compression Techniques for Floating-Point Weights in Machine Learning Models

Description

Deep Neural Networks (DNNs) offer possibilities for tackling practical challenges and broadening the scope of Artificial Intelligence (AI) applications. The considerable computational and memory needs of current neural networks are attributed to the increasing complexity of network structures, which involve numerous layers containing millions of parameters. The energy consumption during the inference execution of deep neural networks (DNNs) is predominantly attributed to the access and processing of these parameters. To tackle the significant size of models integrated into Internet of Things (IoT) devices, a promising strategy involves diminishing the bit-width of weights.

 

The objective of this seminar is to conduct a comprehensive literature survey around compression techniques available for floating-point weights. Gather the advantages and disadvantages posed by the available solutions. Depending on the time and reviewed contents, the survey can be extended to find a hardware-efficient technique for the compression of floating-point weights.

 

Bibliography:

[1] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” 2019.
[2] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,”
2016.
[3] G. C. Marin ?o, G. Ghidoli, M. Frasca, and D. Malchiodi, “Compression strategies and space-conscious representations for deep neural networks,” in
2020 25th International Conference on Pattern Recognition (ICPR), 2021,
pp.9835–9842.
[4] G. C. Marin`o, A. Petrini, D. Malchiodi, and M. Frasca, “Compact representations of convolutional neural networks via weight pruning and
quantization,” CoRR, vol. abs/2108.12704, 2021. [Online]. Available: https://arxiv.org/abs/2108.12704

 

 

 

Contact

Supervisor:

Conrad Foik

Reliability-Aware Design Flow for Silicon Photonics On-Chip Interconnect

Description

Intercore communication in many-core processors presently faces scalability issues similar to those that plagued intracity telecommunications in the 1960s. Optical communication promises to address these challenges now, as then, by providing low latency, high bandwidth, and low power communication. Silicon photonic devices presently are vulnerable to fabrication and temperature-induced variability. Our fabrication and measurement results indicate that such variations degrade interconnection performance and, in extreme cases, the interconnection may fail to function at all. In this paper, we propose a reliability-aware design flow to address variation-induced reliability issues. To mitigate effects of variations, limits of device design techniques are analyzed and requirements from architecture-level design are revealed. Based on this flow, a multilevel reliability management solution is proposed, which includes athermal coating at fabrication-level, voltage tuning at device-level, as well as channel hopping at architecture-level. Simulation results indicate that our solution can fully compensate variations thereby sustaining reliable on-chip optical communication with power efficiency.

Contact

zhidan.zheng@tum.de

Supervisor:

Zhidan Zheng

Simultaneously Tolerate Thermal and Process Variations Through Indirect Feedback Tuning for Silicon Photonic Networks

Keywords:
thermal tolerant; process variations; optical networks-on-chip

Description

Silicon photonics is the leading candidate technology for high-speed and low-energy-consumption networks. Thermal and process variations are the two main challenges of achieving high-reliability photonic networks. Thermal variation is due to the heat issues created by application, floorplan, and environment, while process variation is caused by fabrication variability in the deposition, masking, exposition, etching, and doping. Tuning techniques are then required to overcome the impact of the variations and efficiently stabilize the performance of silicon photonic networks. We extend our previous optical switch integration model, BOSIM, to support the variation and thermal analyses. Based on device properties, we propose indirect feedback tuning (IFT) to simultaneously alleviate thermal and process variations. IFT can improve the BER of silicon photonic networks to 10 -9 under different variation situations. Compared to state-of-the-art techniques, IFT can achieve an up to 1.52 ×10 8 times bit-error-rate improvement and 4.11X better heater energy efficiency. Indirect feedback does not require high-speed optical signal detection, and thus, the circuit design of IFT saves up to 61.4% of the power and 51.2% of the area compared to state-of-the-art designs.

Contact

zhidan.zheng@tum.de

Supervisor:

Zhidan Zheng

A polynomial time optimal diode insertion/routing algorithm for fixing antenna problem

Description

Abstract— Antenna problem is a phenomenon of plasma induced gate oxide degradation. It directly affects manufacturability of VLSI circuits, especially in deep-submicron technology using high density plasma. Diode insertion is a very effective way to solve this problem Ideally diodes are inserted directly under the wires that violate antenna rules. But in today's high-density VLSI layouts, there is simply not enough room for "under-the-wire" diode insertion for all wires. Thus it is necessary to insert many diodes at legal "off-wire" locations and extend the antenna-rule violating wires to connect to their respective diodes. Previously only simple heuristic algorithms were available for this diode insertion and routing problem. In this paper we show that the diode insertion and routing problem for an arbitrary given number of routing layers can be optimally solved in polynomial time. Our algorithm guarantees to find a feasible diode insertion and routing solution whenever one exists. Moreover we can guarantee to find a feasible solution to minimize a cost function of the form /spl alpha/ /spl middot/ L + /spl beta/ /spl middot/ N where L is the total length of extension wires and N is the total number of Was on the extension wires. Experimental results show that our algorithm is very efficient.

Contact

alex.truppel@tum.de

Supervisor:

Alexandre Truppel

A general multi-layer area router

Description

Abstract— This paper presents a general multi-layer area router based on a novel grid construction scheme. The grid construction scheme produces more wiring tracks than the normal uniform grid scheme and accounts for differing design rules of the layers involved. Initial routing performed on the varying capacity grid is followed by a layer assignment stage. Routing completion is ensured by iterating local and global modifications in the layer assignment stage. Our router has been incorporated into the Custom Cell Synthesis project at MCC and has shown improved results for cell synthesis problems when compared with the router Mighty which was used in earlier versions of the project.

Contact

alex.truppel@tum.de

Supervisor:

Alexandre Truppel