Seminars
Methodologies for Accelerating Deep Learning Inference on Different Tensor Machines
Description
The proliferation of deep learning (DL) applications has created an unprecedented demand for high-performance hardware accelerators capable of handling the computationally intensive tasks involved in DL processing. To address this need, various instruction set architectures (ISAs) have introduced specialized matrix extensions, such as Arm's Scalable Matrix Extension (SME) [1] and Intel's Advanced Matrix Extension (AMX) [2], to accelerate matrix operations that are at the heart of DL computations.
However, the proprietary nature of these implementations has limited their adoption and customization, highlighting the need for open-source and flexible solutions. The RISC-V ISA, with its open-source architecture, has emerged as a promising platform for developing custom extensions for tensor operations [3] [4]. Researchers have proposed various methodologies for both dependent and independent matrix extensions, including the use of matrix registers and accumulator registers, to improve performance, efficiency, and scalability.
The seminar will provide a comprehensive overview of the current state of custom extensions for tensor operations, highlighting the advantages and limitations of existing programming models, design methodologies, and performance evaluation techniques.
The seminar should cover the following topics:
-
Research existing programming models for custom extensions for tensor operations particularly for RISCV, including their advantages and limitations.
-
Design methodologies for different tensor extensions, from DL compilation, design, simulation, and deployment, highlighting their strengths and weaknesses.
-
Analysis and evaluation of the performance of different programming models for custom extensions of tensor operations, considering factors such as parallelism, latency, and data transfer.
References:
[1] Intel®Architecture Instruction Set Extensions and Future Features Programming Reference
[2] Arm® Architecture Reference Manual Supplement, The Scalable Matrix Extension (SME), for Armv9-A
[3] V. Verma, T. Tracy II, and M. R. Stan, “EXTREM-EDGE - EXtensions To RISC-V for Energy-efficient ML inference at the EDGE of IoT,” Sust. Comp.: Informatics and Systems, vol. 35, p. 100742, 2022.
[4] Perotti, Matteo & Zhang, Yichao & Cavalcante, Matheus & Mustafa, Enis & Benini, Luca. (2024). MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication.
Contact
Supervisor:
Comparison of ARM and RISC-V ISA Bit-Manipulation Instructions
Description
Different processors of ARM are de-facto standards in embedded SoCs and are on their way to being established in data centers and high-performance computing applications. Since about 15 years ago, a competitor is showing up which is following a completely different approach: An open and free to use instruction set definition that leaves space for various implementations. ISA subset support even allows special instruction extension and both open source and proprietary solutions. Only one thing must be guaranteed, the support of the I32 (respectively I64) instruction set.
The scope of this seminar is the comparison of the RV Zb* instruction set with instructions supported by ARM.
- Starting point are the ISA descriptions in [1-4]. Further, literature research shall be performed that searches publications comparing the two ISAs.
- As a first step the instruction definition relevant system states (e.g. Program Counter, Register Files shall be enumerated and compared.
- As a next step, a summary of the RISC-V B and related ARM bit-manipulation Instructions should be made. An important aspect is the size of the instructions.
- Next a comparison strategy shall be setup, e.g. by defining instructions with the same behavior, a similar behavior and a different behavior or by registers being involved.
- Further, it should be elaborated how instructions supported by ARM can be mimicked by a sequence of RISC-V instructions.
The findings should be summarized in a 4-page paper which might be submitted to a workshop or conference with RISC-V scope and should be presented in an EDA seminar.
[1] Cortex-M0 Devices Generic User Guide Version 1.0, Chapter 3. The Cortex-M0 Instruction Set https://developer.arm.com/documentation/dui0497/a/the-cortex-m0-instruction-set
[2] RISC-V Specifications https://riscv.org/technical/specifications/
[3] RISC-V Extension Specifications https://wiki.riscv.org/display/HOME/Ratified+Extensions
[4] A Comparative Analysis of ARM and RISC-V ISAs for Deeply Embedded Systems https://portal.fis.tum.de/en/publications/a-comparative-analysis-of-arm-and-risc-v-isas-for-deeply-embedded
Contact
Supervisor:
Survey of the Annual Reactive Synthesis Competition (SYNTCOMP)
Description
Today, digital hardware design is most commonly done on the Register-Transfer Level (RTL). This low abstraction level uses sequential logic to describe the behavior and structure of digital circuits using Hardware Description Languages (HDLs) like VHDL or SystemVerilog. To lower design and verification complexity, researchers have proposed moving beyond RTL and using so-called temporal logic instead, which allows for a higher-level expression of temporal correlations. For example, implementing a multi-cycle delay between two actions in VHDL or SystemVerilog requires manually describing this behavior in sequential logic, such as FSMs, buffers, or counters. Conversely, temporal logic can directly specify that action A implies action B after N cycles or even that an event will eventually happen.
Despite the advantages of easily expressing complex temporal correlations, synthesizing actual circuits from temporal logic remains an open challenge. Acknowledging this gap, the annual Reactive Synthesis Competition (SYNTCOMP) added a track to compare and benchmark logic synthesis tools for Linear Temporal Logic (LTL) in 2016 [1].
This seminar aims to conduct a comprehensive literature survey of the SYNTCOMP problem statements, benchmarks, and submitted synthesis tools [2, 3]. Students will critically analyze the advantages and potential drawbacks of these tools and algorithms to provide a perspective on their applicability in digital design. Depending on the findings, the seminar can also be extended with other (more theoretical) approaches to temporal logic synthesis beyond SYNTCOMP.
[1] S. Jacobs and R. Bloem, “The Reactive Synthesis Competition: SYNTCOMP 2016 and Beyond,” 2016, https://arxiv.org/abs/1611.07626
[2] S. Jacobs, G. A. Pérez, and P. Schlehuber-Caissier, “The Reactive Synthesis Competition,” 2023, https://www.syntcomp.org/
[3] P. J. Meyer, S. Sickert, and M. Luttenberger, “Strix: Explicit Reactive Synthesis Strikes Back!,” 2018, https://strix.model.in.tum.de/publications/MeyerSL18.pdf
Contact
RobertNiklas.Kunzelmann@infineon.com, Wolfgang.Ecker@tum.de
Supervisor:
Optimizing Vision Transformer Models: Techniques for Memory and Time-Efficient Pruning
Description
In recent years, vision transformers (ViTs) have revolutionized the field of computer vision, achieving
remarkable success across a wide range of tasks such as image classification [1], object detection [2],
and semantic segmentation [3]. Despite their impressive performance, vision transformers come with
significant computational costs and memory requirements, making them challenging to deploy in
resource-constrained environments. This is where model pruning - a technique aimed at reducing the size
and complexity of neural networks - comes into play. By selectively removing less important weights and
neurons, pruning can substantially reduce the computational burden and memory footprint of ViTs
without significantly affecting their accuracy. However, traditional pruning methods can be timeconsuming
and computationally intensive, which limits their practicality - especially for very large models.
Hence, memory and time-efficient pruning techniques are essential to make large models deployable
with reasonable compute effort.
This seminar aims to explore these advanced pruning strategies specifically tailored for vision
transformers [4,5], focusing on achieving a balance between model size reduction and computational
efficiency.
[1] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR, 2021.
[2] N. Carion et al., “End-to-end object detection with transformers.” ECCV, 2020.
[3] B. Cheng et al., “Masked-attention mask transformer for universal image segmentation.” CVPR, 2022.
[4] W. Kwon et al., “A Fast Post-Training Pruning Framework for Transformers.” NeurIPS, 2022.
[5] M. Sun et al., “A Simple and Effective Pruning Approach for Large Language Models,” ICLR, 2024.
Contact
Please contact Moritz Thoma (Moritz.Thoma@bmw.de)
Supervisor:
Compression Techniques for Floating-Point Weights in Machine Learning Models
Description
Deep Neural Networks (DNNs) offer possibilities for tackling practical challenges and broadening the scope of Artificial Intelligence (AI) applications. The considerable computational and memory needs of current neural networks are attributed to the increasing complexity of network structures, which involve numerous layers containing millions of parameters. The energy consumption during the inference execution of deep neural networks (DNNs) is predominantly attributed to the access and processing of these parameters. To tackle the significant size of models integrated into Internet of Things (IoT) devices, a promising strategy involves diminishing the bit-width of weights.
The objective of this seminar is to conduct a comprehensive literature survey around compression techniques available for floating-point weights. Gather the advantages and disadvantages posed by the available solutions. Depending on the time and reviewed contents, the survey can be extended to find a hardware-efficient technique for the compression of floating-point weights.
Bibliography:
[1] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” 2019.
[2] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,”
2016.
[3] G. C. Marin ?o, G. Ghidoli, M. Frasca, and D. Malchiodi, “Compression strategies and space-conscious representations for deep neural networks,” in
2020 25th International Conference on Pattern Recognition (ICPR), 2021,
pp.9835–9842.
[4] G. C. Marin`o, A. Petrini, D. Malchiodi, and M. Frasca, “Compact representations of convolutional neural networks via weight pruning and
quantization,” CoRR, vol. abs/2108.12704, 2021. [Online]. Available: https://arxiv.org/abs/2108.12704
Contact
Supervisor:
On Memory Optimization of Tensor Programs
In this seminar the student will review state-of-the art memory-aware optimization techniques applied to tensor-level AI programs.
Description
Compact electronic edge devices have limited memory resources. As AI models can require large amounts of memory, running AI models on edge devices becomes challenging. Thus, optimizing AI programs that can be deployed on edge devices is necessary while saving costly memory transfers.
This need has motivated current works exploring different memory-aware optimization techniques that reduce memory utilization but do not modify the DNN parameters (as during compression or network architecture search (NAS)), such as fused tiling, memory-aware scheduling, and memory layout planning [1]. For instance, DORY (Deployment Oriented to memoRY) is an automated tool designed for deploying deep neural networks (DNNs) on low-cost microcontroller units with less than 1MB of on-chip SRAM memory. It tackles the challenge of tiling by framing it as a Constraint Programming (CP) problem, aiming to maximize the utilization of L1 memory while adhering to the topological constraints of each DNN layer. DORY then generates ANSI C code to manage the transfers between off-chip and on-chip memory and the computation phases [2]. DORY has been integrated with TVM to ease the support for heterogeneous compilation and offloading operations not supported by the accelerator to a regular host CPU [3].
This seminar topic reviews state-of-the-art approaches for memory-aware optimization techniques of ML tensor programs targeting constrained edge devices. The different methods and results shall be reviewed and compared.
References:
[1] Rafael Christopher Stahl.Code Optimization and Generation of Machine Learning and Driver Software for Memory-Constrained Edge Devices. 2024. Technical University of Munich, PhD Thesis. URL: https://mediatum.ub.tum.de/doc/1730282/1730282.pdf
[2] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi and F. Conti, "DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs," in IEEE Transactions on Computers, vol. 70, no. 8, pp. 1253-1268, 2021, https://doi.org/10.1109/TC.2021.3066883
[3] Van Delm, Josse, et al. "HTVM: Efficient neural network deployment on heterogeneous TinyML platforms." 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 2023. https://doi.org/10.1109/DAC56929.2023.10247664
Contact
Andrew.stevens@infineon.com
Daniela.sanchezlopera@infineon.com
Supervisor:
Innovative Memory Architectures in DNN Accelerators
Description
With the growing complexity of neural networks, more efficient and faster processing solutions are vital to enable the widespread use of artificial intelligence. Systolic arrays are among the most popular architectures for energy-efficient and high-throughput DNN hardware accelerators.
While many works implement DNN accelerators using systolic arrays on FPGAs, several (ASIC) designs from industry and academia have been presented [1-3]. To fulfill the requirements that such accelerators place on memory accesses, both in terms of data availability and latency hiding, innovative memory architectures can enable more efficient data access, reducing latency and bridging the gap towards even more powerful DNN accelerators.
One example is the Eyeriss v2 ASIC [1], which uses a distributed Global Buffer (GB) layout tailored to the demands of their row-stationary systolic array dataflow.
In this seminar, a survey of state-of-the-art DNN accelerator designs and design frameworks shall be created, focusing on their memory hierarchy.
References and Further Resources:
[1] Y. -H. Chen, T. -J. Yang, J. Emer and V. Sze. 2019 "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019, doi: https://doi.org/10.1109/JETCAS.2019.2910232
[2] Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. "DianNao family: energy-efficient hardware accelerators for machine learning." In Commun. ACM 59, 11 (November 2016), 105–112. https://doi.org/10.1145/2996864
[3] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al. 2017. "In-Datacenter Performance Analysis of a Tensor Processing Unit." In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3079856.3080246
[4] Rui Xu, Sheng Ma, Yang Guo, and Dongsheng Li. 2023. A Survey of Design and Optimization for Systolic Array-based DNN Accelerators. ACM Comput. Surv. 56, 1, Article 20 (January 2024), 37 pages. https://doi.org/10.1145/3604802
[5] Bo Wang, Sheng Ma, Shengbai Luo, Lizhou Wu, Jianmin Zhang, Chunyuan Zhang, and Tiejun Li. 2024. "SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow." ACM Trans. Des. Autom. Electron. Syst. 29, 2, Article 26 (March 2024), 32 pages. https://doi.org/10.1145/3634703
Contact
benedikt.schaible@tum.de
Supervisor:
From Tree to Bus: Modifying Obstacle-Avoiding Steiner Tree Algorithms for the Synthesis of Bus Topology
Description
The ultimate goal of this study is to generate a bus topology that minimizes wire length while considering obstacles. When examining general obstacle-aware routing problems considering wire length minimization, the most widely acknowledged automatic routing method is the Obstacle-Avoiding Steiner Minimum Tree (OASMT). The OASMT algorithm is typically used to generate tree topologies, connecting nodes through branching structures. To achieve bus topology, we aim to modify the existing OASMT algorithm by adjusting the node connection order so that it produces a bus topology structure. The task will focus solely on this modification process, changing the node connections to achieve a bus structure without involving further wire length minimization.
Contact
m.lian@tum.de
Supervisor:
Pre-training Network Pruning
In this seminar the student will review state-of-the art pruning techniques applied before training such as SNIP.
Description
“Pruning large neural networks while maintaining their performance is often desirable due to the reduced space and time complexity. Conventionally, pruning is done within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters during training or using statistically heuristics after training. However, using suitable heuristic criteria, inspired by the “Lottery Ticket” hypothesis networks can also be pruned before training. This eliminates the need for both pretraining and the complex pruning schedules and is well suited to use in combination with neural architecture search. making it robust to architecture variations. The canonical method SNIP [1] introduces a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. These methods can obtain extremely sparse networks and are claimed to retain the same accuracy as reference network on benchmark classification tasks.” As such pre-training pruning methods are potentially a highly attractive alternative to post-training training-time co-optimization methods for use in automated industrial machine learning deployment toolchains. References: [1] Lee, Namhoon, Thalaiyasingam Ajanthan, and Philip HS Torr. "Snip: Single-shot network pruning based on connection sensitivity." arXiv 2018. https://arxiv.org/abs/1810.02340 [2] Artem Vysogorets and Julia Kempe .“Connectivity Matters: “Neural Network Pruning Through the Lens of Effective Sparsity.” https://www.jmlr.org/papers/volume24/22-0415/22-0415.pdf [3] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin: “Pruning Neural Networks at Initialization: Why are We Missing the Mark?” ICLR 2021. https://arxiv.org/abs/2009.08576 [4] Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H.S. Torr, Gregory Rogez, Puneet K. Dokania: “Progressive Skeletonization: Trimming more fat from a network at initialization”. https://arxiv.org/abs/2006.09081
Contact
Andrew.stevens@infineon.com
Daniela.sanchezlopera@infineon.com
Supervisor:
Tensor program optimization for machine learning models
In this seminar the student will review state-of-the art tools allowing to perform tensor optimizations for machine learning (ML) models specially focusing on TVM-related approaches.
Description
The widespread use of ML in many real-life applications is enabled by deep learning models optimized and deployed to specific hardware platforms and devices. Typically, deep learning frameworks depend on libraries that have been manually optimized. Engineers need to choose from many tensor programs that are logically equivalent but differ significantly in performance due to memory access, threading, and the use of specialized hardware primitives. This selection and optimization process are quite labor-intensive. Thanks to the increase of model architectures, their sizes and hardware targets or backends, tensor program optimization has become a key factor for the efficient deployment of ML models. In this seminar paper, an extensive survey of state-of-the-art tools for tensor program optimization shall be done specially focusing on tools leveraging the ML compiler TVM, such as AutoTVM, Ansor [2], MetaSchedule[3], or others comparing against TVM, such as TensorIR [4]. Key areas of focus are: - Understanding core idea of the reviewed methods - Examining and comparing reported results - Assessing the advantages of using those methods and potential drawbacks References: [1] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs (NIPS'18). https://proceedings.neurips.cc/paper_files/paper/2018/hash/8b5700012be65c9da25f49408d959ca0- Abstract.html [2] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: generating high-performance tensor programs for deep learning (OSDI'20). https://www.usenix.org/conference/osdi20/presentation/zheng [3] Shao, Junru, et al. "Tensor program optimization with probabilistic programs." Advances in Neural Information Processing Systems 35 (2022): 35783-35796. https://proceedings.neurips.cc/paper_files/paper/2022/hash/e894eafae43e68b4c8dfdacf742bcbf3- Abstract-Conference.html [4] Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2023. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. (ASPLOS 2023). https://dl.acm.org/doi/abs/10.1145/3575693.3576933
Contact
- Daniela.sanchezlopera@infineon.com
- Samira.ahmadifarsani@tum.de
Supervisor:
Post-processing Flow-Layer Routing with Length-Matching Constraint for Flow-Based Microfluidic Biochips
Description
Here's a consolidated project description based on your provided information:
This project addresses the challenges in the current process of synthesizing microfluidic chips, particularly focusing on the gap in the complete synthesis flow which can lead to reduced performance, resource wastage, or infeasible designs. The general synthesis process typically involves three stages: high-level synthesis, followed by the design of the flow layer, and finally, the design of the control layer.
Current state-of-the-art synthesis methods, primarily operating at the operation- and device-level, make assumptions regarding the availability of fluid transportation paths. They often overlook the physical layout of control and flow channels and neglect the flow rate. This oversight can lead to biased scheduling of fluid transportation time during synthesis.
Our project proposes an innovative approach to bridge this gap. By considering the known physical design of microfluidic chips and the desired experiments, represented as sequence graphs, we aim to improve the physical design. The approach involves adjusting the lengths of the channels according to the required fluid volume. This adjustment is expected to reduce the number of valves and control ports in the original physical design, thereby enhancing the efficiency and feasibility of microfluidic chip designs.
Contact
m.lian@tum.de
Supervisor:
Alternative Optimization Methods for Training of Neural Networks
Description
To this day, the go to algorithms for training neural netowrks are stochastic gradient descent or some flavour of the Adam family. Other algorithms, such as quasi-Newton or derivative free methods are less requently used, although they can, under certain circumstances, bring better results. The goal of this work is to write a survey on the usage of alternative methods for taining, such as quasi-Newton or derivative free algorithms.
Prerequisites
Knowledge in optimization and basic linear algebra is recommended.
Contact
markus.leibl@tum.de
Supervisor:
Performance Comparison of Derivative Free Optimization Algorithms
Description
One of the most common approaches in optimization is to make use of first and second derivatives, in order to find a minimum of a target function. In many cases though, this is not a viable option, as derivatives might not be available, or the function might not be differentiable at all.
For such problems, we need algorithms, that work without derivatives. This can be achieved for example using finite differences or interpolation.
One important algorithm family are the algorithms NEWUOA and BOBYQA by M. J. D. Powell.
The goal of this seminar is to gather works, that evaluate the performance of those algorithms. Furthermore, we want to create a performance matrix that puts Powell’s algorithms into relation with traditional algorithms over a range of different optimization problems.
Paper for BOBYQA (Has not to be read enirely!):
Prerequisites
Knowledge in optimization and basic linear algebra is recommended.
Contact
markus.leibl@tum.de
Supervisor:
A polynomial time optimal diode insertion/routing algorithm for fixing antenna problem
Description
Abstract— Antenna problem is a phenomenon of plasma induced gate oxide degradation. It directly affects manufacturability of VLSI circuits, especially in deep-submicron technology using high density plasma. Diode insertion is a very effective way to solve this problem Ideally diodes are inserted directly under the wires that violate antenna rules. But in today's high-density VLSI layouts, there is simply not enough room for "under-the-wire" diode insertion for all wires. Thus it is necessary to insert many diodes at legal "off-wire" locations and extend the antenna-rule violating wires to connect to their respective diodes. Previously only simple heuristic algorithms were available for this diode insertion and routing problem. In this paper we show that the diode insertion and routing problem for an arbitrary given number of routing layers can be optimally solved in polynomial time. Our algorithm guarantees to find a feasible diode insertion and routing solution whenever one exists. Moreover we can guarantee to find a feasible solution to minimize a cost function of the form /spl alpha/ /spl middot/ L + /spl beta/ /spl middot/ N where L is the total length of extension wires and N is the total number of Was on the extension wires. Experimental results show that our algorithm is very efficient.
Contact
alex.truppel@tum.de