Media Technology Scientific Seminar

Lecturer (assistant)	Eckehard Steinbach Hasan Burak Dogaroglu Marsil Zakour Rahul Chaudhari
Number	0820906570
Type	advanced seminar
Duration	3 SWS
Term	Sommersemester 2025
Language of instruction	German
Position within curricula	See TUMonline
Dates	See TUMonline

23.04.2025 13:15-14:45 0406, Seminarraum
30.04.2025 13:15-14:45 0406, Seminarraum
07.05.2025 13:15-14:45 0406, Seminarraum
14.05.2025 13:15-14:45 0406, Seminarraum
21.05.2025 13:15-14:45 0406, Seminarraum
28.05.2025 13:15-14:45 0406, Seminarraum
04.06.2025 13:15-14:45 0406, Seminarraum
11.06.2025 13:15-14:45 0406, Seminarraum
18.06.2025 13:15-14:45 0406, Seminarraum
25.06.2025 13:15-14:45 0406, Seminarraum
02.07.2025 13:15-14:45 0406, Seminarraum
09.07.2025 13:15-14:45 0406, Seminarraum
16.07.2025 13:15-14:45 0406, Seminarraum
23.07.2025 13:15-14:45 0406, Seminarraum

Admission information

Objectives

Participants deepen their knowledge in the area of media technology. After completing the course, students are able to scientifically work on a topic in the area of media technology, write a scientific paper and give a scientific talk.

Description

All participants will give a scientific talk (30min) on a certain topic. They will get references to related literature and further assistance, if required. In addition, they have to summarize the essentials in writing. The main aim of attending this seminar is to familiarize oneself with scientific working methods as well as gaining experience with modern techniques for speech and presentation. A special characteristic of the Media Technology Seminar is to focus on presentation techniques. In addition to general rhetoric rules, the use of different medias for presentations will be taught. The students will undergo a special training to improve their presentation skills.

Prerequisites

No specific requirements

Teaching and learning methods

Every participant works on his/her own topic. The goal of the seminar is to train and enhance the ability to work independently on a scientific topic. Every participant is supervised individually by an experienced researcher. This supervisor helps the student to get started, provides links to the relevant literature and gives feedback and advice on draft versions of the paper and the presentation slides..

The main teaching methods are:
- Computer-based presentations by the student
- The students mainly work with high quality and recent scientific publications

Examination

- Scientific paper (30%)
- Interaction with the supervisor and working attitude (20%)
- Presentation (30 minutes) and discussion (15 minutes) (50%)

Recommended literature

The following literature is recommended: - will be announced in the seminar

Links

Umbrella topic for WS24/25: "Mastering Data Completion: Superresolution, Inpainting, and Beyond with Machine Learning"

The kick-off meeting for the seminar is on 16.10.2024 at 13:15 in Seminar Room 0406.

Attendance is mandatory to get a fixed place in the course!

This semester's Media Technology scientific seminar is focused on Data Completion. The aim is to investigate its potential, advancements, and future directions in various application domains. More details will be provided during the kick-off Meeting.

Important notice regarding "Fixed places" in the seminar:

Registering in TUM Online will change your status to "Requirements met." During the seminar kick-off, the fixed places are assigned to the students according to a priority list. Thus, attending the kickoff session is mandatory to secure a fixed place at the seminar. However, please note that due to the high demand, not all students will necessarily get a fixed place as the spots are limited. So, it's crucial to register as soon as possible.

Radar sensors are widely used in machine perception, due to low cost and good sensing properties. However, noise and multipath reflections are known to impair the radar signal and thus its obtainable measurements like Range-Doppler maps or point clouds.

Point clouds obtained from radar data can be used in several areas such as automotive radar, radar imaging or similar.

In general, algorithms for radar systems often produce low-resolution and sparse point clouds due to hardware limitations and noise, impacting object detection and scene understanding.

Super-resolution methods based on different machine learning architectures address these challenges by reconstructing high-resolution radar point clouds from the obtained radar data, as shown in [1] and [2].

The goal of this seminar work is to conduct a literature research exceeding the given sources here and evaluate and compare different methods for high resolution point cloud generation from radar data. This can, of course, include applications in different areas.

Supervision: Stefan Hägele (stefan.haegele(at)tum.de)

References:

[1] A. Prabhakara et al., "High Resolution Point Clouds from mmWave Radar," 2023 IEEE International Conference on Robotics and Automation (ICRA), London, United Kingdom, 2023

[2] K. Luan, C. Shi, N. Wang, Y. Cheng, H. Lu and X. Chen, "Diffusion-Based Point Cloud Super-Resolution for mmWave Radar Data," 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024, pp. 11171-11177

Bilateral teleoperation systems equipped with haptic feedback enable human operators to interact with objects or perform intricate tasks in remote or otherwise unreachable environments. The model-mediated teleoperation (MMT) approach ensures both the stability and transparency of the system, even when faced with unpredictable communication delays.[1]. In MMT, multiple sensors are implemented, such as visual cameras and haptic devices, combined with machine learning and neural networks to establish a Digital Twin (DT) between the host and the remote environment, thus improving the performance of the teleoperation system.

This topic aims to obtain a well-performed DT environment restoration using combined sensors to handle various situations, such as grabbing and adding new objects, giving users an immersive visual and haptic experience. Using a depth camera and a robot arm, the update and object position adjustment can be detected and restored for the grasping task [2-3]. When there is no depth information, the environment can be reconstructed through a 2D camera to obtain a visual display [4], and further, a mesh model that provides force feedback information can be obtained [5].

For this topic, the expected achievements include comprehending the environmental restoration strategies provided in the articles and looking for other related articles on your own. The summary and comparison of the advantages and disadvantages of each method should be included in the report and your presentation. Moreover, you need to envision the direction and content of future research based on your work.

Supervision: Siwen Liu (siwen.liu(at)tum.de)

References:

[1] X. Xu, B. Cizmeci, C. Schuwerk and E. Steinbach, "Model-Mediated Teleoperation: Toward Stable and Transparent Teleoperation Systems," in IEEE Access, vol. 4, pp. 425-449, 2016, doi: 10.1109/ACCESS.2016.2517926.

[2] L. Rustler, J. Lundell, J. K. Behrens, V. Kyrki and M. Hoffmann, "Active Visuo-Haptic Object Shape Completion," in IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5254-5261, April 2022, doi: 10.1109/LRA.2022.3152975.

[3] L. Rustler, J. Matas and M. Hoffmann, "Efficient Visuo-Haptic Object Shape Completion for Robot Manipulation," 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 2023, pp. 3121-3128, doi: 10.1109/IROS55552.2023.10342200.

[4] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 42, 4, Article 139 (August 2023), 14 pages.

[5] A. Guédon and V. Lepetit, "SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering," 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2024, pp. 5354-5363, doi: 10.1109/CVPR52733.2024.00512.

Most research on object inpainting focuses on images with clear foreground-background separation. However, in Hand-Object interaction, hands “wrap” around the object, and a separation between them is difficult to achieve. Under this topic, the student should survey methods and models for inpainting (removing) hands from video sequences.

Supervision: Rahul Chaudhari (rahul.chaudhari @tum.de)

References:

[1] Ke et al., Occlusion-aware Video Object Inpainting, ICCV 2021

[2] ] Zhang et al., Fine-Grained Egocentric Hand-Object Segmentation: Dataset, Model, and Applications, ECCV 2022.

[3] Chang et al., Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos, NeurIPS 2023.

[4] Tulsiani et al., PixelTransformer: Sample Conditioned Signal Generation, 2021.

3D sensors like LiDAR and RGB-D cameras have been widely applied to capture 3D point clouds of the surrounding world. However, due to occlusion, light condition, and hardware limitation, the point clouds are often noisy and incomplete, resulting in loss in geometric and semantic information. The performance of downstream tasks, such as semantic segmentation and indoor localization, could be significantly affected by the erroneous scans. Thus, inferring the original object geometries from an incomplete 3D scan becomes an indispensable task in 3D reconstruction.

There are a lot of supervised point completion models in recent years. However, supervised learning requires datasets with paired incomplete-complete point clouds, which are hard to obtain in the real world. Therefore, these models are often trained on scans sampled from CAD models that may not represent the real-world conditions. To overcome this issue, researchers attempt to train point completion models on unpaired data. The unsupervised works are mostly based on Generative Adversarial Nets (GAN) [1]. In [2], A generator learns to create a complete point cloud based on the noisy scan and a discriminator learns to justify whether the created point cloud is fake. [3] trains the network by so-called GAN Inversion. [5] leverages the symmetric nature of real-world shapes into the completion framework. [4] and [6] use another branch of unsupervised learning framework: Autoencoder with self-supervision. Some of the above-mentioned models already achieve comparable results as earlier supervised models.

During the seminar, we will review the state-of-the-art algorithms for unsupervised point cloud completion. You will analyze and compare different approaches in their model architecture, training strategy, loss function, etc. Furthermore, you will explore future prospects and potential improvement for further research.

Supervision: Zhifan Ni (zhifan.ni(at)tum.de)

References:

[1] I. Goodfellow et al., “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems (NeurIPS), 2014.

[2] X. Chen, B. Chen, and N. J. Mitra, “Unpaired Point Cloud Completion on Real Scans using Adversarial Training,” in International Conference on Learning Representations (ICLR), 2020.

[3] J. Zhang et al., “Unsupervised 3D Shape Completion Through GAN Inversion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

[4] H. Mittal, B. Okorn, A. Jangid, and D. Held, “Self-Supervised Point Cloud Completion via Inpainting,” in Proceedings of the British Machine Vision Conference (BMVC), 2021.

[5] C. Ma, Y. Chen, P. Guo, J. Guo, C. Wang, and Y. Guo, “Symmetric Shape-Preserving Autoencoder for Unsupervised Real Scene Point Cloud Completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

[6] F. Liu et al., “CloudMix: Dual Mixup Consistency for Unpaired Point Cloud Completion,” IEEE Transactions on Visualization and Computer Graphics, 2024.

To compress an image, the encoder transforms the data into a bitstream, and the decoder recovers a realistic approximation to the original. The fewer bits in the bitstream, the worse quality of the reconstruction. This puts a natural upper bound to the amount of compression we can apply on an image without losing too much of quality. However, if the goal is not to reconstruct the original but a high visual fidelity substitute, we can compress images even further than before. Generative learned image compression aims to find extremely short bitstreams to represent images such that the decoders can produce satisfying images that preserve the semantic meaning by reimagining the lost data.

Generative learned image codecs focus on perceptually satisfying reconstructions [1] [8]. Most techniques incorporate some kind of Generative Adversarial Network (GAN) as decoders. Initial attempts in this research area were about integrating rate distortion loss into GAN training [2] [3]. Some models could run for low and high realism at the same time, allowing a tradeoff between high accuracy and high fidelity [4] [5]. Incorporating some classical coding techniques such as vector quantization [6] and region of interest extraction [7] allowed further gains and reconstruction quality in this problem.

Your tasks include conducting a thorough literature review to find relevant state-of-the-art techniques and explaining core principles behind them. You will assess their effectiveness across metrics like compression performance, reconstruction realism, and computational efficiency. You are expected to give a summary of the most prevalent methods while emphasizing major developments in the field.

Supervision: Burak Dogaroglu (burak.dogaroglu(at)tum.de)

References:

[1] S. Santurkar, D. Budden, and N. Shavit, “Generative Compression,” 2017, arXiv. doi: 10.48550/ARXIV.1703.01467.

[2] F. Mentzer, G. Toderici, M. Tschannen, and E. Agustsson, “High-Fidelity Generative Image Compression,” 2020, arXiv. doi: 10.48550/ARXIV.2006.09965.

[3] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, “Generative Adversarial Networks for Extreme Learned Image Compression,” 2018, arXiv. doi: 10.48550/ARXIV.1804.02958

[4] S. Iwai, T. Miyazaki, Y. Sugaya, and S. Omachi, “Fidelity-Controllable Extreme Image Compression with Generative Adversarial Networks,” arXiv, 2020, doi: 10.48550/ARXIV.2008.10314.

[5] E. Agustsson, D. Minnen, G. Toderici, and F. Mentzer, “Multi-Realism Image Compression with a Conditional Generator,” 2022, arXiv. doi: 10.48550/ARXIV.2212.13824.

[6] J. Zhaoyang, J. Li, B. Li, H. Li, and Y. Lu. "Generative Latent Coding for Ultra-Low Bitrate Image Compression." 2024, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26088-26098.

[7] J. Luo, Y. Wang, and H. Qin, “Super-High-Fidelity Image Compression via Hierarchical-ROI and Adaptive Quantization,” 2024, arXiv. doi: 10.48550/ARXIV.2403.13030.

[8] B. Chen, S. Yin, P. Chen, S. Wang, and Y. Ye, “Generative Visual Compression: A Review,” 2024, arXiv. doi: 10.48550/ARXIV.2402.02140.

By iteratively refining noise into meaningful content, diffusion models allow for highly realistic and context-aware image generation [1]. This approach is crucial for tasks like image restoration, content manipulation, and creative applications. This generative capability of DMs can also be used for inpainting in which unknown or empty regions are filled in a way to blend seamlessly with the surrounding content, enhancing visual quality. The methodologies used for inpainting can range from training the DM itself for inpainting, to utilizing the generative capabilities of the DMs without any training [3].

One important capability of DMs is the ability to be conditioned on various modalities to control the synthesis. This feature is especially utilized by latent diffusion models (LDM) [2]. This conditioning does not necessarily made available through training the LDM from scratch but also developing adapters which extend the conditioning to the desired modality. These adapters can be either attention-based [4] or residual-based [5]. One common control modality for synthesis is semantic maps, widely used for semantic image synthesis with LDMs. However, especially in real-life applications semantic maps can end up being incomplete, hindering the visual quality.

In this topic the student is expected to dive into the literature of semantic image synthesis and inpainting with LDMs, compare and contrast the methods and discuss how the methodologies could be combined to obtain semantic image synthesis that is robust against empty or missing regions.

Supervision: Cem Eteke (cem.eteke(at)tum.de)

References:

[1] Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in neural information processing systems 34 (2021): 8780-8794.

[2] Po, Ryan, et al. "State of the art on diffusion models for visual computing." Computer Graphics Forum. Vol. 43. No. 2. 2024.

[3] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[4] Ye, Hu, et al. "Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models." arXiv preprint arXiv:2308.06721 (2023).

[5] Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

High Dynamic Range (HDR) differs from lower dynamic range by providing higher contrast in terms of brighter highlights and darker shadows. The quality of HDR visual content significantly improves human perceptual viewing experiences. Nevertheless vast majority of existing images and videos still were not stored in HDR (e.g., JPEG images and H.264 coded videos) and they lack details in the colour ranges as well as exhibiting a narrow colour gamut. The task of HDR reconstruction is to fuse multiple images into a single HDR image/video or to inversely tone map the input lower dynamic range image/video by means of reversing the image processing pipelines to recover and/or reconstruct the details and provide better visual experience.

A typical framework of HDR reconstruction includes the multi-frame alignment of input lower dynamic range images, exposure adjustment and further refinements or for single image/video reconstruction: to inverse the process of imaging pipeline. Two-stage reconstruction of HDR applies first alignment to the input images and then fine tune the coarse results on the second stage to refine details [1, 2]. Some efforts on the luminance-based alignment network also show promising HDR video reconstruction results [3] and some other adopt global to local alignment strategy [4], selective alignment fusion [5] and using transformer for ghost-free HDR images [6]. In case of larger motion in video, efforts were made to use flow network for better reconstruction [7]. Also approaches for investigating Raw images are demonstrated in papers [8, 9], including methods to recover HDR from a single image input. Research conducted on inverse tone mapping also show promising results on HDR videos [10].

This seminar topic provides an in-depth overview of the latest techniques in HDR reconstruction in image and video. The student should read and comprehend the current state-of-the-art works before drafting a seminar summary in the form of an oral presentation and a written paper. It is expected to investigate the typical approaches in HDR and provide insights of metrics comparison, demonstrate their knowledge on benchmarking the quality of reconstructed HDR image/video, as well as pointing out the possible artefacts. It’s encouraged to reproduce the reported results and compare the readily available network architectures. At the end of the semester, the student is to report their findings and share them within the group in presentation and paper.

Supervision: Hongjie You (hongjie.you(at)tum.de)

References:

[1] Chen, Guanying, et al. "HDR video reconstruction: A coarse-to-fine network and a real-world benchmark dataset." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[2] Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network

[3] Chung, Haesoo, and Nam Ik Cho. "Lan-hdr: Luminance-based alignment network for high dynamic range video reconstruction." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[4] Cui, Tengyao, et al. "GLHDR: HDR video reconstruction driven by global to local alignment strategy." Computers & Graphics (2024): 103980.

[5] Kong, Lingtong, et al. "SAFNet: Selective Alignment Fusion Network for Efficient HDR Imaging." arXiv preprint arXiv:2407.16308 (2024). [ECCV 2024]

[6] Liu, Zhen, et al. "Ghost-free high dynamic range imaging with context-aware transformer." European Conference on computer vision. Cham: Springer Nature Switzerland, 2022.

[7] Xu, Gangwei, et al. "HDRFlow: Real-Time HDR Video Reconstruction with Large Motions." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[8] Yang, Qirui, et al. "Efficient hdr reconstruction from real-world raw images." arXiv preprint arXiv:2306.10311 (2023).

[9] Zou, Yunhao, Chenggang Yan, and Ying Fu. "Rawhdr: High dynamic range image reconstruction from a single raw image." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[10] Huang, Peihuan, et al. "Video inverse tone mapping network with luma and chroma mapping." Proceedings of the 31st ACM International Conference on Multimedia. 2023.

Pleasant playback of high fidelity video usually requires smooth motion which is enabled by consecutive frames in a video track. However, capturing shutter speed and frame rates may be limited in source acquisition, resulting in loss of frames and jerky motion at the user end. To create smoother object movements in videos, video frame interpolation (VFI) is a technique that plays a vital role by estimating, compensating and generating the intermediate frames in videos. Wide applications of VFI include motion compensation in video compression, temporal up-sampling for higher frame rates, slow motion video creation etc.

To generate intermediate frames, researchers have come up recently with techniques using a number of neural networks and machine learning approaches, including pyramid features decomposition and extraction [1], convolution-based all-pairs multi-field transforms [2], extracting motion and appearance information via inter-frame attention [3], exploring bilateral correlation without limitation of receptive fields [4], diffusion models [5], global inter-frame attention for motion estimation [6], state space models [7], event-aware motion-estimation-free VFI [8], asymmetric synergistic blending [9] and so on.

The student is expected to research and investigate the state-of-the-art approaches listed in the reference and produce a comprehensive study of different approaches. It’s encouraged to reproduce the reported results using the common metrics in VFI and compare the readily available network architectures. In addition, this seminar topic should provide the student with substantial overview of VFI and a short paper of summary is required at the end of semester, as well as an oral presentation to share the findings within the group.

Supervision: Hongjie You (hongjie.you(at)tum.de), co-supervised by Nicola Giuliani (Nicola.giuliani(at)tum.de)

References:

[1] Kong, Lingtong, et al. "Ifrnet: Intermediate feature refine network for efficient frame interpolation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[2] Li, Zhen, et al. "Amt: All-pairs multi-field transforms for efficient frame interpolation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[3] Zhang, Guozhen, et al. "Extracting motion and appearance via inter-frame attention for efficient video frame interpolation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[4] Zhou, Chang, et al. "Video frame interpolation with densely queried bilateral correlation." Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 2023.

[5] Jain, Siddhant, et al. "Video interpolation with diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[6] Liu, Chunxu, et al. "Sparse Global Matching for Video Frame Interpolation with Large Motion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[7] Zhang, Guozhen, et al. "VFIMamba: Video Frame Interpolation with State Space Models." arXiv preprint arXiv:2407.02315 (2024). [NeurIPS 2024]

[8] Liu, Yuhan, et al. "Video Frame Interpolation via Direct Synthesis with the Event-based Reference." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[9] Wu, Guangyang, et al. "Perception-Oriented Video Frame Interpolation via Asymmetric Blending." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Gaussian Splatting [1] has recently emerged as a promising technique for 3D scene representation and rendering, with applications in areas such as virtual reality, visual effects, and digital twin creation. The method leverages the flexibility of Gaussian primitives to approximate complex surfaces by splatting them into a scene. However, one of the significant challenges in Gaussian Splatting is accurate surface reconstruction, especially when dealing with sparse or incomplete data. Reconstructing fine surface details while maintaining computational efficiency remains an open problem.

This seminar topic aims to explore the reconstruction and completion of surfaces from 3D scenes using Gaussian Splatting. During the seminar, your tasks will include researching, gathering, and conducting a comparative analysis of the current state-of-the-art in Gaussian Splatting-based surface reconstruction [2,3,4]. Furthermore, you will be expected to comprehensively explain the fundamental principles underpinning these methods, assess and contrast their effectiveness, and ultimately deliver a state-of-the-art summary. Additionally, your presentation should encompass future prospects and potential areas for further research within this domain.

It is expected that the student understands fundamentally the concept of Gaussian Splatting [1] and can provide different common approaches of how to conventionally extract meshes from the Gaussian Splatting representation. Additionally, the student should be able to provide at least one to two more examples from [2,3,4] of how state-of-the-art methods improve or change this procedure.

Supervision: Driton Salihu (driton.salihu(at)tum.de)

References:

[1] Kerbl, Bernhard, et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM Trans. Graph. 42.4 (2023): 139-1.

[2] Guédon, Antoine, and Vincent Lepetit. "Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Dai, Pinxuan, et al. "High-quality surface reconstruction using gaussian surfels." ACM SIGGRAPH 2024 Conference Papers. 2024.

[3] Huang, Binbin, et al. "2d gaussian splatting for geometrically accurate radiance fields." ACM SIGGRAPH 2024 Conference Papers. 2024.

Depth completion is a crucial task in computer vision that focuses on generating dense depth maps from sparse depth data, commonly captured by LiDAR or depth sensors. It has broad applications in fields like robotics, and 3D reconstruction. Since the depth data collected by sensors is often incomplete, depth completion must accurately infer missing values and recover fine details. Deep learning approaches significantly improve accuracy by leveraging multi-modal data such as RGB images. This topic will explore on the latest deep learning methods for depth completion with RGB and Lidar data.

Supervision: Supervision: Xin Su(Xin.Su(at)tum.de)

References:

[1] Bartoccioni, Florent, et al. "LiDARTouch: Monocular metric depth estimation with a few-beam LiDAR." Computer Vision and Image Understanding 227 (2023): 103601.

[2] Long, Chen, et al. "SparseDC: Depth Completion from sparse and non-uniform inputs." Information Fusion 110 (2024): 102470.

[3] Eldesokey, Abdelrahman, et al. "Uncertainty-aware cnns for depth completion: Uncertainty from beginning to end." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Remote photoplethysmography (rPPG) is a non-invasive technique that uses video cameras to detect changes in blood volume[1,2]. It offers a promising alternative to traditional contact photoplethysmography (cPPG) methods[3] and has been applied in areas such as heart rate monitoring, stress detection, sleep analysis [4], and hypertension management [5]. However, the accuracy of rPPG signals can be influenced by factors like motion artifacts, changes in lighting, and variations in skin tone [6,7]. These factors can introduce noise and distort the captured signal, resulting in less reliable physiological measurements.

Most existing studies have focused on computing specific physiological parameters from remote photoplethysmography (rPPG) signals rather than capturing the raw signal itself [8]. However, if a model is trained to obtain an rPPG signal for a specific parameter such as HR, other information may be lost. For instance, only the systolic peak is relevant for HR calculation, resulting in the loss of information about the diastolic peak. In addition, the morphology of the signal itself can provide valuable information, such as first and second derivatives, that can be useful for detecting cardiovascular diseases (CVDs) [9].

This seminar topic aims to explore the integration of novel temporal or spatio-temporal AI techniques into rPPG reconstruction, such as, CNN-LSTM, Liquid Networks, or Bi-LSTM, just to name a few.

During the seminar, your tasks will include researching, gathering, and conducting a comparative analysis of the current generative AI-based approaches to improve rPPG reconstruction. Furthermore, you will be expected to comprehensively explain the fundamental principles underpinning these methods, assess and contrast their effectiveness, and ultimately deliver a state-of-the-art summary. Additionally, your presentation should encompass future prospects and potential areas for further research within this domain.

Supervision: Fabian Seguel(fabian.seguel(at)tum.de)

References:

[1] Xiao, H. et al. Remote photoplethysmography for heart rate measurement: a review. Biomed. Signal Process. Control 88, 105608 (2024).

[2] Frey, L., Menon, C. & Elgendi, M. Blood pressure measurement using only a smartphone. npj Digit. Med. 5, 86 (2022)

[3] Elgendi, M. On the analysis of fingertip photoplethysmogram signals. Current Cardiol. Rev. 8, 14–25 (2012).

[4] Premkumar, S. & Hemanth, D. J. in Informatics Vol. 9, 57 (MDPI, 2022).

[5] Elgendi, M. et al. The use of photoplethysmography for assessing hypertension. NPJ Digit. Med. 2, 60 (2019).

[7] Dasari, A., Prakash, S. K. A., Jeni, L. A. & Tucker, C. S. Evaluation of biases in remote photoplethysmography methods. NPJ Digit. Med. 4, 91 (2021).

[8] Schrumpf, F., Frenzel, P., Aust, C., Osterhoff, G. & Fuchs, M. Assessment of deep learning based blood pressure prediction from ppg and rPPG signals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3820–3830 (2021).

[9] Park, Y.-J., Lee, J.-M. & Kwon, S.-H. Association of the second derivative of photoplethysmogram with age, hemodynamic, autonomic, adiposity, and emotional factors. Medicine 98, e18091 (2019).

Hand Object Interactions Overtime (4D HOI) are animation sequences of human hands interacting with objects, the interaction will differ based on object surface, object trajectory and initial relative hand-object poses. Inpainting/completion of these sequences is important to generate animations from few key-frames.

This seminar topic aims to explore the different types of conditional completion models that could be used to generate 4D HOI. Some works in the literature cover inpainting/completion of full human-object interactions like Diller et al [1]. Other works like Shimada et al [2] Uses object weight to condition the interaction. Finally some other works focus more on the object surface and use text to condition the models [3].

During the seminar, your task is to survey the literature and compare different types of conditioning modalities (i.e. text/weight/shape of object) as well as completion methodology (diffusion models, VAEs,…). Finally you should compare the different papers conditioned on similar modalities and explain the advantages, and disadvantages of each method especially from the point of view of inpainting possibility.

Supervision: Marsil Zakour (marsil.zakour(at)tum.de)

References:

[1] Diller, C., & Dai, A. (2024). CG-HOI: Contact-Guided 3D Human-Object Interaction Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 19888-19901).

[2] Shimada, S., Mueller, F., Bednarik, J., Doosti, B., Bickel, B., Tang, D., Golyanik, V., Taylor, J., Theobalt, C., & Beeler, T. (2024). Macs: Mass conditioned 3d hand and object motion synthesis. In 2024 International Conference on 3D Vision (3DV) (pp. 1082–1091).

[3] Cha, J., Kim, J., Yoon, J., & Baek, S. (2024). Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1577-1585).