M.Sc. Frederik Walter
Technical University of Munich
Associate Professorship of Coding and Cryptography (Prof. Wachter-Zeh)
Postal address
Theresienstr. 90
80333 München
- Phone: +49 (89) 289 - 23492
- Room: 0104.04.403
- frederik.walter@tum.de
Biography
Frederik is currently a doctoral candidate in the Associate Professorship of Coding and Cryptography (Prof. Wachter-Zeh) with the Institute for Communications Engineering, School of Computation, Information and Technology, TUM, Munich, Germany.
Before joining the institute, he was part of a medical technology startup that was funded by the eXist program of the German Ministry of Economic Affairs. From 2017 to 2020, he worked in a management consultancy, advising clients on complex strategic transformations.
Frederik received his M.Sc degree with high distinction (1.0) in Electrical Engineering and Information Technology from TUM in 2016. During this time, he spent a semester at the National University of Singapore. He received his B.Sc. also with high distinction (1.0) in Electrical Engineering from Ulm University in 2014.
Research Interests
Frederik's research interests focus on coding and information theory applied to the area of DNA data storage within the DiDAX project. In particular, he studies improvements in the synthesis of DNA strands.
One field is the use of composite DNA as a tool to increase the alphabet size and, therefore, the information density. Instead of synthesizing only one nucleotide in each cycle, mixing several nucleotides and encoding information in the occurring ratio is possible. Therefore, many interesting coding-related challenges arise when the strands are exposed to substitution, deletion, and/or insertion errors.
Another area in his research is the efficient synthesis for gene expression analysis. He applies coding theoretic concepts to minimize the synthesis time while ensuring all genes are uniquely represented on a testing array.
Teaching
Teaching Assistant at TUM
Teaching Assistant at Ulm University
- Fundamentals of Electrical Engineering I
- Fundamentals of Electrical Engineering II
- Signals and Systems
Available Theses
Theses in Progress
Marker Codes for Strand Breaks in Composite DNA
Description
DNA Data Storage
Data storage on DNA molecules is a promising approach for archiving massive data.
In classical DNA storage systems, binary information is encoded into sequences consisting of the four DNA bases {A, C, G, T}. The encoded sequences are used to generate DNA molecules called strands using the biochemical process of DNA synthesis. The synthesized strands are stored together in a tube. To retrieve the binary information, the strand must be read via DNA sequencing and decoded back into the binary representation.
The synthesis and the sequencing procedures are error-prone, and with the natural degradation of DNA they introduce errors to the DNA strands. To ensure data reliability, the errors have to be corrected by algorithms and error-correcting codes (ECCs).
A 5min video with an overview of DNA storage: https://youtu.be/r8qWc9X4f6k?si=Yzm5sOW-a6VDnBu3
Composite DNA
Recently, to allow higher potential information capacity, [1,2] introduced the DNA composite synthesis method. In this method, the multiple copies created by the standard DNA synthesis method are utilized to create composite DNA symbols, defined by a mixture of DNA bases and their ratios in a specific position of the strands. By defining different mixtures and ratios, the alphabet can be extended to have more than 4 symbols. More formally, a composite DNA symbol in a specific position can be abstracted as a quartet of probabilities {p_A, p_C, p_G, p_T}, in which p_X, 0 ≤ p_X ≤ 1, is the fraction of the base X in {A, C, G, T} in the mixture and p_A+p_C+ p_G+ p_T =1. Thus, to identify composite symbols it is required to sequence multiple reads and then to estimate p_A, p_C, p_G, p_T in each position.
Problem description
ECCs for DNA data storage differ in many aspects from classical error correction codes. In this model, new error type gain relevance, like deletions and insertions which affect the synchronization of the sequences.
Another important error model, especially for long-term storage, is dealing with breaks in the DNA sequence. The problem of identifying fragments caused by a single break is addressed using marker sequences, which help determine if a resulting DNA fragment is a prefix or suffix of the original sequence. Marker sequences are inserted at the beginning and end of DNA strands to locate breaks and reconstruct the original sequence. To prevent marker sequences from appearing within the data, Run-Length Limited (RLL) coding is used, which limits consecutive identical symbols and ensures marker sequences do not occur in the data part. The goal is to reconstruct the original random composite matrix from individual fragments after a single break, restoring the stored information by identifying the position of each fragment and capturing base frequencies to restore the original distribution for each column of the DNA sequence.
References
[1] L. Anavy, I. Vaknin, O. Atar, R. Amit, and Z. Yakhini, “Data storage in DNA with fewer synthesis cycles using composite DNA letters,” Nat Biotechnol, vol. 37, no. 10, pp. 1229–1236, Oct. 2019, doi: 10.1038/s41587-019-0240-x.
[2] Y. Choi et al., “High information capacity DNA-based data storage with augmented encoding characters using degenerate bases,” Sci Rep, vol. 9, no. 1, Art. no. 1, Apr. 2019, doi: 10.1038/s41598-019-43105-w.
[3] I. Shomorony and A. Vahid, “Torn-Paper Coding,” IEEE Transactions on Information Theory, vol. 67, no. 12, pp. 7904–7913, Dec. 2021, doi: 10.1109/TIT.2021.3120920.
[4] D. Bar-Lev, E. Yaakobi, Y. Yehezkeally, and S. Marcovich, “Adversarial Torn-paper Codes,” IEEE Trans. Inform. Theory, pp. 1–1, 2023, doi: 10.1109/TIT.2023.3292895.
[5] W. Zhang, Z. Chen, and Z. Wang, “Limited-Magnitude Error Correction for Probability Vectors in DNA Storage,” in ICC 2022 - IEEE International Conference on Communications, Seoul, Korea, Republic of: IEEE, May 2022, pp. 3460–3465. doi: 10.1109/ICC45855.2022.9838471.
Supervisor:
Coding for Composite DNA
Description
DNA Data Storage
Data storage on DNA molecules is a promising approach for archiving massive data.
In classical DNA storage systems, binary information is encoded into sequences consisting of the four DNA bases {A, C, G, T}. The encoded sequences are used to generate DNA molecules called strands using the biochemical process of DNA synthesis. The synthesized strands are stored together in a tube. To retrieve the binary information, the strand must be read via DNA sequencing and decoded back into the binary representation.
The synthesis and the sequencing procedures are error-prone, and with the natural degradation of DNA they introduce errors to the DNA strands. To ensure data reliability, the errors have to be corrected by algorithms and error-correcting codes (ECCs).
A 5min video with an overview of DNA storage: https://youtu.be/r8qWc9X4f6k?si=Yzm5sOW-a6VDnBu3
Composite DNA
Recently, to allow higher potential information capacity, [1,2] introduced the DNA composite synthesis method. In this method, the multiple copies created by the standard DNA synthesis method are utilized to create composite DNA symbols, defined by a mixture of DNA bases and their ratios in a specific position of the strands. By defining different mixtures and ratios, the alphabet can be extended to have more than 4 symbols. More formally, a composite DNA symbol in a specific position can be abstracted as a quartet of probabilities {p_A, p_C, p_G, p_T}, in which p_X, 0 ≤ p_X ≤ 1, is the fraction of the base X in {A, C, G, T} in the mixture and p_A+p_C+ p_G+ p_T =1. Thus, to identify composite symbols it is required to sequence multiple reads and then to estimate p_A, p_C, p_G, p_T in each position.
Problem description
ECCs for DNA data storage differ in many aspects from classical error correction codes. In this model, new error type gain relevance, like deletions and insertions which affect the synchronization of the sequences. Especially for composite DNA data storage, these error types received only little attention.
The most related work to this problem was recently studied by Zhang et al. in [6]. The authors initiated the study of error-correcting codes for DNA composite. They considered an error model for composite symbols, which assumes that errors occur in at most t symbols, and their magnitude is limited by l. They presented several code constructions as well as bounds for this model. In this thesis, we want to analyse a different way to model the composite synthesis method and studies additional error models. We already have some results for substitution and single deletion errors. This thesis should focus on evaluating more error models in the channel model.
This should only roughly introduce the problem. No need to review all references. If you are interested, please reach out to me, and we can discuss a suitable direction for you.
References
[1] L. Anavy, I. Vaknin, O. Atar, R. Amit, and Z. Yakhini, “Data storage in DNA with fewer synthesis cycles using composite DNA letters,” Nat Biotechnol, vol. 37, no. 10, pp. 1229–1236, Oct. 2019, doi: 10.1038/s41587-019-0240-x.
[2] Y. Choi et al., “High information capacity DNA-based data storage with augmented encoding characters using degenerate bases,” Sci Rep, vol. 9, no. 1, Art. no. 1, Apr. 2019, doi: 10.1038/s41598-019-43105-w.
[3] V. Guruswami and J. Håstad, “Explicit Two-Deletion Codes With Redundancy Matching the Existential Bound,” IEEE Transactions on Information Theory, vol. 67, no. 10, pp. 6384–6394, Oct. 2021, doi: 10.1109/TIT.2021.3069446.
[4] J. Sima, N. Raviv, and J. Bruck, “Two Deletion Correcting Codes From Indicator Vectors,” IEEE Trans. Inform. Theory, vol. 66, no. 4, pp. 2375–2391, Apr. 2020, doi: 10.1109/TIT.2019.2950290.
[5] I. Smagloy, L. Welter, A. Wachter-Zeh, and E. Yaakobi, “Single-Deletion Single-Substitution Correcting Codes,” IEEE Transactions on Information Theory, pp. 1–1, 2023, doi: 10.1109/TIT.2023.3319088.
[6] W. Zhang, Z. Chen, and Z. Wang, “Limited-Magnitude Error Correction for Probability Vectors in DNA Storage,” in ICC 2022 - IEEE International Conference on Communications, Seoul, Korea, Republic of: IEEE, May 2022, pp. 3460–3465. doi: 10.1109/ICC45855.2022.9838471.
Supervisor:
Publications
- Coding for Composite DNA to Correct Substitutions, Strand Losses and Deletions. Summer Doctoral Seminar 2024, 2024 more…
- Towards a Cryptographic Framework for DNA Data Storage. Coding Theory and Algorithms for DNA-based Data Storage - ISIT2024 Satellite Workshop, 2024 more…
- Coding for Synthesis of Composite DNA. Meeting on DNA Data Applications, 2023 more…