Coding Theory and Algorithms for DNA-based Data Storage

ISIT2024 Satellite Workshop

Schedule

Time	Activity
08:30	Breakfast and coffee
08:55 - 09:00	Opening remarks
09:00 - 09:40	Keynote Robert Grass (ETH Zurich) - In-depth analysis of the error sources of DNA data storage routines: challenges for error correction codes
09:40 - 10:10	Coffee break – with ice breaker activity
10:10 - 12:00	Plenary talks (25 minutes) Olgica Milenkovic (University of Illinois Urbana-Champaign) - DNA Tails for Molecular Flash Memory, Watermarking and Metadata Storage Daniel Bedau (WD Research) - Solid State Nanopores – A Channel with Unique Properties - short break - Zohar Yakhini (Technion / Reichman University) - Information and data science challenges in using synthetic nucleic acids James Diggans (Twist Bioscience) - Apportioning error: all current sequencing options are bad
12:00 - 12:15	Partition into groups for open problem discussion
12:15 - 14:00	Lunch and poster session
14:00 - 15:20	Plenary talks (15 minutes) Daniella Bar-Lev (Technion) - Universal Framework for Parametric Constrained Coding Natalio Krasnogor (Newcastle University) - Error Correcting Strategies for Storing Short (<1KB) DNA Barcodes In Vivo - short break - Adrian Vidal (Monash University) - Learning the Hidden Markov model of the nanopore sequencing channel Hsin-Po Wang (UC Berkeley) - Geno-Weaving: Low-Complexity Capacity-Achieving Data Storage on DNA
15:20 - 15:35	Coffee break
15:35 - 16:35	Think tanks: discussion of open problems in groups
16:35 - 17:15	Presentation of summary
17:15 - 19:00	Concluding remarks
19:00	Dinner

Talks

Robert Grass (ETH Zurich)

In-depth analysis of the error sources of DNA data storage routines: challenges for error correction codes

DNA data storage is a promising technology for the storage of large sets of digital data over long time horizons, and for the imprinting of digital information into products. It has been shown that error correction codes are instrumental in enabling robust storage of data in DNA, and in the last years many codes have been developed for this task. However, several important challenges remain, which have to deal with highly corrupted DNA sequences, for which existing codes are inefficient. These scenarios encompass both DNA generated by low-cost synthesis techniques, as well as DNA pools after prolonged storage and decay. Both of these aspects are important for the advancement of DNA data storage routines: Currently applied DNA synthesis techniques are significantly too expensive, and long storage horizons are one of the promised advantages of DNA as a data storage technology. I will be presenting the error profiles and probabilities of our currently utilised data storage pipelines together with a chemical description of the underlying problems. In conjunction with this presentation and meeting we will be making well-defined encoding/decoding challenges publicly available, that are aimed at addressing above described scenarios.

Olgica Milenkovic (University of Illinois Urbana-Champaign)

DNA Tails for Molecular Flash Memory, Watermarking and Metadata Storage

DNA-based data storage has emerged as a strong candidate for massive archival data storage due to its durability, exceptional storage density and versatility of content duplication techniques. Still, it is not widely used in practice because of the very high cost of writing in the form of DNA synthesis. To mitigate this cost issue, DNA Punchcard systems were proposed and implemented to work with native (bacterial) DNA, but with the drawback of reducing the storage capacity of the system. We will discuss an "upgrade" of DNA Punchcards, termed DNA Tails, where the punches in double-stranded DNA are augmented by enzymatically synthesized single-stranded DNA tails whose lengths encode nonbinary symbols. In particular, we will describe how the length of the tails can be controlled via staggered nicking and labeling, and how the approach lends itself for developing the first Molecular Flash Memory and Physical DNA Watermarking. We will conclude the presentation with a review of several relevant coding theory problems.

This is a join work with Chao Pan, Kasra Tabatabei, Jin Sima, Alvaro Hernandez and Charles Schroeder.

Daniel Bedau (WD Research)

Solid State Nanopores – A Channel with Unique Properties

Nanopores are one of the prime candidates for reading DNA in storage applications: The pore itself is tiny, takes up only a few nanometers of space, can be very fast, and no chemical cycling is required. Nanopores can be made using normal semiconductor processing technology in large quantities, and there is a clear path towards VLSI-pores. It is thus tempting to think of pores the same way as we do of any other electronic component, but that is far from the truth: nanopores are still part of a complex electrochemical system with many unique properties.

The goal of this talk is to give a general introduction to solid state nanopores compared to other state-of-the-art sequencing technologies and cover the whole process from pore fabrication to measurements from the view of a coding channel and systems integration viewpoint. Starting from a brief overview of the process integration required for pore fabrication, I intend to discuss the constraints that these current lithographic processes impose onto the overall system design. I will then discuss the physical layer that translates the motion of a strand of DNA into a detectable electrical signal and also cover the physical processes that lead to a signal and explain the distortion and intersymbol interference caused by the geometric constraints of the system and the random motion of the DNA strand, as well as the signal distortion caused by parasitic effects. Unfortunately, one of the major challenges of solid state nanopores is the low signal-to-noise-ratio, some of which is again linked to physical constraints of the system. I will explain the sources of this noise and discuss possible ways of mitigation through pore design and signal filtering and will conclude with a discussions of challenges and next steps.

Zohar Yakhini (Technion / Reichman University)

Information and data science challenges in using synthetic nucleic acids

Synthetic DNA opens up the development of new approaches to studying biology as well as new applications such as data storage in DNA. We will review some information theory and data science challenges and directions related to this development. In particular we will discuss the use of Bayes classification in measuring CRISPR off target activity and the use of composite nucleic acid mixtures to drive more efficient data coding into DNA.

James Diggans (Twist Bioscience)

Apportioning error: all current sequencing options are bad

Codecs for DNA data storage would ideally make use of the distributions of errors in both synthesis and sequencing. However, error models vary wildly among current sequencing technologies (and even among sequencer models within a single technology vendor). Assumptions around sequencing error must be baked in at the time of encoding and synthesis which may be years or decades removed from sequencing. With available technologies, using additional sequencing depth to reduce error is expensive, highly dependent on the quality of library preparation and difficult to price appropriately. This talk will consider these challenges and options available to address them.

Daniella Bar-Lev (Technion)

Universal Framework for Parametric Constrained Coding

Constrained coding is a fundamental field in coding theory that tackles efficient communication through constrained channels. While channels with fixed constraints (e.g., a fixed set of substrings may not appear in transmitted messages) have a general optimal solution, there is an increasing demand for supporting parametric constraints that are dependent on the message length and portray some property (e.g., no log(n) consecutive zeros). Several works have tackled such parametric constraints through iterative algorithms, yet they require complex constructions specific to each constraint to guarantee convergence through a monotonic progression.

In this work, we present a universal framework for tackling any parametric constrained-channel problem through a novel, simple iterative algorithm, particularly applicable to DNA storage systems. By reducing the execution of this iterative algorithm to an acyclic graph traversal, we show a surprising result that ensures convergence with efficient average time complexity, even without requiring any monotonic progression. To demonstrate the effectiveness of this universal framework in the context of DNA-based data storage, we apply it to both local and global channel constraints, such as minimal periodicity, repeat-free, and global almost-balanced encoding. These applications showcase state-of-the-art results in DNA storage construction, emphasizing the simplicity and versatility of our approach. Overall, our proposed framework enables the generation of state-of-the-art constructions with significant ease and facilitates the simultaneous integration of multiple constraints for the first time, while also providing advancements applicable across various constrained coding scenarios.

Natalio Krasnogor (Newcastle University)

Error Correcting Strategies for Storing Short (<1KB) DNA Barcodes In Vivo

In this work we look at DNA storage from a somewhat different application perspective. Here, relatively short (<1KB) DNA barcodes are inserted into the genomes of cells for two purposes: 1) to physically link the biological sample to its digital footprint within a specialised version control system and 2) to assert biological & digital rights (IP) over a cell line [1,2]. Here, one is concerned with errors such as those emerging from DNA sequencing, amplification, synthesis or natural evolution/mutation but also with (a) constraints introduced by the genomes of the different strains one may want to asserts IP rights on (e.g. GC content, genome architecture, codon usage differences, etc) as well as (b) adversarial attacks (e.g. barcode removal, barcode tampering, etc). To deal with species specific constraints our coding scheme consults a “taboo list” of features that any generated DNA barcode (or subsequence) must avoid. To mitigate for errors introduced by an adversary, we partition the barcode into several sections that could be distributed into (a) different genomic locations within a single cell and/or (b) different cells from an initially identical monoclonal culture. We use a concatenated scheme, with an inner code for edit errors, and an outer locally repairable code (LRC). This enables us to recover a sufficiently accurate missing subsection of a barcode to ascertain IP rights. The LRCs can operate from data stored and partially recovered from within a single genome or from distributed parts across multiple genomes of a barcoded population.

[1] Tellechea-Luzardo, J., Winterhalter, C., Widera, P., Kozyra, J., de Lorenzo, V., & Krasnogor, N. (2020). Linking Engineered Cells to Their Digital Twins: A Version Control System for Strain Engineering. ACS Synthetic Biology, 9(3), 536–545. doi.org/10.1021/acssynbio.9b00400

[2] Tellechea-Luzardo, J., Hobbs, L., Velázquez, E., Pelechova, L., Woods, S., de Lorenzo, V., & Krasnogor, N. (2022). Versioning biological cells for trustworthy cell engineering. Nature Communications, 13(1). doi.org/10.1038/s41467-022-28350-4

Adrian Vidal (Monash University)

Learning the Hidden Markov model of the nanopore sequencing channel

In nanopore sequencers, DNA molecules are guided through a small constriction, and a current signal (squiggle) results from the variable blockage of the pore. The HMM is a relatively simple parametric model that can be used to characterize the main components of the nanopore channel. The state of the channel (with memory) is defined by the k nucleotides (k-mer) that occupy the nanopore and determine the ionic current levels. The state space $\Sigma$ contains $4^k$ possible nanopore states. Measurement noise and residual modeling errors jointly affect the current levels with additive noise. The motor protein controls the dwell times of each state.

Given a sequence of $m$ nucleotides as input, the nanopore channel will transition through a sequence of states ${s_i}_1^m$ of a Markov chain and output a noisy piecewise-constant discrete-time current signal (squiggle) under the following assumptions: (1) the random sequence of current levels $x_i(S)$ for all $S\in\Sigma$ follow independent Gaussian distributions with stationary means $\mu(S)$ (i.e. independent of the time index $i=1…m$) and standard deviations $\sigma_i(S)$ for all $S\in\Sigma$, and (2) the random sequence of dwell times $\delta_i(S)$ for all $S \in \Sigma$ follow independent geometric distributions Geom(p_i(S)) with E{\delta(S)}=1/p_i(S) being the mean dwell times.

We present a method based on the Baum-Welch algorithm that estimates the parameters $\mu(s_i), \sigma_i(s_i),$ and $p_i(s_i)$ for $i=1…m$ of the HMM by learning from a sample collection of squiggles produced by sequencing with ONT MinIon a known sequence of m nucleotides. The entire HMM of the nanopore channel can be learned by sequencing DNA strands that visit the entire state space. A constrained subset of the state space can be used for barcodes and primers in DNA strands.

Hsin-Po Wang (UC Berkeley)

Geno-Weaving: Low-Complexity Capacity-Achieving Data Storage on DNA

In one possible implementation of data storage using DNA, multiple strands of DNA are stored in one liquid container so that, in the future, they can be read by an array of DNA readers in parallel. These readers will sample the strands with replacement to create a random number of noisy reads for each strand. Thus one essential component of such a data storage system is to reconstruct the data out of these unsorted and noisy reads.

It is known that if a single DNA read can be modeled by a substitution channel $W$, then the overall capacity can be expressed by that of the \emph{Poissonization} of $W$. In this draft, we put a rateless code along each strand to encode its index and put a capacity-achieving code at the same position across all strands to protect data. We show that, in a parameter regime where DNA coding is interesting, our low-complexity construction achieves DNA's capacity.

Posters

Adir Kobovich - M-DAB: An Input-Distribution Optimization Algorithm for Composite DNA Storage by the Multinomial Channel
Omer Sabary - Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems
Inbal Preuss - Efficient DNA-based data storage using shortmer combinatorial encoding
Anisha Banerjee - Error-Correcting Codes for Nanopore Sequencing
Jan Voges - PEARL-DNA — Revolutionizing Sustainable DNA-Based Data Storage
Chen Wang - Improving the Singleton-type Upper Bounds for Non-Linear Deletion Correcting Codes
Avital Boruchovsky - DNA-Correcting Codes: End-to-end Correction in DNA Storage Systems
Hadas Abraham - Accurate Accelerated Clustering for DNA Storage
Oriel Limor - SOLQC: Synthetic Oligo Library Quality Control Tool
Frederik Walter - Towards a Cryptographic Framework for DNA Data Storage
Tomer Cohen - Optimizing the Decoding Probability and Coverage Ratio of Composite DNA
Asad Raza Usmani - Modelling for Efficient Scientific Data Storage Using Simple Graphs in DNA
Alex El-Shaikh - The five-minute rule for DNA data storage
Vivian (Paraskevi) Papadopoulou - On Exact Sequence Reconstruction Over a Stochastic t-Error Channel
Pham Van Long Phuoc - Group Testing for Spatial Genomics
Melpomeni Dimopoulou - An overview of JPEG DNA standard