Natural scienceBioinformaticsBioinformatics algorithmsAssembly

De Bruijn Graph

5 minutes read

The field of mathematics known as Graph theory studies the properties and applications of graphs. There are several ways to use Graph theory in many different branches of research, including bioinformatics. This topic will go over the de Bruijn graph, which is an important part of de novo genome assembly.

De-what graph?

A de Bruijn (pronounced as de Brown or de Broin) graph is a directed graph, which is named after the Dutch mathematician Nicolaas Govert de Bruijn, who described this type of graph in 1946. The vertices in this graph correspond to all the possible sequences of m symbols with length n, where the symbols can be repeated. This means that the graph has mnm^n vertices, each of them representing a unique sequence of m symbols from the set of those symbols (S). An edge is drawn from one vertex to another if the two vertices share a common suffix and prefix.

For example, let's have a set S={A,B}S=\{A,B\} and n = 3, meaning that each vertex is a sequence of length 3 which consists of symbols from the set S. If the vertices represent, let's say, the sequences "ABB" and "BBB", there would be an edge between them because the suffix of "ABB" is "BB" and the prefix of "BBB" is "BB". In this case, we would have 8 vertices in total.

All the possible sequences of A and B symbols with length 3 constitute the graph

Another crucial aspect of the de Bruijn graph is that it is an Eulerian graph, which implies it has an Eulerian cycle. An Eulerian cycle implies that there exists an Eulerian path that visits each edge exactly once and returns to the starting point without visiting any vertex twice.

How to check fast if the graph is Eulerian? Keep in mind, that every vertex in an Eulerian cycle has to have the same amount of edges coming in and out. For example, the following graph is NOT Eulerian — vertex 0 has zero incoming edges:

Non Eulerian graph: vertex 0 has zero incoming edges

Math and graph theory in particular is all fun and games, but how can we apply the de Bruijn graph to the genome assembly?

Applications in Bioinformatics

In bioinformatics, a de Bruijn graph is constructed from a set of k-mers, which are short substrings of a longer string. The vertices in the graph then represent these k-mers, and the edges represent the relationships between the k-mers. In bioinformatics, the ultimate purpose of the de Bruijn graph is to find an Eulerian path in a graph that represents the final sequence.

One of the key properties of a de Bruijn graph, which makes it an amazing algorithm for genome assembly, is that it is a compact representation of the original string from which the k-mers were derived. This means that the graph can be used to reconstruct the original string by following the edges from one k-mer to the next. In this way, a de Bruijn graph can be thought of as a kind of "map" of the original string.

Let's break it down!

Imagine we have small pieces of DNA, 3 nucleotides long each:

DNA fragments with length 3

These pieces kinda remind us of the vertices of a de Bruin graph we discussed earlier, right? Now, Let's break each piece into two parts — "prefix" and "suffix":

Prefix and suffix are placed into separate nodes

Let's say that one "prefix" or "suffix" is a vertex — then we can merge the same vertices together, and the edge between them is the broken-down piece of 3 nucleotides:

Merging vertices to create a chain

The gluing of vertices together continues...

Connecting vertices, the edges represent oligos with length 3

And here we are, having merged everything together. Is it the end?

Chain is remodeled by merging similar vertices

Not really! We can still spot similar vertices, meaning we have to stick them together into one:

The resulting graph: all vertices are connected

And a bit more vertices to merge:

Chain is remodeled by merging similar vertices

The graph becomes complex and branched

And now, we have our graph. Using this graph, we can find a sequence by simply following the edges of a graph.
However, we can find not one, but TWO Eulerian paths in the graph! Which one is the correct one? We will discuss it in the Advanced de Bruijn graph topic.

Existence of two Eulerian paths

Conclusion

A de Bruijn graph is a type of graph that is commonly used in the field of bioinformatics. It is a directed graph constructed from a set of k-mers, which are short substrings of a longer string. One of the key properties of a de Bruijn graph is that it can be used to reconstruct the original string from which the k-mers were derived. This makes it a useful tool in the process of genome assembly.

2 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo