The field of mathematics known as Graph theory studies the properties and applications of graphs. There are several ways to use Graph theory in many different branches of research, including bioinformatics. This topic will go over the de Bruijn graph, which is an important part of de novo genome assembly.
De-what graph?
A de Bruijn (pronounced as de Brown or de Broin) graph is a directed graph, which is named after the Dutch mathematician Nicolaas Govert de Bruijn, who described this type of graph in 1946. The vertices in this graph correspond to all the possible sequences of m symbols with length n, where the symbols can be repeated. This means that the graph has vertices, each of them representing a unique sequence of m symbols from the set of those symbols (S). An edge is drawn from one vertex to another if the two vertices share a common suffix and prefix.
For example, let's have a set and n = 3, meaning that each vertex is a sequence of length 3 which consists of symbols from the set S. If the vertices represent, let's say, the sequences "ABB" and "BBB", there would be an edge between them because the suffix of "ABB" is "BB" and the prefix of "BBB" is "BB". In this case, we would have 8 vertices in total.
Another crucial aspect of the de Bruijn graph is that it is an Eulerian graph, which implies it has an Eulerian cycle. An Eulerian cycle implies that there exists an Eulerian path that visits each edge exactly once and returns to the starting point without visiting any vertex twice.
How to check fast if the graph is Eulerian? Keep in mind, that every vertex in an Eulerian cycle has to have the same amount of edges coming in and out. For example, the following graph is NOT Eulerian — vertex 0 has zero incoming edges:
Math and graph theory in particular is all fun and games, but how can we apply the de Bruijn graph to the genome assembly?
Applications in Bioinformatics
In bioinformatics, a de Bruijn graph is constructed from a set of k-mers, which are short substrings of a longer string. The vertices in the graph then represent these k-mers, and the edges represent the relationships between the k-mers. In bioinformatics, the ultimate purpose of the de Bruijn graph is to find an Eulerian path in a graph that represents the final sequence.
One of the key properties of a de Bruijn graph, which makes it an amazing algorithm for genome assembly, is that it is a compact representation of the original string from which the k-mers were derived. This means that the graph can be used to reconstruct the original string by following the edges from one k-mer to the next. In this way, a de Bruijn graph can be thought of as a kind of "map" of the original string.
Let's break it down!
Imagine we have small pieces of DNA, 3 nucleotides long each:
These pieces kinda remind us of the vertices of a de Bruin graph we discussed earlier, right? Now, Let's break each piece into two parts — "prefix" and "suffix":
Let's say that one "prefix" or "suffix" is a vertex — then we can merge the same vertices together, and the edge between them is the broken-down piece of 3 nucleotides:
The gluing of vertices together continues...
And here we are, having merged everything together. Is it the end?
Not really! We can still spot similar vertices, meaning we have to stick them together into one:
And a bit more vertices to merge:
And now, we have our graph. Using this graph, we can find a sequence by simply following the edges of a graph.
However, we can find not one, but TWO Eulerian paths in the graph! Which one is the correct one? We will discuss it in the Advanced de Bruijn graph topic.
Conclusion
A de Bruijn graph is a type of graph that is commonly used in the field of bioinformatics. It is a directed graph constructed from a set of k-mers, which are short substrings of a longer string. One of the key properties of a de Bruijn graph is that it can be used to reconstruct the original string from which the k-mers were derived. This makes it a useful tool in the process of genome assembly.