Computer scienceSoftware development foundationsSQL and DatabasesDB TheoryRelational DB theory

Normalization

10 minutes read

Normalization is the process of improving the properties of a database by sequentially dividing one table into several tables. The reason for normalization is the presence of anomalies in the unnormalized database. An anomaly is a situation in which there is a complication of data processing and a violation of consistency, i.e. the correctness or adequacy of information storage. In this topic, we will start with a very badly organized database: Children of Employees and work our way together through increasing normalization to a stage where the database is more or less well structured. Stay tuned!

First normal form

Let's start with the first normal form (1NF). The relationship is in 1NF, when all data in its cells is atomic, i.e. simple and non-separable.

Consider an example of an unnormalized relation in the table Employees' Children. Peter has two children, BUT their names are in the same cell, which leads to a violation of 1NF.

ID	Name	Child	Department	Department phone
				Employees' Children
1	Peter	Harry, Ed	Sales	555-55-55
2	Nick	Ann	Sales	555-55-55
3	Bob	Fred	Development	333-33-33

In order for this relationship to be in 1NF, you need to make sure that each value is located in a separate cell, i.e. there is a separate row for each child.

ID	Name	Child	Department	Department phone
				Employees' Children
1	Peter	Harry	Sales	555-55-55
1	Peter	Ed	Sales	555-55-55
2	Nick	Ann	Sales	555-55-55
3	Bob	Fred	Development	333-33-33

Violation of the first normal form complicates data processing. Therefore, in all relational DBMSs, all relations are stored in 1NF.

At the same time, you can notice that we have a duplication of information, the name Peter, and its corresponding ID are duplicated twice. This leads to an anomaly in the data update. When the information about an employee changes, Peter will need to go through the entire table to find all the rows with the information about him.

Note that atomicity is determined by the business logic of the domain and operations on data in that particular database. For example, last name and first name can be stored in one cell if they are considered as a single unit in the subject area.

Second normal form

As you can see above, the relation in 1NF contains data duplication and anomalies. An important concept with 2NF is functional dependence. Functional dependence is more of a semantic concept and is denoted as x->y (read as x functionally defines y, or y is functionally dependent on x). If there is such a dependence between attributes, then this means that for any two identical values of x, exactly the same value of y will correspond to them (i.e., x1->y1 and x1->y2, then y1 = y2). In this case, x is called the dependency determinant.

To further explore the examples, let's define the primary key for the employees' children relation from the previous example. Note that the primary key must uniquely identify each row in the table. In this case, the primary key will be composite and will consist of two attributes id and child.

The relation has a number of functional dependencies:

id->department,
id->name,
name->id,
name->department,
name->department phone,
id->department phone,
department->department phone.

The relation is in 2NF when it is in 1NF, and all non-key attributes are entirely dependent on the key, but aren't a part of it. The dependence on the entire composite key is called complete. If a non-key attribute depends only on a portion of a composite key, then the dependency is incomplete.

In this case, we mean the dependence on the potential key, not the primary one. Potential keys will be discussed further in the context of the BCNF. (Here, the potential key is the primary key itself — id and child, as well as name and child; we assume that employees' names are unique).

Then, since there are dependencies of the form id->department, id->department phone, it can be argued that the attributes department and department phone do not depend on the entire composite key (id and child) but on its part of the id.

To bring the relation to 2NF, we will divide it into two relations, Employees' Children and Employees, agreeing with these dependencies.

ID	Name	Child
		Employees' Children
1	Peter	Harry
1	Peter	Ed
2	Nick	Ann
3	Bob	Fred

ID	Department	Department phone
		Employees
1	Sales	555-55-55
2	Sales	555-55-55
3	Development	333-33-33

In this case, we have reduced the duplication of the data, but there are still some problems. For example, the information about the phone number is duplicated, which also leads to an update anomaly when trying to update the information about the department.

Third normal form

Moving on to 3NF, it is necessary to introduce the concept of transitive dependence. A transitive dependence exists when x->y and y->z, which leads to x->z (say x depends transitively on z).

A relation is in 3NF when it is in 2NF and there is no transitive dependency on non-key attributes. In this case, in Employees relation, we have the following dependencies:

id->department,
id->department phone,
department->department phone.

The non-key attribute phone of the department depends on the non-key attribute, department, which depends on the id i.e. there is a transitive dependence id->department phone.

To bring this relation to 3NF, let's divide it into two relations, Employees and Department, according to the existing dependencies.

ID	Department
	Employees
1	Sales
2	Sales
3	Development

Department	Department phone
	Department
Sales	555-55-55
Development	333-33-33

Thus, we have further improved the properties of our relation.

Boyce-Codd normal form

BCNF appears when it comes to having multiple super keys. In the relational data model a super key is a set of attributes that uniquely identifies each tuple of a relation.

It is different from the primary key, but only that the primary key is the key chosen by the programmer for implementation as a key, therefore any primary key is also potential.

In this case, in our Employees' Children relation, there are two super keys, name and child, or id and child, and therefore there are the following functional dependencies:

id->name,
name->id.

A relation is in BCNF when it is in 3NF and any determinant is the key of any dependence. The key is composed of two attributes id and child, in the dependencies defined above, the id and name attributes, are not the key of the dependence. Therefore, it is required to bring the relation towards BCNF. We get two relations: Children and Employees.

ID	Child
	Children
1	Harry
1	Ed
2	Ann
3	Fred

ID	Name
	Employees
1	Peter
2	Nick
3	Bob

Advantages and disadvantages

In addition to those considered above, there are other normal forms, such as 4NF, 5NF, etc. But in practice, most often the normalization process goes until 3NF.

The reason that the normalization process is not brought to the last normal forms (5NF, 6NF) is that the database structure can grow to huge sizes with a large number of tables, which subsequently makes it difficult to modify the structure and scaling, as well as adding new tables. In addition, it will be extremely difficult to analyze the structure of the database and draw up reports. Although, in spite of this, the normalization process does eliminate a large number of insert, edit and delete anomalies, as well as eliminate data redundancy.

Therefore, at each step of normalization, it is necessary to decide whether to continue this process or stop it.

Denormalization

In addition to normalization, there is a denormalization process, during which you can get the original relations by performing the reverse procedures. It is worth noting that during denormalization information from the original relationship should not be lost and new information, that was not there originally, should not be added.

Summary

Thus, we conclude that with the normalization process we can get rid of a number of anomalies, such as data duplication, which in turn will lead to faster data processing in calculations.

Now you are able to bring your database to a normalized form. Remember, that only you, as a developer, decide at what stage it's better to stop the normalization of the relation to optimize the structure of your database.

68 learners liked this piece of theory. 18 didn't like it. What about you?

Report a typo