Database Normalization
What is database normalization?
Database Normalization is the process of organizing columns and tables in such a way to reduce data redundancy and improve data integrity. It is a way to simplify the design of a database so that everyone and anyone is able to set up a new database to achieve the most optimal structure that we have currently come up with.
(Side Note: Did you know that markdown doesn’t support tables? Hence my super old school ASCII art tables.)
A Non Normalized Database example
Why do we do it?
In our example above, you can see that for each row that represents a Sales Rep, we are also storing the Sales Office they work in, and the phone number for that particular office. Employees 1 and 3 both work in the same office, and in order to represent that with our current database design we had to store that information twice. When we duplicate information we are not only are wasting space in our database but we are also setting ourselves up to run into problems down the line when trying to perform certain actions on our database.
In order to understand why we normalize a database, let’s take a look at some of the things that can go wrong if we don’t.
Update Anomaly
What would happen if the Office Number for the New York office changes? With our current database design, we’d have to update the Office Number for every row where the Office Location is New York. It’d be much more efficient to only have to update the phone number one time, rather than have to update multiple columns in your database.
Insert Anomaly
What would happen if we opened a new office in Chicago and wanted to store that new office location’s phone number? With our current setup, we cannot add a new office location unless we also record a new employee. Every new record needs a “Primary Key” and because we are using “Employee ID” as the primary key, we need a new Employee ID in order to save this new phone number.
Deletion Anomaly
Turns out our employee “Mary Martin” has just been found guilty of embezzling millions from our company and has been fired. Let’s go ahead and delete her from the database. Someone should call our San Francisco office to let them know about Mary. We’ll just look up their number from our database and… Oh no! When we delete the row pertaining to “Mary Martin”, we also deleted our Office Number for the San Francisco office. A normalized database would store these Office Locations and their respective numbers in such a way that when we delete an employee that works at that location, we still have access to that location’s information, even if there are not currently any employees working there.
Searching Difficulties and Horizontal Growth
We can see in our non-normalized database, each employee has to have a new row created in order to store a new customer. This presents a couple problems. First, our current structure only allows each employee to have a max of 3 customers. If an employee had more than 3, we’d have to add an entire new column to our database for each new customer. As the number of customers each employee has grows, our database has to grow horizontally in order to store these new customers. This is not what we want. Furthermore, how would we query our database for a specific customer using SQL? We’d have to set up a query like this: SELECT *
FROM Employees
WHERE Customer1 = 'Google' OR
WHERE Customer2 = 'Google' OR
WHERE Customer3 = 'Google' OR
WHERE Customer4 = 'Google' OR
WHERE Customer5 = 'Google'
But everytime we added a new row for a new customer, we’d also have to modify our search query in order to also search this new row. This isn’t efficient.
1st Normal Form (1NF)
The rules that your database design needs to follow in order to be considered in “First Normal Form” is:
- The table stores information in rows and columns where the one or more columns uniquely identify each row.
- Each column contains atomic values, and there are not repeating groups of columns.
Atomic values are values that cannot be further subdivided. “Google” is an atomic value because it refers to one customer, but a column called “Customers” which contained all the customer for an employee delimited by commas, “Google,Apple,Gerber” , would not be atomic because that one value in the table actually represents three different customers. This extends to another requirement which is that a table should not contain repeating groups of columns, such as the Customer1, Customer2, etc. columns we have in our table.
Let’s go ahead and redesign our database in such a way that we satisfy the rules of 1NF.
We have now created a separate table, Customers, so that there are no longer repeating customer columns. This will make sure our database grows vertically as new customers are added, not horizontally as it did in our non normalized design.
What did this do for us? Well, now every time we want to add a new customer for an employee we simply add a new row to our customer table and use the Employee ID column to associate it to the correct employee. We no longer run into the insertion anomaly we encountered with our original database design when trying to add new customers to the databse.
The deletion anomaly is also no longer an issue for this new database design, at least in terms of Employees being related to customers. We can delete a customer from the database without having to delete the entire employee.
2nd Normal Form (2NF)
The rules that your database design needs to follow in order to be considered in “Second Normal Form” is:
- The table is in 1NF.
- All the non-key columns are dependent on the table’s primary key (which in our example is Employee ID).
We have now almost completely eliminated redundancy. Remember our update anomaly we ran into before? Now if we needed to update the office number for our San Francisco office, we only need to make one update to our table, while before we would have had to update it everywhere it appeared in the 1NF of the customer table.
The point of 2NF is that each table serves a single purpose. Before, our Employees table was not only storing information about an employee, but also information about the contact information for a particular location. By making our database design conform to 2NF, we separated this into two tables, one that stores information strictly about the employee, and another which stores information about an office. We then simply add a foreign key to the Employee table that references which row from the Office table they should correspond to.
3rd Normal Form (3NF)
The rules that your database design needs to follow in order to be considered in “Second Normal Form” is:
- The table is in 2NF.
- For a relation table R, every non-prime attribute of R is non-transitively dependent on every key of R.
Well, what does non-transitively dependent mean? The best definition I could find was “A column’s value relies upon another column through a second intermediate column.”
Let’s formalize this definition by using a 3 column table as our example. For three columns, A, B and the Primary Key. If the value of A relies on the Primary Key, and the value of B relies on A, then it is true that B relies on the Primary Key through A.
Let’s use this new table to explain this further: Let PK = Office ID, A = Zip code and B = Office Location. A, the zip code, relies on the primary key because the zip code is a representation of where that Office is located. B, the office location, relies on A because the zip code points to the city in which the office is located. Because of this, A relies on the Primary Key through B.
In order for this table to conform to 3NF, we need to have the zip code point to the office location in a separate table.
Now, every non-prime attribute of our table is non-transitively dependent on every key for our table.
Let’s look at another example from this site: http://www.1keydata.com/database-normalization/third-normal-form-3nf.php Our original table, “TABLE_BOOK_DETAIL” has both a “Genre ID” and “Genre Type” column. Book ID, which we will refer to as the primary key, is used to determine the Genre ID. Then the Genre ID is used to determine the Genre Type column. Because of this, we can see that the Primary Key determines the Genre Type through Genre ID. This violates our second rule of 3NF, and thus we need to separate out Genre Type into it’s own table which is referenced by the Genre ID.
Conclusions…
Work in progress…
To be added: BCNF 4NF 5NF Why Normalization is not always necessary and when to use it.