Friday, April 16, 2010

What is Normalization

(Q) What is Normalization? What are the Different Types of Normalization?
Note: A regular .NET programmer working on projects often stumbles on this question, which is but obvious. The bad part is sometimes the interviewer can take this as a very basic question to be answered and it can be a turning point for the interview. So let's cram it.

It is set of rules that have been established to aid in the design of tables that are meant to be connected through relationships. This set of rules is known as Normalization.

Benefits of Normalizing your database include:

•Avoiding repetitive entries
•Reducing required storage space
•Preventing the need to restructure existing tables to accommodate new data
•Increased speed and flexibility of queries, sorts, and summaries
Note: During an interview, people expect to answer a maximum of three normal forms and that's what is expected practically. Actually you can normalize database to fifth normal form. But believe this book, answering three normal forms will put you in a decent shape during an interview.

The three normal forms as follows:

First Normal Form
For a table to be in first normal form, data must be broken up into the smallest units possible. In addition to breaking data up into the smallest meaningful values, tables in first normal form should not contain repetitions groups of fields.


Figure 1.8: Repeating groups example
In the above example, city1 and city2 are repeating. In order for these tables to be in First normal form, you have to modify the table structure as follows. Also note that the Customer Name is now broken down to first name and last name (First normal form data should be broken down to the smallest unit).


Figure 1.9: Customer table normalized to first normal form
Second Normal Form
The second normal form states that each field in a multiple field primary key table must be directly related to the entire primary key. In other words, each non-key field should be a fact about all the fields in the primary key.

In the above table of customer, city is not linked to any primary field.


Figure 1.10: Normalized customer table.

Figure 1.11: City is now shifted to a different master table.
That takes our database to a second normal form.

Third Normal Form
A non-key field should not depend on another Non-key field. The field Total is dependent on Unit price and qty.


Figure 1.12: Fill third normal form
So now the Total field is removed and is the multiplication of Unit price * Qty.

(Q) What is Denormalization?
Denormalization is the process of putting one fact in numerous places (it is vice-versa of normalization). Only one valid reason exists for denormalizing a relational design - to enhance performance. The sacrifice to performance is that you increase redundancy in a database.

(DB) Can you Explain Fourth Normal Form?
Note: Whenever the interviewer is trying to go above the third normal form, there can be two reasons, ego or to fail you. Three normal forms are really enough, practically anything more than that is an overdose.

In fourth normal form, it should not contain two or more independent multi-valued facts about an entity and it should satisfy �Third Normal form�.

So let us try to see what multi-valued facts are. If there are two or more many-to-many relationship in one entity and they tend to come to one place, it is termed as �multi-valued facts�.


Figure 1.13: Multi-valued facts
In the above table, you can see that there are two many-to-many relationships between Supplier / Product and �Supplier / Location (or in short multi-valued facts). In order for the above example to satisfy the fourth normal form, both the many-to-many relationships should go in different tables.


Figure 1.14: Normalized to Fourth Normal form.
(DB) Can you Explain Fifth Normal Form?
Note: UUUHHH if you get this question after joining the company, do ask him if he himself really uses it?

Fifth normal form deals with reconstructing information from smaller pieces of information. These smaller pieces of information can be maintained with less redundancy.

Example: Dealers sell Product which can be manufactured by various Companies. Dealers in order to sell the Product should be registered with the Company. So these three entities have a mutual relationship within them.


Figure 1.15: Not in Fifth Normal Form.
The above table shows some sample data. If you observe closely, a single record is created using lot of small information. For instance: JM Associate can sell sweets under the following two conditions:

•JM Associate should be an authorized dealer of Cadbury
•Sweets should be manufactured by Cadbury company
These two smaller bits of information form one record of the above given table. So in order for the above information to be �Fifth Normal Form� all the smaller information should be in three different places. Below is the complete fifth normal form of the database.


Figure 1.16: Complete Fifth Normal Form
(DB) What is the Difference between Fourth and Fifth normal form?
Note: There is a huge similarity between Fourth and Fifth normal form, i.e. they address the problem of �Multi-Valued facts�.

�Fifth normal form� multi-valued facts are interlinked and �Fourth normal form� values are independent. For instance in the above two questions Supplier/Product and Supplier/Location are not linked. While in fifth form, the Dealer/Product/Companies are completely linked.

(DB) Have you Heard about Sixth Normal Form?
Note: Arrrrggghhh yes there exists a sixth normal form also. But note guys you can skip this statement. Just in case you want to impress the interviewer...

If you want a relational system in conjunction with time, you use sixth normal form. At this moment SQL Server does not support it directly.

(Q) What is Extent and Page?
Twist: What is the relationship between Extent and Page?

Extent is a basic unit of storage to provide space for tables. Every extent has a number of data pages. As new records are inserted new data, pages are allocated. There are eight data pages in an extent. So as soon as the eight pages are consumed, it allocates a new extent with data pages.

While extent is basic unit storage from a database point of view, page is a unit of allocation within extent.

(DB) What are the Different Sections in Page?
Page has three important sections:

•Page header
•Actual data i.e. Data row
•Row pointers or Row offset
Page header has information like timestamp, next page number, previous page number etc.

Data rows are where your actual row data is stored. For every data row, there is a row offset which points to that data row.


Figure 1.17: General view of a Extent
(Q) What are Page Splits?
Pages are contained in extent. Every extent will have around eight data pages. But all the eight data pages are not created at once; they are created depending on data demand. So when a page becomes full it creates a new page, this process is called as �Page Split�.

(Q) In which Files does SQL Server Actually Store Data?
Any SQL Server database is associated with two kinds of files: *.mdf and *.ldf. *.mdf files are actual physical database files where your data is stored finally. *.ldf (LOG) files are actually data, which is recorded from the last time data was committed in the database.


Figure 1.18: MDF and LDF files.
(Q) What is Collation in SQL Server?
Collation refers to a set of rules that determine how data is sorted and compared. Character data is sorted using rules that define the correct character sequence, with options for specifying case-sensitivity, accent marks, kana character types, and character width.


Figure 1.19: Collation according to language
Note: Different languages will have different sort orders.

Case Sensitivity
If A and a, B and b, etc. are treated in the same way, then it is case-insensitive. A computer treats A and a differently because it uses ASCII code to differentiate the input. The ASCII value of A is 65, while a is 97. The ASCII value of B is 66 and b is 98.

Accent Sensitivity
If a and A, o and O are treated in the same way, then it is accent-insensitive. A computer treats a and A differently because it uses ASCII code for differentiating the input. The ASCII value of a is 97 and A 225. The ASCII value of o is 111 and O is 243.

Kana Sensitivity
When Japanese kana characters Hiragana and Katakana are treated differently, it is called Kana sensitive.

Width Sensitivity
When a single-byte character (half-width) and the same character when represented as a double-byte character (full-width) are treated differently then it is width sensitive.

No comments:

Post a Comment