A Normalization Example for Access
I see it all too often. Many companies keep track of information in Excel spreadsheets when Access would be more effective, simply because “it is easier to use Excel” or they “don’t know Access.” Unfortunately, keeping a list of things in Excel can over time, result in redundant information, errors, and endless headaches. If only end users would invest a little time in learning something about databases they could develop a much better solution.
Basically the problem comes from “data redundancy.” In the Excel spreadsheet below, we can see that the entries under PROJ_NAME repeat as do the employee names (EMP_NUMBER, EMP_NAME). Secondly, the fields JOB_CLASS and CHG_HOUR repeat as employees are assigned to different projects (click to enlarge):
What are some potential problems with this spreadsheet? In addition to a great deal of unnecessary data entry, anything that is entered more than once opens the possibility for a mistake. For example, the JOB_CLASS “Programmer” has an hourly charge of $35.75. If we have to enter this data 10 times because we have assigned 10 programmers to projects, that increases our chances of a data entry error someplace--like entering $350.75, or $53.75. Also what if we want to assign someone to another project? For example, if we wanted to assign Alice K. Johnson (who is already assigned to “Evergreen” and “Rolling Tide”) to “Starflight,” we would have to re-enter all of her information again. What if she quits? We will have to go back in and delete all of her assignments and then assign someone else! What looks “easy” on the surface gets messy very quickly.
The solution lies in “normalizing” the data from our Excel spreadsheet, then creating a database in Access that will allow us to store everything one time only. Normalization was first described by E.F. Codd in 1970 and much has been written about the process since then. Typically there are 3 stages called "normal forms":
First (1NF) - Remove all repeating data or groups. Put them in separate tables.
Second (2NF) - Eliminate any partial dependencies. If anything still might repeat, break it out to its own table.
Third (3NF) - Inspect each table to ensure that all data in the table is dependent on the primary key.
Although higher levels of normalization have been described, for practical purposes, at "3NF" we are good to go.
Our Excel spreadsheet is "un-normalized" in that it has repeating groups--the person tracking the projects simply typed in everything as it was--a simple list if you will. We would like to create an Access database that is in the “third normal form” or 3NF. In third normal form, each piece of information in a table is dependent on the “primary key” that defines it. An employee’s name or address or salary, for example, could be found by looking up the “EMP_NUM.” In this example, the Projects repeat, the Employees repeat, and the Jobs repeat. Using Access, we create a separate table for each so that the information need only be entered once.
Creating the necessary tables in Access (including an “Assignment” table) and defining the appropriate "key fields" we end up with the database structure shown below (click to enlarge):
Now we will be able to enter new projects and employees, make assignments, and update information on employees and jobs without the risk of mistakes or “anomalies.” If an employee resigns, they can be deleted from all projects automatically. If we change the rate of pay for Programmers or Systems Analysts, the new rate will be reflected in all assignments. The database can grow quite large and still retain the original structure. With Access, we can also do queries and reports.
For a detailed description of Normalization, see
Basically the problem comes from “data redundancy.” In the Excel spreadsheet below, we can see that the entries under PROJ_NAME repeat as do the employee names (EMP_NUMBER, EMP_NAME). Secondly, the fields JOB_CLASS and CHG_HOUR repeat as employees are assigned to different projects (click to enlarge):
What are some potential problems with this spreadsheet? In addition to a great deal of unnecessary data entry, anything that is entered more than once opens the possibility for a mistake. For example, the JOB_CLASS “Programmer” has an hourly charge of $35.75. If we have to enter this data 10 times because we have assigned 10 programmers to projects, that increases our chances of a data entry error someplace--like entering $350.75, or $53.75. Also what if we want to assign someone to another project? For example, if we wanted to assign Alice K. Johnson (who is already assigned to “Evergreen” and “Rolling Tide”) to “Starflight,” we would have to re-enter all of her information again. What if she quits? We will have to go back in and delete all of her assignments and then assign someone else! What looks “easy” on the surface gets messy very quickly.
The solution lies in “normalizing” the data from our Excel spreadsheet, then creating a database in Access that will allow us to store everything one time only. Normalization was first described by E.F. Codd in 1970 and much has been written about the process since then. Typically there are 3 stages called "normal forms":
First (1NF) - Remove all repeating data or groups. Put them in separate tables.
Second (2NF) - Eliminate any partial dependencies. If anything still might repeat, break it out to its own table.
Third (3NF) - Inspect each table to ensure that all data in the table is dependent on the primary key.
Although higher levels of normalization have been described, for practical purposes, at "3NF" we are good to go.
Our Excel spreadsheet is "un-normalized" in that it has repeating groups--the person tracking the projects simply typed in everything as it was--a simple list if you will. We would like to create an Access database that is in the “third normal form” or 3NF. In third normal form, each piece of information in a table is dependent on the “primary key” that defines it. An employee’s name or address or salary, for example, could be found by looking up the “EMP_NUM.” In this example, the Projects repeat, the Employees repeat, and the Jobs repeat. Using Access, we create a separate table for each so that the information need only be entered once.
Creating the necessary tables in Access (including an “Assignment” table) and defining the appropriate "key fields" we end up with the database structure shown below (click to enlarge):
Now we will be able to enter new projects and employees, make assignments, and update information on employees and jobs without the risk of mistakes or “anomalies.” If an employee resigns, they can be deleted from all projects automatically. If we change the rate of pay for Programmers or Systems Analysts, the new rate will be reflected in all assignments. The database can grow quite large and still retain the original structure. With Access, we can also do queries and reports.
For a detailed description of Normalization, see
Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377–387.
Comments