database's containing hierarchy
In Chapter 1, I mention tables and schemas, noting that a schema is an overall structure that includes tables within it. Tables and schemas are two elements of a relational database’s containment hierarchy. You can break down the containment hierarchy as follows:
- Tables contain columns and rows.
- Schemas contain tables and views.
- Catalogs contain schemas.
The database itself contains catalogs. Sometimes the database is referred to as a cluster.
Keep in mind the following procedures when planning your database:
- Identify all tables.
- Define the columns that each table must contain.
- Give each table a primary key that you can guarantee is unique. (I discuss primary keys in Chapters 4 and 5.)
- Make sure that every table in the database has at least one column in common with one other table in the database. These shared columns serve as logical links that enable you to relate information in one table to the corresponding information in another table.
- Put each table in third normal form (3NF) or better to ensure the preven-
- tion of insertion, deletion, and update anomalies. (I discuss database normalization in Chapter 5.)
After you complete the design on paper and verify that it is sound, you’re ready to transfer the design to the computer by using SQL CREATE commands.
a room with view:
- CREATE TABLE CUSTOMER (
- CustomerID INTEGER NOT NULL,
- FirstName CHARACTER (15),
- LastName CHARACTER (20) NOT NULL,
- Street CHARACTER (25),
- City CHARACTER (20),
- State CHARACTER (2),
- Zipcode CHARACTER (10),
- Phone CHARACTER (13) ) ;
- CREATE VIEW NH_CUST AS
- SELECT FirstName, LastName, Phone
WHERE STATE = ‘NH’;
hierarchy of the database:
data definitio language: DDL
- CREATE: You use the various forms of this command to build the essen-
- tial structures of the database.
- ALTER: You use this command to change structures that you have
- DROP: You apply this command to a table to destroy not only the table’s
- data, but its structure as well.
Data manipulation language: DML
numeric value expressions
- To combine numeric values, use the addition (+), subtraction (-), multiplication (*), and division (/) operators. The following lines are examples of numeric
- value expressions:
string value expression
String value expressions may include the concatenation operator (||).
datetime and interval value expressions
- AT LOCAL
DueDate + INTERVAL ‘7’ DAY
boolean value expressions
(Class = SENIOR) IS TRUE
If this was a condition on the retrieval of rows from a student table, only rows containing the records of seniors would be retrieved. To retrieve the records of all non-seniors, you could use the following:
NOT (Class = SENIOR) IS TRUE
Alternatively, you could use:
- (Class = SENIOR) IS FALSE
- To retrieve all rows that have a null value in the CLASS column, use
(Class = SENIOR) IS UNKNOWN
The COUNT function returns the number of rows in the specified table. To count the number of precocious seniors in my example high-school database, use the following statement:
- SELECT COUNT (*)
- FROM STUDENT
- WHERE Grade = 12 AND Age < 14 ;
- Use the MAX function to return the maximum value that occurs in the specified
- column. Say that you want to find the oldest student enrolled in your school.
- The following statement returns the appropriate row:
SELECT FirstName, LastName, Age
WHERE Age = (SELECT MAX(Age) FROM STUDENT);
The SUM function adds up the values in a specified column. The column must be one of the numeric data types, and the value of the sum must be within the range of that type. Thus, if the column is of type SMALLINT, the sum must be no larger than the upper limit of the SMALLINT data type. In the retail database from earlier in this chapter, the INVOICE table contains a record of all sales. To find the total dollar value of all sales recorded in the database, use the SUM function as follows:
- SELECT SUM(TotalSale) FROM INVOICE;
- The AVG function returns the average of all the values in the specified column. As does the SUM function, AVG applies only to columns with a numeric data type. To find the value of the average sale, considering all transactions in the database, use the AVG function like this:
SELECT AVG(TotalSale) FROM INVOICE
Data control language: DCL
Designing a Database
- 1. Decide what object you want to include in your Database
- 2. Determine which of these object should be tables and which should be columns within those tables
- 3. Define tables based on how you need to organize the objects.
- The first step in designing a database is deciding which aspects of the system are important enough to include in the model. Treat each aspect as an object and create a list of all the objects you can think of. At this stage, don’t try to decide how these objects relate to each other. Just try to list
- them all.
- You may find it helpful to gather a team of people who are familiar with the system you’re modeling. These people can brainstorm and respond to each other’s ideas. Working together, you’ll probably develop a more complete and
- accurate set of objects than you would on your own.
When you have a reasonably complete set of objects, move on to the next step: deciding how these objects relate to each other. Some of the objects are major entities, crucial to giving you the results that you want. Others are subsidiary to those major entities. You ultimately may decide that some objects don’t belong in the model at all.
- Major entities translate into database tables. Each major entity has a set of associated attributes, which translate into the table columns. Many business databases, for example, have a CUSTOMER table that keeps track of cus-
- tomers’ names, addresses, and other permanent information. Each attribute of a customer, such as name, street, city, state, zip code, phone number, and e-mail address, becomes a column in the CUSTOMER table.
- If you’re hoping to find a set of rules to help you identify which objects should be tables and which of the attributes in the system belong to which table, think again. You may have some reasons for assigning a particular attribute to one table and other reasons for assigning the attribute to another
- table. You must make your judgment based on what information you want to get from the database and how you want to use that information.
When deciding how to structure database tables, involve the future users of the database as well as the people who will make decisions based on database information. If you come up with what you think is a reasonable structure, but it isn’t consistent with the way that people will use the information, your system will be frustrating to use at best — and could even produce wrong information, which is even worse. Don’t let this happen! Put careful effort into deciding how to structure your tables.
Take a look at an example to demonstrate the thought process that goes into creating a multitable database. Say that you just established VetLab, a clinical microbiology laboratory that tests biological specimens sent in by veterinarians. You want to track several things, including the following:
- Tests that you perform
- Each of these entities has associated attributes. Each client has a name, address, and other contact information. Each test has a name and a standard
- charge. Employees have contact information as well as a job classification and pay rate. For each order, you need to know who ordered it, when it was ordered, and what test was ordered. For each test result, you need to know the outcome of the test, whether the results were preliminary or final, and the test order number.
These tables relate to each other by the attributes (columns) that they share, as the following list describes:
- The CLIENT table links to the ORDERS table by the ClientName column.
- The TESTS table links to the ORDERS table by the TestName (TestOrdered) column.
- The EMPLOYEE table links to the ORDERS table by the Employee Name (Salesperson) column.
- The RESULTS table links to the ORDERS table by the OrderNumber column.
For a table to serve as an integral part of a relational database, link that table to at least one other table in the database by using a common column. Figure 5-1 illustrates the relationships between the tables.
- CREATE TABLE CLIENT (
- ClientName CHARACTER (30) PRIMARY KEY,
- Address1 CHARACTER (30),
- Address2 CHARACTER (30),
- City CHARACTER (25),
- State CHARACTER (2),
- PostalCode CHARACTER (10),
- Phone CHARACTER (13),
- Fax CHARACTER (13),
- ContactPerson CHARACTER (30)
- You usually can’t guarantee that a particular data item in a database is correct, but you can determine whether a data item is valid. Many data items have a limited number of possible values. If you make an entry that is not one of the
- possible values, that entry must be an error. The United States, for example, has 50 states plus the District of Columbia, Puerto Rico, and a few possessions. Each of these areas has a two-character code that the U.S. Postal Service recognizes. If your database has a State column, you can enforce domain integrity by requiring that any entry into that column be one of the recognized two-character codes. If an operator enters a code that’s not on the list of valid codes, that entry breaches domain integrity. If you test for domain
- integrity, you can refuse to accept any operation that causes such a breach.
Domain integrity concerns arise if you add new data to a table by using either the INSERT statement or the UPDATE statement. You can specify a domain for a column by using a CREATE DOMAIN statement before you use that column in a CREATE TABLE statement , as shown in the following example:
- CREATE DOMAIN LeagueDom CHAR (8)
- CHECK (LEAGUE IN (‘American’, ‘National’));
- CREATE TABLE TEAM (
- TeamName CHARACTER (20) NOT NULL,
- League LeagueDom NOT NULL
- The domain of the League column includes only two valid values: American and National. Your DBMS doesn’t enable you to commit an entry or update to the TEAM table unless the League column of the row you’re adding has a
- value of either ‘American’ or ‘National’.
- referential integrity:
- Even if every table in your system has entity integrity and domain integrity, you may still have a problem because of inconsistencies in the way one table relates to another. In most well-designed databases, every table contains at
- least one column that refers to a column in another table in the database. These references are important for maintaining the overall integrity of the database. The same references, however, make update anomalies possible.
- Update anomalies are problems that can occur after you update the data in a row of a database table. The relationships among tables are generally not bidirectional. One table is usually dependent on the other. Say, for example, that you have a database with a CLIENT table and an ORDERS table. You may conceivably enter a client
- into the CLIENT table before she makes any orders. You can’t, however, enter an order into the ORDERS table unless you already have an entry in the CLIENT table for the client who’s making that order. The ORDERS table is
- dependent on the CLIENT table. This kind of arrangement is often called a parent-child relationship, where CLIENT is the parent table and ORDERS is the child table. The child is dependent on the parent.
- Generally, the primary key of the parent table is a column (or group of columns) that appears in the child table. Within the child table, that same column (or group) is a foreign key. A foreign key may contain nulls and need not be
- Update anomalies arise in several ways. A client moves away, for example, and you want to delete her from your database. If she has already made some orders, which you recorded in the ORDERS table, deleting her from the
- CLIENT table could present a problem. You’d have records in the ORDERS (child) table for which you have no corresponding records in the CLIENT (parent) table. Similar problems can arise if you add a record to a child table without making a corresponding addition to the parent table. The corresponding foreign keys in all child tables must reflect any changes to the primary
- key of a row in a parent table; otherwise, an update anomaly results.
- You can eliminate most referential integrity problems by carefully controlling the update process. In some cases, you need to cascade deletions from a parent table to its children. To cascade a deletion, when you delete a row
- from a parent table, you also delete all the rows in its child tables that have foreign keys that match the primary key of the deleted row in the parent table. Take a look at the following example:
The constraint NameFK names ClientName as a foreign key that references the ClientName column in the CLIENT table. If you delete a row in the CLIENT table, you also automatically delete all rows in the ORDERS table that have the same value in the ClientName column as those in the ClientName column of the CLIENT table. The deletion cascades down from the CLIENT table to the ORDERS table. The same is true for the foreign keys in the ORDERS table that refer to the primary keys of the TESTS and EMPLOYEE tables.
You may not want to cascade a deletion. Instead, you may want to change the child table’s foreign key to a NULL value. Consider the following variant of the previous example:
The constraint SalesFK names the Salesperson column as a foreign key that references the EmployeeName column of the EMPLOYEE table. If a salesperson leaves the company, you delete her row in the EMPLOYEE table. New salespeople are eventually assigned to her accounts, but for now, deleting her name from the EMPLOYEE table causes all of her orders in the ORDER table to receive a null value in the Salesperson column.
Refuse to permit an addition to a child table until a corresponding row exists in its parent table. If you refuse to permit rows in a child table without a corresponding row in a parent table, you prevent the occurrence of “orphan” rows in the child table. This refusal helps maintain consistency across tables.
Refuse to permit changes to a table’s primary key. If you refuse to permit changes to a table’s primary key, you don’t need to worry about updating foreign keys in other tables that depend on that primary key.
Adding a column to an existing table:
- Suppose your company institutes a policy of having a party for every employee on his or her birthday. To give the party coordinator the advance warning she needs to plan these parties, you need to add a Birthday column to the
- EMPLOYEE table. As they say in the Bahamas, “No problem!” Just use the ALTER TABLE statement. Here’show:
- ALTER TABLE EMPLOYEE
- ADD COLUMN Birthday DATE ;
- Now all you need to do is add the birthday information to each row in the
- table, and you can party on.
Delecting a column from an existing table
- Now suppose that an economic downturn hits your company and it can no longer afford to fund lavish birthday parties. Even in a bad economy, DJ fees have gone through the roof. No more parties means no more need to retain
- birthdaydata. WiththeALTER TABLE statement, you can handle this situation too.
- ALTER TABLE EMPLOYEE
- DROP COLUMN Birthday ;
Ah, well, it was fun while it lasted.
potential problem areas:
Bad imput data
- Bad imput data
- The source documents or data files that you use to populate your database may contain bad data. This data may be a corrupted version of the correct data, or it may not be the data you want. A range check tells you whether the
- data has domain integrity. This type of check catches some — but not all — problems. Field values that are within the acceptable range, but are nonetheless
- incorrect, aren’t identified as problems.
- Operator error
- Your source data may be correct, but the data entry operator may incorrectly transcribe the data. This type of error can lead to the same kinds of problems as bad input data. Some of the solutions are the same, too. Range checks help, but they’re not foolproof. Another solution is to have a second operator independently validate all the data. This approach is costly, because independent validation takes twice the number of people and twice the time. But in some cases where data integrity is critical, the extra effort and expense may prove worthwhile.
- mechanical failure
- If you experience a mechanical failure, such as a disk crash, the data in the table may be destroyed. Good backups are your main defense against this problem.
- Consider the possibility that someone may want to intentionally corrupt your data. Your first line of defense is to deny database access to anyone who may have a malicious intent, and restrict authorized users so that they can access only the data they need. Your second defense is to maintain data backups in a safe place. Periodically reevaluate the security features of your installation. Being just a little paranoid doesn’t hurt
- data redundancyData redundancy is a big problem with the hierarchical database model, but the problem can plague relational databases, too. Not only does such redundancy waste storage space and slow down processing, but it can also lead
- to serious data corruption. If you store the same data item in two different tables in a database, the item in one of those tables may change, while the corresponding item in the other table remains the same. This situation generates a discrepancy, and you may have no way of determining which version is correct. Keep data redundancy to a minimum.
A column constraint imposes a condition on a column in a table.
A table constraint is a constraint on an entire table.
An assertion is a constraint that can affect more than one table.
Normalizing the Database:
A host of problems — called modification anomalies — can plague a database if you don’t structure the database correctly. To prevent these problems, you can normalize the database structure. Normalization generally entails split-
ting one database table into two simpler tables.
- Your company sells household cleaning and personal-care products, and you charge all customers the same price for each product. The SALES table keeps track of everything for you. Now assume that customer 1001 moves out of the
- area and no longer is a customer. You don’t care what he’s bought in the past, because he’s not going to buy anything from your company again. You want to delete his row from the table. If you do so, however, you don’t just lose the
- fact that customer 1001 has bought laundry detergent; you also lose the fact that laundry detergent costs $12. This situation is called a deletion anomaly. In deleting one fact (that customer 1001 bought laundry detergent), you inad-
- vertently delete another fact (that laundry detergent costs $12).
- You can use the same table to illustrate an insertion anomaly. For example, say that you want to add stick deodorant to your product line at a price of $2. You can’t add this data to the SALES table until a customer buys stick
- The problem with the SALES table in the figure is that this table deals with more than one thing: It covers not just which products customers buy, but also what the products cost. You need to split the SALES table into two tables, each dealing with only one theme or idea, as shown in Figure 5-3.
- Figure 5-3 shows that the SALES table is divided into two tables:
- CUST_PURCH, which deals with the single idea of customer purchases.
- PROD_PRICE, which deals with the single idea of product pricing.
You can now delete the row for customer 1001 from CUST_PURCH without losing the fact that laundry detergent costs $12 (the cost of laundry detergent is now stored in PROD_PRICE). You can also add stick deodorant to PROD_PRICE, whether anyone has bought the product or not. Purchase information is stored elsewhere, in the CUST_PURCH table.
The process of breaking up a table into multiple tables, each of which has a single theme, is called normalization. A normalization operation that solves one problem may not affect other problems. You may need to perform several successive normalization operations to reduce each resulting table to a single theme. Each database table should deal with one — and only one — main theme. Sometimes, determining that a table really deals with two or more themes is difficult.
- You can classify tables according to the types of modification anomalies to which they’re subject. In a 1970 paper, E. F. Codd, the first to describe the relational model, identified three sources of modification anomalies and
- defined first, second, and third normal forms (1NF, 2NF, 3NF) as remedies to those types of anomalies. In the ensuing years, Codd and others discovered additional types of anomalies and specified new normal forms to deal with
- them. The Boyce-Codd normal form (BCNF), the fourth normal form (4NF), and the fifth normal form (5NF) each afforded a higher degree of protection against modification anomalies. Not until 1981, however, did a paper, written
- by Ronald Fagin, describe domain-key normal form (DK/NF). Using this last normal form enables you to guarantee that a table is free of modification
The normal forms are nested in the sense that a table that’s in 2NF is automatically also in 1NF. Similarly, a table in 3NF is automatically in 2NF, and so on. For most practical applications, putting a database in 3NF is sufficient to
- ensure a high degree of integrity. To be absolutely sure of its integrity, you must put the database into DK/NF.
- After you normalize a database as much as possible, you may want to make selected denormalizations to improve performance. If you do, be aware of the types of anomalies that may now become possible.
First normal form (1FN)
To be in first normal form (1NF), a table must have the following qualities:
- The table is two-dimensional, with rows and columns.
- Each row contains data that pertains to some thing or portion of a thing.
- Each column contains data for a single attribute of the thing it’s describing.
- Each cell (intersection of a row and a column) of the table must have only a single value.
- Entries in any column must all be of the same kind. If, for example, the entry in one row of a column contains an employee name, all the other rows must contain employee names in that column, too.
- Each column must have a unique name.
- No two rows may be identical (that is, each row must be unique).
- The order of the columns and the order of the rows is not significant.
A table (relation) in first normal form is immune to some kinds of modification anomalies but is still subject to others. The SALES table shown in Figure 5-2 is in first normal form, and as discussed previously, the table is subject to deletion and insertion anomalies. First normal form may prove useful in some applications but unreliable in others.
Second normal form:
To appreciate second normal form, you must understand the idea of functional dependency. A functional dependency is a relationship between or among attributes. One attribute is functionally dependent on another if the value of the second attribute determines the value of the first attribute. If you know the value of the second attribute, you can determine the value of the first attribute.
Suppose, for example, that a table has attributes (columns) StandardCharge, NumberOfTests, and TotalCharge that relate through the following equation:
TotalCharge = StandardCharge * NumberOfTests
TotalCharge is functionally dependent on both StandardCharge and NumberOfTests. If you know the values of StandardCharge and NumberOfTests, you can determine the value of TotalCharge
- Every table in first normal form must have a unique primary key. That key may consist of one or more than one column. A key consisting of more than one column is called a composite key. To be in second normal form (2NF), all
- non-key attributes (columns) must depend on the entire key. Thus, every relation that is in 1NF with a single attribute key is automatically in second normal form. If a relation has a composite key, all non-key attributes must depend on all components of the key. If you have a table where some non-key attributes don’t depend on all components of the key, break the table up into two or more tables so that, in each of the new tables, all non-key attributes depend on all components of the primary key.
- Sound confusing? Look at an example to clarify matters. Consider a table like the SALES table back in Figure 5-2. Instead of recording only a single purchase for each customer, you add a row every time a customer buys an item for the first time. An additional difference is that charter customers (those with Customer_ID values of 1001 to 1007) get a discount off the normal price. Figure 5-4 shows some of this table’s rows.
In Figure 5-4, Customer_ID does not uniquely identify a row. In two rows, Customer_ID is 1001. In two other rows, Customer_ID is 1010. The combination of the Customer_ID column and the Product column uniquely identifies a row. These two columns together are a composite key.
- If not for the fact that some customers qualify for a discount and others don’t, the table wouldn’t be in second normal form, because Price (a non-key attribute) would depend only on part of the key (Product). Because some customers do qualify for a discount, Price depends on both CustomerID
- and Product, and the table is in second normal form.
third normal form
Tables in second normal form are subject to some types of modification anomalies. These anomalies come from transitive dependencies.
A transitive dependency occurs when one attribute depends on a second attribute, which depends on a third attribute. Deletions in a table with such a dependency can cause unwanted information loss. A relation in third normalform is a relation in second normal form with no transitive dependencies.
Look again at the SALES table in Figure 5-2, which you know is in first normal form. As long as you constrain entries to permit only one row for each Customer_ID, you have a single-attribute primary key, and the table is in second normal form. However, the table is still subject to anomalies. What if customer 1010 is unhappy with the chlorine bleach, for example, and returns the item for a refund? You want to remove the third row from the table, which records the fact that customer 1010 bought chlorine bleach.
You have a problem: If you remove that row, you also lose the fact that chlorine bleach has a price of $4. This situation is an example of a transitive dependency.Price depends on Product, which, in turn, depends on the primary key Customer_ID.
Breaking the SALES table into two tables solves the transitive dependency problem. The two tables shown in Figure 5-3, CUST_PURCH and PROD_PRICE,make up a database that’s in third normal form.
Domain-key normal form (DK/NF)
- After a database is in third normal form, you’ve eliminated most, but not all, chances of modification anomalies. Normal forms beyond the third are defined to squash those few remaining bugs. Boyce-Codd normal form (BCNF), fourth
- normal form (4NF), and fifth normal form (5NF) are examples of such forms. Each form eliminates a possible modification anomaly but doesn’t guarantee
- prevention of all possible modification anomalies. Domain-key normal form (DK/NF), however, provides such a guarantee.
- A relation is in domain-key normal form (DK/NF) if every constraint on the relation is a logical consequence of the definition of keys and domains.
- A constraint in this definition is any rule that’s precise enough that you can evaluate whether or not it’s true. A key is a unique identifier of a row in a table. A domain is the set of permitted values of an attribute.
Look again at the database in Figure 5-2, which is in 1NF, to see what you must do to put that database in DK/NF.
- To enforce Constraint 3 (that Customer_ID must be an integer greater than1,000), you can simply define the domain for Customer_ID to incorporate
- this constraint. That makes the constraint a logical consequence of the domain of the CustomerID column. Product depends on Customer_ID, and Customer_ID is a key, so you have no problem with Constraint 1, which
- is a logical consequence of the definition of the key. Constraint 2 is a problem. Price depends on (is a logical consequence of) Product, and Product isn’t a key. The solution is to divide the SALES table into two tables. One table uses Customer_ID as a key, and the other uses Product as a key. This setup is what you have in Figure 5-3. The database in Figure 5-3, besides being in
- 3NF, is also in DK/NF.
- Design your databases so they’re in DK/NF if possible. If you do so, enforcing key and domain restrictions causes all constraints to be met, and modification anomalies aren’t possible. If a database’s structure is designed so that
- you can’t put it into domain-key normal form, you must build the constraints into the application program that uses the database. The database doesn’t guarantee that the constraints will be met
- Sometimes being abnormal pays off. You can get carried away with normalization and go too far. You can break up a database into so many tables that the entire thing becomes unwieldy and inefficient. Performance can plummet.
- Often, the optimal structure is somewhat denormalized. In fact, practical databases are almost never normalized all the way to DK/NF. You want to normalize the databases you design as much as possible, however, to eliminate the possibility of data corruption that results from modification anomalies.
After you normalize the database as far as you can, make some retrievals. If performance isn’t satisfactory, examine your design to see whether selective denormalization would improve performance without sacrificing integrity. By carefully adding redundancy in strategic locations and denormalizing, you can arrive at a database that’s both efficient and safe from anomalies.