Database Design - are you Novice, Ninja or Nincompoop?

By Tony Marston

23rd September 2012
Amended 9th November 2012

Introduction
Basic Database Rules
Secondary Database Conventions
Improvements which may not be for the better
Rules which do not exist
Common mistakes
Conclusion
References
Amendment History

Introduction

In my long career in IT I have designed and built many applications using different languages, different paradigms, and different storage systems. Among these storage systems have been flat files, indexed files, hierarchical databases, network databases and relational databases. The vast majority of my work over the past 20 years has been with relational databases which are accessed using the ubiquitous SQL language. Although the technology is constantly evolving, with new features and new ways of doing things, there are some concepts which have stood the test of time and attempts by newcomers to supplant them with new and "cool" ideas often have unintended consequences further down the line. Whenever I see a database which has been designed by someone who has yet to realise that these "cool" ideas are not so cool as he would like to think I always give a deep sigh and say "Will these novices never learn?" (I tell a lie. What I actually thinks is "WTF! What idiot designed this pile of cr*p?" but I'm too polite to say so.) I am therefore writing this article in the hope that my experiences will provide some guidance to those who are new to the world of database design.

There are some people out there who can design but never build. I have come across quite a few who cannot build what they design, and sometimes they design things which cannot actually be built without immense effort, so excuse me if I don't place any value on their opinions. It is only by building what you design that you actually experience the consequences of every design decision. It is only by using databases, by writing the queries to read and store data, that you get to understand the pros and cons of the different ideas that are floating around. I have written code which accesses databases which were designed the "old fashioned" way, and I have also written code which accesses databases which were designed with some of the "new fangled" ideas. This means that I have personally witnessed the consequences of each change from the "old" way to the "new" way, which in turn means that I am speaking from direct practical experience and not quoting some rumour, gossip, hearsay or old wive's tale.

Let me start by providing a definition of the terms which I used in this article's title:

The big problem here is when a novice receives advice he innocently assumes that the speaker actually knows what he is talking about. Unfortunately the world is full of charlatans, cowboys and purveyors of snake oil, so the poor novice is unlikely to spot when he is being directed up the wrong path. Searching the internet is just as bad as there is nothing to stop these eejits from spreading their false wisdom. If you wish to be directed down the path of enlightenment then read on.

Basic Database Rules

These are the basic database rules that have existed for decades. Some of these rules may have been relaxed in recent DBMS software, but in my humble opinion some of these changes are not for the better and should be kicked into the long grass.

  1. Every table requires a primary key. Some databases allow a table to be created without a primary key, but there is no good reason to use this "feature".
  2. The primary key can be of almost any data type, but should not exceed the size limit for the database (which is usually 255 characters). The primary key can also span more than one column, and need not be the first column(s) to be defined for the table.
  3. If more than one unique key is available then one must be picked as the primary. The remainder are known as candidate keys.
  4. Primary keys, once created, cannot (or should not) be changed. That is why surrogate/technical keys were invented.
  5. Candidate keys (additional unique keys) can be updated as this is not supposed to have any effect on any relationships with records on child tables.
  6. Foreign keys on a child table link to a primary key on a parent table. A child table may update its foreign key to point to a different parent, but a parent cannot (or should not) change its primary key as this would affect all child records.
  7. All database, table and column names should use the standard character set, which is comprised of letters, numbers and underscores. Attempting to use symbols or spaces may seem "cool" at first, but eventually you will feel the pain.
  8. All databases contain a list of reserved words which should not be used in any database, table, column or procedure names. Some databases allow reserved words if they are enclosed in special characters such as double quotes (the ANSII standard), backticks (MySQL), or '[' and ']' (SQL Server).
  9. All databases, both past and present, are case INsensitive when it comes to names. This means that names such as "foo_bar", "FOO_BAR" and "Foo_Bar" will all resolve to the same column. This means that using mixed case names is pointless.

Secondary Database Conventions

Secondary to the basic rules are some standard practices which have stood the test of time:

  1. After identifying the columns that need to be held on a database table the first step is to look for a natural primary key, a column (or columns) whose value will never change. Something like a person's name would not be a good choice as name changes are quite possible. Something like an ISO country code would be a good choice as a country's code never changes. New countries may emerge and old countries may disappear, but the codes are always unique. Try to keep the number of columns in the primary key as small as possible. Two or three is OK, nine or ten is very questionable.
  2. If a natural key is not apparent it is permissible to add an extra column for this purpose. This is often known as a "surrogate" or "technical" key. The key may or may not be exposed to the outside world, and may be a number or a string, or any other permissible data type. All you have to do is supply it with a value which is unique and will never change.
  3. All tables should be named in the singular, not the plural. Thus you should have "product" instead of "products". This is because it is referred to as "the product table" and not "the products table". When writing SQL queries it is more logical to write "product.cost" than "products.cost". It is common practice to have a table's primary key comprised of the table name with a suffix of "_id", and the name "product_id" is generally more acceptable than "products_id".

    It was pointed out in the comp.databases.mysql newsgroup that SELECT * FROM customers (plural) would seem to be more logical than SELECT * FROM customer. But when you reference a column within a table the opposite would be true, so customer.customer_id would seem to be more logical than customers.customer_id. It is not possible to please everybody all of the time, so I tend to stick with the old convention which used the term "customer file" and "product file" instead of "customers file" and "products file".

  4. All columns should have meaningful names and not short and unintelligible codes. I once worked with a company which had migrated its data from a very primitive database which had a limit of 8 characters for every name, and this habit still persisted even though this restriction no longer applied. The result of this was that all names were totally meaningless, and it was impossible to work with the database without a cheat sheet.

    Some useful conventions I have followed for years are as follows:

    While such conventions as these are not absolutely necessary, they do make it easier for a programmer to look at a column name and have a better understanding of what data it contains.

  5. Following on from the fact that all databases are case INsensitive, trying to enforce specific case would be counter-productive, so don't try to use CamelCaps to identify different components in a name. The standard convention for many years has always been to use "lower_under" (lower case with underscores) or Snake Case, so use "foo_bar" and not "FooBar". This is even more important in languages such as PHP when referencing column names in the query result as they become case sensitive, and they use the same case that was specified in the SELECT part of the query which may not be the same as what was specified in the DDL statement. As the case could vary depending on how the string was actually written, any lookups by name on the result set would have to have exactly the same case otherwise the lookup would fail. There is only one way to circumvent this problem, and that is to force all column names in the result set to lower case. This does not cause a problem for the DBMS or the programming language, so it shouldn't cause a problem for you.
  6. Wherever possible fields with the same content should have the same name. This means that where the same content appears on more than one table it should have exactly the same name with the same spelling, but subject to certain exceptions. Thus it is unwise to use ACCOUNT_ID on one table, ACCNO on another and ACC_NUM on a third. By using column names which are identical across tables you enable the option of using the USING clause on JOIN statements, so that
    SELECT tableA.id, tableA.name, tableB.comment
    FROM tableA
    LEFT JOIN tableB ON (tableB.id=tableA.id)
    
    can be reduced to:
    SELECT tableA.id, tableA.name, tableB.comment
    FROM tableA
    LEFT JOIN tableB USING (id)
    
    As you can see this only works if the two ID columns share the same value as well as the same name.
  7. Columns with the same name should also have the same type and size. It would be very confusing to see a column defined as an integer on one table and a decimal or float on another.
  8. Wherever possible fields with different content should have different names. This means that different columns should not share the same name if they have different meanings. Thus if you have an ORDER table and an ORDER_ITEM table and both contain a column called STATUS, they are both assumed to have the same range of values with the same meaning. There are also some database tools which, when encountering the same column name on different tables, will assume that they indicate a relationship based on the column values. As this is not the case these two columns should have unique names such as ORDER_STATUS and ORDER_ITEM_STATUS. This removes all confusion from both a human observer and any software tools.

    Naming Exceptions: As with almost every rule there may be exceptions in certain circumstances. The same content can have different names:

  9. There is no advantage in using the misunderstood and much abused Hungarian Notation to prefix each database name with "db", each table with "tbl" and each column with "col". The database software is intelligent enough to work without these prefixes, and so should you.
  10. Other bad naming conventions which have absolutely no merit and should be avoided:
  11. A practice which I have always found useful is for one person to design the database to satisfy the known business rules, then for a second person to reverse engineer the database schema and reconstruct those business rules. If there is any discrepancy between the original and reconstructed business rules then perhaps either the rules or the database schema needs a little tweaking.

Improvements which may not be for the better

Here are some changes or "improvements" which have been introduced into some modern databases, but which should not necessarily be adopted as a "good idea":

Rules which do not exist

As well as identifying some rules which do exist, here are some rules which do not exist in modern DBMS software:

Some novices quote the observation "numeric keys are faster than string keys" in order to justify their use of single integer keys where a viable string key or compound key already exists. While this observation may have had some merit decades ago when the hardware was considerably slower and the software less sophisticated, it has much less merit today. Quoting an out of date observation which is not substantiated with sufficient proof in the form of proper benchmarks could lead to wasted effort. If the effort of introducing an artificial integer key is greater than the speed improvement gained by using an integer key instead of a string key, then the net improvement is actually negative and so is not an improvement after all. This is especially true when you have to perform a lookup on a natural string key in order to obtain the integer primary key.

Common Mistakes

Here are some of the mistakes which I regularly encounter in databases which were designed by those I consider to be nincompoops:

  1. Failure to normalise the database properly

    Before moving from the draft stage of your database design to the final physical implementation it is absolutely essential that you go through a process called 'data normalisation'. This ensures that you have the right data in the right place and makes the insertion and subsequent retrieval of data as efficient and effective as possible. Without it your application could run like a pig with a wooden leg. This process is explained in more detail in The Relational Data Model, Normalisation and effective Database Design.

  2. Every table has a technical primary key called "id" which is from an automatic sequence.

    Having a column called 'id' appearing on every table breaks the different content should have different name rule. If you ever bother to look at the sample SQL queries in the database vendor's documentation you will always see column names such as 'product_id' and 'customer_id' and never just 'id'. Why is this I wonder? I once asked a novice why he followed this dubious practice, and his answer was: "It avoids duplication when typing queries. Why use 'product.product_id' when you can type less with 'product.id'?" A novice may think that this is a good idea, but a ninja knows that this "saving" comes with several gotchas:

    Gotcha #1: Multi-table JOINS

    What this novice failed to realise was that when the database returns a result set none of the column names is qualified with a table name. This may not matter too much with a single-table query, but what happens with a multi-table JOIN? Take a look at the following query:

    SELECT *
    FROM order AS o 
    LEFT JOIN deal AS d ON (o.deal_id = d.id) 
    LEFT JOIN user AS u ON (d.user_id = u.id) 
    LEFT JOIN address_book AS ab ON (o.address_id = ab.id) 
    LEFT JOIN product_supplier AS ps ON (d.prod_supplier_id = ps.id) 
    LEFT JOIN product_price AS pp ON (pp.product_supplier_id = ps.id) 
    LEFT JOIN product AS p ON (ps.product_id = p.id) 
    LEFT JOIN supplier AS s ON (ps.supplier_id = s.supplier_id) 
    WHERE o.status = 'A'
    

    This query references eight tables which means that the result will contain eight references to a column called 'id'. This is bad enough when you look at the result in your SQL client program as you will find it impossible to relate each occurrence of 'id' back to its table of origin, but when the result is made available inside your program you will not see multiple values for a column name as all column names will become unique, which means that there will only be one value for each column name. This means that each separate value for the 'id' column will overwrite the previous value, so the value for 'id' in the result set will be from the last table in the query. So what happens when you want the 'id' value for a specific table which may or may not be the last in the query? The only solution is to specify an alias in the select list, such as:

    SELECT *, o.id AS order_id, d.id AS deal_id, u.id AS user_id, ab.id AS address_id, p.id AS product_id, ...
    

    The observant among you will see that the extra effort required to get around the problem caused by using the universal name of 'id' is greater than the savings which were assumed in the first place. The saving of *NOT* including the table name in the primary key name is more than eaten up when you have to modify your queries to include alias names which *DO* include the table name. So what you gain on the swings you lose twice over on the roundabouts. Although a human may be able to cope with the extra effort required to get around this "saving", with a proper framework the number of hand coded queries should be small to non-existent, but a computerised query generator might find it more difficult to deal with such violations of proper naming conventions.

    If you think the previous idea is clever, there are those who double the cleverness by using a universal 'name' column to go with the 'id' column. This means that the SELECT string has to be enhanced even more to include the following:

    u.name AS user_name, p.name AS product_name, ...
    

    So as well as having to specify aliases for every 'id' column you now have to double the effort by including aliases for every 'name' column. Double the savings! Double the fun!

    Gotcha #2: Passing context to child programs

    As well as having to modify queries to deal with multiple columns which have the same name, a ninja will spot a serious problem when trying to use a framework which passes context (selection criteria) from one program to another. Suppose the database has two tables called CUSTOMER and ORDER where the ORDER table has a foreign key which links back to an entry on the CUSTOMER table. The application will therefore have two programs called "List Customer" and "List Order". Without any selection criteria the "List Order" program will retrieve all orders in the system irrespective of the customer. Now suppose in the "List Customer" program there is the ability to select a customer then press a button to go to the "List Order" program to show only those orders which are associated with that customer. The "List Customer" program has to pass the relevant selection criteria to the "List Order" program so that it can retrieve the desired records. Because these two tables are related the primary key on the CUSTOMER table is also a foreign key on the ORDER table, so the context (selection criteria) which is passed from the "List Customer" program to the "List Order" program is the primary key of the selected customer record. This context can be passed as a simple string in the format <pkey_name>='<customer_id>' such as customer_id='12345'. Provided that the column name(s) in the primary key are *exactly* the same as the column name(s) in the foreign key then the receiving program can use the passed context without the need for any additional processing.

    This entire process becomes more complicated if the column names are not *exactly* the same, such as when every primary key column of every table has the same name. This means that the 'id' column in the CUSTOMER table cannot be used as the 'id' column in the ORDER table as one contains the customer identity while the other contains the order identity. The child program which receives context from a parent program now has to have additional code to translate the column name(s), so in this example 'id' has to be translated into 'customer_id'. This translation process usually means extra work for the programmer as it would not be possible for it to be automated unless the receiving program had the following information:

    1. The name of the table from which the primary key was generated.
    2. Details of the relationship with that table so that the name(s) of the primary key column(s) can be translated into the corresponding foreign key column(s).

    This information is not usually available, so the process cannot easily be automated. It is usually the case that the effort required to implement the manual solution is much greater than the "savings" that the original design decision was supposed to generate, so that original design decision does not look so rosy in the cold light of day, now does it?

  3. Using common names in the database makes it possible to use common methods in the code.

    The novice designers like to justify their use of universal column names of 'id' and 'name' by saying that it makes it possible to use standard method names such as getByID($id) and getByName($name) in their code. A ninja will laugh at this argument as he knows that in large systems the ability to retrieve records where the selection criteria is limited to either id=n or name='x' is so restrictive that it will soon cause the system to come to a grinding halt. In the real world the variety of selection criteria is limitless, and as it would be impossible to generate a different method to deal with each possibility a ninja uses a single getData($where) method in which the $where argument is a string which is acceptable as the WHERE clause in an SQL SELECT statement. This single method would take *ANY* valid string value and would therefore deal with *ANY* selection criteria.

  4. Unnecessary technical keys.

    If there is already a suitable primary key then you will be wasting both time and disk space by creating an additional technical key. This usually follows on from using the universal 'id' column. Take a look at the following:

    CREATE TABLE country (
      id INT(11) NOT NULL AUTO_INCREMENT,
      iso_country_code CHAR(3) NOT NULL,
      country_name VARCHAR(50) NOT NULL,
      PRIMARY KEY (id),
      INDEX iso_country_code (iso_country_code)
    );
    

    The column 'iso_country_code' is guaranteed to always be unique, so that should be the primary key. The addition of a separate 'id' column is just wasting space in the data table and its own index. Why use a separate table to convert a 4-byte integer into a 3-byte string? What is even worse is that the 'iso_country_code' column is defined as a non-unique index instead of a unique index (candidate key) which means that it is possible for the same country code to appear more than once.

    Similar mistakes can be made with relationships between tables:

  5. A foreign key is linked to a unique but non-primary key field.

    This "feature" has only appeared recently in some databases, so is not universally adopted. I can see no good reason to link to a column which is NOT the primary key and is therefore capable of being changed.

  6. A foreign key is linked to a non-key field.

    This means that it is no longer possible to guarantee that the foreign key will retrieve a single row from the parent table. This defeats the original purpose of foreign keys.

  7. A foreign key is linked to only part of the associated primary key.

    If the primary key contains multiple columns then the foreign key should have the same number of columns. Without this one-to-one match it is no longer possible to guarantee that the foreign key will retrieve a single row from the parent table. This defeats the original purpose of foreign keys.

  8. Foreign key constraints are set to ON UPDATE CASCADE.

    This allows the key to be changed and the changed value to cascade down to all related records on child tables. This totally negates the purpose of a technical key, and if used unwisely could cause a large processing overhead.

  9. The same name used to mean different things.

    Take a look at the following DDL statements:

    CREATE TABLE IF NOT EXISTS `brand` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `brand` varchar(255) NOT NULL,
      PRIMARY KEY (`id`)
    );
    
    CREATE TABLE IF NOT EXISTS `product` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      ...
      `brand` int(11) DEFAULT NULL,
      ...
      PRIMARY KEY (`id`),
      CONSTRAINT `product_fk_1` FOREIGN KEY (`brand`) REFERENCES `brand` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
    );
    

    Do you see the mistake? There is a column called brand which exists on two tables, but on one it has the data type VARCHAR while on the other it is an INTEGER.

    The sensible way to do this would be as follows:

    CREATE TABLE IF NOT EXISTS `brand` (
      `brand_id` int(11) NOT NULL AUTO_INCREMENT,
      `brand_name` varchar(255) NOT NULL,
      PRIMARY KEY (`brand_id`)
    );
    
    CREATE TABLE IF NOT EXISTS `product` (
      `product_id` int(11) NOT NULL AUTO_INCREMENT,
      ...
      `brand_id` int(11) DEFAULT NULL,
      ...
      PRIMARY KEY (`product_id`),
      CONSTRAINT `product_fk_1` FOREIGN KEY (`brand_id`) REFERENCES `brand` (`brand_id`) ON DELETE NO ACTION ON UPDATE NO ACTION
    );
    
  10. Different names used to mean the same thing.

    Take a look at the following DDL statements

    CREATE TABLE IF NOT EXISTS `product_price` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `product_supplier_id` int(11) NOT NULL,
      `price` decimal(7,2) NOT NULL,
      PRIMARY KEY (`id`),
      UNIQUE KEY `product_supplier_id` (`product_supplier_id`),
      CONSTRAINT `product_price_fk_1` FOREIGN KEY (`product_supplier_id`) 
                 REFERENCES `product_supplier` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
    );
    
    CREATE TABLE IF NOT EXISTS `product_price_history` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `prod_supp_id` int(11) NOT NULL,
      `price` decimal(7,2) NOT NULL,
      `start_date` datetime NOT NULL,
      PRIMARY KEY (`id`),
      KEY `idx_product_price_history` (`prod_supplier_id`),
      CONSTRAINT `product_price_history_fk_1` FOREIGN KEY (`prod_supplier_id`) 
                 REFERENCES `product_supplier` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION
    );
    

    Do you see the mistake? There are two columns with the same content, but on one it is called product_supplier_id and on the other it is called prod_supp_id. These are wrong because they are different spellings of the same name, but in certain circumstances it may be permissible to use completely different names.

  11. Numeric fields which cannot contain negatives are not marked as 'unsigned'.

    This may seem only a minor point, but as an unsigned integer has a maximum value which is twice that of a signed integer, you are effectively doubling the range of possible values without any increase in storage space. If the software has been written to only expect positive values then it would be good practice to prevent negative values being allowed in by the database.

  12. Improper definition of BOOLEAN fields.

    MySQL does not have a native BOOLEAN data type, so it is necessary to use the nearest alternative. The recommended method is as follows:

    `is_boolean` TINYINT(1) UNSIGNED NOT NULL DEFAULT '0',
    

    Note that the column name begins with 'is_', which provides an enormous hint that the column is boolean as well as what the TRUE condition actually means. I once came across a situation where a table had a column called ACTIVE_OR_INACTIVE where it was unclear whether TRUE meant 'is active' or 'is inactive'. A name such as IS_ACTIVE does not cause such confusion.

    Although this allows a range of values other than 0 or 1 which are the limits for a true BOOLEAN column, 0 is always FALSE while any non-zero value (which includes 1 of course) is treated as TRUE. MySQL will allow you to test for the values TRUE or FALSE as well as the values 1 or 0 in your SELECT statement, but under some circumstances a non-zero value other than 1 may not be included in a search for TRUE. Note the following slightly different statements:

    WHERE is_boolean IS TRUE               // will include any non-zero value
    WHERE is_boolean IS NOT TRUE           // will exclude any value apart from '0'
    WHERE is_boolean = TRUE                // will include the value '1' only
    WHERE is_boolean != TRUE               // will exclude any value which is not '1'
    WHERE is_boolean IS FALSE              // will include the value '0' only
    WHERE is_boolean IS NOT FALSE          // will exclude any value which is not '0'
    WHERE is_boolean = FALSE               // will include the value '0' only
    WHERE is_boolean != FALSE              // will exclude any value which is not '0'
    WHERE is_boolean                       // same as IS TRUE
    WHERE !is_boolean                      // same as IS FALSE
    

    It is also possible to define a numerical BOOLEAN as follows:

    `some_field` INT(11) DEFAULT NULL,
    

    The use of signed integers with a size greater than one is a sure sign of a sloppy mind. The specification TINYINT(1) UNSIGNED makes it more obvious that the range of possible values is severely limited.

    Other possibilities are as follows:

    `is_boolean` CHAR(1) NOT NULL DEFAULT 'N' COMMENT 'N=No, Y=Yes',
    `is_boolean` ENUM('No','Yes') DEFAULT 'No',
    

    Note that with these non-numeric data types it is not possible to test for the values TRUE or FALSE - you must use the actual character strings. Note also that with ENUM columns the valid values are given index numbers which start at 1, which allows an invalid value to be given the index number of 0. Such invalid values will be excluded from any search using either of the following:

    WHERE is_boolean = 'No'                // will test for index=1 only
    WHERE is_boolean = 'Yes'               // will test for index=2 only
    
  13. Incorrect use of ENUM fields.

    It seems that some novices do not understand how certain data types work before they try using them. Take the following which I encountered recently:

    `type` enum('A','M') NOT NULL DEFAULT 'M' COMMENT 'A for Admin, M for Member',
    

    As anyone who has read the manual will tell you, each entry in the list of permissible values is given an index number, and it is the index number which is stored in the database. The software can then retrieve the value associated with each index number from the ENUM list. It is not usual practice to provide a list of values which are in turn keys to yet another set of values.

    The correct way to use an ENUM data type is as follows:

    `type` enum('Admin','Member') NOT NULL DEFAULT 'Member',
    

    An alternative method would be as follows:

    `type` CHAR(1) NOT NULL DEFAULT 'M' COMMENT 'A for Admin, M for Member',
    

    Trying to mix the two methods is a sure sign of a very confused mind.

  14. Too many indexes

    This was identified by Brian Cryer in the comp.databases.mysql newsgroup who used to work with someone who insisted on adding an index for every field that was used as part of a query, as in the following table:

    CREATE Table Arrivals (
        ArrivalID int AUTO_INCREMENT,
        CountryID int,
        SiteID int,
        StationID int,
        PRIMARY KEY (ArrivalID),
        Index I1 (CountryID),
        Index I2 (SiteID),
        Index I3 (StationID)
    );
    

    Most of the queries against this table were of the form: Select * from Arrivals where CountryID=X and SiteID=Y;

    What he didn't appreciate is that the database will only use one index (per table), so defining multiple indexes like this - especially ones which would never actually be used - is wasteful. Each index will consume disk space and processor cycles to maintain its contents. What he should have done was to define a compound index, such as:

        Index I1(CountryID,SiteID);
    

    He went on to compound his error by simply adding compound indexes to his previous mistakes:

        Index I1 (CountryID),
        Index I2 (CountryID, SiteID)
    

    The first of these is totally redundant as it is the leading part of the second index.

  15. Columns containing compound values

    This was identified by Herman Viaene in the comp.databases.mysql newsgroup.

    While it is permissible to have compound keys which can be comprised of more than one column, it is not a good idea to have a column with compound values. By this I mean a value such as AAAA-BBBB-CCCC-NNNN where each of the different parts has a different meaning such as product type, subtype, et cetera. This used to be quite common in the pre-relational database days when compound keys were not allowed, so as much data as was necessary was packed into a single column. This would cause problems whenever one of the substrings became too short for the addition of new codes, and would also make searching on anything other than the leading substring a bit of a headache. With modern databases each of those components would be in its own column, so none of those old problems would exist.

  16. Storing values which can be calculated or derived

    In some applications there may be values which can be calculated from other values, so there is little point in storing all of them in the database. For example, in a sales order processing system you will have order items which have UNIT_PRICE, NUMBER_OF_UNITS and EXTENDED_PRICE. Only two of these values need be stored in the database as the third can be obtained from a simple calculation, such as:

    EXTENDED_PRICE = UNIT_PRICE * NUMBER_OF_UNITS
    or
    UNIT_PRICE = EXTENDED_PRICE / NUMBER_OF_UNITS

    So which values do you store in the database, and which do you calculate as and when necessary? When there is a choice between two possible calculations, one involving a multiplication and another involving a division, you should always choose the multiplication. Why? The simple answer is "accuracy". If the values contain different numbers of decimal places then the result of a multiplication, even if it is rounded afterwards, will always be more accurate than dividing with a value which may have been rounded already. If the UNIT_PRICE and EXTENDED_PRICE both have two decimal places then it might be difficult to prove this point, but suppose the UNIT_PRICE has three decimal places? Look at the following examples where the UNIT_PRICE is 4.005:

    4.005 * 7 = 28.035, which can become either 28.04 (rounded up) or 28.03 (rounded down).

    What happens when you store the EXTENDED_PRICE and try to calculate the UNIT_PRICE?

    28.04 / 7 = 4.005714285714286 (which is rounded UP to 4.006)
    28.03 / 7 = 4.004285714285714 (which is rounded DOWN to 4.004)

    This becomes even more critical if you ever do currency conversions using an exchange rate which has 5 or more decimal places. The two possible calculations are:

    HOME_CURRENCY_AMOUNT * EXCHANGE_RATE = FOREIGN_CURRENCY_AMOUNT
    or
    FOREIGN_CURRENCY_AMOUNT / HOME_CURRENCY_AMOUNT = EXCHANGE_RATE

    The difference between the two calculations may appear small to you, but in the world of finance where you are expected to balance your books to the nearest penny on sums which go into millions, such minor differences are not acceptable.

  17. Performing unnecessary database reads for values which should be stored

    This is the opposite of the previous tip. A junior designer who heard about the idea of not storing values which could be calculated went further than necessary and decided that in his accounting system he would not bother to store the account balance anywhere, he would simply calculate it by summing all the values on the associated records in the transaction table which stored individual debit and credit amounts. When the number of transactions was quite small this overhead was quite small, but as the number of accounts grew and the number of transactions grew the time it took to perform this calculation became longer and longer, eventually becoming totally unacceptable.

    Where a value can be calculated easily by reading a single record then the cost of that calculation is acceptable, but where the number of records you need to read can grow and grow over time then it really should be stored for speed of access, and should only be recalculated as and when necessary.

  18. Database changes which break existing code.

    One thing worse than producing a less-than-efficient design before a single line of code has been written is to then change a table's structure in such a way that it breaks that code which has been written. A typical case is when a new column is added to a table, but instead of being made optional (nullable) it is made required (not nullable) but without a default value. Any existing query which attempts to insert a record into that table will therefore fail. This is even worse in those situations where an application contains hard-coded queries in multiple places as each of those places will have to be identified and changed.

Conclusion

Every decision has consequences, some intended, some unintended. Every design decision has consequences for those who have to implement that design. I have little respect for designers who cannot build what they design as they have a tendency to design things which are either difficult or impossible to build. While none of the mistakes I have identified above makes it impossible to write code which uses the database, each one acts as a speed bump which slows the developer down. As any driver will tell you, too many speed bumps will have a drastic effect on your journey time. While the distance remains the same the time to reach the destination has been increased, so when a development schedule fails to be met because of too many designer-instigated speed bumps, why do the creators of those speed bumps have the audacity to blame the developers?

If I have little respect for designers who make such basic mistakes, I have absolutely zero respect for those designers who, after being told of the consequences, often unintended, of their poor design, have the arrogance to say "My design is perfect. You'll have to code around it!" In my humble opinion such arrogant nincompoops should be forced to spend time working as lavatory attendants so they get to experience what it is like dealing with other people's cr*p instead of forcing other people to deal with theirs. I once worked on a new application where the UI, application and database designers had absolutely no knowledge of how the development language worked, which meant that nothing they designed could be implemented easily. Their combined designs were so bad that it took 10 days to build 2 simple components which I can now implement, using my own framework, in under 10 minutes.

"How can you possibly reduce development times from 10 days to 10 minutes?" I hear you say. The simple answer is that I am still a hands-on developer and not a day-dreaming designer or an architecture astronaut. I design what can be built because I regularly build from my designs. When I spot a speed bump I do what I can to flatten it and reduce its effect. It doesn't matter what part of the project causes the problem - it could be the naming standards, the development standards, the application design, the database design or the UI design - anything which slows down the developer will be put under the microscope, dissected and reassembled without the problem.

This is a technique which I have used on all my projects, and I have regularly made little savings here and little savings there. Sometimes what used to be a laborious piece of coding can be converted into a simple subroutine. Sometimes whole swathes of code can be replaced by several subroutines. Sometimes a collection of subroutines can be merged into a full-blown framework. I don't have to write any SQL statements anymore because my framework does it for me, but that is only possible because I tweaked my design until it was possible. My framework can do lots of things automatically simply because I kept tweaking my design until it was possible.

Lots of little savings soon accumulate into big savings.

Good design pays off, bad design costs both in initial development and ongoing maintenance.

References


© Tony Marston
23rd September 2012

http://www.tonymarston.net
http://www.radicore.org

Amendment history:

09 Nov 2012 Updated list of common mistakes to contain Too many indexes and Columns containing compound values following feedback from the comp.databases.mysql newsgroup.

counter