Tony Marston's Blog About software development, PHP and OOP

The Case Against Intelligent Databases

By Tony Marston

Posted on 1st January 2012 by Tony Marston

Amended on 10th January 2012

Introduction
Opinions and Counter Opinions
Conclusion
Responses to my article
Amendment History
Comments

Introduction

This is in response to the article titled The Case for Intelligent Databases. As soon as I read this article I thought to myself "Why are you making things more complicated than they need be?" I have seen designs like these created by many programmers who are so obsessed by what they consider to be the "proper" way of doing things in the object oriented world that they are totally blind to other approaches which are, in my humble opinion, much simpler and more practical.

Before I go any further let me say that my approach to OO programming has been ridiculed by any so-called "experts" by being "the wrong way" and "impure" as you can see in the following articles:

I do not consider my approach to be wrong for the simple reason that it works, and it works very well thank you very much. It is not wrong, just different. It is based on years of experience gained in the pre-OO era, and I have simply taken methods which have proved themselves to be solid in the past and updated them for more modern technologies.

Opinions and Counter Opinions

As I read through the article I came across more and more statements which I believe are the result of muddled thinking, and which are taking the author and his readers in the wrong direction, so these statements, and my personal counter-opinions, are listed below.

While most programmers primarily use the database as a place to simply store data, this reduces a database to basically an ACID-compliant data store with some reporting capability.

Well, isn't that what it is supposed to be? The database has always been a dumb data store, with all the business logic held separately within the application. This is an old idea which is now reinforced by the single responsibility principle. It is the application code which is responsible for the application logic while the database is responsible for the storing and retrieval of data. Modern databases also have the ability to enforce data integrity, manage concurrency control, backup, recovery and replication while also controlling data access and maintaining database security. Just because it is possible for it to do other things as well is no reason to say that it should do other things.

This approach saves the programmer from worrying about advanced database functionality, but it also prevents the database system from being used to do a number of useful tasks that databases can really be leveraged to do.

Just because a database has more features does not mean that you should bend over backwards to use them. The English language contains a huge number of words that I have never used, but does that make my communication with others unintelligible? I have used many programming languages, but I have rarely used every function that has ever been listed in the manual, but does that mean that my programs don't work? With a relational database it is possible to use views, stored procedures, triggers or other advanced functionality, but their use is not obligatory. If I find it easier to do something in my application code, then that is where I will do it.

The fundamental idea of encapsulating a database inside a defined API is that usually the application should hit the API and not the underlying relations where the data is stored if at all possible.

From where did this idea originate? The application has to access the database at some point in time anyway, so why do people like you keep insisting that it should be done indirectly through as many intermediate layers as possible? I have been writing database applications for several decades, and the idea that when you want to read from or write to the database you should go indirectly through an intermediate component instead of directly with an SQL query just strikes me as sheer lunacy. This is a continuation of the old idea that programmers must be shielded completely from the workings of the database, mainly because SQL is not objected oriented and therefore too complicated for their closed minds. If modern programmers who write for the web are supposed to know HTML, CSS and Javascript as well as their server-side language, then why is it unreasonable for a programmer who uses a database to know the standard language for getting data in and out of a database?

Many current development environments seek to save the programmer from spending heavy engineering time on a database, particularly in the early stages of a project where requirements can change. The database is relegated to a simple data store with ACID compliance.

If your programmers have to spend large amounts of time in writing boilerplate code for basic read/write/update/delete operations for your database access then you are clearly not using the right tools. A proper RAD (Rapid Application Development) toolkit should take care of the basics for you, and should even ease the pain of dealing with changes to the database structure. Using an advanced toolkit it should possible to create working read/write/update/delete transactions for a database table without having to write a single line of SQL. I have done this with my RADICORE toolkit, so why can't you? Simple JOINs can be handled automatically as well, but more complex queries need to be specified within the table class itself. I have implemented different classes for different database servers, so customers of my software can choose between MySQL, PostgreSQL, Oracle and SQL Server.

Frameworks like Rails will even generate a database schema from programming data structures.

The very idea that your software data structure should be used to generate your database schema clearly shows that you consider your software structure, as produced from your implementation of Object Oriented Design (OOD) to be far superior to that of the database. This is where I consider you, and others of the same mental persuasion, to be barking up the wrong tree. In all my years of writing database applications it has been proved to me time and time again that the starting point is always the database, and the database must be properly structured using the process of normalisation. A badly designed and un-normalised database will always be the root cause of so many problems that you will encounter down the line. While it is possible to have a software structure which is not in line with the database structure, I was taught at an early stage in my programming career that if the software and database structures are kept synchronised then a lot of problems will simply not appear. This was the basis of the Jackson Structured Programming course I attended in the 1970s. This simple idea has proved its worth over the years, so I will not abandon it without an exceptional reason. I love this idea so much that I have created my own framework where the programming structures are created from the database schema, not the other way around. I have built several applications using this framework, including an ERP package which has 450+ tables, 1,200+ relationships and 4,000+ user transactions, so I know how effective it can be.

If one goes with an intelligent design for the database, one cannot use agile programming methodologies to design the database.

You shouldn't use agile programming methodologies to design the database, nor any other methodology EXCEPT the rules of data normalisation, as documented in The Relational Data Model, Normalisation and effective Database Design. The rules of normalisation are much more scientific that those of OOD with their flimsy "is-a" and "has-a" concepts which can be implemented in so many different arbitrary ways, with their overly complex inheritance hierarchies, and you will find that given the same data requirements a collection of database designers are more liable to produce matching designs than a similar collection of OO designers.

Instead the database requires some real engineering in its own right. This takes some time and effort in its own right and can slow down development compared to ORM-based code generators such as we see in Rails.

You may think that it slows down development, but I do not. You still have to analyse an application's data requirements before you can go through the design process, but instead of going through Object Oriented design as well as database design I prefer to ignore OOD completely and spend my time in getting the database right. Once this has been done I can use my RAD toolkit to generate all my software classes, so effectively I have killed two birds with one stone. It also means that I don't have to waste my time with mock objects while I'm waiting for the database to be built as it is already there, so I can access the real thing instead of an approximation. As for using an ORM, I never have the problem that they were designed to solve, so I have no need of their solution.

Such engineering requires taking a clear look at what data is expected to be collected, what internal structure it has, designing the database storage first and foremost around that analysis, and then building an API based on how it will be used.

This is another area where OO people, in my humble opinion, make a fundamental mistake. If you build your classes around the database structure and have one class per table, and you understand how a database works, you should realise that there are only four basic operations that can be performed on a database table - Create, Read, Update and Delete (which is where the CRUD acronym comes from). In my framework this means that every class has an insertRecord(), getData(), updateRecord() and deleteRecord() method by default, which in turn means that I do not have to waste my time inventing unique method names which are tied to a particular class. Where others have a createCustomer(), createProduct(), createOrder() and createWhatever() method I have the ubiquitous insertRecord() method which can be applied to any object within my application. Experienced database developers will also tell you that, regardless of the contents of any particular table, its data can only be accessed using one of four different queries. The rules for constructing each of those queries are fixed, which means that each query can be constructed and executed using code which is similar if not identical. This "similar" code is often known as boilerplate code, and a competent programmer should be able to reduce the amount of boilerplate code he has to write either by putting it into a subroutine or, with the advent of OOP, an abstract class which can then be inherited multiple times. The methods in the abstract class then become instantly available to every subclass, and this means that any of those subclasses can be accessed using the same methods. This is the basis of that OO concept called polymorphism, so it should be familiar to every OO programmer. Because of this I have a single page controller for each of the methods, or combination of methods, which can be performed on a subclass, and I can reuse the same page controller on any subclass within my application. I have yet to see the same level of reuse in other application frameworks, but isn't a high level of code reuse what OOP is supposed to be about?

I am often told "but that is not the way it is done!" which is another way of saying "it is not the way I was taught". This tells me that either the critic's education was deficient, or he has a closed mind which is not open to other techniques, other methods or other possibilities.

An early critic of my approach pointed out that if I have a separate class for each table then I must have a great amount of duplicate code. The poor dear had obviously not understood the concept of inheritance. All the code which is common to every database table is contained within an abstract class, and this is inherited by every concrete class. In fact there is so much code in the abstract class that the concrete class starts off by being very small indeed. All it contains is the database name, the table name, the table structure, and details of any relationships both as a parent or child. Standard validation is carried out by the framework, so all the programmer has to do is amend each table class to deal with any business rules. Standard validation (which I refer to as primary validation) is carried out by the framework in order to verify that before a query is sent to the database all the data items are of the correct size and type. This ensure that the execution of that query will not fail. Additional (or secondary validation) is implemented by the developer inserting code into the relevant hook method which is defined in the abstract class but which can be overidden in any subclass.

Another critic said that he had tried the one table-one class approach but had to abandon it because it didn't work. If his implementation did not work then it was simply because his implementation was faulty, probably because he got lost and confused in all that non-essential OO cruft such as multiple inheritance, design patterns, and all that other gaudiness which is high in complexity but low on effectiveness. My simpler implementation has been working very well since 2003, so that just reinforces my view that "simple works, complicated works harder to achieve the same result".

For simple schema development this doesn't take a long time but the complexity does grow with the data collected.

Simply adding more database tables does not make an application more complex, it just makes it have more components. I can add new tables and components without increasing the complexity of any existing components. Each component is only affected by those tables that it touches, so if it doesn't touch a new table then it's not affected by its addition. I can also change the business rules within an object without having to amend other objects. This is a simple extension of the modular programming I learnt in my pre-OO days, but with OO I have the opportunity to implement that concept in different ways.

In the LedgerSMB project this is not the end of the world since we are re-engineering the database incrementally and so each major branch includes at least a few areas where old db structures have been pulled out and new, better engineered ones added.

I can even make changes to a table's structure, such as adding or deleting a column, or changing a column's size or type, without having to perform major surgery on my table class. I simply change the database, import the new structure into my data dictionary, and then export the changes to my application. I don't have to change any code unless the structural changes affect any business rules. My ERP application started off with 20 database tables but this has grown to over 400 over the years, so I am speaking from direct experience.

Other projects may find that this adds significant development time beyond what is reasonable, but these must still be weighed against the benefits.

It depends on what you call "reasonable". I have worked with applications where even a slight change to the database caused days of work, whereas with my own framework I can time my changes with a wrist watch instead of a calendar. Any application architecture which requires significantly more effort to implement database changes than what I have become used to with my own architecture I would consider to be unreasonable.

Conclusion

The idea that some OO programmers have that all their problems would be solved if only they had a "proper" database with more intelligence, such as an object oriented database into which they could just dump their objects and have the database sort out their mess, just strikes me as pure arrogance. They believe that their blinkered viewpoint is the "only" viewpoint worth having, and that everybody else is wrong. In my humble opinion what the world really needs is not more intelligent databases, but more intelligent programmers. They should learn how to design applications properly and write code properly instead of trying to offload everything to the database.

It may be that the design decisions they make may seem appropriate for their particular circumstances, but all to often they automatically assume that those same decisions are appropriate for all circumstances. I have to disagree. The circumstances for which the article The Case for Intelligent Databases was written appear to be as follows:

Attempting to implement the same business rules in the home application as well as all those third-party applications would result in duplicate code which could easily get out of sync. In these circumstances putting the business rules in a single place, the database, would seem to be a good idea.

The circumstances under which my own applications operate are totally different:

My application has to support multiple DBMS systems, so maintaining different sets of stored procedures and triggers for each DBMS would result in duplication of effort. As my application has exclusive use of my database(s) it therefore makes sense (to me at least) to maintain the business rules in a single place, which is the application. It is also the easiest and cheapest option.

So you see, it's different horses for different courses. I rest my case.

Responses to my article

Chris Travers, the author of the original article, responded in A Reply to Tony Marston's advocacy of unintelligent databases. In it he said the following:

My own belief is that a large number of database systems out there operate according to the model of one database to many applications.

That has not been my experience. I have spent all of my career in writing applications which have sole access to their databases. Having multiple applications trying to perform the same function on the same database is something which I have never encountered.

The above opinion works quite well in a narrow use case, namely where one application and only one application uses the database. In this case, the database tables can be modelled more or less directly after the application's object model.

I do things the other way round - my application's classes are built from the database schema by way of my data dictionary. If I change my schema then I regenerate the necessary application code so that they are always in step.

then the only thing the RDBMS brings to the table is some level of ad hoc reporting ...

Incorrect. I use an RDBMS as a reliable and flexible data store, with backup, recovery and replication options, not just as a reporting tool. As an object's data may be spread across multiple tables as a result of normalisation I make extensive use of JOINS in my SQL queries so that I can gather this data into a single result set. This was impossible to do with the hierarchical and network databases that I used previously.

... at the cost of added development time and complexity compared to some NoSQL solutions.

I disagree completely. It is implementing business logic in the database which is more time consuming, difficult and expensive. And please don't muddy the waters with NoSQL databases as they are a recent phenomena which have yet to reach full maturity and live up to their hype.

The problem is that when you start moving from single applications into enterprise systems, the calculation becomes very different. It is not uncommon to have several applications sharing a single database, and the databases often must be designed to make it quick and easy to add new applications on top of the same database.

It has been my experience that when I design and write something like an Order Entry application then that application is the only one which allows orders to be entered. I have never had design a database that can be used by multiple applications, so all your arguments on that matter are irrelevant to me, as I suspect they would be to a lot of other programmers who only ever deal with the "one database, one application" scenario.

Chris Travers also posted a Further Reply to Tony Marston.

There was also a response from Joe Abbate in Business Logic in the database. As he describes himself as a "database technologist" it is no wonder that he is a champion of that particular cause.

In this opening message he said the following:

If a database (actually, a shortening of DBMS - combination of the data and the software that manages it) has always been dumb, then presumably one would never specify UNIQUE indexes. It is a business requirement that invoice or employee numbers be unique, so if all the business logic should reside in the application, then the DBA should only create a regular index and the application - all the applications and tools! - should enforce uniqueness.

You misunderstand me. When I said "dumb" data store I meant the basic properties of any DBMS (whether relational or not) which is the ability to specify tables, columns within tables with data types and sizes, null/not null attributes, primary keys, unique keys and indexes. More advanced features, such as foreign key constraints, column constraints, triggers, stored procedures and user-defined functions were later additions. It is the use of these "advanced" features which I regard as optional and therefore I have the right to exercise that option and not use them. All my business logic, the "intelligence" as it where, is defined within my application code, and it is my application which tells the database what to do and it is the database's job to do what it is told. It is only because this "intelligence" does not lie within the database that I describe the database as "dumb".

He also said:

Tony believes that primary keys, and unique and referential constraints should be enforced by the DBMS, but apparently an integrity constraint such as "No supplier with status less than 20 supplies any part in a quantity greater than 500" should instead only be code in an application (make that all applications that access that database).

That is precisely what I mean. I have always implemented business rules such as these in my application so that I need not generate and execute an SQL query until I know all the data is valid. If the query fails then it is a catastrophic error and my application terminates immediately. I do not need to examine the database response in order to determine what to do next - either abort, or tell the user his data was bad and needs correcting. I am not the only programmer who follows this practice - if you read the comments to the blog entry Database Constraints: Use Them or Lose Them? you will see that the general consensus is quite simple:

The first of these is quite important as the messages that are generated by database constraint violations can be horribly arcane from a developer's perspective and totally unbearable from a user's perspective. The only way to do the job properly is with application code.

Some people seem to think that if you have data validation in your application as well as in the database then you are violating the DRY principle. These people are using twisted logic - the fact that the database performs type checking as well as the application has *NEVER* be seen as a duplication of effort which should be avoided, it is a safety net which verifies that the application is generating valid queries. If a query fails then it indicates an error in the application.

When it comes to "all applications that access that database" I have never written separate applications which share a database and which have duplicate functionality. What I *DO* write are applications which are comprised of integrated subsystems which were written under the same framework where the functionality provided by one subsystem can be shared with any of the others. Thus supplier orders are only ever generated within a single subsystem, and the business rules for supplier orders, such as the one you mentioned, need only be enforced within a single subsystem within the larger application.

There was this interesting point made by Roger Hunwicks in comment #518:

The problem that I see with pushing business logic into the database, is that database development is much less efficient than java/python/ruby/php development - modern OO languages and their associated IDEs, unit test frameworks, continuous integration servers, etc. make developing and testing business logic fast (and therefore cheap). Developing the same functionality inside the database generally takes longer (and therefore costs more).

My sentiments exactly!

JMA made this remark in comment #529:

To take an extreme, one could define a quantity on hand (qoh) column as a TEXT datatype. Then it would be up to the application to validate that the qoh entered was all numeric, without decimals, positive or zero, and within a range suitable to the particular business. Of course, most sane DBAs and developers would define qoh as INTEGER in the database and int or similar in the application.

Only an idiot would take it to that extreme. I always design my databases with the correct data type for each column, but my application also checks the data is of the correct type before the SQL query is constructed and issued. If the query fails it is a catastrophic error and not a minor validation error which the user can correct.

He also said the following:

Where do you draw the line on ensuring that qoh isn't changed to a negative value or an enormous value for a particular item? If your answer is "only in the application", I contend you'll be surprised when a user who has been given access via ODBC/JDBC spreadsheet updates qoh to -1 or 99999, because he wants to attribute some "special" meaning to those values.

Wrong again. If any user manipulates values in a spreadsheet then that spreadsheet is uploaded onto my server as a CSV file and processed by a component within my application, thus ensuring that the business rules within my application code are not bypassed.

In comment #531 JMA also said the following:

Yes, of course it's a ridiculous suggestion. I did say it was extreme, like my earlier mention of not defining unique keys. The point I'm trying to make is that those are business requirements, those are the semantics of the data from a business perspective, and there are gradations of specifying and checking those requirements.

Then your definition of "business requirement/rule" is different from mine. It is not a "business rule" that dates are stored as dates, integers are stored as integers, decimals are stored as decimals, etc. That is plain common sense and basic data analysis. A "business rule" is something which goes beyond a column's data type, such as "this value cannot be less than 1" or "this date cannot be less than that date". It is this type of business rule which has traditionally been implemented within the application and not the database.

Roger Hunwicks in comment #588 said the following:

... there is value in enforcing data integrity in the database. To me the only question is how much value, for what cost? I.e. we need to consider the cost-benefit analysis of putting data integrity rules inside the database. Writing trigger code to enforce business rules is generally less efficient than writing application code.

Which is another reason why I prefer the cost-effective method of implementing business rules within my application and not the database.

In comment #525 Richard Gibson said the following:

His articles appear to be adverts for his PHP framework which doesn't appear to accommodate complex table designs or where data is pulled from multiple tables by a single object (as far as I can tell), which is odd given his push for normalisation (which is good). It looks very limiting.

Such limitations do not exist in my framework. In my ERP package I have implemented several of the designs contained in Len Silverston's Data Model Resource Book, and they can be quite complex. As for pulling data from multiple tables within a single object, I use sql JOINs just like everyone else.

In comment #544 Richard Gibson asked the following:

How can his PHP framework enforce single access to the background data store? It can't, unless it is using a single connection to the data store, which is useless for concurrent users.

My application is running on a server which contains no other applications, and the database can only be accessed from 'localhost'. Access to the database management tool is not given out to dumb users, so dumb users can only get to the data through my application. I am *NOT* limited to a single connection and I *DO* support multiple concurrent users.

In comment #546 Robert Young began to turn towards the ridiculous with the following:

Considering that you're championing the failed COBOL/VSAM/IMS paradigm that Dr. Codd put a stake in, why aren't you the one making such comments?

I am most definitely *NOT* advocating a return to the COBOL/VSAM/IMS paradigm. I am simply questioning the argument that because you *CAN* implement business rules in the database then you *MUST*. I don't have to, and I choose not to. I put my business rules in my application code because that it the most tried, tested and cost-effective method.

In comment #549 he followed with this statement:

Lots of folks, take Struts as an example, bifurcate classes between function and data (they give the classes names like FooAction and FooData) which is semantically identical to COBOL/VSAM (or FORTRAN or Flow-Matic); one can stamp one's foot and claim the mantel of OOD, but it's just a lie. It just is. This bifurcated mentality then infiltrates the rest of the application code, and it is just a bunch of "Action" classes (the functions) and a bunch of "Data" classes (the data): that's what your grand daddy wrote back in 1965. Admit it. If you want to see real OOD/OOP in action, read Allen Holub.

That may be how YOU write code, but I certainly don't. To me the term encapsulation means "the act of placing data and the operations that perform on that data in the same class", and that is precisely what I do. My framework is built around the 3 Tier Architecture, and every class in the business layer encapsulates both the data and the operations (which includes the business rules) for each database table. I certainly do *NOT* have "actions" and "data" in separate classes.

A separate issue from whether your paradigm/framework is relational.

"Relational" applies to a database, not the application software which accesses that database.

Your data is still flatfile-in-engine.

The data in my applications exists in a relational database, not flat files, so stop making such stupid claims.

But, pretty clearly, your "schema" is very little different from a flatfile, and certainly not what Dr. Codd designed.

I design my databases according to relational theory, according to the rules of normalisation, and implement those designs in a variety of relational databases. Only an idiot such as you would still claim that they are still no more than flat files.

And who ever said that PHP was OO??

It *IS* by virtue of the fact that it supports (and has done since version 4) the concepts of encapsulation, inheritance and polymorphism, and regardless of what you or anyone else thinks those are the *ONLY* factors which decide if a language is OO or not.

Amendment History

10 Jan 2012 Added Responses to my article.

counter