Tony Marston's Blog About software development, PHP and OOP

How NOT to Validate Data

Posted on 1st April 2016 by Tony Marston
Introduction
You must initialise an object through its constructor
You must use setters for optional parameters
You could create a validate() method ...
... but this would be dangerous
Invalid objects
Can the Profile class run the validator?
Put each validation rule into its own class
Data Validation and Business Rules are different
Build lots of small objects
Summary
References
Comments

Introduction

I recently came across an article called How to Validate Data which gives advice to PHP programmers on what is supposed to be "the right way" to perform data validation in their PHP code. As a developer who has been writing database applications in several different languages for several decades, and who has built a development framework for building such applications which totally automates the data validation process, I consider this article to be yet another failure in that it gives bad advice. The whole idea behind OOP is supposed to be as follows:

Object Oriented Programming is programming which is oriented around objects, thus taking advantage of Encapsulation, Inheritance and Polymorphism to increase code reuse and decrease code maintenance.

The article fails to fulfil this objective simply because it requires too much hand-written and non-reusable code. This is probably a follow-on from the fact that the author's ideas regarding encapsulation and the creation of classes and objects was flawed from the very start, more than likely because of a dogmatic approach to OOP which has been passed down by other members of the "paradigm police". I do not follow the established view, instead I am a non-conformist, a heretic and proud of it. I design and build enterprise applications which use large numbers of database tables, and my "alternative" approach has allowed me to automate the process of data validation to such a degree that I don't have to write any code at all to perform primary data validation. As far as I am concerned any programmer who cannot do the same is doing it wrong.

If you are not yet aware of my heretical views you might like to read the following articles:

In the article it shows how the class for Profile is defined, and another piece of code which instantiates that class into an object so that its methods can be executed. In my application framework, which is a combination of both the 3-Tier Architecture and the Model-View-Controller design pattern, I refer to these classes as the Model and the code which does the instantiating and method calling as the Controller.

Simply creating classes with lots of methods will often not produce the most efficient or maintainable code. Just as the data in a database is not stored in a single table, neither should the code for an application be stored in a single class. The application should be broken down into smaller classes, but this means that you will end up will classes/objects that call other classes/objects, and this will create dependencies between those objects. If object "A" calls a method on object "B" then there is a dependency between "A" and "B" as "A" is dependent on "B" to perform its function. When creating classes in an application you should aim to achieve a balance between cohesion and coupling with the target being a combination of high cohesion and low coupling

I will now examine every step in that article and explain why I belive that the underlying theories are wrong, and show how my alternative approach produces better results. I am a pragmatic programmer who is results-oriented, not a dogmatic programmer who is rules-oriented, so I place more importance in the results that I achieve and not the rules that I follow to achieve them.

You must initialise an object through its constructor

The article starts with the following statement:

Let us assume we need a profile which holds some user-related data. We will start small, and ignore the validation in the first step. Ideally, we can initialize an object through its constructor:
<?php
class  Profile
{
    private  $firstName ;
    private  $lastName ;
    private  $email ;
    private  $nickname ;
    private  $homepage ;

    public  function  __construct (
        $firstName ,
        $lastName ,
        $email ,
        $nickname ,
        $homepage
    )
    {
        $this -> firstName  =  $firstName ;
        $this -> lastName   =  $lastName ;
        $this -> email      =  $email ;
        $this -> nickname   =  $nickname ;
        $this -> homepage   =  $homepage ;
    }

    // ...
}

$profile  =  new  Profile (
    'John' ,
    'Doe' ,
    'user@example.com' ,
    'johnny' ,
    'http://example.com'
) ; 

I disagree with this approach for the simple reason that there is no rule which says that I must instantiate an object and at the same time populate it with data through the class constructor. As far as I am concerned these are two separate operations for the simple reason that the data can come from two possible sources - the user or the database. The above code is also a shining example of tight coupling, which is supposed to be avoided, because of the following:

This is how I would do it. Step #1 - Define the Profile class:

<?php
require_once 'std.table.class.inc';
class profile extends Default_Table
{
    function profile ()
    {
        $this->dirname     = dirname(__file__);
        $this->dbname      = 'foobar';
        $this->tablename   = 'profile';
        $this->fieldspec = $this->getFieldSpec_original();
        
    } // profile
    
} // end class
?>

Notice the following:

Step #2 - Define the controller to access the Profile class:

script: std.add1.inc
<?php
require "classes/$table_id.class.inc";
$dbobject = new $table_id;
$fieldarray = $dbobject->insertRecord($_POST);
if ($dbobject->errors) {
    $errors = $dbobject->errors;
    $dbobject->rollback();
} else {
    $dbobject->commit();
} // if
?>

If I want to read data from the database before displaying it the screen I would use code similar to the following:

script: std.enquire1.inc
<?php
require "classes/$table_id.class.inc";
$dbobject = new $table_id;
$fieldarray = $dbobject->getData($where);
?>

Notice here that this code does NOT specify any class name. Instead it uses a variable $table_id which was passed down from the previous component script which looks like the following:

script: profile(add1).php
<?php
$table_id = "profile";                      // identify the Model
$screen   = 'profile.detail.screen.inc';    // identify the View
require 'std.add1.inc';                     // activate the Controller
?>

Notice the following:

This approach clearly demonstrates loose coupling, thus making the controller completely reusable.

You must use setters for optional parameters

The article continues with the following statement:

To work around the optional constructor parameters, we can create setters for all optional parameters:
<?php
class  Profile
{
    // ...

    public  function  __construct ( $lastName )
    {
        $this -> lastName  =  $lastName ;
    }

    public  function  setFirstName ( $firstName )
    {
        $this -> firstName  =  $firstName ;
    }

    public  function  setEmail ( $email )
    {
        $this -> email  =  $email ;
    }

    // ...
} 

As I stated previously I do not use separate class variables for each table column, and I do not use getters and setters for each table column. Instead I pass the data around as a single array which can contain data for any number of rows and any number of columns. If the array does not contain data that the object needs then it will generate an error. If the array contains data that the object does not need then it will be ignored. This means that the amount of data which I pass into an object is determined by the contents of this array, and I can alter the volume of data without having to change the number of setters which I call.

You could create a validate() method ...

The article contains the following statement:

Now when and where do we validate? We could create a validate() method that returns an array or collection of error messages if the validation has failed. I have seen this approach quite often:
<?php
class  Profile
{
    // ...

    public  function  validate ( )
    {
        $errors  =  [ ] ;

        // add error if last name is not empty
        // add error if email address is invalid
        // ...

        return  $errors ;
    }

    // ...
} 

Believe it or not I actually agree with this approach, but rather than having to manually build a separate method within each class I have taken it further by generating a single and reusable validateInsert() method which exists within my abstract table class:

script: std.table.class.inc
<?php
class  Default_Table
{
    // ...

    function  validateInsert ($fieldarray, $fieldspecs )
    {
        $errors = array();

        foreach ($fieldspecs as $fieldname => $specs) {
            // compare contents of $fieldarray[$fieldname] with specifications contained in $specs
            // and add to $errors if any check fails
        } // foreach
			
        $this->errors = $errors;

        return  $fieldarray;
    }

    // ...
} 

Note the following:

This approach makes it easy to step through all the items in the array and to check that the value for each field conforms to its specifications. It does not use exceptions as this would terminate the validation process on the first error, and there could be any number of errors. Note that I have a validateInsert() method which starts with the $fieldspecs argument so that it can detect when a required field is missing from $fieldarray. I have a separate validateUpdate() method which starts with the $fieldarray argument as it only needs to validate data which is actually present - a required field which is not in the array does not cause an error because it is not being changed.

... but this would be dangerous

The article then states that this approach would be dangerous:

This approach is extremely dangerous: before the validate() method has been called, you cannot tell whether the object is in a valid or invalid state. Even worse: the validate() method might never be called. Plus, the object has setters, so its state might change after validation.

The article goes on to show how the data is inserted via the constructor, then the validation method is called, followed by another call which then inserts invalid data.

This approach is only dangerous when the controller calls separate methods to insert the data, validate it, then store it in the database. The solution is not to call separate methods - the controller calls the insertRecord() method which itself calls the validateInsert() method. If the validation passes it then calls another method to write the data to the database, otherwise it returns back to the controller with messages in the $errors array. This idea is shown in this code sample, with the full details explained in this UML diagram. In other words, the call to the insertRecord() method will either update the database or return with errors. It is simply not possible to insert data into an object and have it written to the database without it passing through the validation routine.

For those of you who struggle to understand simple concepts when explained in words, let me draw you a diagram. The approach which is preferred by the author in that article has multiple operations between the Controller and the Model, as in the following:

Table 1 - the purist approach to inserting data
Controller Model class DML class
1 call--> load data using setters (one call per column)
2 call--> validate data
3 call--> store data ->insertRecord()

Here there are three method calls from the Controller to the Model. The first step will actually involve a separate setter for each item of data. The second step will return an array of validation errors, and the Controller will only activate the third step if there are no errors. It is possible to perform steps 1 and 3 without step 2 which could result in ether corrupt data or termination caused by invalid SQL. It is also possible to load more data after step 2 has been performed with the same effect.

Table 2 - my heretical approach to inserting data
Controller Model class DML class
1 call--> insertRecord($_POST) (loads all data in one go)
2 validate data
3 store data ->insertRecord()

Here the Controller will only make a single call on the Model where it passes all the data in a single array instead of having a separate setter for each column. The Model will automatically call the validation method, and if there are errors it will return to the Controller immediately and skip over the following steps. If there are no errors then it will automatically activate the insertRecord() method on the database object and update the database. Here there are fewer calls between the Controller and the Model, and there is no opportunity to load in additional data between the validation step and the database update. There is less code and fewer opportunities for programming errors, so which approach should be regarded as "better"?

Invalid objects

The article then contains the following statement:

The main problem with this approach is that we allow an object to enter an invalid state in the first place. This is a deadly sin, because it forces us back into procedural programming as we cannot pass around object references safely.

It must not be possible for an object to enter an invalid state.

The rule that it must not be possible for an object to enter an invalid state simply does not exist in my universe. There is only one way in which an object can be said to be "invalid":

An invalid object is one that cannot respond meaningfully to any of its public operations.

It should be obvious therefore that the "validity" of an object applies to the instance itself and not to the data that it may or may not contain. It is wrong to say that a valid object is one that contains valid data as this implies that the data must be validated outside of the object before it is inserted into it. Why can't the data validation be performed inside the object itself? If it is supposed to be performed externally then this implies that the validation rules are also held externally, and if information which is vital to the processing of an object does not exist within the object itself then surely this breaks encapsulation? Using a public method other than the constructor is a perfectly valid mechanism of populating the object with data, and that data may come from either the user or the database. When working with a database there is only one rule:

Data for each column must be validated against that column's definition before it is written to the database otherwise the operation will fail.

When each table class has an insertRecord() method which acts as described above, the data will not be written to the database unless it passes all its validation. If there are any validation errors then the object will immediately exit back to the controller so that it can inform the user about the errors and wait for fresh data. It is also possible to use a method such as getData() to obtain data from the database, and any data which already exists in the database has already been validated.

His remark it forces us back into procedural programming I consider to be totally bogus. If I am programming with objects then my program IS object oriented. The only way that any program can be described as being "procedural" (i.e. not object oriented) is when it does NOT involve the use of objects.

The article goes on to say the following:

We need to be able to pass around references to it. And if we do not know whether we hold a reference to a valid or an invalid object, we cannot rely on the object. Even if we re-validate the object whenever we work with it, what should we do with the error messages that we get back? Those error messages exist to provide feedback to the user, and somewhere deep down inside our object graph, we cannot even pass those messages back to the user.

As for passing around references to potentially invalid objects, I never have this problem simply because I never pass around references to objects, only data. Provided that this data gets validated before it is written to the database there is never a problem. As for the problem with returning errors messages from deep within your object graph, the solution is simple - do what I do and don't have a deep object graph.

Can the Profile class run the validator?

The article then asks the question:

Can we fix this problem by having the Profile object itself run the validator?

It then shows some sample code where the validator is run within the class constructor. This approach will not work because the constructor is not allowed to return anything (except SUCCESS or FAILURE), so any error messages would be lost. My approach does not suffer from this problem as I do NOT insert data into an object via its constructor. I use a separate insertRecord() method which can return any number of values. This means that I can run the validator within the insertRecord() method and pass back as many error messages as I want.

The article then tries a different approach where the validate() method is called in every setter, but this idea is dismissed as the entire validate() is run again with each setter, but still cannot be run for values which are set in the constructor. It also has the problem that the first validation error will terminate the input sequence, which means that if there are multiple errors only the first error message will be returned.

The next approach is to split the validate() method into separate methods for each field, and to call the relevant method within each setter. The problem with this is that the setters for each column within each class would need to be modified to perform the relevant validation. Special validation, such as for email addresses, would require having that code duplicated within each class that had an email address. This would be code duplication, which is supposed to be a bad thing.

Put each validation rule into its own class

The article goes on to say the following:

This works nicely, but indeed means code duplication, because we will have to copy this method to every other object that needs to validate an email address. But wait - why does a profile object validate an email address in the first place? We could move this code into a separate object. Why not call it EmailAddress?
<?php
class  Profile
{
    // ...

    public  function  setEmail ( EmailAddress  $email )
    {
        $this -> email  =  $email ;
    }

    // ...
}

class  EmailAddress
{
    private  $email ;

    public  function  __construct ( $email )
    {
        $this -> ensureEmailAddressIsValid ( $email ) ;

        $this -> email  =  $email ;
    }

    private  function  ensureEmailAddressIsValid ( $email )
    {
        // throw exception when email address is invalid
    }

    // ...
} 

This is followed with:

It does pay off to create small objects that encapsulate business rules. They are meaningful from a business perspective. They are easy to reuse. And the help to prevent code duplication.

Although the article says that such objects are easy to reuse, it completely ignores the fact that each class would still have to be manually coded to call the relevant object. The idea of having to manually write code to validate each field in each database table is something which I got rid of over a decade ago as shown in this code sample.

Data Validation and Business Rules are different

The article contains the following statement:

From the viewpoint of a business object, the concept of validation does not really exist. Business objects (objects representing things that matter to your business) enforce business rules. They do not validate data.

This implies that data validation and business rules should be processed in different objects. I completely disagree. At the start of a project the client identifies requirements which may be dealt with either by constructing the database in a particular way, or by writing code within the application. It does not matter that the former can be called "data validation" and the latter can be called "business rules" as they both arise out of user requirements. In a database application each business entity is represented as a table in a database, and the class which is built for each table is responsible for handing all the operations which can be performed on that table. These operations should also ensure that all user requirements, whether they be called "data validation" or "business rules", are satisfied before the data is written to the database. Just because they can be given different names does not mean that they have to be processed in different objects.

Build lots of small objects

The article contains the following statement:

So it turns out that code duplication is not a real issue if we build small and meaningful objects. They are reusable, thus avoiding the duplicate code. By the way: the EmailAddress object in the example above is a so-called value object.

For our example, think about objects such as Name (which might be composed of a FirstName and a LastName object), or a Homepage object (which might make use of a more generic URL object). Basically, you can (and should) create an object for everything that is meaningful from a business perspective, at least everything that has rules attached to it.

No matter where data comes from, either from the user via the $_POST array, or from the database as an array of columns, each value in the array is a primitive and not an object. Any effort spent in writing code to convert each primitive into an object would, in my humble opinion, be completely wasted. Not only would it add to the development times, but the instantiation of all those objects and subsequent method calls would require more processing at runtime, which means that the application would be slower. It would also make reading the code more difficult as the processing flow would require far too many jumps to other objects to execute a few lines of code instead of executing those few lines of code in situ.

Summary

The author of the article produces code which is less than optimum (a polite way of saying "crap") simply because he is trying to write code in a way than satisfies a bunch of artificial rules. Those rules were not carved in stone and handed down from the mountain top by some supreme being, they were scratched on the lavatory wall by some minor nonentities. Those rules do not exist in my universe, so I do not have to be constrained by them. Consequently I can write code which has more reusability and requires less developer effort in order to get things done. As "getting things done" by producing code which is cost-effective and as efficient as possible is supposed to be the name of the game, any complaints from the "paradigm police" that I am doing things wrong are simply ignored. I can write better code by not following those rules, therefore those rules are crap. Among the "rules" which I do not obey are:

  1. There is no rule which says that I cannot have a separate class for each database table.

    Each software object interacts with an external entity, so each class should encapsulate the methods and properties of one of these entities. However, when writing a database application the code does not interact with external entities such as Customer, Supplier, Product, Order, Invoice, Inventory or Shipment, it interacts instead with tables in a database. Each table does nothing but hold the data for an entity which is essential for the application to fulfil its purpose, and this data may be a small subset of the total which is available. Regardless of the operations which may be performed on a real world object, for a database table the only operations which can be performed are Create, Read, Update and Delete. So when I am writing a database application it makes sense to me to use objects which represent database tables instead of objects which don't. That is why I don't waste my time with OOD as it is incompatible with database design.

  2. There is no rule which says that the software must not know about the database structure.

    Too many programmers design their objects with total disregard to how the database is constructed, and refuse to learn SQL so that they can interact with the database in the most efficient manner. Instead they employ that abomination called an Object-Relational Mapper (ORM) and write code which is totally ignorant of the database structure. This is absurd as the application MUST know the structure of the database so that it can construct the relevant SQL queries, and trying to get your ORM to indirectly deal with a particular database structure is much harder than dealing directly with that structure yourself. In my humble opinion a database programmer who does not now SQL is as useless as a web programmer who does not know HTML. Even Robert C. Martin in this article thinks that programmers should know how to use SQL more efficiently.

  3. There is no rule which says that I must populate an object with data using the constructor.

    Each table in the database has its own object, and data for that object can come from two sources:

    I use the constructor to do nothing but construct an instance, then use one of several alternative methods to populate it with data, thus avoiding any problems that trying to do both in the constructor would generate.

  4. There is no rule which says that I must have a separate class variable for each column in the database.

    The SQL language uses sets of data which may contain any number of columns from any number of rows, so I prefer to pass this data around as a single array. It is just as easy to access data as $this->fieldarray['varname'] as it is with $this->varname, so there is no loss of functionality.

  5. There is no rule which says that I must access column data individually using getters and setters.

    I can put data into an object as an argument on any method, not just a setter. This is why I use such methods as insertRecord($array) and updateRecord($array). This means that I do not have to use getters and setters to access an object's data one column at a time, which in turn means that I can make any changes to the columns in a table without having to change any getters and setters. This then makes my code as loosely coupled as it can possibly be, which is supposed to be a good thing.

  6. This is no rule which says that an object cannot contain invalid data.

    While it is true that a constructor should not leave the object in an invalid state, here the word "state" means "condition" and not "data". A constructor MUST leave an object in a valid "condition" so that it can respond to calls on any of its public methods. The fact that I may insert invalid data into an object is NOT an issue. The ONLY rule with database applications is that I should not write invalid data to the database, so provided that AFTER the data is inserted into an object but BEFORE it is written to the database it is validated then this rule is obeyed. Any attempt to validate the data OUTSIDE of the object would be clearly wrong as it would an example of low cohesion, which is supposed to be bad.

  7. There is no rule which says that a controller must use separate methods to load, validate and store data.

    Using one method to load the data, a second to validate it, then a third to store it in the database is not an approach that I have ever used, even though I have seen it quite often. I have found if much more convenient to perform all three steps by calling a single insertRecord() method which automatically performs the validation and only updates the database if there are no validation errors.

  8. There is no rule which says that data validation rules must be hand-coded within each class.

    Any experienced database programmer will tell you that validating data to ensure that it matches the column specifications for that particular table follows a standard pattern, and because it follows a standard pattern it should be able to be automated. When a database table is constructed all the column/field specifications are defined with the CREATE TABLE script, and this data can easily be extracted from the INFORMATION_SCHEMA and made available to the software. In my framework this metadata is made available to each Model object in the $fieldspec array, and all primary validation is performed automatically by calling the generic validation class.

  9. There is no rule which says that "data validation" and "business rules" should be processed in different objects.

    Application requirements or "rules" can be processed in different ways, either in the way that the database is constructed, or by writing code in the application. How the rules for a particular entity can be satisfied is irrelevant as the definition of encapsulation is that ALL methods and ALL properties for an entity are encapsulated in the SAME class. Splitting the code across several objects would therefore violate encapsulation which would not be a good thing.

  10. There is no rule which says that validation errors should throw exceptions.

    This is a complete abuse of exceptions as they were designed to deal with unexpected and rare errors, and once an object throws an exception it is terminated and control returned to the calling object. Validation errors should not be dealt with in this way for the following reasons:

    In my framework when error messages are created they are simply added to the $this->errors array, which is indexed by field name. This array can contain any number of errors. After the validation process has completed the subsequent database update is not performed unless this array is empty.

  11. There is no rule which says that once data has been loaded into an object it cannot be changed.

    In every database application that I have written it has been a common scenario to have a function which reads data from the database so that it can be displayed to the user, amended, then the changes stored in the database. The whole idea of immutable objects is alien to me.

  12. There is no rule which says that class variables should be constructed as a value objects.

    All data which goes into or comes out of an object does so in the form of an associative array which contains values for any number of columns. Each of these values is a primitive and not an object. It would require too much effort to convert these primitives into objects, and it would also make the application run slower, so I'm not going to bother.

References

The following articles describe aspects of my framework:

The following articles express my heretical views on the topic of OOP:

These are reasons why I consider some ideas to be complete rubbish:

Here are my views on changes to the PHP language and Backwards Compatibility:

The following are responses to criticisms of my methods:

Here are some miscellaneous articles:


counter