(Integrity) Constraints in a datawarehouse

24 06 2009

In data warehouse land it is not very common to see constraints in the database. I never felt very comfortable with that, but until now I did not get around to analysing why I felt that way. Until I read this article by Tom Kyte. In the article Tom Kyte shows that the CBO (Cost Based Optimizer) can profit from the information that is derived from the presence of constraints by generating better query plans. Better in this case is defined as ‘producing result sets faster’. The examples in the article are not exactly ‘real world’ data warehouse examples. Following Tom Kyte’s line of reasoning I do agree that constraints are capable of improving the performance of queries.

The reasons for not having constraints in a data warehouse are along the lines of ‘I have checked the integrity when I did my ETL, so why would I need constraints to confirm that? And besides, constraints would only delay my ETL because they have to be checked before they are really enabled’. I see a couple of flaws in this reasoning:

  • I suspect that most constraints in a data warehouse cannot be enabled when actually applied. The quality of the ETL might be good, but is it just as good as a constraint would be? I think not.
  • Enabling constraints might take time, but how often do you have to check constraints? Only when doing the ETL, of course. I hope that in your DWH, doing ETL will be during a small part of the time your DWH is being used. Otherwise your DWH will have a problem. The rest of the time your DWH will be used for querying and Tom Kyte just showed that querying can be sped up by applying constraints.

Summing up my pros and cons of applying constraints.

Pro:

  • it will improve the data quality of the DWH
  • it can speed up the queries in your DWH (querying it is the purpose of your DWH anyway)

Con:

  • it will take more time to do your ETL (which is only a means to create your DWH)

My conclusion is that I wil try to incorporate as many constraints as possible in my next DWH. It also means I will have to be smart enough to enable the constraints at just the right moment during my ETL to have an acceptable loading performance.


Actions

Information

4 responses

24 06 2009
Pete Scott

Don’t forget that RELY constraints can also be used with Oracle – you tell the CBO to “trust me, the rule is valid for this data” – of course if you can’t trust the data then that’s another thing 🙂

The other thing is where NOT NULL is going to be valid for the column, then tell the CBO by putting in the NOT NULL constraint

24 06 2009
ebuddelm

Thanks for the additions, Pete.
RELY is a very useful option when enabling a constraint. Personally, I think it would be wise to really check the constraints now and then. For instance by temporarily creating a second constraint with the same syntax under another constraint name and see if that becomes valid. Something to do in the spare time if there is some. But who is wise all the time?

I put the ‘Integrity’ between brackets because indeed not null constraints can also improve the results of the CBO. It is something Tom Kyte mentioned explicitly in his article. I, on the other hand, left it to be read in between the lines. You made it more explicit with your comment.

17 02 2010
king4aday

We at our company use foreign key constraints with the following options:
DISABLE NOVALIDATE RELY
This gives the best performance, and lowest downtime.

17 02 2010
Buddel

Interesting, disable novalidate rely. Have you checked the CBO is still using the foreign key when coming up with a plan when they are in this state? Currently I have no database available to test this. Just hoping you have tested it, and I can profit 🙂

Leave a comment