(Integrity) Constraints in a datawarehouse

24 06 2009

In data warehouse land it is not very common to see constraints in the database. I never felt very comfortable with that, but until now I did not get around to analysing why I felt that way. Until I read this article by Tom Kyte. In the article Tom Kyte shows that the CBO (Cost Based Optimizer) can profit from the information that is derived from the presence of constraints by generating better query plans. Better in this case is defined as ‘producing result sets faster’. The examples in the article are not exactly ‘real world’ data warehouse examples. Following Tom Kyte’s line of reasoning I do agree that constraints are capable of improving the performance of queries.

The reasons for not having constraints in a data warehouse are along the lines of ‘I have checked the integrity when I did my ETL, so why would I need constraints to confirm that? And besides, constraints would only delay my ETL because they have to be checked before they are really enabled’. I see a couple of flaws in this reasoning:

  • I suspect that most constraints in a data warehouse cannot be enabled when actually applied. The quality of the ETL might be good, but is it just as good as a constraint would be? I think not.
  • Enabling constraints might take time, but how often do you have to check constraints? Only when doing the ETL, of course. I hope that in your DWH, doing ETL will be during a small part of the time your DWH is being used. Otherwise your DWH will have a problem. The rest of the time your DWH will be used for querying and Tom Kyte just showed that querying can be sped up by applying constraints.

Summing up my pros and cons of applying constraints.

Pro:

  • it will improve the data quality of the DWH
  • it can speed up the queries in your DWH (querying it is the purpose of your DWH anyway)

Con:

  • it will take more time to do your ETL (which is only a means to create your DWH)

My conclusion is that I wil try to incorporate as many constraints as possible in my next DWH. It also means I will have to be smart enough to enable the constraints at just the right moment during my ETL to have an acceptable loading performance.

Advertisements




Little known but useful facts about Oracle

31 03 2009

Laurent Schneider has a blog that I follow and always has tiny facts that you think you know the details about. But, when you look closely, the details are somewhat different from what you thought they were.

One such gem today: 1 <> 1 and another about 2 weeks ago: commit suicide





UKOUG; My schedule and gradually some notes

3 12 2008

Monday december 1st

Keynote    ‘Oracle Corporate and Product Strategy Update’, David Callaghan, Oracle

As keynotes go, general and fluffy. Main point: Oracle is (trying to be) complete, open and integrated.

Beginners’ Guide to Partitioning, Mr Jonathan Lewis, JL Computer Consultancy
Indeed a beginners guide. But mr. Lewis could not help himself and he did put some valuable advice in regarding performance aspects. Such as:

  • There is a penalty to partitioned indexes which can lead to having to visit every partition which is not expected behaviour
  • Bind variables may make it difficult for the CBO to do partition pruning, dates can do that as well, and make sure you use 4-digit years preventing the optimizer from thinking you are talking about the first century
  • New to me, but probably not to others, Bloom filter (in 11g, maybe already in 1ogR2?), something to investigate
  • Partition exchange when a global (unique) index is present will require checking of the index even when using the NOVALIDATE clause.

And a teaser/challenge to check whether it is possible to solve the performance problem of “where ‘2007-01-01’  between start_date and end_date” with a range-range composite partitioned table.

Oracle Business Intelligence Road Map and Strategy, Mr Nick Tuson, Oracle

General introduction of where Oracle thinks the place of BI is. It is a key support part of Enterprise Performance Management (EPM). And EPM is supported by virtually all Oracle Products, not only the BI products. Oracle uses something that they have called the Common Enterprise Information Model (CEIM). I will try to find a reference or white paper to it later.

BI applications have a new component wich will be the integration of Oracle Data Integrator (ODI, formerly Sunopsis).

Lots of new and changed interfaces, personalised start page, lend from server technology and Fusion middleware integration.

Probably a useful presentation if you are ‘into’ OBIEE, Hyperion and the like. Unfortunately I am not.

Anatomy of Data Breaches: How They Occur and How to Protect Against Them, Ms Tammy Bednar, Oracle

Excellent material of how you can characterise the types of people that are a threat to your systems. Including the measures you can take to prevent the data breaches from happening. If the presenter had been a bit less nervous the presentation would have been even stronger. I hope to very soon see the slides on the UKOUG conference website so I can build on that to win over more people to consider security.

Flash Recovery Area, Mr Tuomas Pystynen, Miracle Finland Oy

As can be expected of Tuomas a thorough investigation of what the FRA is, what kind of files it holds, what the drawbacks are of using it and even a novel use of FRA for making a copy of a database. If you have enough diskroom, of course nothing is free.

Below a short description of the kind of files FRA can hold and a warning (by Tuomas) that you can place other files in the FRA but then they are not seen by the processes that guard the filling up of the FRA and you could end up with a low level error along the lines of ‘file system full’.

• Current control file
• Online logs
• Archived logs
• Control file autobackups and copies
• Datafile copies
• Backup pieces
• Flashback logs

Only the last logs will actually be used in a Flashback feature (flashback database to be precise) so why it is called Flash Recovery Area is a matter of opinion.

The best way… , Thomas Kyte, Database Product Evangelist, Oracle

I would like to call the presentation a kind of ‘rant’ against the people who, just like that, want to know what is the BEST way to do something. And they don’t want to spent time thinking about whether the way they have been given is indeed the best way in their circumstances. Which is almost never the case. I think it is about save to say that we all know the standard answer to the standard ‘best way’ question; It depends. And the above illustrated in a manner that we all would love to be able to copy. But alas, there is only one Tom Kyte.

And then the UKOUG 2008 Opening Party …….

Tuesday december 3rd

‘Oracle Exadata – Extreme Performance’ , Andrew Mendelsohn, Oracle

A short reprise of why we need a database machine or even a storage server. And all that from Oracle. Very interesting, but it will be some time before most of us will see such a database machine from close by. At least for me, the dataware houses I’m working on are not that big.

Monitoring Oracle using Open source software, Mr Jason Lester, DSP Managed Services

A good, well structured presentation of why you would choose Open Source Software (OSS). And in this case not from a technical viewpoint, but from a business viewpoint. Main points being vendor lock-in and dependency on key technicians with commercial software. Also a good overview of what OOS products come into play and how you can integrate several products into a valuable ‘dashboard’. There must be something there for the Techies as well. Looking forward to this document being uploaded as well.

Change, Change, Change, Mr Tom Kyte, Oracle

What can I say, good as ever. Read his blog.

Oracle 11g’s SuperModels: Predictably Enhanced, Ms Melanie Caffrey, Oracle

Weird PL/SQL, Mr Steven Feuerstein, Quest Software

Database Constraints – A Woeful State of Affairs and a Polite Excuse, Dr Hugh Darwen, University of Warwick; Mr Toon Koppelaars, RuleGen BV

Socializing in the COMMUNITY FOCUS PUBS …..

Wednesday, december 4th

Designing Data Warehouse in Oracle 11g, Mr Vincent Chazhoor, Lafarge NA

Data Warehouse modelling for dummies, Mr Andrew Bond, Oracle

Sort Merge: Coming to a join near you, Ric Van Dyke, Hotsos

Active Statistics, Mr Wolfgang Breitling, Centrex Consulting Corporation

Performance Tuning Basics, Mr Doug Burns, Bayvale

Visual Data Modelling with SQLDeveloper, Sue Harper, Oracle

Last, but not least; 25th YEAR CELEBRATION PARTY

Thursday, december 4th

A complete BI+W Architecture, Mr. Andrew Bond, Oracle

What’s in Oracle Warehouse Builder 11GR2, Miss Nathalie Nunn, Oracle

Comparing Materialized Views and Analytic Workspaces in Oracle Database 11g, Keith Laker, Oracle

Deploying a Dimensional Model on the Oracle Database, Stewart Bryson, Rittman Mead Consulting

Managing OWB 10gR2 in a deployed environment, Wolfgang Scherrer, infomArt Gmbh

Variance as a tool: Measuring for Robust Performance, Ms Robyn Sands, Cisco

Optimizing Systems by Eliminating Throwaway, Mr Daniel Fink, OptimalDBA

Early night, much needed.

Friday, december 5th, Sinterklaas, pakjesavond

Not home this night so I won’t get any presents from Sinterklaas this year. Oh well, I don’t know if I have been that good a boy this year.

Oracle 11g Data Warehousing Masterclass, Mr Mark Rittman, Rittman Mead Consulting

Advanced SQL Features, Mr Daniel Fink, OptimalDBA





Oracle VM

13 11 2007

Interesting and confusing, Oracle now has its own virtualisation software, Oracle VM. Completely free for download it says as of November 14. And free downloads of virtual machines. Oracle VM does not need a host operating system according to the FAQ. But windows runs considerably slower on OracleVM than native on the hardware, of course they are ‘working on it’. More questions than answers here. Let’s see where this is going.





Oracle 11g, from the Sixties to the Nineties

8 10 2007

Oracle has a feature since 10g called Flashback. It allowed to recover from loss of data in a somewhat more easy way. I have seen some presentations about it and have used it occassionally to recover from inadvertently dropping a table. Dropped by me or others, I am no exception there. One of the presentations was by Lex de Haan called ‘Flash me back, Scotty’, with a nice reference to this and all that it stands for in the sixties. Apparently Oracle marketing has seen that presentation as well and now the feature is advertised as ‘Total Recall‘, totally nineties. A leap into the future using the past.

Now marketing is useful to get the attention, but as true non-believer one needs to see some ‘proof of the pudding’ (I don’t know where I learned that phrase, but it seems appropiate here). Lucas Jellema is publishing currently an excellent series (some examples here, here and here) on the way 11g new features can be put to use, including some ideas and coding to demonstrate their usefulness. The most recent one being about a significant change in the implementation of the Flashback feature. This might also be a convenient way to keep a store of all the changes that have been applied to your datawarehouse. A lot of those datawarehouses already keep a complete history and Total Recall could be an interesting feature to simplify the maintenance of that history.

Of course we are currently in the “zero-ities” or whatever the current epoch is called, so I think Oracle will probably not be completely up to speed yet.





Just for myself (Shared server vs Dedicated)

4 10 2007

Jonathan Lewis posted a link to a forum post from Howard Rogers about the difference between Shared Server and Dedicated Server. And as Jonathan mentioned, it was short and clear. I have no idea how long the forum posts are preserved, but I have some control over the lifetime of the postings I make here, so I copied the text from Howard’s response here. I hope I am not committing some illegal act by copying it here. But I do believe I have put the credits where they belong. Here it comes:

 I’ll just summarise, because I can see some rather iffy advice being offered in this thread.

If you are running in 9i or above, there are two parameters (not one) which govern whether or not you are capable of running in shared server mode.

DISPATCHERS governs whether a job dispatcher runs. You have to have at least one of them configured before shared server is possible. However, the mere fact that a dispatcher runs does NOT mean your database is running in shared server mode. There also have to be shared server processes capable of handling the work dispatched by the dispatcher(s), and those are configured with the SHARED_SERVERS parameter. If that’s set to any number greater than 1, you have shared server processes running on your instance. But that STILL doesn’t mean you’re running in shared server mode! If you have SHARED_SERVERS=57 and no dispatcher, you simply have 57 processes sitting around doing nothing whatsoever (and incapable of doing useful work!)

In short, you have to have DISPATCHERS and SHARED_SERVERS set.

Note, for example, that 10g configures a single dispatcher for all databases by default (if they’re created with DBCA and you don’t get in there to stop it happeneing), but it does NOT configure SHARED_SERVERS, so by default a 10g database does not run in shared server mode.

The other thing I’d clarify is that a database doesn’t really run in shared server mode anyway! The fact that your instance has a dispatcher and shared server processes running doesn’t necessarily mean your users will end up connected to the dispatcher and having shared server processes handling their job requests. They will by default, but if the tnsnames.ora they use to connect (or its centralised equivalent) contains the line SERVER=DEDICATED, then they will get to use dedicated server processes, no matter what the dispatcher or shared server processes might think about it!

With dispatchers and shared server processes configured, in other words, an instance can “support shared server connection requests”. That’s rather different than “running in shared server mode”. The distinction is important because privileged actions (startup, shutdown, backup and recover commands) cannot be processed by a shared server process, so it’s important for an instance that is configured for normal users to use shared server processes to still support the connection to dedicated server processes by suitably credentialled users.

If a user does end up connected to a shared server process, there is usually a performance penalty to pay compared to using a dedicated server process. A user submits a query and instead of it being immediately processed by a server process, it gets submitted to a dispatcher …which promptly sticks it on a job queue! You then have to wait for a shared server process to become free and decide to pick your job off the queue. That’s inevitably slower than doing it the dedicated way.

People use shared server as the first line of scaling up their databases… and you’re right that it primarily depends on the number of users connected to the server concurrently. In dedicated server mode, a new connection means a new process gets spawned (or a new thread on Windows) and a new connection socket is opened. Servers can only handle so many connection sockets, processes or threads before they start to keel over under the strain. Shared server, as the name suggest, means that new connections do not cause new server processes to be spawned. So 300 users can be processed with, maybe, 30 or 40 processes in total. If your box would normally keel over handling 300 dedicated connections, then clearly with that sort of sharing ratio, you’d be able to scale to nearer 3000 users before it starts wilting by using shared processes.

But it’s also a bit subtler than that: a data warehouse would be daft to implement shared server, even if it did have 300+ concurrent users. That’s because the users of such systems typically run queries that run for hours… and a shared process that is nabbed to perform one job for hours on end isn’t really a shared process any more, is it?! So the rule of thumb as to when to implement shared server is yes, (a) when your concurrent user count starts reaching levels that your server just doesn’t seem able to sustain any more AND (b) when you can be sure that the users tend to issue short, sharp queries -say, about 3 seconds or so to process, max.

Again, there are mixed states to get through, too. You might have lots of OLTP-type sub-3-second transactions on the same database on which one or two users regularly run big reports. In that case, you make sure the reporters have a tnsnames.ora that says SERVER=DEDICATED and the OLTP-type people use one that has SERVER=SHARED in it; configure the DISPATCHERS and SHARED_SERVER parameters for the instance and then those that can benefit from shared servers can do so and those that wouldn’t won’t be stealing shared processes from those that can!

The alternative approach for those with more cash is to go and buy better server hardware that can cope with the user community numbers! Shared Server configuration, however, comes free. You pays your money and you takes your choices!





Plan stability in oracle 11g

28 09 2007

I noticed this post in Blogging about oracle, which sounds to me a little bit like a good opportunity for achieving and maintaining plan stability. One more thing to try out.