(Integrity) Constraints in a datawarehouse

24 06 2009

In data warehouse land it is not very common to see constraints in the database. I never felt very comfortable with that, but until now I did not get around to analysing why I felt that way. Until I read this article by Tom Kyte. In the article Tom Kyte shows that the CBO (Cost Based Optimizer) can profit from the information that is derived from the presence of constraints by generating better query plans. Better in this case is defined as ‘producing result sets faster’. The examples in the article are not exactly ‘real world’ data warehouse examples. Following Tom Kyte’s line of reasoning I do agree that constraints are capable of improving the performance of queries.

The reasons for not having constraints in a data warehouse are along the lines of ‘I have checked the integrity when I did my ETL, so why would I need constraints to confirm that? And besides, constraints would only delay my ETL because they have to be checked before they are really enabled’. I see a couple of flaws in this reasoning:

  • I suspect that most constraints in a data warehouse cannot be enabled when actually applied. The quality of the ETL might be good, but is it just as good as a constraint would be? I think not.
  • Enabling constraints might take time, but how often do you have to check constraints? Only when doing the ETL, of course. I hope that in your DWH, doing ETL will be during a small part of the time your DWH is being used. Otherwise your DWH will have a problem. The rest of the time your DWH will be used for querying and Tom Kyte just showed that querying can be sped up by applying constraints.

Summing up my pros and cons of applying constraints.

Pro:

  • it will improve the data quality of the DWH
  • it can speed up the queries in your DWH (querying it is the purpose of your DWH anyway)

Con:

  • it will take more time to do your ETL (which is only a means to create your DWH)

My conclusion is that I wil try to incorporate as many constraints as possible in my next DWH. It also means I will have to be smart enough to enable the constraints at just the right moment during my ETL to have an acceptable loading performance.





Little known but useful facts about Oracle

31 03 2009

Laurent Schneider has a blog that I follow and always has tiny facts that you think you know the details about. But, when you look closely, the details are somewhat different from what you thought they were.

One such gem today: 1 <> 1 and another about 2 weeks ago: commit suicide





UKOUG; My schedule and gradually some notes

3 12 2008

Monday december 1st

Keynote    ‘Oracle Corporate and Product Strategy Update’, David Callaghan, Oracle

As keynotes go, general and fluffy. Main point: Oracle is (trying to be) complete, open and integrated.

Beginners’ Guide to Partitioning, Mr Jonathan Lewis, JL Computer Consultancy
Indeed a beginners guide. But mr. Lewis could not help himself and he did put some valuable advice in regarding performance aspects. Such as:

  • There is a penalty to partitioned indexes which can lead to having to visit every partition which is not expected behaviour
  • Bind variables may make it difficult for the CBO to do partition pruning, dates can do that as well, and make sure you use 4-digit years preventing the optimizer from thinking you are talking about the first century
  • New to me, but probably not to others, Bloom filter (in 11g, maybe already in 1ogR2?), something to investigate
  • Partition exchange when a global (unique) index is present will require checking of the index even when using the NOVALIDATE clause.

And a teaser/challenge to check whether it is possible to solve the performance problem of “where ‘2007-01-01’  between start_date and end_date” with a range-range composite partitioned table.

Oracle Business Intelligence Road Map and Strategy, Mr Nick Tuson, Oracle

General introduction of where Oracle thinks the place of BI is. It is a key support part of Enterprise Performance Management (EPM). And EPM is supported by virtually all Oracle Products, not only the BI products. Oracle uses something that they have called the Common Enterprise Information Model (CEIM). I will try to find a reference or white paper to it later.

BI applications have a new component wich will be the integration of Oracle Data Integrator (ODI, formerly Sunopsis).

Lots of new and changed interfaces, personalised start page, lend from server technology and Fusion middleware integration.

Probably a useful presentation if you are ‘into’ OBIEE, Hyperion and the like. Unfortunately I am not.

Anatomy of Data Breaches: How They Occur and How to Protect Against Them, Ms Tammy Bednar, Oracle

Excellent material of how you can characterise the types of people that are a threat to your systems. Including the measures you can take to prevent the data breaches from happening. If the presenter had been a bit less nervous the presentation would have been even stronger. I hope to very soon see the slides on the UKOUG conference website so I can build on that to win over more people to consider security.

Flash Recovery Area, Mr Tuomas Pystynen, Miracle Finland Oy

As can be expected of Tuomas a thorough investigation of what the FRA is, what kind of files it holds, what the drawbacks are of using it and even a novel use of FRA for making a copy of a database. If you have enough diskroom, of course nothing is free.

Below a short description of the kind of files FRA can hold and a warning (by Tuomas) that you can place other files in the FRA but then they are not seen by the processes that guard the filling up of the FRA and you could end up with a low level error along the lines of ‘file system full’.

• Current control file
• Online logs
• Archived logs
• Control file autobackups and copies
• Datafile copies
• Backup pieces
• Flashback logs

Only the last logs will actually be used in a Flashback feature (flashback database to be precise) so why it is called Flash Recovery Area is a matter of opinion.

The best way… , Thomas Kyte, Database Product Evangelist, Oracle

I would like to call the presentation a kind of ‘rant’ against the people who, just like that, want to know what is the BEST way to do something. And they don’t want to spent time thinking about whether the way they have been given is indeed the best way in their circumstances. Which is almost never the case. I think it is about save to say that we all know the standard answer to the standard ‘best way’ question; It depends. And the above illustrated in a manner that we all would love to be able to copy. But alas, there is only one Tom Kyte.

And then the UKOUG 2008 Opening Party …….

Tuesday december 3rd

‘Oracle Exadata – Extreme Performance’ , Andrew Mendelsohn, Oracle

A short reprise of why we need a database machine or even a storage server. And all that from Oracle. Very interesting, but it will be some time before most of us will see such a database machine from close by. At least for me, the dataware houses I’m working on are not that big.

Monitoring Oracle using Open source software, Mr Jason Lester, DSP Managed Services

A good, well structured presentation of why you would choose Open Source Software (OSS). And in this case not from a technical viewpoint, but from a business viewpoint. Main points being vendor lock-in and dependency on key technicians with commercial software. Also a good overview of what OOS products come into play and how you can integrate several products into a valuable ‘dashboard’. There must be something there for the Techies as well. Looking forward to this document being uploaded as well.

Change, Change, Change, Mr Tom Kyte, Oracle

What can I say, good as ever. Read his blog.

Oracle 11g’s SuperModels: Predictably Enhanced, Ms Melanie Caffrey, Oracle

Weird PL/SQL, Mr Steven Feuerstein, Quest Software

Database Constraints – A Woeful State of Affairs and a Polite Excuse, Dr Hugh Darwen, University of Warwick; Mr Toon Koppelaars, RuleGen BV

Socializing in the COMMUNITY FOCUS PUBS …..

Wednesday, december 4th

Designing Data Warehouse in Oracle 11g, Mr Vincent Chazhoor, Lafarge NA

Data Warehouse modelling for dummies, Mr Andrew Bond, Oracle

Sort Merge: Coming to a join near you, Ric Van Dyke, Hotsos

Active Statistics, Mr Wolfgang Breitling, Centrex Consulting Corporation

Performance Tuning Basics, Mr Doug Burns, Bayvale

Visual Data Modelling with SQLDeveloper, Sue Harper, Oracle

Last, but not least; 25th YEAR CELEBRATION PARTY

Thursday, december 4th

A complete BI+W Architecture, Mr. Andrew Bond, Oracle

What’s in Oracle Warehouse Builder 11GR2, Miss Nathalie Nunn, Oracle

Comparing Materialized Views and Analytic Workspaces in Oracle Database 11g, Keith Laker, Oracle

Deploying a Dimensional Model on the Oracle Database, Stewart Bryson, Rittman Mead Consulting

Managing OWB 10gR2 in a deployed environment, Wolfgang Scherrer, infomArt Gmbh

Variance as a tool: Measuring for Robust Performance, Ms Robyn Sands, Cisco

Optimizing Systems by Eliminating Throwaway, Mr Daniel Fink, OptimalDBA

Early night, much needed.

Friday, december 5th, Sinterklaas, pakjesavond

Not home this night so I won’t get any presents from Sinterklaas this year. Oh well, I don’t know if I have been that good a boy this year.

Oracle 11g Data Warehousing Masterclass, Mr Mark Rittman, Rittman Mead Consulting

Advanced SQL Features, Mr Daniel Fink, OptimalDBA





Oracle VM

13 11 2007

Interesting and confusing, Oracle now has its own virtualisation software, Oracle VM. Completely free for download it says as of November 14. And free downloads of virtual machines. Oracle VM does not need a host operating system according to the FAQ. But windows runs considerably slower on OracleVM than native on the hardware, of course they are ‘working on it’. More questions than answers here. Let’s see where this is going.





Oracle 11g, from the Sixties to the Nineties

8 10 2007

Oracle has a feature since 10g called Flashback. It allowed to recover from loss of data in a somewhat more easy way. I have seen some presentations about it and have used it occassionally to recover from inadvertently dropping a table. Dropped by me or others, I am no exception there. One of the presentations was by Lex de Haan called ‘Flash me back, Scotty’, with a nice reference to this and all that it stands for in the sixties. Apparently Oracle marketing has seen that presentation as well and now the feature is advertised as ‘Total Recall‘, totally nineties. A leap into the future using the past.

Now marketing is useful to get the attention, but as true non-believer one needs to see some ‘proof of the pudding’ (I don’t know where I learned that phrase, but it seems appropiate here). Lucas Jellema is publishing currently an excellent series (some examples here, here and here) on the way 11g new features can be put to use, including some ideas and coding to demonstrate their usefulness. The most recent one being about a significant change in the implementation of the Flashback feature. This might also be a convenient way to keep a store of all the changes that have been applied to your datawarehouse. A lot of those datawarehouses already keep a complete history and Total Recall could be an interesting feature to simplify the maintenance of that history.

Of course we are currently in the “zero-ities” or whatever the current epoch is called, so I think Oracle will probably not be completely up to speed yet.





Just for myself (Shared server vs Dedicated)

4 10 2007

Jonathan Lewis posted a link to a forum post from Howard Rogers about the difference between Shared Server and Dedicated Server. And as Jonathan mentioned, it was short and clear. I have no idea how long the forum posts are preserved, but I have some control over the lifetime of the postings I make here, so I copied the text from Howard’s response here. I hope I am not committing some illegal act by copying it here. But I do believe I have put the credits where they belong. Here it comes:

 I’ll just summarise, because I can see some rather iffy advice being offered in this thread.

If you are running in 9i or above, there are two parameters (not one) which govern whether or not you are capable of running in shared server mode.

DISPATCHERS governs whether a job dispatcher runs. You have to have at least one of them configured before shared server is possible. However, the mere fact that a dispatcher runs does NOT mean your database is running in shared server mode. There also have to be shared server processes capable of handling the work dispatched by the dispatcher(s), and those are configured with the SHARED_SERVERS parameter. If that’s set to any number greater than 1, you have shared server processes running on your instance. But that STILL doesn’t mean you’re running in shared server mode! If you have SHARED_SERVERS=57 and no dispatcher, you simply have 57 processes sitting around doing nothing whatsoever (and incapable of doing useful work!)

In short, you have to have DISPATCHERS and SHARED_SERVERS set.

Note, for example, that 10g configures a single dispatcher for all databases by default (if they’re created with DBCA and you don’t get in there to stop it happeneing), but it does NOT configure SHARED_SERVERS, so by default a 10g database does not run in shared server mode.

The other thing I’d clarify is that a database doesn’t really run in shared server mode anyway! The fact that your instance has a dispatcher and shared server processes running doesn’t necessarily mean your users will end up connected to the dispatcher and having shared server processes handling their job requests. They will by default, but if the tnsnames.ora they use to connect (or its centralised equivalent) contains the line SERVER=DEDICATED, then they will get to use dedicated server processes, no matter what the dispatcher or shared server processes might think about it!

With dispatchers and shared server processes configured, in other words, an instance can “support shared server connection requests”. That’s rather different than “running in shared server mode”. The distinction is important because privileged actions (startup, shutdown, backup and recover commands) cannot be processed by a shared server process, so it’s important for an instance that is configured for normal users to use shared server processes to still support the connection to dedicated server processes by suitably credentialled users.

If a user does end up connected to a shared server process, there is usually a performance penalty to pay compared to using a dedicated server process. A user submits a query and instead of it being immediately processed by a server process, it gets submitted to a dispatcher …which promptly sticks it on a job queue! You then have to wait for a shared server process to become free and decide to pick your job off the queue. That’s inevitably slower than doing it the dedicated way.

People use shared server as the first line of scaling up their databases… and you’re right that it primarily depends on the number of users connected to the server concurrently. In dedicated server mode, a new connection means a new process gets spawned (or a new thread on Windows) and a new connection socket is opened. Servers can only handle so many connection sockets, processes or threads before they start to keel over under the strain. Shared server, as the name suggest, means that new connections do not cause new server processes to be spawned. So 300 users can be processed with, maybe, 30 or 40 processes in total. If your box would normally keel over handling 300 dedicated connections, then clearly with that sort of sharing ratio, you’d be able to scale to nearer 3000 users before it starts wilting by using shared processes.

But it’s also a bit subtler than that: a data warehouse would be daft to implement shared server, even if it did have 300+ concurrent users. That’s because the users of such systems typically run queries that run for hours… and a shared process that is nabbed to perform one job for hours on end isn’t really a shared process any more, is it?! So the rule of thumb as to when to implement shared server is yes, (a) when your concurrent user count starts reaching levels that your server just doesn’t seem able to sustain any more AND (b) when you can be sure that the users tend to issue short, sharp queries -say, about 3 seconds or so to process, max.

Again, there are mixed states to get through, too. You might have lots of OLTP-type sub-3-second transactions on the same database on which one or two users regularly run big reports. In that case, you make sure the reporters have a tnsnames.ora that says SERVER=DEDICATED and the OLTP-type people use one that has SERVER=SHARED in it; configure the DISPATCHERS and SHARED_SERVER parameters for the instance and then those that can benefit from shared servers can do so and those that wouldn’t won’t be stealing shared processes from those that can!

The alternative approach for those with more cash is to go and buy better server hardware that can cope with the user community numbers! Shared Server configuration, however, comes free. You pays your money and you takes your choices!





Plan stability in oracle 11g

28 09 2007

I noticed this post in Blogging about oracle, which sounds to me a little bit like a good opportunity for achieving and maintaining plan stability. One more thing to try out.





Installing Oracle 11g Database

23 09 2007

Oracle 11g has been out for a couple of months now and I found me some time to play around a bit.

One thing that immediately sprang out to me is the size of the download, 1,7GB zip file. That’s something different from the promise Oracle made when introducing 10g where you only needed one CD and 15 minutes to install an oracle database. That was key-note stuff, back then. Now not so important anymore, apparently.

I do not have that much hardware available to play around with so it had to be a virtual machine. I had created a CentOS 4.4 VM some months back, so that seemed a logical choice. Of course it is wise to upgrade the installed packages when you have not been using a CentOS recently. So Up2date was the first to start in the VM. I had no trouble in starting it, just that it took quite some time to finish downloading and installing all packages. Another evening of not doing very much, just a lot of waiting.

The time I had to wait however was eventually well spent because it gave others the time to finish the same action and blog about it. All I next had to do is follow in the footsteps of Howard to prepare linux and install oracle. Of course the installation failed when a starter database was to be created because there was not enough disk space in the VM. I am not a unix or linux administrator so that was another evening puzzling and googling about. But I succeeded eventually.

The next install (after removing the original one the’hard’ way) went smoothly, so now I have a virtual machine with a DWH11 and a REPO11 Oracele 11g database. Next up; installing Warehouse builder environment in the Kadenza style.

(tip of the veil; it is installed, but the description will have to wait till after preparing and savouring dinner and watching a ‘supersunday’ of Dutch soccer (AZ-Ajax and PSV-Feijenoord, no 1-4 currently))





Bitmap index size, or, Size does matter.

18 09 2007

For the past couple of years I have been working with the notion that Oracle’s bitmap indexes are a cool feature. Perfect for an index on low cardinality columns. And even the definition of ‘low’ in this regard has been stretched a lot. We have a datawarehouse where we create an index on every column of the dimension tables and the index is created as a bitmap index when the cardinality is below 50.000. Queries that are restricted on these columns fly.

The biggest downside to using bitmap indexes is (or so I believed) that they tend to grow enormously in size when DML is applied to the table that is underlying the bitmap index. Resulting in frequently rebuilding bitmap indexes to reclaim space (and thus performance by reducing the required I/O) or not using bitmap indexes at all.

It appears I have been sleeping or did not read the ‘What’s New’ guides thoroughly enough, because in the What’s new guide for 10gR1 this is mentioned:

Enhanced Bitmap Index Performance and Space Management

Bitmap indexes now perform better and are less likely to be fragmented when subjected to large volumes of single-row data manipulation language (DML) operations.

Let’s get dirty and do some querying. Here’s what happens in 9.2.0.7 (script shamelessly copied from Jonathan Lewis’s blog):
SQL> select * from v$version where banner like 'Oracle%';BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.7.0 - 64bit Production
SQL>
SQL>
SQL> create table t1 as
2 select
3 rownum id,
4 mod(rownum,2) bit_col
5 from
6 all_objects -- {or your favourite row source}
7 where
8 rownum <= 10000
9 ;
Table created. SQL>
SQL> create unique index t1_pk on t1(id);
Index created. SQL> create bitmap index t1_bi on t1(bit_col);

Index created.

SQL>
SQL> — gather statistics on the table
SQL>
SQL> validate index t1_bi;

Index analyzed.

SQL>
SQL> select
2 name, height, lf_blks, lf_rows, btree_space, used_space
3 from
4 index_stats
5 ;

NAME HEIGHT LF_BLKS LF_ROWS BTREE_SPACE USED_SPACE
—————————— ———- ———- ———- ———– ———-
T1_BI 1 1 2 16188 2891

SQL>
SQL> begin — update 250 rows
2 for i in reverse 9500..10000 loop
3 update t1
4 set bit_col = 0
5 where
6 id = i
7 and bit_col = 1;
8 end loop;
9 commit;
10 end;
11 /

PL/SQL procedure successfully completed.

SQL>
SQL> validate index t1_bi;

Index analyzed.

SQL>
SQL> select
2 name, height, lf_blks, lf_rows, btree_space, used_space
3 from
4 index_stats
5 ;

NAME HEIGHT LF_BLKS LF_ROWS BTREE_SPACE USED_SPACE
—————————— ———- ———- ———- ———– ———-
T1_BI 3 82 496 1505836 810761

SQL>
SQL> drop table t1;

Table dropped.

SQL>
SQL>

So what I see here is that the bitmap index has grown from 1 8K block to 0,77M, just by updating 250 rows of a 10.000 row table. Imagine what that will do to your bitmap index on your multi-million row DWH table.
Now, I only have an 10gR2 database available and nothing is mentioned in that documentation regarding this subject, but let’s see what’s happening in 10gR2 (assuming the same will be true for 10gR1):

SQL> select * from v$version where banner like 'Oracle%';BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bi
SQL>
SQL>
SQL> create table t1 as
2 select
3 rownum id,
4 mod(rownum,2) bit_col
5 from
6 all_objects -- {or your favourite row source}
7 where
8 rownum <= 10000
9 ;
Table created.SQL>
SQL> create unique index t1_pk on t1(id);

Index created.

SQL> create bitmap index t1_bi on t1(bit_col);

Index created.

SQL>
SQL> — gather statistics on the table
SQL>
SQL> validate index t1_bi;

Index analyzed.

SQL>
SQL> select
2 name, height, lf_blks, lf_rows, btree_space, used_space
3 from
4 index_stats
5 ;

NAME HEIGHT LF_BLKS LF_ROWS BTREE_SPACE USED_SPACE
—————————— ———- ———- ———- ———– ———-
T1_BI 1 1 2 7996 2916

SQL>
SQL> begin — update 250 rows
2 for i in reverse 9500..10000 loop
3 update t1
4 set bit_col = 0
5 where
6 id = i
7 and bit_col = 1;
8 end loop;
9 commit;
10 end;
11 /

PL/SQL procedure successfully completed.

SQL>
SQL> validate index t1_bi;

Index analyzed.

SQL>
SQL> select
2 name, height, lf_blks, lf_rows, btree_space, used_space
3 from
4 index_stats
5 ;

NAME HEIGHT LF_BLKS LF_ROWS BTREE_SPACE USED_SPACE
—————————— ———- ———- ———- ———– ———-
T1_BI 1 1 2 7996 2853

SQL>
SQL> drop table t1;

Table dropped.

SQL>
SQL>

The index has not grown at all!

Now don’t go throwing away all your regular indexes and replace them all by bitmap indexes. There still is another downside to updating bitmap indexes, as Jonathan Lewis mentions in his post. They used to create a lot of redo in 9i and they still do use a lot of redo in 10g (Not to self; create script to demonstrate this). But there are certainly more cases where a bitmap index can save your day, or the performance of your datawarehouse.

And, as always, check if the results for this kind of use is the same in your own environment, or in IM speak; YMMV.





It is not only out, it is available for download

10 08 2007

Yesterday over at Eddie Awad’s blog was the first time I read that oracle 11g on Linux is available for download. And the documentation is there as well. I Shamelessly copied the links from Eddie, mainly so I won’t have to find them myself when I find the time to download and install.

Of course the first to download, install AND publish about it is to be found here.

Someone has got to do the dirty job :-).