I am using Oracle 11g and I have a user table, transaction table and user_transaction table. the number of users is around 75,000 number of unique transactions possible in the application is about (rows in transaction table is between 1 and 3 million). user_transaction is the join of the above two table storing which transaction users did at what dateand time.. SO this table is going to be huge for 1 year of data (we are going to purge the active data from the table and archive it after 1 year). We are expecting the count to be around 50- 60 million rows. This will be final data size at the end of the year.
I would say average size is about 30 million records. Also a nightly import job updates all these tables and thats the only part when inserts are done in these tables, we only access data (use select queries) from our app.
Now our application queries this table for following (mostly aggregation):
1) Send out last 7 days activity reports to all 75000 users email them, docs accessed, dates, counts etc.. the most data and process intensive task of our app.
2) sum of all transactions in a date range by location, resource etc...
3) sum of all users doing a particular transaction
4) documents count for each user and distinct list of docs
5) pull up graphs and search user activity separately through UI.
What would be the best way to design the join table to make retrieval from the huge transaction table faster?We have added many fields in the table to denaormalize it and reduce joins, and have almost all data available only in the transaction and user_transaction table.
If we want to partition the table how do we go about partitioning? The application is used to query the more recent data most frequently.
We are thinking in terms of partitioning month wise the transaction table so we would have 1 table for each month..
Other option we were thinking of is have 7 tables each for 1 day of the week, but this is increasing the complexity of queries greatly, considering we are using hibernate.
How do we design the huge table of around 60 milion
asked Jan 12 '12 at 03:08 PM in Default
Seems like more of an OLAP scanrio rather than OLTP. So, I guess all the strategies for OLAP can be applied here. You did the right thing by De-normalizing the tables. Now, as you said you would be mostly aggregating the data, first you need to look at that do you need the real time data OR you can import the data already aggregated for the DAY? If you need data only on day basis, your number of rows could decrease immensely. On top of that, MATERIALIZED VIEWS could do wonders for you. Furthermore, Since it is mostly the static data, you have the leverage to make more indexes as compared to OLTP where more indexes could slow you down. Nightly, index maintenance should be done to get the best throughput.
As far as portioning is concerned, tables containing historical data, where only the recent month's data is updatable and the rest of the months are read only, are always good candidates for partitioning. Not only you can partition the data, you could have partitioned indexes as well. But the most valuable attribute for your environment would be Partition-Wise Joins. They could boost your performance, since you would be needing only the recent month data mostly. Having a partition for last 7 days, would be hard to manage and maintain. Sticking with month wise partition is much easier to handle. Moreover, I am not sure but I heard Oracle can manage monthly partitions itself. Since, did not work on Oracle since quite a long time ago, I am a bit out of date on that. :)
answered Jan 12 '12 at 11:10 PM
That's a big question to ask. Firstly, I agree with Usman - sounds like an OLAP system. Materialised Views could be excellent too - although you need to be aware of the penalty of keeping them up to date (making them "on commit refresh" can cause an absolutely massive performance hit on inserts and updates, but you should be able to use a "fast" or "complete" refresh model here following each data load).
Taking just one of the subjects, if you have partitioning licensed then it is a very efficient way to manage data as you can simply drop off old partitions when they are no longer required. You can even swap them out of your Fact table and swap them into an Archive table if you design it well.
You need to think carefully about your partition (and possibly sub-partition) keys. You need to pick a column or columns which are queried frequently and will allow "partition elimination". Date is usually up there, and possibly a sub partition by transaction type for you. I would look to partition into as small a logical chunk as reasonable. You look like daily partitioning might be a good idea. A few hundred is very managable and normal. Subpartitioned by, say, 20 transaction types and you have less than 10,000 partitions. That's not unusual. (I know one site which does 30,000 partition swaps per day). If you are querying by a day or two, then Oracle will automatically exclude 364 of the partitions and not consider them. By a subpartition of status, you might only be reading 50,000 rows instead of 50,000,000. That can provide considerable performance benefits.
You also need to think about your indexing strategy, and also the statistics gathering strategy. Do you need global indexes across all partitions or can you get away with partition level indexes (which confers some benefits). As for gathering stats, this can be a minefield and you need to look at how you get the good-enough stats on each partition and at what point. Geting that right will really help your query optimisation. I think Doug Burns did some good blog posts on this (google his website.)