ETL Consideration: Constraint vs. Index vs. Check in query
I am running an ETL which may bring in between 5 and 10 Million rows a day to a staging schema. I will run my calculations, pivots, etc. and move this data into the warehouse in the DBO schema. Certain tables in this schema will have columns with unique combinations. (For example: There can only be one ActingSupervisorId per ShiftId) I have a multipart question: Where do I check the uniqueness - Staging or Warehouse? How do I check it? 1. Constraint 2. Index 3. Check in query
My philosophy on this is pretty simple: your warehouse should be a reflection of reality, regardless of how dirty. You should not need to put any constraints on warehouse tables for integrity purposes, because you want to display the data as it actually is. If the data you have violates business rules or analysts look at the sheets and say that the numbers can't possibly be right, then the source system needs to fix it. Otherwise, if you spend your time fixing the problem today, you'll need to fix it the next time the data loads, and in perpetuity. Also, people looking at source system reports could never tie the numbers to what your warehouse shows, which will make them doubt both systems. Regarding constraints in the warehouse, I would only put constraints on if they provide better read performance; otherwise, I would leave them off. I tend to leave off any unique constraints or check conditions. The reason I can get away with this is that I know there is one and only one way that data gets into my warehouse tables: through a single ETL process. Users can't input data, developers and DBAs don't do insert or update statements, and there aren't parallel load processes in place with different business logic. When you have a single source of loading truth, you can push out any checks, data conversions, calculations, etc. that you want to do back a step or so. I should probably note that there are times when I violate my simple philosophy: in our warehouse, we have one particular date range that I fix because I know that the data will never be corrected in the source system (because that source system is dead) and it really helps solve a lot of problems. I don't like doing it, though... Anyhow, I do these in staging tables, as I want my inserts into the warehouse to go quickly and smoothly. If you do end up performing some types of checks, I think your method should be determined by your requirements. If you do have duplicate rows or check violations, what do you want to do? Do you want to have the whole process fail, skip just those rows, or push it along and send yourself a reminder to tell the analysts to look at the bad records? Any of those are legitimate options. If you want the process to fail, use check constraints and roll back the transaction. If you want to skip the rows, you could do a windowing function like ROW_NUMBER() to get only valid records. If you want to load the data because that's what it really is, you could do similar ROW_NUMBER()-based checks and store them in a table for later analysis, to let people know just how bad the source records are.
I will take a shot at this even though I am not a database developer. Part of this really all depends on what you want to accomplish and what you want to be able to do if the check fails. Me personally I would handle this check between the staging table and the insert into the warehouse. If you use a constraint and the uniqueness is not met, wont the data load fail? What then? I personally would run the ETL to bring the data into my staging tables then use a query to check for uniqueness. That way if the uniqueness fails if there is a way to de dup or cleans the data you already have it in a staging table. If all is well then you can insert into the warehouse. Again it really depends on what your scenario is.
@Kevin Feasel has a great answer. I'd just like to add one thing: You say as an example that there can only be one ActingSupervisorId per ShiftID. In the source system this is probably all true. But in the Warehouse, that would be one of the things that will change over time. Therefore the Warehouse has the time-dimension which probably does not exist in the source system, and for that reason it could be a good idea to check the data integrity when inserting into the Warehouse (not necessarily because there are/could be errors in the source system, but because the Warehouse is another creature than the source system).