An application I inherited tracks lab test results performed on material samples. Data is stored in a single table (tblSampleData) with a primary key of SampleID and 235 columns representing potential test results. The problem is that only a few tests are performed per sample, so each row contains over 200 nulls. Actually, there is a second similar table (tblSampleData2) with another 215 primarily null columns and a primary key of SampleID. The two tables have a one-to-one relationship and most SampleIDs have some data in both tables. For every SampleID, however, there are easily 400 null columns.
Is this bad database design? If so, which normal form rule is broken? How can I query this table to identify which groups of columns are typically filled together with data? My goal would be to have, say 45 tables with 10 columns and fewer null values. How can I do this? How do I avoid breaking existing applications?
These table have about 200,000 sample records so far. Users are asking me to add more columns for more tests, but I'd rather build a new table. Is this wise?
asked Jul 01 '10 at 09:16 AM in Default
I really depends on the data for a sample. Say for instance there is a group called Analysis 1234 and it has 10 results for columns. You might create a related table called Analysis1234 and have those 10 columns in that table and relate throught a foreign key back to the tblSampleData.
If that is not the case, you at the idea of Entity Variable Attributes. Search for EVA tables at Bing.com
answered Oct 17 '12 at 06:05 PM
I really wouldn't go for 45 tables each of which with 10 columns. I would do the job properly and have a very narrow table
If you do any more than that, you're going to have to write code to search multiple tables, different rules to address what goes where, and it's going to be just as big (if not more) of a maintenance headache than you have now.
Yes, doing this is going to break existing code - as is making any change to the number of tables. The easiest way around this is to encapsulate data access by use of stored procedures in the short term, and migrate client code to use the stored procedures to access data; then, once all code is migrated, you can safely change the back-end database structures with no impact on the client code, as long as you remember to change the SPs at the same time!
PS: In terms of Normalization / Normal Forms, technically, given that you allow null values, you violate the original definition of 1NF!
I'll take the questions out of order.
As for the normal forms, technically any nullable column violates the first normal form and so cannot be considered normalized. However, many people consider that requirement controversial and use a definition of normalization that permits nulls in 1NF-3NF. Putting a table into 4th or 5th normal form however will address multivalued dependencies and create a situation where there is never a need for nulls.
As 3NF is often redefined (to allow nulls) there is nothing inherently wrong with having either a large number of columns or having numerous null columns. Such a design is unusual, but as long as everything depends, to borrow a phrase, "on the key, the whole key, and nothing but the key" it can be considered normalized as the term is normally used in practice.
Maybe. It depends on how it is being used. There are certainly disadvantages to having huge numbers of columns that go beyond whether the table is technically normalized or not. But, it could be a perfectly acceptable design, especially if the calling application really needs different subsets of those columns on a regulat basis and was smart enough to call for only the columns it needs rather than using "select *." If the number of different subsets of columns needed are large, having it all in one table can avoid a lot of joins.
I will say this sounds like a great place to use the relatively new "sparse column" feature if you have a newer eddition of SQL Server.
But if there really are clumps of columns that are almost always called together as a group, it might be more effecient to break it up into multiple tables even if it is technically normalized.
How to find the clumps?
So, that brings up the question of how do you identify which groups of columns are normally populated together. I can't think of a way to write a single query to do this, and handwriting it would be tedious anyway. You can certainly write a loop in T-SQL that will query each possible group of columns and see which ones normally go together. It wouldn't be that hard using dynamic SQL.
Personally, I wouldn't do that though. I would bring those tables (or perhaps a simplified version that just showed each column as 0 for null and 1 for not null) into a programming language that had good matrix support, such as Python with Numpy or matlab or R. Then I would start doing matrix analysis to see which columns where normally null or 1 there. It would probably be much easier to write that code and would certainly run faster.
Of course, that does assume that you can bring the whole (possibly simplified) table into memory. But after simplifying it to a matrix of integers it would have to be huge for a modern computer not to be able to bring it into memory. And if that is still a real problem, just take a representative sample to do your analysis.
Of course, you might not even want to do that. A better question might be "Which columns are normally queried together?" For that, you might be able to look over the source code of the calling application, just ask your users outright which ones go together logically, or run a trace for a while to get the queries. The answer to which columns are queried together will tell you how to set up your tables to minimize the number of joins, not which ones are normally populated together. (Of course there's a good chance that there is a strong correlation between what is populated together and what is queried together, but that is not necessarily true.)
Should you add more tables or more columns
This basically boils down to the answers to the other questions. If you would wind up joining them all together anyway, and the tests are at least in a sense independent (so they can be together without violating 3NF) then you just might want to keep adding columns. It comes at a price, but then so does adding tables and requiring a lot more joins.
If the table is so large that it is causing problems, these columns are dependent on something else (so it would violate 3NF), or if you know they won't be quiered alot with your other columns (so there wouldn't be a join concern) then you probably want to add tables.
What data type are the results? Are they all the same or do they vary? Just out of curiosity, I see that this is an older question, have you resolved it by now and if so how?
I'll throw in my .02 for what it's worth. I would be hesitant to add more columns to an already wide table. Depending on the size of the columns, if the columns do get populated one day, you could have another issue with page splits. Without knowing the data types of the columns, you very well might have that issue already.
Shooting from the hip & without knowing anything further, I would probably lean towards having a test table to track each individual test, a results table that would track each result type and then a sample table that would store the results from each respective sample taken.
Something like this:
There are some overhead issues with this type of design, such as implicit conversions in the event that your result value isn't a VARCHAR. Something to just keep in mind.
This type of design would allow the addition of future tests & results without changing the underlying structure of the tables.
In either case, this will most likely take some serious work to turn this around, but it can be done.
Hope this helps!
answered Oct 17 '12 at 07:15 PM