x

Drowning in a Sea of Nulls

An application I inherited tracks lab test results performed on material samples. Data is stored in a single table (tblSampleData) with a primary key of SampleID and 235 columns representing potential test results. The problem is that only a few tests are performed per sample, so each row contains over 200 nulls. Actually, there is a second similar table (tblSampleData2) with another 215 primarily null columns and a primary key of SampleID. The two tables have a one-to-one relationship and most SampleIDs have some data in both tables. For every SampleID, however, there are easily 400 null columns.

Is this bad database design? If so, which normal form rule is broken? How can I query this table to identify which groups of columns are typically filled together with data? My goal would be to have, say 45 tables with 10 columns and fewer null values. How can I do this? How do I avoid breaking existing applications?

These table have about 200,000 sample records so far. Users are asking me to add more columns for more tests, but I'd rather build a new table. Is this wise?

more ▼

asked Jul 01, 2010 at 09:16 AM in Default

DeveloperDan gravatar image

DeveloperDan
32 1 1 1

(comments are locked)
10|1200 characters needed characters left

4 answers: sort oldest

I really depends on the data for a sample. Say for instance there is a group called Analysis 1234 and it has 10 results for columns. You might create a related table called Analysis1234 and have those 10 columns in that table and relate throught a foreign key back to the tblSampleData.

If that is not the case, you at the idea of Entity Variable Attributes. Search for EVA tables at Bing.com

Thomas
more ▼

answered Oct 17, 2012 at 06:05 PM

TheSmilingDBA gravatar image

TheSmilingDBA
91 1 1 3

(comments are locked)
10|1200 characters needed characters left

I really wouldn't go for 45 tables each of which with 10 columns. I would do the job properly and have a very narrow table

SampleID
TestID
Result

If you do any more than that, you're going to have to write code to search multiple tables, different rules to address what goes where, and it's going to be just as big (if not more) of a maintenance headache than you have now.

Yes, doing this is going to break existing code - as is making any change to the number of tables. The easiest way around this is to encapsulate data access by use of stored procedures in the short term, and migrate client code to use the stored procedures to access data; then, once all code is migrated, you can safely change the back-end database structures with no impact on the client code, as long as you remember to change the SPs at the same time!

Have fun.

PS: In terms of Normalization / Normal Forms, technically, given that you allow null values, you violate the original definition of 1NF!
more ▼

answered Oct 17, 2012 at 07:04 PM

ThomasRushton gravatar image

ThomasRushton ♦
33.9k 18 20 44

That's a great point, and this could be a great design depending on how the data is used. If the normal question is "What was the results for this test on this sample?" this is perfect.

If the normal question is more like "I want to do statistical analysis on the correlations between tests1, test5, and test7 across all samples where test8 gave a result of 2" then the "column per test" might make it easier to write the queries and give you faster results.
Oct 17, 2012 at 07:53 PM TimothyAWiseman
Good point, but you could always expose the column-per-test reporting version through a VIEW, and leave the raw data as is.
Oct 18, 2012 at 11:24 AM ThomasRushton ♦

Very true, but I've written views like that. They can get complicated and come at a performance price.

Under certain use scenarios I think your suggestion is brilliant and the way to go. But I can see other scenarios where it makes sense to break it out, especially if they call for different datatypes (some tests may need float, others may need varchar, others may be straight integers).
Oct 18, 2012 at 04:15 PM TimothyAWiseman
(comments are locked)
10|1200 characters needed characters left

I'll take the questions out of order.

Normal Forms

As for the normal forms, technically any nullable column violates the first normal form and so cannot be considered normalized. However, many people consider that requirement controversial and use a definition of normalization that permits nulls in 1NF-3NF. Putting a table into 4th or 5th normal form however will address multivalued dependencies and create a situation where there is never a need for nulls.

As 3NF is often redefined (to allow nulls) there is nothing inherently wrong with having either a large number of columns or having numerous null columns. Such a design is unusual, but as long as everything depends, to borrow a phrase, "on the key, the whole key, and nothing but the key" it can be considered normalized as the term is normally used in practice.

Bad Design?

Maybe. It depends on how it is being used. There are certainly disadvantages to having huge numbers of columns that go beyond whether the table is technically normalized or not. But, it could be a perfectly acceptable design, especially if the calling application really needs different subsets of those columns on a regulat basis and was smart enough to call for only the columns it needs rather than using "select *." If the number of different subsets of columns needed are large, having it all in one table can avoid a lot of joins.

I will say this sounds like a great place to use the relatively new "sparse column" feature if you have a newer eddition of SQL Server.

But if there really are clumps of columns that are almost always called together as a group, it might be more effecient to break it up into multiple tables even if it is technically normalized.

How to find the clumps?

So, that brings up the question of how do you identify which groups of columns are normally populated together. I can't think of a way to write a single query to do this, and handwriting it would be tedious anyway. You can certainly write a loop in T-SQL that will query each possible group of columns and see which ones normally go together. It wouldn't be that hard using dynamic SQL.

Personally, I wouldn't do that though. I would bring those tables (or perhaps a simplified version that just showed each column as 0 for null and 1 for not null) into a programming language that had good matrix support, such as Python with Numpy or matlab or R. Then I would start doing matrix analysis to see which columns where normally null or 1 there. It would probably be much easier to write that code and would certainly run faster.

Of course, that does assume that you can bring the whole (possibly simplified) table into memory. But after simplifying it to a matrix of integers it would have to be huge for a modern computer not to be able to bring it into memory. And if that is still a real problem, just take a representative sample to do your analysis.

Of course, you might not even want to do that. A better question might be "Which columns are normally queried together?" For that, you might be able to look over the source code of the calling application, just ask your users outright which ones go together logically, or run a trace for a while to get the queries. The answer to which columns are queried together will tell you how to set up your tables to minimize the number of joins, not which ones are normally populated together. (Of course there's a good chance that there is a strong correlation between what is populated together and what is queried together, but that is not necessarily true.)

Should you add more tables or more columns

This basically boils down to the answers to the other questions. If you would wind up joining them all together anyway, and the tests are at least in a sense independent (so they can be together without violating 3NF) then you just might want to keep adding columns. It comes at a price, but then so does adding tables and requiring a lot more joins.

If the table is so large that it is causing problems, these columns are dependent on something else (so it would violate 3NF), or if you know they won't be quiered alot with your other columns (so there wouldn't be a join concern) then you probably want to add tables.

more ▼

answered Oct 17, 2012 at 07:11 PM

TimothyAWiseman gravatar image

TimothyAWiseman
15.5k 20 23 32

+1 for the Sparse Columns, if the table cannot be converted to 3NF and of course for good explanation.
Oct 17, 2012 at 08:36 PM Pavel Pawlowski
(comments are locked)
10|1200 characters needed characters left

What data type are the results? Are they all the same or do they vary? Just out of curiosity, I see that this is an older question, have you resolved it by now and if so how?

I'll throw in my .02 for what it's worth. I would be hesitant to add more columns to an already wide table. Depending on the size of the columns, if the columns do get populated one day, you could have another issue with page splits. Without knowing the data types of the columns, you very well might have that issue already.

Shooting from the hip & without knowing anything further, I would probably lean towards having a test table to track each individual test, a results table that would track each result type and then a sample table that would store the results from each respective sample taken.

Something like this:

CREATE TABLE #Test(
 ID INT IDENTITY(1,1) NOT NULL
 , TestDescription VARCHAR(10) NOT NULL
)
GO
CREATE TABLE #Result(
 Result_ID INT IDENTITY(1,1) NOT NULL
 , ResultDescription VARCHAR(10) NOT NULL
GO
CREATE TABLE #Sample(
 Sample_ID INT IDENTITY(1,1) NOT NULL
 , ID INT NOT NULL
 , Result_ID INT NOT NULL
    , ResultValue VARCHAR(10) NOT NULL
)

There are some overhead issues with this type of design, such as implicit conversions in the event that your result value isn't a VARCHAR. Something to just keep in mind.

This type of design would allow the addition of future tests & results without changing the underlying structure of the tables.

In either case, this will most likely take some serious work to turn this around, but it can be done.

Hope this helps!
more ▼

answered Oct 17, 2012 at 07:15 PM

JohnM gravatar image

JohnM
6.8k 1 3 7

(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x371
x108
x31
x13

asked: Jul 01, 2010 at 09:16 AM

Seen: 1112 times

Last Updated: Oct 18, 2012 at 04:15 PM