I have a data structure used for storing information about animals. Types of data (columns) depends on animal type, only a few are common. There can be many animal types with different feature sets (one type can implement one or more feature sets).
What I want to achieve is: - to have a possibility to create new animal types, based on existing feature sets without modifying data structure later - to have a possibility to make simple selects of all data about animal with known animal_id or all animals with known type - to have a data control (to make sure that inserting data e.g. about flying_animal wouldn't be allowed for animal type 'Dog') - to Keep It as Simple and Stupid as possible (of course now I can write some stored procedures and triggers "before insert" to make sure that all data is correct, but maybe there is more elegant way?)
Existing data structure:
animal_id (Primary Key)
animal_type (Foreign Key from animal_types table)
Example data row: 1, 'Rex', 3, 2
animal_type_id : int (Primary Key)
animal_type_name : char
house_animal : boolean
flying_animal : boolean
wild_animal : boolean
Example data rows:
1, 'Wild parrots', false, true, true
2, 'Dogs', true, false, false
Every feature set is now implemented as table, e.g.:
Table: house_animal animal_id (Foreign Key from animals table) favorite_food last_vet_visit_date
Table: flying animal
animal_id (Foreign Key from animals table)
Current data structure makes things too complicated.
You can ask me why not to make one big table with all possible columns and one dictionary table with animal types. Of course I can, but that makes data control almost impossible, and how to define a new type with only some of the columns allowed..
And why wouldn't I create one table for every new type? Because I don't want to have e.g. 200 tables with only a few rows in each of them.
The short answer is to design to 3NF when using an RDBMS.
You may want to look at using Phyla or Class as base tables to avoid your 200+ scenarios and to limit a really wide table with tons of NULL values.
answered Nov 13, 2012 at 03:41 PM
You have a couple of options. Let me start with the enticing but entirely unsatisfactory one just to get it out of the way: Entity-Attribute-Value (EAV).
It's so sublime: two tables (one for attribute types and one for entity attributes) and you never have to make any other changes again. Unfortunately, if you ever want to pull that data out or query it by anything other than animal ID, it's a terrible design which performs miserably. You also lose the ability to do things like data type checking and subsequent foreign key checks.
You also wouldn't want to denormalize those records and put them into one big table, I agree. It's "easy" but lazy, difficult to understand, and can be very difficult to maintain (especially if you want to put in constraints like ensuring that non-flying animals don't have wingspan or max_speed entries).
So I would recommend keeping separate subtype tables for different classifications if you want users to be able to create relatively complex queries, like looking for all animals with 48"+ wingspans whose favorite food is beetles (so all animals who are in a type which has a record in house animal and flying animal). I understand your concern regarding having so many tables, but if you have a complex data model with a lot of entities (or subtypes, as you have), there are negative repercussions to making your data model easier than reality. EAV and one-big-table are such simplifications. You may end up with a large number of tables with relatively few rows, but if you are trying to collect specialized attributes about a large number of subtypes, that's where you'll want to end up. As the saying goes, make the model as simple as possible, but no simpler.
One way potentially to simplify without losing valuable domain knowledge could be to have higher-level classifications, going down into lower-level classifications only when necessary and possibly aggregating up to more generic classifications. A silly example would be if you had "winged mammals," "flying birds," and "flightless birds." If these three tables had similar attributes (wingspan for all three and maximum flight speed for the first two), they could be aggregated into "flying animals." Without knowing the complete details of your data domain, I couldn't tell you if you have any of those potential gains, but if you start seeing several subtypes which are very similar, there might be a way to collapse those into a single table (with another attribute or two to differentiate records). You don't lose necessary modeling complexity, but can get rid of some implementation complexity.
There is one exception to having all of those subclasses: if you have one application which uses this data and you do not want to filter using particular sets of data (e.g., yes, you want flying animal stats, but don't want to let users filter where max_speed > 15 MPH), you could store this as a single, formatted column (XML, JSON, that kind of thing) off of the animal table. This isn't necessarily the wrong answer, but only if you know that this data should never be part of a filter and instead should just be displayed to end users. When there's any doubt, though, I'd still break it out, and honestly, when I see something like this, I automatically cringe because for every time it's done correctly, there are a dozen examples of it being misused.