Taxonomies and Data Models

Originally published 29 July 2010

For many years now, there has grown an intellectual discipline around data models. There have been conversations about entities and definitions. There is conversation about cardinality and physical characteristics. There are discussions about granularity and keys. In a word, the world has grown up knowing a lot about the intellectual activity of data modeling. What has gone unstated is that data modeling applies to structured data. Without realizing it, the discipline of data modeling has been created for structured, repetitive data.

Structured data of course refers to the world of transaction processing. In transaction processing, the same type of data occurs over and over again. When a bank withdrawal is done, the information from one transaction to the next is repeated. Or at least the same type of information is repeated. Standard DBMSs are geared to handling the repetitive occurrence of data. That is why this kind of data is called structured data.

Now as the world recognizes that there is a wealth of information to be found in unstructured textual information, it is very natural to ask this question: Does data modeling apply to unstructured data?

The answer is that INDIRECTLY data modeling applies to unstructured data. But DIRECTLY data modeling does not apply to unstructured data. An explanation is in order.

Consider one very significant difference between structured and unstructured data. Structured data can be changed, but unstructured text cannot be changed. Suppose that when an analyst is building a data model in the structured world, he/she discovers that a piece of data is missing. The analyst has the power to go and insert the missing data into the system specifications, and the structured system then will include and handle the data.

But the analyst does not have the same power at all when it comes to unstructured data. When the analyst finds text that the analyst disagrees with, the analyst cannot go and change the text. In some cases, going in and changing the text is actually illegal. In other cases, it is merely unethical. So when an error is found in the unstructured textual environment, the analyst cannot go in and make a correction of the data.

The analyst uses taxonomies to organize and understand textual, unstructured data. A taxonomy is merely a large classification of data. A simple taxonomy might be:

A car is:
– a Porsche
– a Honda
– a Ford
– a Volkswagen

Taxonomies do not change text, but they do classify text.

There is a word of warning here. Taxonomies are deceptively simple. But when you get into the implementation of taxonomies you find that they have their own set of complexities. For example:
  • A word may be classified by many taxonomies, not just one
  • Taxonomies may contain recursive relationships. Care must be taken with recursive relationships.
  • Taxonomies may have their own “pecking” order. One taxonomy may be more appropriate than another taxonomy.
  • Some taxonomies may be inappropriate for certain text, and so forth.
In many ways then, structured data has a data model and unstructured data has a taxonomy. Analogically, a data model is to structured data the same thing that a taxonomy is to unstructured data.

SOURCE: Taxonomies and Data Models

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!