Originally published 2 June 2010

Venn diagrams are familiar to most data management professionals. They are a symbolic notation that describes how things can belong (or not belong) to two different classes. Venn diagrams are presented in many books of mathematics and modern logic as if they apply to all of reality. I have never seen data discussed in any of these publications, and I suspect that the authors simply assumed that data has no choice but to fall into the order prescribed by Venn diagrams. Yet, it this really the case? Or is there something special about data such that its characteristics cannot be fully captured by Venn diagrams? After many years of working with reconciling datasets, I have begun to think that data really is different.

Mr. John Venn (1843-1923) was a Cambridge (England) man where he lectured in moral sciences; and in 1881 he wrote a book called Symbolic Logic in which he introduced his diagrammatic technique. Venn was chiefly interested in propositional logic, but today everyone seems more familiar with the extended use of Venn Diagrams in set theory.

Let us illustrate this basic case. Suppose we have a source and target with the Customer records shown in Figure 1.

Figure 1: Sample Source and Target Customer Tables

From this, we can see that Aristotle, Diogenes, and Plato are both in the source and target, but Socrates and Thales are only in the source. Likewise, Marx and Engels exist only in the target. We can use a Venn diagram to illustrate this situation with the number of records that pertain to the overlap, the source only and the target only, as in Figure 2.

Figure 2: Venn Diagram with Record Counts for Tables in Figure 1

Figure 3: Source at Account Level and Target at Customer Level

Now everything is more complex. Five records in the source overlap with three records in the target. Two records in the target still do not overlap with anything in the source, and five records in the source do not overlap with any record in the target. Of the latter, two are duplicates for the same customer and two are nulls. Let us try to put all this into a Venn diagram. It is not easy, and the best I can do is shown in Figure 4.

Figure 4: Attempt at Venn Diagram for Records in Figure 3

Figure 4 is not a Venn diagram in any real sense. It violates the principles of a Venn diagram for two main reasons: the existence of duplicates, and the existence of nulls. Neither of these occur in reality – which is what the Venn diagram is based on – but they do occur in data.

Carroll (his real name was the Rev. Charles Dodgson) wrote a book in 1896 also called Symbolic Logic and wrote letters to Mr. Venn about how Venn would use his diagrams to represent combinations of various propositions. One can only imagine how mortified Venn must have been to be asked to represent premises like "Some mermaids smoke cigars" and to deduce conclusions such as "Some persons who are not gamblers are not philosophers." However, Venn did reply to Carroll and produced some remarkable variants of his diagrams that Carroll included in his book. Even so, Carroll did identify that there are things that are in neither of the two classes (the source and target in our case) that the Venn diagrams cover. Carroll created an alternative diagrammatic method that has never been popularized, but did cope with nulls. As Carroll put it:

My method...differs from his [Venn's] method in assigning a closed area to the Universe of Discourse, so that the Class [the nulls in Fig. 4] which, under Mr. Venn's liberal sway has been ranging at will though Infinite Space, is suddenly dismayed to find itself "cabin'd, cribb'd confined" in a limited Cell like any other Class!

Carroll taught logic and mathematics at Oxford University, and knew what he was doing. Therefore, we can conclude that Venn diagrams do not support nulls, and since nulls are a persistent feature of data, Venn diagrams do not support data.

It could also be objected that we should only consider the distinct values in the source, just as distinct values are populated into the target. But, again, why should we? We are looking at records which are a fact, rather than distinct values which are an abstraction from this fact. We have to reconcile records between the source and target. This is not at all like the basic idea that underlies a Venn diagram where a single instance of a thing can belong to two or more distinct classes. In data, a single instance of a thing can be represented many times, so there really is nothing corresponding to a single instance of a thing as conceived for a Venn diagram. Once again, Venn diagrams do not work for data.

SOURCE: Why Venn Diagrams Don't Work for Data

**Recent articles by Malcolm Chisholm**

## Comments

Want to post a comment? Login or become a member today!

Be the first to comment!