This is a continuation of my thread on columnar databases. In the last entry, as a foundation, I introduced the row-wise data page. In this post, I will address the changes at the data page level that occur for a columnar database.
In the row-wise data page, all of the columns are found sequentially within the table for each row and rows are stored consecutively. In a columnar database, each data page is populated with data from only one COLUMN. I will refer to this data structure as Sybase IQ does, as a vector. So, each vector contains a column's worth of data. Vectors will take up more or less pages depending on the size of the column. A table will have 1 vector per column.
Queries will usually need multiple columns from a table. So those columns must be fit together appropriately. It would not work to put column 1 from row 1 together with column 2 from row 2 and column 3 from row 3 and present it as a "row". The columns have to be put back together (what I call "glued") appropriately. To facilitate this, the columns are stored in the order of the rows they belong to. Hence, a row ID map at the end of the page is not required in columnar and consequently, there can be less wasted space.*
For example purposes, I will use a 3-column table with a customer number, first name and last name columns. The rows contain:
1. 123-William-McKnight
2. 456-Joe-Smith
3. 789-Joe-Doe
There are 4 ways that columnar databases store data:
1. For each column, the row ID is stored alongside the column value.
a. Vector 1: 1-123,2-456,3-789
b. Vector 2: 1-William,2-Joe,3-Joe
c. Vector 3: 1-McKnight, 2-Smith, 3-Doe
2. For each column, a row "range" is used where applicable.
a. Vector 1: 1-123,2-456,3-789
b. Vector 2: 1-William,2-3-Joe
c. Vector 3: 1-McKnight, 2-Smith, 3-Doe
3. For each column, the row is indicated by position.
a. Vector 1: 123,456,789
b. Vector 2: William, Joe, Joe (or William, Joe (2))
c. Vector 3: McKnight, Smith, Doe
4. Bitmap representation with abstracted values corresponding to entries in a separate map structure are used.
The first way is wasteful since the row ID is unnecessary to store since it is implied by the order of the values. The second way does not improve upon that, but does compress space when there are recurring values in sequential rows. The third ways shows removal of the row ID. Notice Vector 2 may or may not actually use the range representation for the recurring values. It's usually better if it does. Finally, while there is nothing new about columnar bitmaps (vs. row-wise bitmaps), the acceptable cardinality for the fit of a column as a bitmap is much higher in columnar than row-wise. In columnar, up to 1500-2000 values is considered "low cardinality" and "worth it" to apply bitmapping to (as opposed to < 50-100 in row-wise DBMS). This is useful when you have recurring values that are not stored in sequential records. A column like "country" can reduce its size 30-fold with bitmap encoding.
In some columnar DBMS, the columns/vectors can utilize different storage methods. The 3rd and 4th ways shown here are best in a mixed strategy.
Again, columnar is attempting to give you more bang for the buck in your I/Os when you are seeking a minimal set of columns from the table. In the next entry, I will describe why getting the most out of your I/Os has becoming increasingly very important.
*Wasted space can occur in row-wise when you run out of Row IDs before storage space on the page. Since each row must have a map entry, the page will be considered full when the Row ID limit is reached.