So far in this series on columnar, I have attempted to distinguish the columnar database from the row-wise database from the bottoms up, drawing distinctions in how the data is stored on disk, which is the major physical difference. Once the data is stored in its new and unusual way in a column-based DBMS, it, of course, must be retrieved.
In row-wise data retrieval, we frequently ask for multiple tables to be joined in the result set. Of course, in columnar (which works with SQL too by the way), there are joins, but there is also an analogous concept to joins in columnar within a single table. It's a process I call "gluing" and it's the process of pulling together the necessary columns from within a table that are needed by the query, either for selection or projection (where clause).
In row-wise, unnecessary columns must be removed from the result set because inevitably (unless the query is a SELECT *) more columns were retrieved than are interesting to the query. In columnar, since just enough columns were retrieved, that is not necessary. However, what is necessary is a strategy that determines when to glue (materialize) the columns together. There are two choices - early materialization or late materialization.
Early materialization retrieves all the columns necessary and glues them together up front, effectively creating a "row" with which to do the necessary processing. If you asked...
SELECT CUSTID, PRICE
FROM SALES
WHERE PRODID = 4 AND STOREID = 1
...it would pull CUSTID, PRICE, PRODID, and STOREID from their respective vectors, glue them together*, apply the WHERE clause and return the selected columns by trimming PRODID and STOREID from the "row". This strategy works best for complex joins.
The same query could also look into the PRODID vector and return a bitmap of the rows that met the condition (=4), do the same for STOREID (=1), then "AND" the bitmaps to determine which rows met both conditions. The columns that matched the rows in those vectors can then be pulled and glued together at the very end of the processing, just before presentation. This is called late materialization and it is usually best for less complex and non-join queries.
Ideally, the columnar optimizer is able to execute either strategy. Ideally, it can analyze the query and determine and execute the best strategy. It's important to analyze your workload in order to determine the value of this aspect of the columnar optimizer to your shop. And that will lead to the next entry, which will be on the fit of columnar databases in an enterprise.
*Recalling from this post how it carefully connects columns from the same row
Posted June 1, 2010 1:51 PM
Permalink | No Comments |




Leave a comment