Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is President of McKnight Consulting Group. His practice focuses on delivering business value and solving business problems utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures.

William has more than 20 years of information management experience, nearly half of which was gained in IT leadership positions, dealing firsthand with the challenging issues his clients now face.  His IT and consulting teams have won best practice competitions for their implementations. In 11 years of consulting, he has been a part of 150 client programs worldwide, has over 300 articles, white papers and tips in publication and is a frequent international speaker. William is the author of 90 Days to Success in Consulting. Contact William at william@williammcknight.com or (214) 514-1444.

Editor's Note: More articles, news and resources are available in William's BeyeNETWORK expert channel. Be sure to visit today.

Do we need business intelligence (BI) tools to be successful?
Get William's take on whether you need business intelligence (BI) tools to be successful and learn how about low cost BI including cheap and open source BI tools.

Fastest way to learn business intelligence (BI)
Find out how you can learn business intelligence (BI), get business intelligence training and discover why analytics is the key to marketing efforts in this BI tip.

Data architect careers: The benefits of working at a System Integrator
Working as a data architect at a consultancy or System Integrator can be challenging, but has benefits. Find out the career value of working at a System Integrator, as advised by William McKnight.


Posted August 20, 2010 3:27 PM
Permalink | No Comments |

Now that the B-Eye Network is fully ensconced in the Tech Target family, I wanted to bring to your attention the "tips" I've been providing to searchdatamanagement.com (also part of Tech Target) over the years.

A consolidated list of tips can be found here on MDM/DW/BI topics such as ETL order, slowly changing dimensions, quick BI, customer matching, data quality, data modeling, standards, using cubes, star vs. snowflake, data mining techniques, planning, project management, and archiving data.

A few favorites are:

Should business intelligence analysis tie in with operational reports?
Read about William's rules for tying business intelligence (BI) system content back to the sources and what "percentage of tolerance" he recommends.

Should a data steward have direct SQL access for reporting purposes?
Find out when a data steward should have direct SQL access and how to request SQL access to get data directly from the data warehouse for reporting and analytical purposes.

Who should the business intelligence team report to?
Learn which department the business intelligence (BI) team should report to -- and why.

I've recently begun the Data Warehouse Strategy Q&A Section.  You can pose questions to me there.  Here are my first two:

How will the Sun acquisition affect Oracle data warehouse strategy?
Find out how Oracle's acquisition of Sun may affect Oracle data warehouse strategy and what it means for users. Also, see what role packaged BI software plays in Oracle's strategy.

When to use a DBMS and the future of data warehousing data management
Find out when you should use a DBMS with your data warehouse, and learn about the future of data warehousing data management and current state of data warehouse architecture.

 


Posted August 13, 2010 8:50 AM
Permalink | No Comments |

Links to the Columnar Database series:

 

Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

This final installment in my series on columnar will focus on what, to many, is the bottom line - when to use one.  I have built up to this point from the ground up - starting with the major difference between columnar and row-based - the orientation of the data in its blocks/pages.  Understanding those physical differences is the foundation for understanding the queries and workloads that will perform better or worse, and by how much. To be sure, you would use columnar for selective improved performance of data retrieval queries.

Seldom are such platform decisions based on a few queries.  The overall workload would need to be analyzed.  There may be a view to split the workload accordingly.  This may mean duplicating some of the data across columnar and row-based systems. 

So, anything is possible when it comes to system selection.  It all depends on the value of performance.  Since columnar databases will not run an enterprise, they are up against entrenched row-based skill sets and mindsets.  It must be the value of performance that overrides this if columnar is going to be considered.  Many times this comes after exhaustive tuning of the row-based DBMS. 

Based on theory, client-side experience in platform selection and building information management systems, personal testing and benchmarks, I offer these observations.  Clearly, these must be query observations and not workload observations.  Block reads are a primary barometer of performance, although you have the post-read "gluing" process I spoke of in an earlier post to deal with in columnar.  You also have prefetch strategies in both row-based and column-based, which speeds up scans when it is triggered.  Also the row makeup (size, number/type columns), as discussed earlier, factors into what you're getting with the block reads.

Some things are clear however. 

Posted June 24, 2010 10:59 AM
Permalink | No Comments |

So far in this series on columnar, I have attempted to distinguish the columnar database from the row-wise database from the bottoms up, drawing distinctions in how the data is stored on disk, which is the major physical difference.  Once the data is stored in its new and unusual way in a column-based DBMS, it, of course, must be retrieved.

In row-wise data retrieval, we frequently ask for multiple tables to be joined in the result set.  Of course, in columnar (which works with SQL too by the way), there are joins, but there is also an analogous concept to joins in columnar within a single table.  It's a process I call "gluing" and it's the process of pulling together the necessary columns from within a table that are needed by the query, either for selection or projection (where clause).

In row-wise, unnecessary columns must be removed from the result set because inevitably (unless the query is a SELECT *) more columns were retrieved than are interesting to the query.  In columnar, since just enough columns were retrieved, that is not necessary.  However, what is necessary is a strategy that determines when to glue (materialize) the columns together.  There are two choices - early materialization or late materialization.

Early materialization retrieves all the columns necessary and glues them together up front, effectively creating a "row" with which to do the necessary processing.  If you asked...

SELECT CUSTID, PRICE

FROM SALES

WHERE PRODID = 4 AND STOREID = 1


Posted June 1, 2010 1:51 PM
Permalink | No Comments |

This is a continuation entry in my series on columnar databases.  Today, I will talk about how columnar databases address the modern architectural bottlenecks.

Row-based MPP is a large cluster of thousands of processors.  Each node can be SMP shared-nothing or shared-disk. The interconnect connects the nodes, allowing for processing to span nodes.  Row-based MPP architectures support some of the largest and most successful data warehouses in the world.  And this is not going to change any time soon.

In row-based MPP, one of the most important design choices we make include how to distribute data amongst the nodes.  Typically some form of 'randomization' is used, like hash or round-robin.  We also have slowly made our block sizes larger over the years and the reason is the block is the unit of I/O (notwithstanding prefetch) and we want to get as much data as possible in a single I/O.  I/Os have become the bottleneck in our environments. 

While disk density has gone up significantly in the last 20 years or so, packing much more data down into smaller spaces, I/O is still limited by the physical head movement of the arm.  Physics simply won't allow such a small component to move much faster without the arm flying right off its handle (which could ultimately be why solid state disk and phase change memory become the norm over time).

Other than the larger block sizes, our designs have not changed much over the years to accommodate this I/O-bound reality.  We have tried to make the OLTP database more analytic with specialized indices, OLAP cubes, summary tables and partitioning, but we would need hundreds of drives to keep 1 CPU truly busy in a robust, complex, utilized data warehouse environment.  It's not feasible.  Incidentally, because of this bottleneck, random I/O has sped up much more slower over the years than sequential I/O, which doesn't require nearly as much head movement.

Naturally, on a project-by-project basis, you just make the best of the systems you have to work with.  The reason to discuss this in my columnar series is because columnar is one technique that makes the most of the I/O.

Regardless of row- or column-orientation, the data is stored on disk and there are various gates that store and process data until it gets to the CPU.  Each works on ever-decreasing data sizes.  Think of it as a pyramid.

pyramid.JPG

 


Posted May 17, 2010 9:01 PM
Permalink | No Comments |

This is a continuation of my thread on columnar databases.  In the last entry, as a foundation, I introduced the row-wise data page.  In this post, I will address the changes at the data page level that occur for a columnar database. 

In the row-wise data page, all of the columns are found sequentially within the table for each row and rows are stored consecutively.  In a columnar database, each data page is populated with data from only one COLUMN.  I will refer to this data structure as Sybase IQ does, as a vector.  So, each vector contains a column's worth of data.  Vectors will take up more or less pages depending on the size of the column.  A table will have 1 vector per column.

Queries will usually need multiple columns from a table.  So those columns must be fit together appropriately.  It would not work to put column 1 from row 1 together with column 2 from row 2 and column 3 from row 3 and present it as a "row".  The columns have to be put back together (what I call "glued") appropriately.  To facilitate this, the columns are stored in the order of the rows they belong to.  Hence, a row ID map at the end of the page is not required in columnar and consequently, there can be less wasted space.*

For example purposes, I will use a 3-column table with a customer number, first name and last name columns.  The rows contain:

1.       123-William-McKnight

2.       456-Joe-Smith

3.       789-Joe-Doe

There are 4 ways that columnar databases store data:

1.       For each column, the row ID is stored alongside the column value.

a.       Vector 1:  1-123,2-456,3-789

b.      Vector 2: 1-William,2-Joe,3-Joe

c.       Vector 3: 1-McKnight, 2-Smith, 3-Doe

2.       For each column, a row "range" is used where applicable.

a.       Vector 1: 1-123,2-456,3-789

b.      Vector 2: 1-William,2-3-Joe

c.       Vector 3: 1-McKnight, 2-Smith, 3-Doe

3.       For each column, the row is indicated by position.

a.       Vector 1: 123,456,789

b.      Vector 2: William, Joe, Joe (or William, Joe (2))

c.       Vector 3: McKnight, Smith, Doe

4.       Bitmap representation with abstracted values corresponding to entries in a separate map structure are used.

The first way is wasteful since the row ID is unnecessary to store since it is implied by the order of the values.  The second way does not improve upon that, but does compress space when there are recurring values in sequential rows.  The third ways shows removal of the row ID.  Notice Vector 2 may or may not actually use the range representation for the recurring values.  It's usually better if it does.  Finally, while there is nothing new about columnar bitmaps (vs. row-wise bitmaps), the acceptable cardinality for the fit of a column as a bitmap is much higher in columnar than row-wise.  In columnar, up to 1500-2000 values is considered "low cardinality" and "worth it" to apply bitmapping to (as opposed to < 50-100 in row-wise DBMS).  This is useful when you have recurring values that are not stored in sequential records.  A column like "country" can reduce its size 30-fold with bitmap encoding.

In some columnar DBMS, the columns/vectors can utilize different storage methods. The 3rd and 4th ways shown here are best in a mixed strategy.

Again, columnar is attempting to give you more bang for the buck in your I/Os when you are seeking a minimal set of columns from the table.  In the next entry, I will describe why getting the most out of your I/Os has becoming increasingly very important.

*Wasted space can occur in row-wise when you run out of Row IDs before storage space on the page.  Since each row must have a map entry, the page will be considered full when the Row ID limit is reached.

 


Posted May 6, 2010 9:33 PM
Permalink | No Comments |

I received several direct questions on my latest blog posting regarding the row-oriented data page.  Before I move on to more discussion of columnar, I thought I'd answer those questions here.

1.       How does the DBMS know where fields begin that are after variable length fields since it will not be in the same place for every record?

It's correct that any field that follows a variable length field will not be in the same position on each record.  The DBMS actually reads the field length of the variable length field in its first 2 bytes and uses that to "jump" to the start of the next field.  All subsequent fields, provided they are not variable length also, will begin at fixed offsets from the end of the variable length field.  This is why you should put variable length fields at or towards the end of the record and why you should only use variable length fields when the field size really does vary.  Variable length fields save space, but keep in mind 2 bytes are added to each field for the length bytes.

2.       If a field is null, wouldn't it be 0 or 1 bit in length?

First of all, nothing inside actual fields is a bit in length.  Everything in the DBMS is done on byte (8 bit) boundaries.  So, there will be at least 1 byte for each field, even if that field were a decimal (1), which could only contain values up to 9 and could be held in 4 bits.  For nulls, there is an ADDITIONAL byte pre-pended to the field.  If that byte is 'on', the field is null and the DBMS will ignore whatever leftover data may happen to be in the field.  If the byte is 'off', the value is not null and the field value is legitimate.  So, if a value is set to null, space is not saved!  For any nullable field, you actually have to ADD a byte to storage.  However, nullability does not make the column variable length (see answer #1 above.)  Of course, a variable length column can be nullable.

3.       What happens when a record is deleted?

Not much really happens when a record is deleted.  There is a bit in the record header that is on/off depending on whether it's a valid record or not.  That bit is set to 'off' when a record is deleted.  I'll refer to invalid records as holes.  Holes just sit there taking up space.  Holes are actually linked together within the page!  Why?  Because if this is the page that the DBMS wants to insert a record, it can use an existing hole or compress holes together to make a bigger hole to use for the new record.  If you have a clustering index, the DBMS MUST put the records in order.  However, that 'order' is only according to the Row ID map.  Physically on the page, they can still be out of order.  This is fixed, as well as all holes removed, during reorganization processes, either explicitly called or, in the case of some DBMS, as background processes.

 


Posted April 30, 2010 8:08 AM
Permalink | No Comments |

For 99% of you, the database management systems you have been working with have been "row oriented".  You may never have applied that label to them, know why it's important or that alternatives exist.  And for many, even DBAs, the "need to know" down to the data page layout has not been there.  Even the vendors who are columnar don't necessarily promote their products that way, but rather promote their systems as solving "analytic" queries.

The current main alternative to "row oriented" is "column oriented".  "Column oriented" databases now permeate the landscape and claim to provide a better value proposition for a specialized analytic workload or even the entire data warehouse. 

This is the orientation of....

  • Sybase IQ
  • SAND
  • Vertica
  • ParAccel
  • InfoBright
  • Kickfire
  • Exasol
  • MonetDB
  • Microsoft SQL Server 2008 R2 Gemini/Vertipaq
  • Oracle Exadata2 (both row- and columnar-)
  • Calpont's InifiniDB

Microsoft  is using columnar in PowerPivot for SharePoint (Vertipaq) client-side collaborative result set manipulation.  Oracle's Exadata2 accommodates both columnar and row orientation and Oracle has announced a "hybrid future".  While scant on details, this should merit some attention.  Also there is at least one vendor, Illuminate, which is a hybrid of row and column implementation, which I'll discuss later.

 

So what is columnar?  Well, today, I'm going to start the literal answer and in later posts, I'll expand on the answer, the byproducts and ramifications.  The literal answer will expand from a knowledge of the row-based data page layout.

 

Row- and columnar- have to do with how the data is laid out on a data page.  DBAs will know about data page size, which can range from 4K to 64K.  Every DBMS is slightly different, but inside the page is a page header, the records (with record headers), a row ID map and a small page footer.  The row ID map is significant because it contains the offsets from the start of the page to the beginning of the record (well actually the record header.)  If the DBMS wants to read the 4th column of the 3rd record on the 123rd page, it will go to the 123rd page and read the (let's say 2-byte) entry that is 8 bytes from the end of the page (2 byte footer plus 2 bytes for each row ID map entry times 3) to get to the 3rd record.  That takes it to the offset and after (let's say) 6 bytes for the record header, the actual data begins.

 

In a row-oriented DBMS, EVERY field for the record is then found (in column order within the table.)  If a certain column (say, the 4th column) is interesting, the DBMS will advance the number of bytes that's contained in the first 3 fields (according to the catalog notwithstanding variable length fields, for which the field length is contained IN the first 2 bytes of field itself) and there the read engine finds it's desired 4th column of the 3rd record on the 123rd page.

 

The time-consuming part of the operation was reading the data page in the first place.  If you're interested in just one column, it would be great to be able to find a bunch of that column's values on the page, without having to skip over those other columns that aren't interesting.  And that will set us up for the next entry in this series, which will be on the columnar data page.

 

Row-oriented data page:datapage.JPG


Posted April 26, 2010 7:45 PM
Permalink | No Comments |

Platform research and development in large enterprises is necessary today more than ever.  The reasons:

1.       No shortage of data consolidation and re-platforming opportunities as a byproduct of the years and years of departmental and otherwise non-centralized data warehouse development

The drive for a reduced infrastructure footprint from the boardroom is finally giving the directors and vice presidents who preside over the various overlapping data warehouses in the environment reason to point out the redundancy.  Data consolidation projects have straightforward ROI at some level - the lower carrying costs of the multiple environments.  

2.       The emerging specialized future, providing new and confusing opportunities

A new variety of legitimate options in database orientation, architecture and hosting mean that research is necessary more now than ever.

3.       Growing cynicism in vendor messages

However confused you may believe your end user shop to be, that is as confused, or more, that the newly consolidated enterprise company sales force finds itself.  The options can be overwhelming and the advice largely needs to be updated and consistent.  Likewise, the sales quotas need the same alignment.  Consequently, the vendor messages are received with more cynicism and suspicion than ever.

4.       An increased realization that 2010 will not be the last year

2010 is already tuning out to be a progressive year for information management and there is seemingly no end to these projects and, as such, platform decisions of today may need to be revisited soon if not made optimally.  This could be costly.  The rate of data collection is beyond expectations.  Companies who never considered themselves to have terabyte level needs now have 5 terabytes under management.  Fortune 50 enterprises have hundreds of terabytes.  At this rate, petabytes are not far off.  There's a need to get it optimal now.  The alternative could be fatal to the business, not just IT.

These reasons are leading to more and more internal research.  The need is for a huge sandbox that supports practical information management platform investigation, both as a matter of curiosity as well as pre-planning specific upcoming platform needs.  Decisions today around platform, modeling approach (dimensional/normalized), columnar, hosting, memory use, etc. will be highly leveraged for years and your internal research, supported by independent unbiased consulting, is required to make the best decisions today.


Posted April 12, 2010 6:50 AM
Permalink | No Comments |

Just when you thought it was safe to catch your breath and hunker down with your general purpose DBMS data warehouse, data and business requirements are escalating everywhere past the point where this strategy is tenable.  The market has filled this gap nicely over the past few years, providing us with a plethora of alternatives - none of which would seem to me to indicate a continued allegiance to a single platform enterprise data warehouse being the elegant sole end game.  Database specialization is upon us and the primary determinant of an end client's portfolio is the workload.

End clients face an array of possibilities today to accomplish common objectives of getting leaner, more agile, becoming more real-time and reducing costs.  The general purpose DBMS, hosted and supported in-house, will, of course, remain an anchor of operational and post-operational environments.    Multidimensional OLAP will also continue to play a key role in many environments.  Playing on the edges now in many environments are data warehouse appliances.  Some have made it their enterprise data warehouse platform.  There are memory resident DBMS.  There are DBMS that store data in columns, instead of rows.  Actually so many "analytic" specialist DBMS dot the platform today, such as NeoView, Aster Data, Exadata, ParAccel, and DB2 BCUs, garnering so much interest, that it would be unusual for a shop to not be considering one or more now.

All of these categories are not mutually exclusive!  There are appliances that are columnar, memory resident and hosted in the cloud.  Go figure.  It's important to know the building blocks that solutions are made of, ensure they are all compatible with your workload and goals, and not look solely at how the platform delivers a single application.


Posted March 23, 2010 7:33 AM
Permalink | No Comments |
   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›