Blog: William McKnight http://beyenetwork.nl/blogs/mcknight/ Hello and welcome to my blog! I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics. Copyright 2011 Sat, 19 Feb 2011 09:39:48 -0700 http://www.movabletype.org/?v=4.261 http://blogs.law.harvard.edu/tech/rss Setting up Shop with Master Data Management, Part 3

As applications discover the need to work with MDM for their projects to be successful, and the data and the parties are identified, the MDM team needs to be able to engage the requirement.  This is what I call setting up shop. 


 


You will need to balance structure and agility.


 


Early in the project's planning/research phase, a general conversation about the project and its data needs should be conducted with MDM Leadership.  As enough information is made available to complete the project plan tasks, the MDM team could provide the project team with the appropriate tasks for inclusion into their project plan.  A description of a superset of those tasks follows.   


 


All projects engaging MDM data need to share these documents with the MDM team or put MDM Leadership on outright sign-off for the following documents during the Requirements phase:


 



  • Business Requirements  

  • Non-functional Requirements  

 


The Business Requirements should contain diagrams and commentary on the interface(s) that the project will have to MDM.  Depending on your working model, either the MDM team or the application team provides the technical aspects of the application integration with MDM.  Figure out which in the manifesto or be prepared for MDM to be labeled "hard to work with."

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2011/02/setting_up_shop_2.php http://www.beyenetwork.nl/blogs/mcknight/archives/2011/02/setting_up_shop_2.php Sat, 19 Feb 2011 09:39:48 -0700
Mobile Business Intelligence: When Mobility Matters, a BeyeNetwork Custom Research Report Prepared for MicroStrategy  

I have completed this paper, where I make the case for one of the largest trends in BI, mobility.

 

"No matter what business you are in, you are in the business of information. And it's business intelligence that has long been the discipline to deliver the needed information. Demand for business intelligence as a means to get maximum value from information has never been higher as businesses increasingly compete in real time and require information that is integrated from across the enterprise. The old saw about business intelligence is that it gets "the right information to the right people at the right time." It's really time to add "right medium" to that mix.

 

Automating business decisions and action is one path to business intelligence maturity. Determining what actions to trigger automatically based on changes in corporate data can come from a solid understanding of how decisions are made today. However, many decisions are multifaceted, and a knowledge worker's analysis will continue to be a part of effective business intelligence.

 

Effective analysis is getting more complicated for knowledge workers. The more complicated aspects include sensing what is happening and combining that with summarized historical data to build a set of possible actions. These "analytics" are the basis of competitive advantage for organizations today. Once calculated, they must be put to effective use, again utilizing the best medium available for real-time delivery."

 

Please see here for the full paper.  The contents include:

 

Business Intelligence Deployment Option History

Business Mobility

Mobile Business Intelligence Deployed

GUESS? Store Managers Don't Have to Second Guess Data

PriceLYNX: Going Mobile to Curb Supply Spend

What These Stories Tell Us Tips

Approaches to Mobile Business Intelligence

MicroStrategy Mobile

 

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2011/02/mobile_business.php http://www.beyenetwork.nl/blogs/mcknight/archives/2011/02/mobile_business.php Tue, 01 Feb 2011 08:33:49 -0700
Setting up Shop with Master Data Management, Part 2 MDM programs are generally designed to provide the data needed by a cross-section of applications or for data that can utilize its workflow capabilities for its origination and updates.  It's an approach usually not taken for data needed by a single application, although it may be done as a set-up for future applications. 

 

Part of the MDM manifesto must include how teams will source its data.  Over 75% of the post-implementation requests of MDM will be around this question.

 

In order to acquire the data, it must be mapped to the data structures of the target application.  Who does this? The MDM team, the application team, a separate integration team, or a separate architecture team?

 

Regardless, all new projects should meet with MDM Leadership in a very early phase of their project to determine:

 

  1. Data available in MDM that should be used in the project
  2. Data not available in MDM that should be sourced by the MDM team for the project (and other projects)
  3. Data that the project is generating that the MDM team should source into MDM
  4. Time and resource estimate for the MDM team contribution

 

Data not in MDM that needs to be may be done in 1 of 2 ways:

 

  1. MDM (the MDM team usually) can source the data from its origination point or a third party system
  2. MDM can update or add to its workflow environment, which incorporates manual entry of the information at the right point  

 

 

 

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2011/01/setting_up_shop_1.php http://www.beyenetwork.nl/blogs/mcknight/archives/2011/01/setting_up_shop_1.php Sat, 29 Jan 2011 09:45:10 -0700
Setting up Shop with Master Data Management, Part 1 Several MDM programs out there are in development and about to go to production.  Several others are struggling in production as they try to move the program into a second subject area or to engage more publishing or subscribing systems to the information.  Others need to extend the data governance beyond a single business group.

 

Few have made the leap to successfully setting up their MDM program as a fully functioning member of the 'major system' ecosystem of the company.  The guidelines in these blog entries will help those shops make that transition and address the questions that the REST of the company may have about MDM.  It is absolutely essential that MDM be properly positioned to these important evaluators of the program success.

 

Enterprise MDM cannot be successful "in a vacuum" - built to meet the need of a single application/subject area that is well-known.  Building MDM with this hyper-focus to the exclusion of all concerns for scalability results in just what I am seeing now: MDM re-dos and multiple MDMs where there could be one, enterprise MDM.

 

These questions include (again, from the perspective of those not in the MDM team):

 

What is MDM?

What data is available there?

Do I have to use MDM's data?

Do I really have to use MDM's data?  Who will care if I don't?

What if the MDM data is not suitable for my application needs?

How long does it take to incorporate my data?

Whose role is it to add data to MDM?

Is it push or pull?

I'm just going to bring this third-party data into my application, not MDM, OK?

Is my need an extension of a subject area or a new one?

Who do I talk to about MDM?

Do I have to contribute my data?

How do I modify the existing MDM workflows?

Does the MDM team carry a separate project plan for my need?

Who builds the plan and manages those tasks?

How do I unit test, do quality assurance testing, etc. with MDM data?

 

Just knowing these questions could trigger the necessary action, but in case it doesn't, I'll keep posting here (and you can as well) some tips to setting up shop with MDM.

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2011/01/setting_up_shop.php http://www.beyenetwork.nl/blogs/mcknight/archives/2011/01/setting_up_shop.php Sun, 23 Jan 2011 21:29:17 -0700
Microsoft Parallel Data Warehouse In my last post, I talked about Microsoft's new, upcoming columnar offering, Apollo.   I said it was designed to take some pressure off the core DBMS to do it all and do it all fast.  That's doubly true for Parallel Data Warehouse (PDW), the new MPP offering from Microsoft.  This is probably one of the last times you'll hear the word DataAllegro, but that technology, acquired in 2008 by Microsoft, is what PDW is based on.  MIcrosoft has spent the last 2 years replacing the core (Ingres) DBMS with SQL Server and the Linux/Java with Windows/C#.  PDW currently works on HP hardware and is in early release.

Microsoft is giving its users 2 major additional data storage options in Denali - columnar and MPP.  Microsoft is going down the path of functional parity between the core SMP offering and PDW, which is already integrated with the SQL BI stack.  It hopes to keep some of those SMP customers hitting its scalability limits in the Microsoft tent.

There is a lot of overlap in capabilities among SMP, columnar and MPP.  It's your job to sort through your workloads and make a plan.  I have found MPP much more advantageous the larger the data is and columnar useful for those high column selectivity workloads.

I'll be part of a virtual seminar focused on PDW on Tuesday.  I'll be talking about data consolidation strategies, a topic Microsoft is ready to take on with PDW.

Abstract:

As budgets languish, data growth balloons and business demand intensifies, BI and data warehousing professionals are under immense pressure to squeeze every last dollar of value from existing investments, while providing 24/7 access to mission-critical business information. That's the bad news.

The good news is you're invited to join renowned visionaries Bill Inmon (the father of data warehousing) and William McKnight (leading information management consultant), for our LIVE, interactive virtual seminar on November 16th (9:00 AM - 1:30 PM (EDT))  - designed to help you leverage next-generation data warehousing technologies for maximum gain.

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/11/microsoft_paral.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/11/microsoft_paral.php Sun, 14 Nov 2010 08:44:26 -0700
Microsoft Unveils Apollo Column Store It was columnar day for me at SQL PASS on Wednesday.  On Tuesday, Microsoft announced that Denali, the, the code name for its next release, would have a columnar data store option.  My talk was on columnar databases Wednesday.  Here are some of the details I shared about Denali's column store, which has a project name of Apollo.  If you're interested in columnar databases in general, see my blog entries here.

In Denali, there will be an index type of "COLUMN STORE".  I find this to be an interesting use of index, because the resultant data stores that are created are not like traditional indexes.  However, Microsoft has never been a conformist organization.   The column stores are non-clustered indexes.  No indexes can be created on top of the column stores.

Where the column store is like an index is that you need the accompanying row-wise database.  The column stores are not created mutually exclusive to the row-wise database.  To my knowledge, this is the only database that requires this.  I don't expect this to be a long-term requirement.  While this may seem like it's expanding your storage needs (and it is), it may not be as much as you initially think because some non-clustered indexes might become redundant in this architecture. 

The good news about this is that the optimizer has been updated to route queries to the column stores or the row store accordingly.  This could prove to be a competitive differentiator.  Few other database systems have this.  An intelligent hybrid optimizer will be key to the success of databases that are at least partly columnar. 

Apollo's vectors (per my Sybase IQ language in my earlier posts) are called column segments, although there can be multiple segments per column, as explained below.  You can only have one column store index per table, but you can name as many columns as you want.  Of course, it doesn't matter what order you use because each column forms an independent segment.  Only single-column segments are supported in Apollo.

Apollo leverages the intellectual property, patents, and good experiences that Microsoft has had with Vertipaq, the client-side, columnar, in-memory structure used with PowerPivot.  Columnar remains the preferred, future, and only, format for Vertipaq. 

In Apollo, no inserts, updates or deletes are allowed on the tables that have a COLUMN STORE (this is the part of the talk where I did a mock exit).  You can, however, do incremental loads and you can switch partitions to add data.  You can also sandwich DISABLE and REBUILD of the segments around your updates.  I expect this will improve over time.

As long as I'm on limitations, the columns selected have data type restrictions.  The columns must be integer, real, string, money, datetime or a decimal that is 18 digits or less.  No other data types are supported.

As we know, getting more (relevant) information in the I/O is one of the advantages of columnar data store.  Microsoft has taken this to a new level.  While data is still stored in pages (blocks), the unit of I/O is actually one million data values.  That forms a "segment."  You read it right - the unit of I/O is not a certain number of "K" but has to do with the NUMBER of data values.  Inside those pages, the data is stored in blobs.  Bitmapping is part of the storage somehow as well although columnar data page layouts are not public information.  Neither is how it's doing materialization.  As for I/O, compression algorithms have been reengineered for columnar.  These are not the same compression algorithms from the row-wise database.

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/11/microsoft_unvei.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/11/microsoft_unvei.php Fri, 12 Nov 2010 10:22:28 -0700
Building the Case for Data Mart Consolidation  

For much of the last decade, conventional theories surrounding decision support


architectures have focused more on cost than business benefit. Lack of Return on


Investment (ROI) quantification has resulted in platform selection criteria being focused


on perceived minimization of initial system cost rather than maximizing lasting value to


the enterprise. Often these decisions are made within departmental boundaries without


consideration of an overarching data warehousing strategy.


 


This reasoning has led many organizations down the eventual path of data mart proliferation.


This represents the creation of non-integrated data sets developed to address


specific application needs, usually with an inflexible design. In the vast majority of


cases, data mart proliferation is not the result of a chosen architectural strategy, but a


consequence due to lack of an architectural strategy.


 


To further complicate matters, the recent economic environment and ensuing budget


reduction cycles have forced IT managers to find ways of squeezing every drop of


performance out of their systems while still managing to meet users' needs. In other


words, we're all being asked to do more with less. Wouldn't it be great to follow in


others' footsteps and learn from their successes while still being considered a thought


leader?


 


The good news is that the data warehousing market is now mature enough that there are


successes and best practices to be leveraged. There are proven methods to reduce costs,


gain efficiencies, and increase the value of enterprise data. Pioneering organizations


have found a way to save millions of dollars while providing their users with integrated,


consistent, and timely information. The path that led to these results started with a


rapidly emerging trend in data warehousing today - Data Mart Consolidation (DMC).


I've learned that companies worldwide are embracing DMC as a way to save large


amounts of money while still providing high degrees of business value with ROI. DMC


is an answer to the issues many face today.  


 

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/11/building_the_ca.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/11/building_the_ca.php Sun, 07 Nov 2010 16:44:50 -0700
Analyzing the ROI Potential of the Contact Center Data Mart

Customer contact center data contains hidden nuggets of insight about customers,


products, and business operations, and it provides the foundation for effective


customer relationship management (CRM). Mining this data for insights can be


daunting, however.


The databases that support operational activities such as call center operations are tuned


for the performance of those operations and are usually inappropriate for data analysis.


The database structures are designed for transaction processing. The databases themselves


contain limited historical content because data retention is typically limited to a


maximum of three to six months. And the data in them is only a subset of the total


contact activities the business handles, either because geographically dispersed contact


centers handle enterprise contact activities or because different applications or divisions


handle telephone, e-mail, and Web-based contacts.


Deriving the full value from customer contact data requires the integration of all contact


records, regardless of how or where they were received. Data might represent phone


contacts routed to a regional call center, e-mails sent to a service organization, or Web


interactions between sales agents and prospects surfing a company's Web site. The ability


to optimize the business to better meet the needs of customers depends on knowing


what those customers are doing, regardless of the communication channel they use and


regardless of how or where their contacts were routed within an enterprisewide contact


infrastructure.

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/10/analyzing_the_r.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/10/analyzing_the_r.php Sun, 31 Oct 2010 11:08:58 -0700
Documenting Information Management Projects

For any well-done information management project, there exists a set of documentation.


I've recently come into many shops to help clients overcome process, organizational and technical challenges and many times documentation is non-existent.  Of course, we all know many benefits of good documentation but also know that it helps consultants be a quick-study and get straight to advice or action.


But what documentation is necessary and is it necessary if you are deploying an agile methodology?  I'll answer the first part below but my answer to documentation and agile is that yes, documentation is important in agile.  If you are going to support the system, potentially do something similar again, hire consultants or have employee turnover, you need documentation - at the appropriate times and with appropriate, not necessarily excruciating, detail. 


You also probably do not want to be the person who repeats himself repeatedly, which is what you'll do without documentation available to hand out.  Repeating oneself is hardly the most advanced use of time.


The documentation can be built with agility just like the systems, but here is a list to think about delivering (or, as the case may be, retrofitting).  It's a STARTER list for what is necessary.




  • Non Functional Requirements - Describes the environment in which the system operates


  • Decisions Capture - A place to catch all those important decisions that are made and a vehicle for making those decisions visible in the culture


  • Logical Data Model - The logical data model should not exist without narrative


  • Test Approach - Describe how the development will be tested - systems, data, user involvement, players, etc.


  • Interface Specification - Interfaces are largely what these projects are all about - either interfacing existing systems or adding multiple components that need interfacing themselves


  • Startup Plan - Day 1 of production is hardly when data can begin accruing in new information management projects; Planning for and loading the data backlog needs a plan


  • Data Access Specification - Describes how the new data being made available will be accessed by users and systems

 


Do yourself a favor.  Inventory your documentation set and make sure it's the right level for sustaining the system.


]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/09/documenting_inf.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/09/documenting_inf.php Tue, 21 Sep 2010 11:58:43 -0700
Some more favorites from searchdatamangement Do we need business intelligence (BI) tools to be successful?
Get William's take on whether you need business intelligence (BI) tools to be successful and learn how about low cost BI including cheap and open source BI tools.

Fastest way to learn business intelligence (BI)
Find out how you can learn business intelligence (BI), get business intelligence training and discover why analytics is the key to marketing efforts in this BI tip.

Data architect careers: The benefits of working at a System Integrator
Working as a data architect at a consultancy or System Integrator can be challenging, but has benefits. Find out the career value of working at a System Integrator, as advised by William McKnight.

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/08/some_more_favor.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/08/some_more_favor.php Fri, 20 Aug 2010 15:27:10 -0700
William McKnight's contributions to SearchDataManagement.com Now that the B-Eye Network is fully ensconced in the Tech Target family, I wanted to bring to your attention the "tips" I've been providing to searchdatamanagement.com (also part of Tech Target) over the years.

A consolidated list of tips can be found here on MDM/DW/BI topics such as ETL order, slowly changing dimensions, quick BI, customer matching, data quality, data modeling, standards, using cubes, star vs. snowflake, data mining techniques, planning, project management, and archiving data.

A few favorites are:

Should business intelligence analysis tie in with operational reports?
Read about William's rules for tying business intelligence (BI) system content back to the sources and what "percentage of tolerance" he recommends.

Should a data steward have direct SQL access for reporting purposes?
Find out when a data steward should have direct SQL access and how to request SQL access to get data directly from the data warehouse for reporting and analytical purposes.

Who should the business intelligence team report to?
Learn which department the business intelligence (BI) team should report to -- and why.

I've recently begun the Data Warehouse Strategy Q&A Section.  You can pose questions to me there.  Here are my first two:

How will the Sun acquisition affect Oracle data warehouse strategy?
Find out how Oracle's acquisition of Sun may affect Oracle data warehouse strategy and what it means for users. Also, see what role packaged BI software plays in Oracle's strategy.

When to use a DBMS and the future of data warehousing data management
Find out when you should use a DBMS with your data warehouse, and learn about the future of data warehousing data management and current state of data warehouse architecture.

 

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/08/william_mcknigh_1.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/08/william_mcknigh_1.php Fri, 13 Aug 2010 08:50:17 -0700
The Value of Performance and When to Consider Columnar Databases Links to the Columnar Database series:

 

Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

This final installment in my series on columnar will focus on what, to many, is the bottom line - when to use one.  I have built up to this point from the ground up - starting with the major difference between columnar and row-based - the orientation of the data in its blocks/pages.  Understanding those physical differences is the foundation for understanding the queries and workloads that will perform better or worse, and by how much. To be sure, you would use columnar for selective improved performance of data retrieval queries.

Seldom are such platform decisions based on a few queries.  The overall workload would need to be analyzed.  There may be a view to split the workload accordingly.  This may mean duplicating some of the data across columnar and row-based systems. 

So, anything is possible when it comes to system selection.  It all depends on the value of performance.  Since columnar databases will not run an enterprise, they are up against entrenched row-based skill sets and mindsets.  It must be the value of performance that overrides this if columnar is going to be considered.  Many times this comes after exhaustive tuning of the row-based DBMS. 

Based on theory, client-side experience in platform selection and building information management systems, personal testing and benchmarks, I offer these observations.  Clearly, these must be query observations and not workload observations.  Block reads are a primary barometer of performance, although you have the post-read "gluing" process I spoke of in an earlier post to deal with in columnar.  You also have prefetch strategies in both row-based and column-based, which speeds up scans when it is triggered.  Also the row makeup (size, number/type columns), as discussed earlier, factors into what you're getting with the block reads.

Some things are clear however.  ]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/06/the_value_of_pe.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/06/the_value_of_pe.php Thu, 24 Jun 2010 10:59:46 -0700
Columnar Materialization So far in this series on columnar, I have attempted to distinguish the columnar database from the row-wise database from the bottoms up, drawing distinctions in how the data is stored on disk, which is the major physical difference.  Once the data is stored in its new and unusual way in a column-based DBMS, it, of course, must be retrieved.

In row-wise data retrieval, we frequently ask for multiple tables to be joined in the result set.  Of course, in columnar (which works with SQL too by the way), there are joins, but there is also an analogous concept to joins in columnar within a single table.  It's a process I call "gluing" and it's the process of pulling together the necessary columns from within a table that are needed by the query, either for selection or projection (where clause).

In row-wise, unnecessary columns must be removed from the result set because inevitably (unless the query is a SELECT *) more columns were retrieved than are interesting to the query.  In columnar, since just enough columns were retrieved, that is not necessary.  However, what is necessary is a strategy that determines when to glue (materialize) the columns together.  There are two choices - early materialization or late materialization.

Early materialization retrieves all the columns necessary and glues them together up front, effectively creating a "row" with which to do the necessary processing.  If you asked...

SELECT CUSTID, PRICE

FROM SALES

WHERE PRODID = 4 AND STOREID = 1

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/06/columnar_materi.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/06/columnar_materi.php Tue, 01 Jun 2010 13:51:35 -0700
How Columnar Databases Address Modern Bottlenecks This is a continuation entry in my series on columnar databases.  Today, I will talk about how columnar databases address the modern architectural bottlenecks.

Row-based MPP is a large cluster of thousands of processors.  Each node can be SMP shared-nothing or shared-disk. The interconnect connects the nodes, allowing for processing to span nodes.  Row-based MPP architectures support some of the largest and most successful data warehouses in the world.  And this is not going to change any time soon.

In row-based MPP, one of the most important design choices we make include how to distribute data amongst the nodes.  Typically some form of 'randomization' is used, like hash or round-robin.  We also have slowly made our block sizes larger over the years and the reason is the block is the unit of I/O (notwithstanding prefetch) and we want to get as much data as possible in a single I/O.  I/Os have become the bottleneck in our environments. 

While disk density has gone up significantly in the last 20 years or so, packing much more data down into smaller spaces, I/O is still limited by the physical head movement of the arm.  Physics simply won't allow such a small component to move much faster without the arm flying right off its handle (which could ultimately be why solid state disk and phase change memory become the norm over time).

Other than the larger block sizes, our designs have not changed much over the years to accommodate this I/O-bound reality.  We have tried to make the OLTP database more analytic with specialized indices, OLAP cubes, summary tables and partitioning, but we would need hundreds of drives to keep 1 CPU truly busy in a robust, complex, utilized data warehouse environment.  It's not feasible.  Incidentally, because of this bottleneck, random I/O has sped up much more slower over the years than sequential I/O, which doesn't require nearly as much head movement.

Naturally, on a project-by-project basis, you just make the best of the systems you have to work with.  The reason to discuss this in my columnar series is because columnar is one technique that makes the most of the I/O.

Regardless of row- or column-orientation, the data is stored on disk and there are various gates that store and process data until it gets to the CPU.  Each works on ever-decreasing data sizes.  Think of it as a pyramid.

pyramid.JPG

 

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/05/post_2.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/05/post_2.php Mon, 17 May 2010 21:01:36 -0700
The Columnar Data Page This is a continuation of my thread on columnar databases.  In the last entry, as a foundation, I introduced the row-wise data page.  In this post, I will address the changes at the data page level that occur for a columnar database. 

In the row-wise data page, all of the columns are found sequentially within the table for each row and rows are stored consecutively.  In a columnar database, each data page is populated with data from only one COLUMN.  I will refer to this data structure as Sybase IQ does, as a vector.  So, each vector contains a column's worth of data.  Vectors will take up more or less pages depending on the size of the column.  A table will have 1 vector per column.

Queries will usually need multiple columns from a table.  So those columns must be fit together appropriately.  It would not work to put column 1 from row 1 together with column 2 from row 2 and column 3 from row 3 and present it as a "row".  The columns have to be put back together (what I call "glued") appropriately.  To facilitate this, the columns are stored in the order of the rows they belong to.  Hence, a row ID map at the end of the page is not required in columnar and consequently, there can be less wasted space.*

For example purposes, I will use a 3-column table with a customer number, first name and last name columns.  The rows contain:

1.       123-William-McKnight

2.       456-Joe-Smith

3.       789-Joe-Doe

There are 4 ways that columnar databases store data:

1.       For each column, the row ID is stored alongside the column value.

a.       Vector 1:  1-123,2-456,3-789

b.      Vector 2: 1-William,2-Joe,3-Joe

c.       Vector 3: 1-McKnight, 2-Smith, 3-Doe

2.       For each column, a row "range" is used where applicable.

a.       Vector 1: 1-123,2-456,3-789

b.      Vector 2: 1-William,2-3-Joe

c.       Vector 3: 1-McKnight, 2-Smith, 3-Doe

3.       For each column, the row is indicated by position.

a.       Vector 1: 123,456,789

b.      Vector 2: William, Joe, Joe (or William, Joe (2))

c.       Vector 3: McKnight, Smith, Doe

4.       Bitmap representation with abstracted values corresponding to entries in a separate map structure are used.

The first way is wasteful since the row ID is unnecessary to store since it is implied by the order of the values.  The second way does not improve upon that, but does compress space when there are recurring values in sequential rows.  The third ways shows removal of the row ID.  Notice Vector 2 may or may not actually use the range representation for the recurring values.  It's usually better if it does.  Finally, while there is nothing new about columnar bitmaps (vs. row-wise bitmaps), the acceptable cardinality for the fit of a column as a bitmap is much higher in columnar than row-wise.  In columnar, up to 1500-2000 values is considered "low cardinality" and "worth it" to apply bitmapping to (as opposed to < 50-100 in row-wise DBMS).  This is useful when you have recurring values that are not stored in sequential records.  A column like "country" can reduce its size 30-fold with bitmap encoding.

In some columnar DBMS, the columns/vectors can utilize different storage methods. The 3rd and 4th ways shown here are best in a mixed strategy.

Again, columnar is attempting to give you more bang for the buck in your I/Os when you are seeking a minimal set of columns from the table.  In the next entry, I will describe why getting the most out of your I/Os has becoming increasingly very important.

*Wasted space can occur in row-wise when you run out of Row IDs before storage space on the page.  Since each row must have a map entry, the page will be considered full when the Row ID limit is reached.

 

]]>
http://www.beyenetwork.nl/blogs/mcknight/archives/2010/05/the_columnar_da.php http://www.beyenetwork.nl/blogs/mcknight/archives/2010/05/the_columnar_da.php Thu, 06 May 2010 21:33:59 -0700