Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

Recently in Design Category

Here's another "what I've learned." When we have a deadline or a business-critical issue, we can tend to underemphasize the business requirements, due diligence, experience and sometimes some spending that is going to be necessary.

One area, just to pick something out, is Dun&Bradstreet numbers. DNB is practically ubiquitous in my clientele yet it’s always harder to match than is initially believed. Another truth about DNB is that not every company has a DNB number. Plan for it!

I have a collection of data warehousing “famous last words.” Here’s one: “We’ll just link our customers to DNB tomorrow and it’ll be ready for you on Thursday.”


Posted September 11, 2007 7:22 AM
Permalink | No Comments |

Many systems have struggled to add maximum value to their organizations as an indirect result of lacking classical OSS architectural standards, such as:
• Reasonably-scaled and separate test, development, and QA platforms
• Change management
• Business Continuance and disaster recovery planning.
• Robust backup management
• Physical isolation or thoughtful co-location based on shared characteristics
• User care measures such as support, training and built-in descriptive information
• Proactive performance planning
• Business roles and responsibilities
• Ongoing business tuning of program direction

They therefore may not seem stable enough for business user dependence and continued development.

However, there are several aspects of technology that enable the progression of these standards:

• The continued reduction in hardware costs, especially in disk technology, make it economically possible to have separate development, QA, and recovery systems.
• Improvement in database technology so that we now have the ability to insert massive amounts of data while the system is being accessed, enabling occasional (near) real-time data feeds
• Sophisticated ETL tools and “hub and spoke” systems and standards facilitate the establishment and operation of systems that move data from source systems to targets, and vice versa.

The main inhibitor, however, will probably continue to be the perception of such systems as strictly technical in nature, when in reality the fit into the business situation is just as important as the technical aspects.


Posted July 27, 2007 8:37 AM
Permalink | No Comments |

One of the most difficult things to do in data warehousing is to engage a new source system. Learning about the fields the system has to offer the data warehouse, when they are populated, how “clean” the fields are and when you can get after them with your extract job can be daunting. Then, after going through the process of attaching the extract jobs, scheduling and beginning the cycles, you would want to be set for a while.

Not so fast. Usually 1 day to 2 weeks after putting a data warehouse – any iteration – into production (or prototype), users who previously communicated requirements in abstract terms are now seeing the results and requiring changes. New fields and new transformations are not unheard of at this point.

Although data warehousing is very dynamic, it is possible for a practitioner to think beyond initial, spoken requirements and “prime the pump” by bringing additional fields into the ETL process. This concept, known as “triage” works very well if you have a staging area where initial loading from source is “dropped” prior to the majority of the transformations.

With triage and a staging area, the staging area can contain many more fields than are moved forward to the actual data warehouse. Then, if a new field is needed in the warehouse, there is no effect on the source extracts (and no accompanying disruption of source operation and negotiation with the source system team).

But wait, you say. "What about the historical data that usually accompanies such new data sourcing?"

The concept of the persistent staging area is to keep all data, both from a “triaged” (see yesterday’s tip) and a historical perspective in the staging area. That way, when requirements change post-production (again, see yesterday’s tip), you not only have the ETL “primed”, you also have the historical data primed and ready to be moved forward to the warehouse – in the persistent staging area.

Persistent staging areas almost always require a separate DBMS instance from the data warehouse DBMS due to the volume that will accumulate in them.

Since historical data is also kept in the warehouse, the distinctness for the persistent staging area lies in its capturing of triaged data, ready for historical loading of required data post-implementation. It will be bigger than the warehouse itself.

Although I usually do not use this technique in my data warehouses, if there was a high likelihood that requirements would be very dynamic after production and disk cost were not an issue, it would be very applicable.


Posted July 4, 2007 10:49 AM
Permalink | No Comments |

Here is some code I’ve been dragging around in my toolbox for many years. Some data warehouses populated their date dimension through only 2007 or 2008 because that seemed to be a long time from when the data warehouse project began. Well, now it’s time to add on a few more years of data to the date dimension. However, with the auto-population capabilities with current releases of database management systems for date dimensions, manually populating the date dimension has become a lost art. So, whether it’s useful or a stroll into the past, this code populates Year, Quarter, Month, Week, Day and Date dimensions.

The table names are as follows:

Year – Year_D
Year_Key - SK
The_Year - Int

Quarter – Quarter_D
Quarter_Key - SK
The_Quarter - Int

Month – Month_D
Month_Key - SK
Quarter_Key - Int [Ref: Quarter_D.Quarter_key]
The_Month - Int

Week – Week_D
Week_Key - SK
The_Week - Int

Day – Day_D
Day_Key - SK
The_Day - Int

Date – Date_D
Date_Key - SK
Year_Key - Int [Ref: Year_D.Year_Key]
Month_Key - Int [Ref: Month_D.Month_Key]
Week_Key - Int [Ref: Week_D.Week_Key]
Day_Key - Int [Ref: Day_D.Day_Key]
The_Date - Date

Code:

create procedure Load_Date_Dimension_Sp @year_start int, @year_end int
as
begin
declare @yr int
set @yr = year_start
while (@yr <= year_end)
begin
insert into year_d (the_year) values (@yr)
set @yr = @yr + 1
end


declare @qt int
set @qt = 1
while (@qt <= 4)
begin
insert into quarter_d (the_quarter) values (@qt)
set @qt = @qt + 1
end


declare @mo int
set @mo = 1
while (@mo <= 12)
begin
insert into month_d (the_month) values (@mo)
set @mo = @mo + 1
end

update month_d
set quarter_key = (select quarter_key from quarter_d
where the_quarter=1)
where the_month in (1,2,3)

update month_d
set quarter_key = (select quarter_key from quarter_d
where the_quarter=2)
where the_month in (4,5,6)

update month_d
set quarter_key = (select quarter_key from quarter_d
where the_quarter=3)
where the_month in (7,8,9)

update month_d
set quarter_key = (select quarter_key from quarter_d
where the_quarter=4)
where the_month in (10,11,12)


declare @wk int
set @wk = 1
while (@wk <= 52)
begin
insert into week_d (the_week) values (@wk)
set @wk = @wk + 1
end


declare @dt1 int
set @dt1 = 1
while (@dt1 <= 31)
begin
insert into day_d (the_day) values (@dt1)
set @dt1 = @dt1 + 1
end


declare @dt datetime
declare @dt_end = ‘12/31/’ + @year_end
declare @day_k int
declare @week_k int
declare @month_k int
declare @year_k int


set @dt = '01/01/’ + @year_start
while (@dt <= @dt_end)
begin
select @year_k=year_key from year_d where the_year = year(@dt)
select @month_k=month_key from month_d where the_month = month(@dt)
select @day_k=day_key from day_d where the_day = day(@dt)
select @week_k=week_key from week_d where the_week = datepart(wk,@dt)
insert into date_d (the_date, day_key, week_key, month_key, year_key) values (@dt, @day_k, @week_k, @month_k, @year_k)
set @dt = @dt + 1
end

end

The Procedure can be executed as:

Exec Load_Date_Dimension_Sp 1985, 2030

…which will populate the date dimension for the years 1985 through 2030.

All the syntaxes are based on SQL Server TSQL.

Technorati tags: Data warehouse, Dimensional Model, Date Dimension


Posted June 15, 2007 8:51 AM
Permalink | No Comments |

I’ve been putting some thought into data warehouse design issues in my recent three (and one more to go) DM Review column, but today I was thinking about what inhibits good data modeling. In other words, what are the practices leading to poor data warehouse data modeling?

The main ones really all have to do with the same thing – not doing a logical model before the physical model. The major reasons for this are:


· Thinking physical too early - The hardest part of modeling is getting the grain of each entity and the relationships right. It is difficult to tell the grain and “real” relationships (versus those that are being added for performance) when physical constructs are thrown in too early.

· Thinking usage over relationships too early - Sure, there are those target queries that you will test the model with. But if you model to them, you can easily make the model very difficult to accomplish future requirements with – those sitting below the peak of the proverbial iceberg.

· Thinking denormalization before the normalization is done - Anticipating downstream query performance shortcomings and considering them too early into the modeling process is a reason models fail. By the way, I'm using normalization here generically to mean normalizing or dimensional modeling - whichever you are doing.

As a rule, you should try to stay close to the logical model during implementation. Physical modeling should be iterative and based on performance testing. It’s not something you necessarily do with your modeling tool. If you “over-spec” the machine a little bit, and arm your toolbelt with ability to index, cluster, partition (which does not stray from the logical model) and build cubes, summaries and push complexities in a mart or at least outside the core model, what you gain in model understandability and manageability usually outweigh the added complexities of numerous rule exceptions, like denormalizations, in the model.

Technorati tags: Data Modeling, Data Warehouse, Normalization


Posted April 13, 2007 2:41 PM
Permalink | No Comments |

You can learn a lot about a data warehouse program by analyzing how it uses nulls. As most of you know, null means "unknown" or "irrelevant". The nullability, or ability to take on the null value, is a dimension of every column in the data warehouse - or any database for that matter. Most columns should not be nullable, but a few should be.
Nulls add extra storage to the column so that the DBMS can record whether the value is null or not. If the null bit/byte is "on", then whatever value may reside in the actual field is irrelevant and will not display. Of course, the other factor is the proper assignment of the null value to those columns that are nullable.

Nulls do not equal zero or spaces. They actually have an entire different meaning than either of those. There may be actual values like "not supplied" or "invalid value" that should be used in place of null (or zeroes or spaces). These descriptive terms are actually more explanatory about the field than null. So nulls get overused sometimes.

But mostly nulls are underused, taking a backseat to zeros and spaces. Nulls don't equal other nulls. And the manner in which nulls participate in aggregate functions like SUM, AVG and COUNT is very logical, but can be tricky. You also can't join on nulls. These basic facts discourage many from using nulls in data warehouses at all. But a little investment in knowledge of nulls can go a long way and afford your program the power that nulls bring.

Thinking about nulls makes me think about... country music of course.

Here are some of my favorite songs when looking at the effective use of nulls in data warehousing:

Stand by Your Null
If My Heart Were Nullable
Kentucky Null
Don't be Null
The Null Road
At the Gas Station of Love, I Got the Self-Service Null
Her Cheatin' Heart Made A Null Out Of Me
You Turned my Lullaby into a Nullaby


Posted April 9, 2006 8:36 PM
Permalink | No Comments |

E-week reports that the European Union has passed a contentious data-retention directive that requires all telephone and Internet traffic to be logged and stored for between six months and two years in order to help combat organized crime and terrorism.

Data to be retained include both incoming and outgoing phone numbers, how long calls last, and the location of calls, for both successful calls and those that get dropped. Also covered are IP addresses for SMS and Internet activity, as well as login and logoff times.


Posted December 23, 2005 1:30 PM
Permalink | No Comments |

I remember the day, not many years ago, when data warehouse programs would annually undergo a disaster recovery plan test. This usually involved a weekend and usually involved some important revelations in terms of readiness! I've noticed in the past year that those tests are seldom done anymore. Not being a server expert, I assumed that there were some important new built-in capabilities to servers or more failover environments in place that removed the need for the exercise.

So, I began to make it a point to find out more and, while surely the servers have improved in this area and there are more failover arrangements in place, the primary reason data warehouse disaster recovery plans aren't done as much seems to be that the exercise is out-prioritized. This is not a great reason. Of course, averting a disaster is always a tough justification in light of the many clearly progressive things we like to do for our data warehouse environments.

But consider these simple things that form the basis of a sound disaster recovery plan, as given to me by a systems expert...

1. Have a backup strategy that you execute; usually this will mean daily backups of all production servers (my expert discouraged incremental backups) and hourly log backups
2. Diagram the network in detail with specifications and contact numbers (incidentally, CSI has capabilities here we call ClarityPath)
3. Prioritize systems and connections by acceptable downtime; this prioritizes the recovery effort focus
4. Build a recovery team including these roles - project leader, communication leader and technology experts who are on call 24 x 7
5. Test the plan (the premise of this blog entry) which includes testing those old tape drives to make sure they can read the old tapes still needed
6. Review the plan systemically once per month - make it a living document
7. Put up a secure website where status can be communicated in the event in-house systems are down and people are working from off-network computers

It's obviously difficult to evaluate these team members on the basis of something that will likely and hopefully never occur. Therefore, readiness metrics are more appropriate for the evaluation.


Posted November 15, 2005 1:21 PM
Permalink | 5 Comments |

Today, I'm thinking about building the DW/BI data model and the various approaches builders will take on the relationship to the source system(s). This is one of about 35 different aspects of a DW/BI methodology that are often effected in a shop in de-facto fashion, and often suboptimally. With this entry, I'd like to point out the importance, not only of this decision, but of making all these decisions in a heads-up fashion.

Theoretically, the source system(s) should have little impact on the DW/BI data model. However, let’s take a look at real world data warehouse model building strategies.

Data warehouse data models are built with one of 3 strategies in mind...

1. Mimic the operational data model(s) completely. This saves time in the modeling process and you know that you are building something for which you will have the data for and you know where you’ll get the data. With this approach, you solve one problem you are probably having and that has to do with limited query access to operational systems.
2. Model the data warehouse based entirely on user access requirements irrespective of the source systems and the availability of data. This will create a data model that is fit-for-purpose and possibly technically elegant, but the data acquisition piece will need to then be tied in with the modeling effort. Since the source data has not been considered, the mapping effort may be long and there may be unrealistic constructs in the model.
3. Model the data warehouse based on user access requirements but with knowledge of and an eye on the source systems. Data acquisition design is a parallel activity in this strategy with source systems identification and analysis having previously been done so the modeler understands what is available and doesn’t stray beyond the bounds of what is available during the modeling effort.

I recommend the third approach for DW/BI modeling efforts. This strategy strikes the best balance between designing for user access and building that which you can reasonably populate. This illustrates the need for a close working relationship between the data modeler(s) and the data acquisition programmer(s), whose work eventually needs to be brought together.


Posted November 13, 2005 7:46 PM
Permalink | No Comments |
   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›