Data Profiling: Familiarise yourself with your data

In a recent project, I came across an all too familiar scenario: how to get a feel for a new data set?

I received monthly snapshot data for a Life Insurance project, but the data came from 5 different providers. The data was similar, but the way they represented it differed and I wanted to hide this complexity from the user by building a data model that would standardise how the data is tied together. This meant that I needed to know which fields the providers had in common, and how well populated those fields were across multiple months. I needed a data profiler.

A quick web search brought me to Steve Dark’s wonderful website. He tackles the issue by copying his data profiling objects to existing QlikView applications, thus checking data integrity and getting a feel for the data contained within fields.

However, my needs were different.

I didn’t have an existing data model (yet). So I created an app that focuses on pulling in 1 or more documents (be it in txt, xls, etc.) on the “Get Data” tab by setting up the load script and then output summary statistics on all the fields, contained in the multiple tables loaded under “Get Data”.

The data profiling statistics from Steve has also been extended, thanks to a helpful comment from Barry Harmsen of QlikFix, so the measures are quite comprehensive:

  • The type of data: numeric, text, special characters, or a mix
  • The completeness: number of rows, proportion unique values (cardinality), proportion non-null and non-blank values (information density), null and blank counts.
  • Characteristics: minimum and maximum length of values within the field
  • Data range: max-, min-, and median values
  • Sample: 10 or 30 first loaded values to get a feel of what the data contained within the field truly looks like

I foresee three use cases for this approach:

  1. the data exploration and -modeling phases of projects,
  2. the validation of recurring monthly data sets,
  3. and the documentation of data sources.

I might flesh out these use cases in more detail in future blogposts.

The resulting app can be downloaded via my post on QlikCommunity: http://community.qlik.com/docs/DOC-8028

A big thank you goes out to Steve for his excellent blog and his model, which served as impetus for my attempt.

Advertisements

About Charles Crous

Data visualisation in support of better decision-making is my passion. I'm a data enthusiast. An ardent knowledge-seeker, trying to understanding the relationships between data points one data set a time.

2 comments

  1. Hi Charles – many thanks for the mention and the link. Glad you found my app a useful springboard for you own. The other thing I tend to do, if the profiler objects alone are not enough, is to export the data model to text files (via Settings \ Document Properties \ Tables) and bring those into QlikView. This makes it very quick to see which fields only have one possible value, or even only contain nulls.

    Thanks for sharing your thoughts and your app (presently pending moderation, but hopefully it will be there shortly).
    -Steve

  2. Hi Steve, thank you very much for the response.

    Your tip about making use of Settings>>Document Properties>>Tables is a good one, and definitely something I should make use of more often. I’m still looking at ways to get a better handle on new data sets, and if you are interested on exploring the data visually there is a wonderful free tool called Raw (http://app.raw.densitydesign.org/#%2F) that allows just that.

    I see the moderation has been completed. Would love to hear your thoughts should you get a change to use it. =)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: