In a recent project, I came across an all too familiar scenario: how to get a feel for a new data set?
I received monthly snapshot data for a Life Insurance project, but the data came from 5 different providers. The data was similar, but the way they represented it differed and I wanted to hide this complexity from the user by building a data model that would standardise how the data is tied together. This meant that I needed to know which fields the providers had in common, and how well populated those fields were across multiple months. I needed a data profiler.
A quick web search brought me to Steve Dark’s wonderful website. He tackles the issue by copying his data profiling objects to existing QlikView applications, thus checking data integrity and getting a feel for the data contained within fields.
However, my needs were different.
I didn’t have an existing data model (yet). So I created an app that focuses on pulling in 1 or more documents (be it in txt, xls, etc.) on the “Get Data” tab by setting up the load script and then output summary statistics on all the fields, contained in the multiple tables loaded under “Get Data”.
The data profiling statistics from Steve has also been extended, thanks to a helpful comment from Barry Harmsen of QlikFix, so the measures are quite comprehensive:
- The type of data: numeric, text, special characters, or a mix
- The completeness: number of rows, proportion unique values (cardinality), proportion non-null and non-blank values (information density), null and blank counts.
- Characteristics: minimum and maximum length of values within the field
- Data range: max-, min-, and median values
- Sample: 10 or 30 first loaded values to get a feel of what the data contained within the field truly looks like
I foresee three use cases for this approach:
- the data exploration and -modeling phases of projects,
- the validation of recurring monthly data sets,
- and the documentation of data sources.
I might flesh out these use cases in more detail in future blogposts.
The resulting app can be downloaded via my post on QlikCommunity: http://community.qlik.com/docs/DOC-8028
A big thank you goes out to Steve for his excellent blog and his model, which served as impetus for my attempt.