Automating the Path from Raw Data to Analysis
Updated: Nov 7, 2019
by: Emily Goldhammer
“The goal is to turn data into information, and information into insight” – Carly Fiorina, former executive, president, and chair of Hewlett-Packard Co.
Effective data management is part of a holistic solution to leverage the power of analytics to drive better, more timely, and proactive decisions. Developing data infrastructure (such as relational databases) and processes (such as data extraction templates) enables efficient manipulation of data for future analysis and model development.
Raw Data → Database → Modeling
Creating data infrastructure & processes focuses on efficiency, which means minimizing the amount of manual extraction required. This also has a proactive focus, which emphasizes the need to understand the multidimensional capabilities of data management.
Various validation steps ensure that data limitations are known and data integrity is respected (think – how does this data connect to an authoritative or primary source? Can the scope of this data source be mapped to other sources?).
I. Raw Data:
The process for managing a Raw Data source includes the following:
Collection → Assessment → Extraction → Validation
Collection: Collect raw data values in primary source files.
Assessment: Assess primary source files for consistent reporting format. Example questions shown below:
What lower level details are reported?
What subtotals are reported?
Extraction: Based on the reporting format, create a template to extract values from the primary source files.
Templates allow for automation of the data extraction process – they can be built to populate dynamically & automatically such that when the template tabs are copied & pasted into new raw data files, they automatically pull and format the necessary data.
Macro free Excel functions allow for consistent reference to primary source tabs and avoid creating links that refer to a specific workbook. Example functions explained below:
Validation: Additional formulas are included in the template to cross check for human error and internal accuracy. Examples explained below:
A combo of INDEX / MATCH & INDIRECT functions can check that the location of rows & columns are consistent in each tab.
A combo of SUMIF & INDIRECT functions can validate that the subtotal of lower level values equals reported total values.
II. Database Development
The steps taken to develop a database are:
Relationship Development → Data Upload → Maintenance
Relationship Development: Establish links between the data sets based on raw data assessment and data set commonalities.
Create tables for main data sets – what sort of data are you looking to compare? Each major data type should have its own table.
Create tables for category details – is there additional information available for the data you have collected? Store the details for each type of data.
Create unique IDs to allow for consistent reference to data that appears more than once. *This step is important if you are using a database platform such as Microsoft Access because you will need to be able to numerically refer to different categories – i.e. if you have collected data about different clothing items, these items may be grouped into categories such as shirts, pants, and shoes. You could creating a unique ID for clothing category: ∙ 1 = shirt ∙ 2 = pants ∙ 3 = shoe
Data Upload: Normalize data extracted from different raw data sources.
Format the data – create tables such that the data is organized with each unique entry in a new row and each column referring to a specific category. Make sure data is formatted to have a consistent number of columns.
Map category names consistently.
Identify unique IDs.
Maintenance: Keep the database up to date and valid.
Ensure accuracy by adding latest values from primary sources.
Enhance database capability through review of possible tools and queries.
III. Analysis & Modeling
Storage of valid and accurate data in one place allows for data pulls (queries) that can feed directly into dynamic analysis and modeling. A query from the database can feed into automated charts and graphs – once the structure for the charts and graphs is developed, updated query data is pasted into one tab. The updated data from this tab flows through to all tabs, enabling efficient analysis and consistent modeling.