Self Made Data Taxonomy

In August 2014, the NY Times reported that the vast majority of highly paid data scientists in the US spend a disproportionate amount of time cleansing data. This same article refers to the “janitor work” of cleansing data sets and to the increasingly worrying amount of time is spent on cleansing data.

Data cleansing and integrity checks are a big part of the role of data analysis, data scientists or any individuals manipulating data sets that are too large to be exhaustively audited line by line.

This piece should help with scenario by which one receives a large amount of data from an unknown or unverified source(s). Before deep diving into business problem, hypothesis and analysis, triple checking the validity of the underlying data is crucial for the sounding of the analysis.

Understanding the expected nature of a structured data sets

The first step is to assess the type of data, expected values, how values were collected in a structured data set. Practically speaking it is all about reading and summarizing carefully the notes, data taxonomy or related written information about a given data set. In some cases, none of the above is documented and an in-depth discussion with the individual(s) who pulled the data is needed.

The below is my suggested view of writing data taxonomy and related notes for the data set you are playing with.

I would distinguish between dimension fields, core data fields, and derived data fields.

Dimensions are for me any non-numerical data or alphanumerical values, which are nonetheless very helpful in either qualifying, segmentation, or later on analysing the data. A dimension can be any of the following but not limited to the following: product identifier, customer identifier, time of the day, region, city, day of the week, product category, store name etc…

Core data fields are fields that contain raw numerical values. In other words, numerical values that are directly and without alterations taken from or generated by an activity whether the activity is a transaction, a non-monetary activity such as a visit to a website etc…

Derived data fields are fields that contain numerical values that are themselves constructed on the back of core data that may or may not be present in the data set you have access to. Examples of derived values can be ratio, scores, % etc…

Suggested data taxonomy documentation’s template

Field 1: Field location | Source | Header’s presence| Header’s Complementary Info |Date of data pulled | Dimension or core data or derived data | Expected value range | Definition & Caveats | Latest data verification date or author | Other Notes

In practice I would expected a document that could look like describing the taxonomy for 3 fields of a large data sets for example purpose.

FieldstrNm”: 1st field from left to right |

SQL Database: dmtdb.pwvmt.ltjk on client’s infrastructure |

Has header |

Header’s name is a system generated abbreviation for Store Name |

Date of data pulled is 17 January 2015 as per client’s email sent on 21 January 2015 |

Dimension |

Client has a total of 120 active stores in the UK, expect up to 120 distinct store names|

Store name is unique, whether these 120 active stores were active during the date range the data was pulled for is unknown as of 15 March 2015 |

No known data verification date and author |

No further notes

 

FieldSales$”: 2nd field from left to right |

SQL Database: dmtdb.pwvmt.ltjk on client’s infrastructure |

Has header |

Header’s name is a system generated abbreviation for Sales Generated |

Date of data pulled is 17 January 2015 as per client’s email sent on 21 January 2015 |

Core Data|

Expect £ positive decimal values without the £ sign, when store is operating. If store is not operating on a given day, then no record present|

A £ value without the currency sign for Sales generated on a given day. Which of the 120 active stores this field stores the sales generated for, is opened 7d/week is unknown|

No known data verification date and author |

$ in the field header is misleading, currency is in £ but not explicitly quoted in the field’s value |

 

FieldProfitaly$”: 3rd field from left to right |

SQL Database: dmtdb.pwvmt.ltjk on client’s infrastructure |

Has header |

Header’s name is a system generated abbreviation for Profitability Ratio|

Date of data pulled is 17 January 2015 as per client’s email sent on 21 January 2015 |

Derived data |

Anywhere between 0% and 100% without the % sign. Client’s CEO says that its top profitable store operates at a gross profit margin of 50%. Expect most of these stores to operate at 20% range. |

Constructed by Sales divided by Gross Store Costs but Gross Store Costs is not a field in this data set| No known data verification date and author |

High reliance on how Gross Store Costs was evaluated and computed and for which we have no visibility.