In the ever-evolving landscape of data analytics, the phrase "clean data" holds immense significance. It is the backbone of any successful transformation, analysis, and visualisation endeavor. To truly understand its impact, let's delve into the rules of data structure and why they are paramount for unlocking meaningful insights.
There are a couple clear and simple rules to follow to check if data is clean:
- One Column (Data Field) for Each Category or Measure.
- One Data Type for Each Column. Some examples include strings and numeric.
- A Single Date Column. This might change if you have multiple date fields e.g order date and shipping date that can't be combined.
- One row should be a record containing all the values in each data field possible.
However, in the real world, data is never this clean immediately, and we are stuck with dirty data. This insidious adversary can take many forms that violates the rules of clean data. Some examples of this include:
- Multiple Data Points in a Single Field.
- Mis-spelling Categorical Field Values. 'High,' 'Highh,' and 'Hig' might all represent the same category but introduce unnecessary confusion.
- Records Split Across Multiple Rows.
- Missing Headers. Without proper labeling, interpreting the purpose of each column becomes a daunting task.
Some other aspects of dirty data can be more nuanced. For example, Out-of-Date Data can mislead analyses and decisions, however, how can we tell if data is out-of-date? It is vital that we know and understand the sources of our dataset.
The second issue is Missing Values. This can impede the completeness required for robust analyses and visualizations. However, it is important to tell the difference between Missing Values, Nulls and Zeros.
- Missing Values are values that you expect to exist but are not within the dataset.
- Nulls are values that don't exist. This could mean they are missing and could be sourced elsewhere, or they simply don't exist.
- Zeros (0) are a valid and known value, which contribute to calculations and analysis. It is important to make sure you don't interpret Nulls as Zeros!
There are many different ways to clean data for analysis using software like Tableau Prep or Alteryx. Each method would require multiple blog posts so I'll briefly explain popular techniques:
- Renaming headers accurately.
- Splitting columns in order to create a data field with a single data type, and a category or measure.
- Filtering to only retain accurate and necessary data.
- Removing spelling mistakes by grouping algorithms.
- Sourcing new data or creating new calculations in order to fit the needs of the task
Embracing the rules of data structure and steering clear of dirty data pave the way for enhanced analytics, efficient transformations and impactful visualisations. Clean data streamlines processes, fosters accurate insights, and cultivates a foundation for informed decision-making.