One of the promises of “AI” is that it will eliminate the mundane, redundant tasks we don’t like to do. And yes, I put “AI” here in quotes to reflect the buzzword that it has become. When you dig a little deeper, there is a sense among data and analytics experts that perhaps AI could automate the mundane work that no data analyst likes to do, data cleaning.
A popular quote is that 80% of the analyst’s time is spent in that boring, dreaded task of dealing with dirty data. I don’t know where the data is to back up that 80% figure. I suspect it is just a phrase that everyone has heard someone else say and assumes it must be true because it corresponds to their own experience. I doubt you could get a reliable estimate of that number but suffice to say, data cleaning does seem to be something that many agree takes a lot of time.
Can something be done to reduce that time? Can data cleaning be fully automated allowing the analyst to spend their time solely on the analysis? I’m not so sure that it can. Why? Let’s dig into it.
To start, we should define what we mean by data cleaning. Based on my experience, I assume data cleaning is the work you do on the raw data from its source to prepare it for the desired analysis. In no particular order, here are some of the possible elements of data cleaning along with some example questions to ask for those elements.
Data formatting/recoding - Do I need to remove characters that weren’t read correctly? Are my column names in all caps? Are there abbreviations or acronyms in the data that the average user wouldn’t understand? Do the data types (E.g. character, numeric) match the analysis? Did the data types get passed through correctly as they were input?
Data transformations - Do I need to convert values? Do I need to protect sensitive individual information so that it isn’t identifiable in the results? Do I need to change the scale of the data? Are all the dates consistent to be able to use them in analysis?
Determining and refining the output - Do I have the relevant response for my model? Which method should I use to calculate the response? Which response is relevant to my business problem?
Data subsetting - Have I removed blank columns and rows? Should I remove summary rows or unnecessary header information? Are there columns that have some constant value? Are there columns that aren’t relevant? Do I need to remove irrelevant columns and rows to make the data more manageable for the analysis? Which rows correspond to the business question of interest?
Scoping of the analysis - How do I translate the broad business question into a statistical question? Have I done an exploratory analysis of the data? Do I understand the data context and how it was collected? How does the way the data were collected influence the analysis?
Grouping data - Should I group numerical data into categories? Are there too many categories that make analysis less effective? Should I combine some categories based on the analysis goals?
Outliers - How should I handle those unusual values? Do you just delete them? Ignore them? Verify them? Are those outlying points representative of what you might see elsewhere in the data? Is there an explanation for the unusual values?
Missing or duplicate data values - Have I removed duplicate data values? Similar to the questions about outliers, what should I do about missing values? Delete them, ignore them, or investigate them? Should I use imputation methods to fill in the blanks?
Data integration - How do I connect some of the data in one table to some in another table? How do I link data sources that may be in different databases that require different connections? What type of join should I use? What is the best way to integrate the disparate data sources which may also create additional missing values?
There may be more elements, but these are common elements I’ve experienced as part of data cleaning. So can these be automated? Some are better candidates than others. But notice how many of these elements require some judgment.
For example, you can’t just delete every outlier you find because those outliers may be valid, legitimate data. Sometimes outliers are the most informative pieces of data you have because they indicate something unusual going on. Of course, if they are due to data entry errors, you fix them. I’m wary of outlier detection algorithms that eliminate data based on statistical rules or distance to other points in the dataset.
To further expand the example of outliers, consider tenure data on employees in an organization. Suppose you find one employee who has been at the organization much longer than anyone else. Upon investigation, you find out this data point represents the first founder of the company. If I were doing some analysis of that tenure data against compensation to study how compensation changes as a function of tenure, I would likely want to exclude the founder whose compensation trajectory doesn’t follow the typical employee. But if that founder decides to retire, then I would include that data in a voluntary turnover calculation to determine our replacement hiring needs. Whether or not to include an outlier depends on the analysis at hand.
Some of these data-cleaning elements don’t require as much judgment as others. Some of them are preventable with higher-quality data. There are two potential solutions to reduce the amount of data cleaning needed. One underrated solution is data governance. That means getting clear on the data definitions, determining consistent ways of calculating metrics, and setting boundaries and controls for the data and how it's collected and stored. This should reduce the amount of reformatting and grouping needed.
A second solution is data engineering. Any time you are repeating the same task more than once, it is a candidate for coding. Eagerness to analyze the data sometimes causes analysts to take shortcuts in data cleaning that are not repeatable. And yes, I’ve been guilty of that eagerness. So some data cleaning is self-inflicted by poor habits. Good data engineering with automated data workflows should automate more of the data integration and reduce the time needed for data cleaning.
However, even with those solutions in place, there will still be some data cleaning needed. The time required may be significantly less than the 80% we noted earlier. But I don’t think it could be reduced down to anything close to 0% even with the promises of AI.
Why? In short, data cleaning is the process of making a series of judgments and assumptions about your data. It depends. There’s not a clear-cut, one-size-fits-all answer. So data cleaning can only be automated to the extent that you can automate judgements and assumptions. Can a machine or AI make those judgments? To be honest, I don’t know. I’m not sure that I see much of that ability in AI technology today. Until I see that, I will remain skeptical of the ability of AI to significantly reduce the time devoted to data cleaning.
You can easily spot a data analyst novice by the amount of data cleaning they’ve done. The novice does no cleaning and just dumps whatever data they have into a software tool and starts interpreting the garbage results. The expert analyst understands all the ways the analysis can be misleading depending on how you clean and prepare the data. Careful data cleaning and preparation is what makes the insights of an expert analyst so valuable. A clean dataset is one where the assumptions are explicit and have been carefully checked.
So rather than begrudge excessive amounts of data cleaning, take some steps to make the process better by applying good data governance and data engineering. Partner with other experts in IT, data management, data quality, and data governance to drive down that time. I’ve found that these departments often don’t know how the data is being used and welcome input on how it needs to be prepared to maximize its value.
Then when you’ve automated what you can, recognize that the remainder is the part where the expert analyst shines. Use that expert judgment and skill. Perhaps we should not consider such an important task so mundane.
Consider how a sculptor views a rough-hewn block of marble. Knowing that there is a beautiful piece of art inside the marble, the laborious work of chipping away at the marble is not wasted time. Perhaps good tools can make that laborious work easier but there will always be some chipping away to do. It is the core of what it means to be a sculptor.
Similarly, data cleaning and preparation are core to good analysis. While some tools can help make the data cleaning easier, there will always be some necessary amount based on expert judgment. And just like the sculptor, the analyst can look beyond the rough-hewn data to see the beauty hiding within.