Part 2: Artificial Intelligence Fundamentals: Get Your Data Quality Sorted First!

In 2023, Intech hosted a series of webinars on the role that data quality plays in driving effective and reliable artificial intelligence tools. The following article is the first in a series that will explore the themes that emerged from those webinars, as presented by our esteemed panelists who addressed the deep questions around the issues raised by the increasing prevalence of AI.

There’s a presumption that artificial intelligence means that computer programs are intelligent in their own right. This is simply not true.

AI only acts on information that it’s been given. This means there’s no real ‘intelligence’ in AI at all. Rather, AI learns from data and uses that to make predictions.

Because it does not have intelligence of its own, there are some clear examples where AI gets things wrong.

For example, driverless cars have been known to not recognise new situations (such as pedestrians unexpectedly crossing roads) resulting in accidents.

Similarly, AI tools used in healthcare can perpetuate unfair or discriminatory client treatment practices, and medical chatbots have often been known to provide bad advice.

On a larger scale, Meta – the company that owns Facebook and its related social media platforms – recently barred political advertisers from using generative AI tools in ads because these was too much risk that the ads would be either inappropriate or misinformed. The company now requires political advertisers to disclose any use of AI so Facebook can decide to publish these ads or not.

All these examples are a result of feeding AI bad data.

Ultimately, AI safety is about not feeding it bad data, but rather protecting your input data from several key problems that can arise when training an AI. What sort of problems do we mean?

Firstly, your data can just be plain incorrect. Or, it could have a bad structure or be meaningless, invalid, or just wrong. Data can also be duplicated, making it look like one person is actually two different people or entities. This can leave an AI tool unable to identify correlations between one event and another event happening to the same customer.

Secondly, bias in data can make AI unsafe – and there are all kinds of bias to be aware of:

Sampling bias – This inevitably occurs when you choose who are you going to sample to collect data about;
Availability bias – This occurs when not all relevant data is available to use;
Historical bias – As time and generations change, we have to be careful about teaching AI algorithms using yesterday’s data, which doesn’t always reflect tomorrow.
Confirmation bias – This occurs when we show favour to data that confirms our own currently held opinions.

Clearly, we need to mitigate, or at least consider our biases as we collect and cleanse data for AI purposes.

The process of implementing an AI tool starts with data ingestion. Simply put, this is where we collect a whole lot of data and put it through a preparation process to try and make sense of it.

We do this by creating models of that data to evaluate it, and ultimately deploy it within some sort of AI platform that can make decisions with the given input.

However, it is at the data ingestion and preparation stages where 80% of the work effort sits.

In my experience, the process of ingesting data then preparing, structuring and making sense out of it, is always a much larger job than initially anticipated. (I’ve been doing this for 25 years, and it still surprises me how often this happens.)

While the preparation step of the process is 80% of the work, it’s important to recognise that it’s also the most common reason for failure. This stage is where incorrect or biased data becomes a problem and will fester all the way through the process if not addressed, and so become amplified. This is the most common reason for bad data quality.

At this all-important preparation stage, data parsing is the simplest method that we use to structure data well. It allows us to take a big string of freeform data and break it down into something that makes sense – with all data classified into its own fields. Once this is achieved, any algorithm – be it an AI, statistical, or rules-based tool – can act on the data and make good sense of it.

There’s no doubt this is a complex hard process in its own right – but is an absolutely necessary prerequisite for getting good data for training AI.

Terry Goodman

Managing Director & Principal Consultant, Intech Solutions

Intech provides data solutions that lay a secure foundation for robust, cost-effective and timely business transformation. Intech’s products have been successfully deployed to thousands of users, across hundreds of sites. See intechsolutions.com.au

Information Request