Digital Economy Dispatch #161 -- Time to Face up to AI’s Data Challenge

Digital Economy Dispatch #161 -- Time to Face up to AI’s Data Challenge
10th December 2023

It has always been about data. From my earliest days programming a computer over 40 years ago, I realized that as much as I would like to devise sophisticated algorithms and produce elegant code, all that effort would be wasted without a strong approach to data management. Whether it was more mundane tasks such as creating accounting and stock control systems, or designing advanced real-time avionics components, the situation was the same: Poor data produces poor results. So, while I saw myself as a programmer, I also began to realize that I had to be a data architect.

In those days the challenge consisted of finding ways to produce structured data to provide consistency and uniformity for transaction processing systems. Relational databases became widely popular in the 1980s due to their ability to organize and manage large amounts of data in a structured and efficient manner. This was a significant advancement from the earlier hierarchical and network database models, which were less scalable and had limitations in data integrity and consistency.

However, as organizations began to handle increasingly complex and diverse data types, relational databases faced limitations. Their rigid schema structure and focus on structured data proved challenging in dealing with unstructured data, such as text, images, and videos.

The rise of big data, cloud computing, and the Internet of Things further strained the capabilities of relational databases. The sheer volume and velocity of data generated by these technologies exceeded the processing capacity of relational databases, making them inefficient for handling large-scale data analysis and real-time applications.

To address these limitations, new database architectures emerged, including NoSQL databases, NewSQL databases, and hybrid database systems. These innovative approaches addressed the challenges of unstructured data, scalability, and real-time processing, providing more flexible and efficient data management solutions.

Now in the AI era, new requirements are being placed on data and data management schemes. Ai systems require a lot of data to train, test, and tune AI systems. They also need a sophisticated approach to data handling in what Andrew Ng calls “data-centric AI”. He sees the secret to success with AI to be “the discipline of systematically engineering the data needed to build a successful AI system.”

Hence, despite the many advances in software and systems delivery, many of the problems are unchanged. Those involved in the ever-evolving landscape of AI are also finding that data is a key determinant in AI’s transformative power. However, in the complex world of AI, the relationship between data and algorithms is not merely a straightforward exchange; it is a combination of quality, quantity, and relevance that determines the effectiveness and reliability of AI systems.

The Search for Data Quality

AI models, much like human learners, rely on data to extract meaningful patterns, make informed predictions, and devise optimal solutions. It is now clear that if data quality is compromised, AI models falter, leading to inaccurate or biased decisions that undermine business outcomes and reduce trust in AI applications. Furthermore, when data contains errors, inconsistencies, or gaps, AI models struggle to grasp the underlying nuances, leading to suboptimal performance and unreliable outcomes.

Data quality is paramount for effective data-driven decision-making in all domains. AI models build knowledge from 3 kinds of data: Training data to establish core behaviour, test data to verify behaviour, and feedback data to refine behaviour. The quality of this data determines the model's ability to make sound judgments and predictions. Imperfect or incomplete data can introduce biases and inaccuracies, leading to erroneous conclusions.

Unfortunately, in too many situations, the quality of available data is far too low. To ensure the success of AI initiatives, organizations must elevate the priority of driving the availability and quality of data. This requires establishing clear data governance policies, investing in data cleansing and transformation tools, and fostering a culture of data stewardship within the organization. Each of these aspects is important to ensure AI will bring the benefits expected.

Take a simple example. Consider the application of AI in image recognition, a common task in AI systems. If an AI model is trained on a dataset of images with colour distortions or low resolution, its ability to accurately classify objects will be compromised. Similarly, AI models used for text analysis may struggle with poorly formatted or grammatically incorrect data. The quality of data counts.

The Public Sector Data Challenge

High quality data is especially important in the public sector where imprecise, incomplete, or inconsistent data can lead to erroneous conclusions, flawed policies, and wasted resources. For instance, low quality images in disease detection could produce many false positives, or inaccurate data on patient demographics or healthcare outcomes can hinder the development of targeted interventions and treatment protocols.

Government agencies, with large stores of data on citizens, businesses, and infrastructure, hold immense potential for leveraging AI. However, the public sector faces unique challenges in ensuring data quality, quantity, and relevance. Inaccurate or incomplete data, such as inconsistently recorded information or missing records, can distort AI models employed for tasks like welfare eligibility assessment or fraud detection. Limited data on specific populations or regions can hinder the effectiveness of AI models designed to address their unique needs.

The public sector often operates with data fragmented across numerous systems, departments, and agencies. This siloed approach impedes comprehensive understanding of issues and hinders the ability to identify cross-cutting trends and patterns. For example, siloed health data can hinder the tracking of disease outbreaks and the identification of emerging health risks.

However, in much of the public sector the source and nature of the data it manages adds to the challenges they face. The public sector holds a wealth of sensitive personal information, including health records, financial data, and demographic information. Protecting this data from unauthorized access, breaches, and misuse is of paramount importance. Ensuring the appropriate controls are in place is essential to provide the required protections. However, this inevitably adds cost and complexity to the way data is used.

Health data stands out as a particularly sensitive and valuable asset in the public sector. It is essential for providing quality healthcare, monitoring disease progression, and conducting research to identify new treatments and preventive measures. However, health data management faces unique challenges, such as data fragmentation, duplication, and inconsistency.

Despite the challenges, there are numerous efforts to improve data management in the public sector. Investing in data infrastructure, standardizing data formats, enforcing data privacy laws, and educating data custodians are key strategies being pursued to enhance data quality, integration, and security. The success of these efforts will have a major impact on the speed of AI adoption in the public sector, and will significantly influence the public’s trust in their use of AI.

Addressing Data Challenges for AI Success

Hence, managing appropriate data sources and establishing good data practices are critical to AI. To achieve success with AI in complex areas such as the public sector, it is essential to address the challenges of data quality, quantity, and relevance. It has been found in practice that this requires a comprehensive approach that encompasses at least 4 areas:

  1. Data Cleaning and Harmonization: Identifying and correcting errors, inconsistencies, and missing data to ensure data accuracy and consistency across different sources.

  2. Data Augmentation: Expanding the quantity of data by creating synthetic data or combining multiple data sources to address limited data availability.

  3. Data Labelling: Manually classifying or annotating data to provide context and meaning for AI models, especially in tasks requiring domain expertise.

  4. Data Governance and Ethics: Establishing clear protocols for data collection, storage, and usage to ensure data privacy, security, and ethical considerations.

This is a lot of time, investment, and effort. But it is only by addressing these data challenges that organizations can unlock the full potential of AI to enhance citizen services, improve decision-making, and optimize resource allocation, ultimately serving the public interest more effectively.

Winning the Data Game

Data is the foundation of AI’s transformative power. AI models rely on data to extract meaningful patterns, make informed predictions, and devise optimal solutions. However, AI models falter when data contains errors, inconsistencies, or gaps. To ensure the success of AI initiatives, organizations must elevate the priority of driving the availability and quality of data. This requires establishing clear data governance policies, investing in data cleansing and transformation tools, and fostering a culture of data stewardship within the organization. Addressing data challenges is essential to achieve success with AI in complex areas such as the public sector. This requires a comprehensive approach that encompasses data cleaning and harmonization, data augmentation, data labelling, and data governance and ethics.