Author

Tina Chace

Author

Akhilesh Kale

Author

Robyn Kiernan

Understanding Data Lineage: Exploring Its Definition and Growing Adoption in Organizations

In today’s data-driven world, understanding the journey of data—from its origin to its final destination—is more crucial than ever. This capability, referred to as data lineage, provides a comprehensive view of how data flows through an organization, detailing its transformations and dependencies along the way. Data lineage varies in terms of levels of complexity, with “coarse lineage” demonstrating the table to table transformations, and “fine lineage” being at the attribute level. These assets can be mapped in tools such as Solidatus, providing an automated method for creating a clear overview of data sources, transformation, and usage. At Artefact, our teams design & build data & AI products for our clients day in and data out, and lineage helps our clients answer questions like: “Which systems are giving us this client balance, since it looks inaccurate?” or “Why is my client lending propensity model having different results specifically on Wednesdays?” We follow a meticulously structured six-step methodology for deploying lineage at scale, starting with a comprehensive assessment of the unique business use case and its intrinsic value. This process culminates in a seamless implementation, where ongoing maintenance and user adoption are integrated at the core.

Companies leverage data lineage for several key reasons, with regulatory compliance and data quality management being at the forefront. In the financial services industry, robust data lineage is essential for meeting stringent auditing requirements and principles such as the BCBS 239, ensuring adherence to regulations around governance, data architecture, risk data aggregation, accuracy, integrity and frequency of risk reporting. For example, in a case of Commercial Banking where assessing a Mortgage Financial Reporting Workflow allows a company to trace the origins and consumption of sensitive data.

Looking beyond compliance, data lineage is a powerful tool for enhancing data quality, enabling organizations to track data issues, validate accuracy, and maintain trust in their information systems. This article will delve into the intricacies of data lineage, specifically coarse lineage, and explore why it has become a cornerstone of modern data management strategies.

AI’s Rapid Rise in Financial Services: Opportunities, Challenges, and the Path Forward

Building on the importance of understanding data, Artificial Intelligence (AI) is transforming the modern financial services landscape, simulating human intelligence to perform tasks requiring learning and decision-making. AI’s applications are diverse and impactful: conversational AI, like chatbots, enhances customer interactions; productivity assistants streamline workflows and automate tasks; and automated data analysis accelerates insights from complex datasets. In August 2024, the European Union’s AI Act introduced new regulations aimed at ensuring ethical AI use and protecting user rights, highlighting the global shift towards responsible AI implementation. This development underscores the growing need for organizations to not only harness AI’s power but also manage it with careful oversight, complementing their efforts in data lineage and quality management.

While the use of open-source Generative AI like ChatGPT for personal use, integrating AI into an organization and generating real value for the business is a different ball game. Most financial institutions are in the rapid race of churning out Gen AI pilots and POCs however real dollars are only committed when it’s proved that they believe the potential benefits are dependable and the product is suitable for both business and technical users. Many institutions are still struggling to scale these technologies due to concerns about reliability (74%), user adoption (60%)(1), and insufficient technical expertise (60%). A Gen AI scalability framework is what has been built by Artefact to address the core scalability dimensions: Output Relevancy, Explainability, Fairness/Bias, Latency, Infrastructure, Organizational Efficiency, and User Experience/Adoption.

In the context of AI, data lineage offers significant business value by ensuring transparency and reliability in data-driven decisions. Today, over 75% of consumers are concerned about misinformation from AI(2). AI is often referred to as a “black box”, meaning end users frequently do not understand the inner workings that produce the output they are regularly using. As AI systems increasingly rely on vast and complex datasets, understanding the origins and transformations of this data is crucial for maintaining accuracy and trustworthiness. Data lineage helps organizations track and validate the data feeding into AI models, which is essential for optimizing model performance and addressing issues like bias or errors. By providing a clear audit trail, data lineage also supports compliance with regulations and enhances data governance, ultimately leading to more informed, reliable, and ethical AI applications that drive better business outcomes.

Data Lineage in Action: How It Could Have Supercharged Real-World AI Development

Data lineage is crucial for meeting regulatory and legal requirements in AI, especially under policies like the California Consumer Privacy Act (CCPA) and the Gramm-Leach-Bliley Act (GLBA). For example, consider a use case involving customer turnover within a financial services firm. In this case, the system lacked standardized practices for anonymizing private information and had no data lineage to track data flows. As a result, data enrichment to mask sensitive details was performed as a last step with minimal governance. This approach not only compromised data privacy but also exposed the system to compliance risks. If our partnering organization had robust data lineage in Solidatus, the organization could have tracked where data was being used, captured data transformations, ensured proper anonymization at each stage, and met regulatory requirements more effectively, thereby safeguarding privacy and enhancing data governance.

The majority of organizations (80%) have claimed their data is ready to use in AI, however more than half (52%)(3) experienced issues with implementation based on the quality of their data. Data lineage is vital for ensuring data quality in AI development, as it provides a clear view of how data is sourced, transformed, and utilized. At Artefact, we understand the imperative of data readiness and quality. We believe in an AI operating model that develops the technical requirements simultaneously with the data preparation and governance required to deploy large scale reliable AI. Our teams worked on a credit risk prediction model that relied on multiple data tables to assess borrower risk. The team discovered inconsistencies between these tables in their preliminary investigations—such as discrepancies in data formats or outdated information. This would cause the model to be skewed, and an inaccurate risk assessment to be generated. By implementing data lineage, the organization could trace the origins of data, identify where inconsistencies arise, and ensure that data transformations align with quality standards. This transparency helps in correcting issues before they impact the model, ultimately leading to more reliable and accurate predictions, and maintaining the overall integrity of the AI system.

Data lineage enhances AI model debugging, improvements, and reproducibility by offering a detailed view of data’s journey through the modeling process. If a model is producing unexpected results, data lineage helps trace the data through all stages—from collection to preprocessing and feature engineering. This visibility allows data scientists to identify where issues or inconsistencies may have arisen, facilitating precise debugging and targeted improvements. Additionally, “Domain Adaptation” is a common method to reuse components of algorithms to reduce the time and resources required to build an AI model from scratch. Lineage is essential for this reproducibility across different iterations and experiments as it provides the documentation and usage of data for the algorithms.

Powering AI Success: How Artefact and Solidatus Are Revolutionizing Data Management for Financial Institutions

Artefact and Solidatus are collaborative partners, united in their belief that effective data management is essential for the success of AI. Akhilesh Kale, a leader at Artefact in Financial Services, claims that “data lineage is the backbone of trust in financial institutions, providing that integrity that is critical in such a highly complex landscape defined by its regulatory pressures.” This emphasis on data integrity is central to how Artefact’s expertise in data and AI execution complements Solidatus’ structured environment, which helps manage and store coarse lineage. Together, they enable financial institutions to enhance transparency into source data for AI models. As Phil Yeoman, formerly of Cardano, highlights, “with Solidatus, our data estate is now mapped, modeled and cataloged. In a single view, I can show the business where their data resides, how it flows through systems and applications, what data quality rules apply and what data is subject to GDPR.” This seamless integration of tools and services underscores the transformative power of data lineage for AI. It simplifies navigating compliance, ensures top-tier data quality, and enhances model accuracy by tracing data from origin to destination. This traceability is invaluable for meeting regulatory standards, such as the CCPA and GLBA, while also helping to identify data inconsistencies that could compromise AI models. Furthermore, data lineage streamlines debugging, boosts model performance, and ensures consistent, reliable results.

References

(1) Prestianni, Timothy. “131 AI Statistics and Trends for (2024).” National University, May 30, 2024. https://www.nu.edu/blog/ai-statistics-trends/.

(2) Matthew Fox, “How Artificial Intelligence Is Shaping Consumer Sentiment,” Forbes, September 22, 2023, https://www.forbes.com/advisor/business/artificial-intelligence-consumer-sentiment/.

(3) Campus Technology, “Report: AI Adoption Hindered by Data Quality,” April 10, 2024, https://campustechnology.com/Articles/2024/04/10/Report-AI-Adoption-Hindered-by-Data-Quality.aspx#:~:text=Organizations%20Have%20Serious%20Concerns%20Around,and%20integration%20complexity%20(59%25)