In our first post, we explored the six attributes that define high-quality data in drug development: completeness, granularity, traceability, timeliness, consistency and contextual richness. But even when data comes from authoritative public sources, it rarely meets those standards on its own. So, how do we get there?
The target state, known as AI-ready data, refers to data that is structured, standardized, well-labeled, provenance-tracked, and permission-cleared, allowing it to be used confidently for analytics and machine learning.
Read the first blog in our series to find out What Makes Clinical Development Data “Good”?
Why Biomedical Data Needs Expert Curation for AI
This second post addresses that challenge by examining a fundamental issue. While vast amounts of biomedical data are publicly available through resources such as ClinicalTrials.gov, FDA.gov, ChEMBL, and DrugBank this data is rarely ready for direct use. To ensure reliability and structure, it must first undergo careful biomedical data curation.
Public Data: Essential, But Not Enough
Public biomedical databases contain critical information about drugs, clinical trials, mechanisms of action, indications, and regulatory events. They form the backbone of much of the research, risk modeling, and portfolio analysis across the pharmaceutical industry.
However, these databases were not designed for analytics and powering machine learning (ML) algorithms. Their primary goal revolves around regulatory visibility and transparency, not the structured interoperability required for ML/AI models.
Transforming these public data sources into a reliable, structured, and useful form for advanced analytics and AI, and for reuse “beyond its original purpose,” requires significant curation. Therefore, while these repositories offer volume and breadth, they often fall short in terms of the structure and standardization required for AI models to work effectively.
Why Biomedical Data Curation Is Critical
The journey from raw public data to AI-ready insights is fraught with challenges. These publicly available datasets, while invaluable, often suffer from several key issues that hinder their direct use in advanced analytics and ML model development:
- Heterogeneity: Data is collected and reported by different stakeholders (researchers, pharmaceutical companies, regulatory bodies) using varying formats, terminologies, and standards. This makes direct comparison and integration difficult. For example, a trial’s phase or indication might be labeled differently across sources, or a single indication might be listed under different names (e.g., “breast cancer” vs. “mammary carcinoma”). Similarly, a drug’s mechanism of action might be vaguely or inconsistently described, e.g., as “tyrosine kinase inhibitor,” “TKI,” or simply “inhibitor.”
AI models require standardized, unambiguous inputs to produce reliable outputs. Without harmonization of this biomedical data, the risk of misclassification or flawed inferences increases.
- Missing or Outdated Information: Public databases are dynamic: information is added, updated, and occasionally removed. Datasets can have gaps, lack full endpoint definitions or detailed population information, or contain outdated entries that do not reflect the current state of research or regulatory status.
- Ambiguity and Noise: These are both common issues. A single drug may appear under multiple names. Trials may be updated, but without clear version control or documentation. Without careful curation, even well-established data sources can yield unreliable or biased predictions.
Curation reduces ambiguity by assigning unique identifiers to each entity (e.g., compound, trial, company), tracking relationships between data points (e.g., linking trial arms to cohorts) and ensuring that the metadata for every record is complete.
These challenges underscore why raw public biomedical data, despite its abundance, cannot be directly fed into sophisticated AI models and expected to yield reliable, actionable insights.
What Expert Biomedical Data Curation Means
Expert biomedical data curation is not merely about cleaning up errors; it’s a comprehensive process that transforms raw, disparate data into a harmonized, enriched, and reliable resource for AI and decision-making. This involves:
- Standardization: Applying consistent terminologies, formats and units across different datasets to ensure interoperability. This includes mapping synonyms and variations to a single standard.
- Integration: Combining data from multiple sources (e.g., ClinicalTrials.gov, FDA.gov, scientific literature, internal datasets) into a unified database, carefully resolving conflicts and ensuring accurate linkages.
- Augmentation: Enriching the integrated data with additional relevant information, such as gene ontologies, pathway data, biomarker details, and trial outcomes, to provide deeper context and enhance the predictive power of AI models.
- Validation: Implementing rigorous quality checks to identify and correct errors, inconsistencies, and missing information. This often involves a human-in-the-loop approach, where expert reviewers validate automated processes.
- Structuring: Organizing the data in a way that is optimized for analytical queries and ML model training, ensuring that relationships between different data points are clearly defined.
This meticulous process involves a blend of automated tools and expert human oversight. And is necessary to elevate raw data to “decision-grade” quality. Without it, even the most advanced AI models’ decision-support systems can falter – even when powered by state-of-the-art AI.
Why Automation Alone Isn’t Enough
Automated tools and AI-based data extraction methods, particularly those utilizing Natural Language Processing (NLP), have become indispensable for processing the vast volume of biomedical information. They are highly effective for tasks like identifying key entities, extracting structured data from text, and flagging potential inconsistencies.
However, when it comes to transforming raw public domain data into decision-grade input, NLP alone struggles with the inherent complexity of biomedical texts, highlighting why human expertise remains essential.
Several studies and expert reviews have documented the challenges of relying solely on automated methods in biomedical domains, highlighting why human expertise remains essential:
- Complex Terminology and Synonyms: Biomedical language is rich with jargon, abbreviations, and synonyms. Disease names, drug mechanisms, and endpoints can be expressed inconsistently, introducing ambiguity. NLP systems often misinterpret or fail to accurately capture this variability. As a review in Journal of Biomedical Informatics notes, disease name normalization remains challenging due to “a wide variety of naming patterns for diseases, ambiguity, and term variation.”
- Contextual Nuance: Understanding the true meaning of biomedical data often requires deep contextual understanding. A trial outcome, for example, might be reported differently depending on the specific patient population, study design, or statistical methods used. Automated systems can struggle to interpret these nuances without human guidance.
- Evolving Knowledge: The field of biomedical research is constantly evolving, with new discoveries, terminologies, and guidelines emerging regularly. Expert curators stay abreast of these changes, ensuring that the curated data reflects the latest scientific understanding – a challenge for purely automated systems.
Therefore, while automation is a powerful accelerator and crucial for scale and efficiency, expert human curation provides the critical layer of interpretation, validation, and contextualization necessary to transform raw public data into a reliable foundation for AI-driven drug development.
Curation Is a Strategic Investment
In the race to bring life-changing drugs to market, the quality of data can be the deciding factor. Investing in expert biomedical data curation is not just a technical necessity; it’s a strategic imperative. It ensures that AI models are trained on accurate, comprehensive, and contextually rich data, leading to more reliable predictions, more efficient research, and ultimately, a higher probability of success in clinical development. By transforming the chaos of raw public data into curated, decision-grade insights, pharmaceutical companies can unlock the full potential of AI and gain a critical edge in a competitive landscape.
Coming Up Next
In our next blog, we’ll look at the flip side: what happens when data quality is poor. We’ll explore how unreliable data can distort portfolio decisions, risk models, and resource allocation, and why even small gaps in quality can translate into costly missteps.
Missed Blog 1? Read about the six key attributes of high-quality clinical development data here.