Part 1 | What Makes Clinical Development Data “Good”?

If you listen to the buzz around AI in pharma – or really any discipline – it’s all about the algorithms: their speed, accuracy and scalability. But behind every algorithm is a less glamorous asset: data. In our case, this includes clinical development data. For the algorithms to do their job well, any data isn’t good enough; you need high-quality, well-structured, expertly curated, deeply contextual data.

This is the first blog of a 4-part series about data challenges: From Raw to Refined: The Power of Good Data in Drug Development. Here is part 2 (Chaos to Clarity: The Case for Biomedical Data Curation), part 3 (Garbage In, Failure Out: The Cost of Poor-Quality Data) and part 4 (From Principles to Practice: How Intelligencia AI Turns AI-Ready Clinical Data into Better Decisions).

This first blog in the series examines what constitutes “good” data in the context of decision support for drug development.

Let’s start by putting some more context around what truly makes good data good. Here are six core attributes: 

  1. Completeness – good data captures the full picture, including all relevant variables from trial design parameters and endpoints to drug modality and regulatory status. In drug development, missing even a single element — like patient population details or biomarker inclusion — can derail downstream analysis or skew AI predictions. Completeness ensures that decision-making isn’t built on partial or biased inputs.
  2. Granularity – highly granular data provides a detailed, multi-dimensional view of the subject it describes. In the context of drug development, this means capturing information not only at the trial level but also at the level of individual cohorts, endpoints and patient subgroups. This level of detail allows for more nuanced comparisons, targeted benchmarking, and highly tailored AI models. Without sufficient granularity, programs may appear similar on the surface but could differ in ways that impact risk assessment and decision-making.
  3. Traceability – every data point should be traceable back to its source. Traceability is always essential, but particularly in high-stakes domains like drug development, where it’s needed for both regulatory compliance and internal validation. Good data includes metadata that provides information about its origin, when it was captured, and how it was classified. This is particularly crucial when datasets are refreshed regularly or used in evolving predictive models.
  4. Timeliness – outdated data can be far worse than no data at all. Good data is updated continuously, especially in dynamic fields like drug development, where new trial results, updated endpoints or regulatory changes can dramatically shift a program’s risk profile. Only timely data allows proactive decision-making.
  5. Consistency – without consistency, datasets are unusable. Consistency means uniform terminology (e.g., drug names, indications), harmonized ontologies (e.g., MeSH) and standard data formats and taxonomies. Consistency ensures that different data sources can be combined and compared without introducing ambiguity or duplication. In AI systems, inconsistent inputs are a leading cause of poor model performance.
  6. Contextual richness – good data is linked to its clinical and regulatory background, for example, whether a trial included biomarker-selected patients and why specific endpoints were chosen. Good data is embedded in a clinical and regulatory context, such as when information about whether a clinical trial was enriched with biomarkers or what the rationale behind a given endpoint was. Context transforms data points into insights, making the difference between predicting technical success and understanding why a program is likely to succeed or fail. One example is understanding whether a past failure was due to flawed trial design, an inappropriate endpoint, or a lack of biomarker-driven patient selection.

Collectively, these six core attributes define data that is robust, reliable and truly fit for driving high-stakes decisions in drug development.

Connecting Data Quality to the FAIR Framework

The six characteristics of high-quality data also align closely with the FAIR principles, a globally recognized standard for scientific data management and stewardship, which emphasizes making data findable, accessible, interoperable and reusable.

  • Findable – to be findable, data must have rich metadata and a unique identifier. This maps directly to traceability: good data always includes links back to original sources, and completeness, since including all relevant fields makes data easier to index and retrieve.
  • Accessible – data must be retrievable via standardized methods. Accessibility connects with timeliness, ensuring that data can be pulled or queried when needed, and consistency, through standard data structures that enable seamless, repeatable access.
  • Interoperable – data must utilize standardized vocabularies and ontologies so that it can be integrated with other data, which is the function of consistency, or the harmonization of terms across datasets, and granularity, which enables compatibility between diverse datasets.
  • Reusable – data must be well-described, structured and annotated so that it can be reused in different contexts. This aligns with contextual richness, which ensures that data isn’t stripped of the scientific and regulatory insights that make it interpretable, as well as granularity and completeness, both of which are needed for reuse in advanced analytics and AI pipelines.

Many pharma companies now explicitly require or promote FAIR-compliant data infrastructures. By aligning internal data strategies with these principles, organizations are better equipped to collaborate across silos, accelerate regulatory submissions, and feed their AI models with trustworthy input.

Ontologies Make Data Interoperable

One of the biggest obstacles to integrating public biomedical data is that different sources often describe the same concept in different ways. Without a common language, even the most comprehensive datasets remain siloed and difficult to compare. That’s where curated ontologies like MeSH (Medical Subject Headings) and EFO (Experimental Factor Ontology) play a critical role.

Ontologies are structured, hierarchical vocabularies that define concepts and their relationships within a specific domain. In biomedical research, they serve as the backbone for organizing complex information across diverse data sources.

Ontologies provide standardized terms and logical mapping. They ensure, for example, that “non-small cell lung cancer,” “NSCLC,” and “lung adenocarcinoma” are recognized as related or equivalent. This enables disparate datasets to be interoperable, facilitating consistent classification, cross-source analysis, and scalable AI applications.

In some cases, publicly available ontologies are not detailed or specific enough to support advanced modeling, particularly when it comes to granular distinctions within drug modalities, biomarker usage, or emerging indications. For this reason, data-driven companies like Intelligencia AI, create custom ontologies to extend or refine existing frameworks. These proprietary structures enable a higher level of precision in classification, which in turn enhances the interpretability and accuracy of downstream analytics.

Clinical Development Data Quality in Practice: Industry Relevance and Real-World Impact

Strategic decisions in drug development often hinge on a program’s Probability of Technical and Regulatory Success (PTRS), a metric used to prioritize assets, allocate resources, and model financial outcomes. Traditionally, PTRS has been calculated based on limited historical data and expert opinion; yet now, drug developers are increasingly looking to AI to provide a more solid assessment of this important metric. However, even the most sophisticated AI models are only as accurate as the data they rely on.

Poor data introduces noise and bias, which can distort PTRS assessments in several ways:

  • PTRS scores may be artificially inflated or deflated, leading to misinformed go/no-go decisions.
  • Comparisons between internal and external programs become unreliable, making it harder to prioritize assets or evaluate in-licensing opportunities with confidence.
  • NPV models and scenario analyses lose credibility, impacting budget planning and investor confidence.

Inconsistent or incomplete data also hinders the ability to benchmark accurately across programs, indications, or therapeutic areas. The result is often an overreliance on internal judgment, or worse, a false sense of certainty based on flawed inputs.

As predictive modeling becomes a standard part of R&D decision-making, data quality is becoming an increasingly important strategic imperative.

The FDA Sentinel initiative, a national electronic surveillance system, provides a real-world example of the importance of good data. It uses data from multiple sources, including electronic health records, claims data, and registries, to monitor the safety of FDA-regulated medical products. This system is similar to what Intelligencia AI is doing on the development side, highlighting the importance of integrating disparate, high-quality data sources for informed decision-making and risk assessment.

For more details on how bad data impacts decision-making in drug development, check out part 3 of our series.

Clinical Development Data Quality in Practice: The Foundation for AI in Drug Development

High-quality clinical development data is a prerequisite for building effective AI models and determines whether those models yield accurate and interpretable results. Attributes such as completeness, consistency, and contextual richness, combined with careful curation and structured ontologies, transform raw information into a solid foundation for risk analysis and informed decision-making in drug development.

In an industry where the cost of a wrong decision is measured in years and millions of dollars, investing in data quality is not optional; it is essential.

From day one, our focus at Intelligencia AI has been laser-focused on creating the industry-leading data platform to power our analytics and AI. 

De-risk clinical development and enhance decision-making with accurate, AI-driven probability of success.

yes