Want to Use AI? Start by Wrangling Your Data!

Here is a common misconception: AI is all about the latest, greatest, really cool algorithms.

Sure, there is that, but to get the most precise and accurate answers from your algorithms a lot of upfront work needs to go into data wrangling. Experts estimate that up to 80% of the time of deploying AI is preprocessing data.

Using AI, specifically machine learning, models to assess the Probability of Technical and Regulatory Success (PTRS) is no exception. In fact, data preprocessing for this application is especially time-consuming.

Let’s have a look at why this is the case

Data trouble: inconsistent sources

To put the PTRS scores generated with the help of ML algorithms on a solid footing, dozens of data sources need to be accessed and harmonized. Here are some examples:

  • Publicly available data such as ClinicalTrials.gov and Drug Bank
  • Publicly available data that needs significant curation such as scientific publications, conference abstracts, press releases, and scientific posters.
  • Data in the private domain that require a subscription from data providers and full-text publications.
  • In-house data owned by pharmaceutical companies e.g., assay data, patient data, pharmacokinetics, and ADME-Tox data as well as real-world data from sources like claims data that are generally owned by healthcare providers and insurance companies.

A major challenge related to data is that format, conventions, and even ontologies used are not consistent across sources. For the data to be “digestible” by AI, it needs to be harmonized and transformed to be consistent across all sources.

Data preprocessing is a significant challenge initially, as large amounts of historic data are needed to train the models. But the task doesn’t end there, repositories have to be regularly updated to include the latest data, and robust data processing pipelines are needed to ensure the data gets pulled, preprocessed, and added regularly.

The human touch

So, once the initial data is wrangled and those pipelines are built it all happens automatically, right? Wrong!

Data is never perfect; to name but a few of a long list of possible problems: it can be missing, entered incorrectly, have duplicate entries, or is only available in a format that cannot be automatically read. One example of this last issue is clinical trial data: rather than being neatly arranged in a single database it can be spread over countless scientific publications, conference abstracts, and even press releases.

This is where data curation and quality control come in. While some issues, like missing data, are fairly easy to deal with, other tasks require deep domain expertise. Take the case of the clinical trial data: to review the literature and extract the relevant data in a consistent manner you need biologists or medical professionals, not junior data entry staff or software engineers.

While another form of AI, known as Natural Language Processing (NLP gives computers the ability to interpret, manipulate, and comprehend human language) can assist in this time-consuming task, it’s ultimately humans who need to do the bulk of it. There currently is no shortcut to data wrangling, no way to fully automate it.

Data wrangling often flies below the radar screen, it’s long hours of focused work and is decidedly less exciting than working on the latest, greatest, really cool algorithms. However, for AI the old rule “garbage in, garbage out” applies in full force, and therefore getting the data right is absolutely critical.

At Intelligencia, we have dedicated serious resources to building an extremely solid data foundation and robust pipelines to integrate new data.

We are proud of our work and happy to talk about how it enables us to provide our customers with objective, data-driven AI-based PTRS scores. Contact us here to learn more.

Further reading: Assessing Drug Development Risk Using Big Data and Machine Learning, Cancer Res (2021) 81 (4): 816-819