In today’s rapidly evolving drug discovery landscape, predictive models have emerged as essential tools to accelerate workflows by simulating and predicting biological activity, drug-target interactions, and much more. The utility of these models is highly dependent on the quality and management of the data upon which they are built. At the forefront of this technological revolution is CAS, whose CAS BioFinder Discovery Platform™ is powered by advanced predictive models. To understand how the accuracy of these models leads to true insights for drug discovery scientists, we spoke with Adam Sanford, Director of the Life Sciences Division, and Orr Ravitz, Senior CAS BioFinder™ Product Manager, to delve into the rigorous data management strategies that make CAS a leader in the field.
CAS: To start, could you talk about the CAS approach to data integration, normalization, and harmonization when building your predictive models?
Adam: We have a couple of core philosophies regarding data management. The first is comprehensiveness. We aim to capture as many relevant sources as possible, casting a wide net to ensure our models are built on a robust foundation of diverse data. But it’s not just about collecting data; it’s about making sure that data is usable. This is where our process of human curation and reconciliation comes into play. While this process may seem mundane or excessive, we believe it is vital for building models that achieve a degree of accuracy unattainable through AI-driven extraction alone.
When we bring data in, we focus on three key areas. First, we ensure that if it’s a specific kind of entity—such as a small molecule, a protein, or a pathway—we reconcile those to our authority constructs. This involves resolving the many different expressions of an entity into a singular identifier or component. In published literature, it’s common to see hundreds of different representations of a protein or a chemical structure. If you’re not careful, you could end up with what appears to be many different independent observations, but they can actually be clustered together because they are the same entity. Our process reconciles these different entities into one cluster.
Orr: The disambiguation of entities in the literature is key for us to ensure model accuracy. For instance, in biology, a protein can be referred to in numerous ways, and these variations can cause researchers to miss large segments of data if all names and forms are not accounted for. Similar challenges exist in chemistry, where we have specialized expertise. We’ve been able to draw on expertise acquired through our long history of handling chemical data to disambiguate biological entities with high accuracy.
It is not just about identifying entities correctly but also about capturing the experimental context correctly and ensuring that the actual measurements, including units or methods used are harmonized effectively.
We spend a lot of energy creating these underlying authorities. For example, when a protein is referenced in the literature, it can be under various names or identifiers depending on the species or modifications. Our approach ensures that all these variations are captured under a single, consistent identifier within our system. This allows us to maintain a high level of precision in our predictions, which is crucial for drug discovery.
Adam: Another critical aspect of our process is normalizing information. This is not a fully automated task—humans are deeply involved in this process. For instance, when we index data, a real scientist will look at an observation made in the literature and determine if it’s a numerical observation, an activity, or something else. They’ll then reconcile this data to a standard set of units. This is a meticulous process that involves a lot of detail, ensuring that every piece of data is accurate and consistent with the rest of our content.
This rigorous approach to data management is what sets CAS apart from others in the field. We’ve built an entire infrastructure to handle this complexity, and it’s this infrastructure that allows our predictive models to be so effective.
CAS: This rigorous approach to data handling must be unique to CAS. How does it help your models benefit drug discovery researchers?
Orr: Our models are built on a foundation of data that we trust deeply, and this confidence translates directly into more accurate predictions. We began testing our models with publicly available data. When we transitioned to CAS curated content, we saw a significant jump in the accuracy of our predictions. We also discovered that we could create more granular models that can be organism-specific and focused on specific modes of action. This is because we not only ensure that the data is accurate, but we also capture the context in which it was obtained.
We employ informatics-driven models in CAS BioFinder. We look at patterns across the data, which is why scale is critically important. The more data we have, the better our models can perform. We start with a “triple”—the right molecule, the right target, and the right measurement—and build from there. Because we’re diligent about the quality of these triples, our models are inherently more reliable.
Adam: CAS BioFinder uses a cluster of five different predictive models, each with its own methodology. Some models are very structure-based and leverage our chemical data exceptionally well, while others might focus on different data characteristics. Using an ensemble approach, with each model making predictions from its unique perspective, we can combine these predictions to create a consensus. This consensus often provides a higher confidence level in the predictions than any single model could achieve on its own.
For example, ligand-to-target activity predictions are at the heart of what CAS BioFinder does. Whether it’s a novel compound or something within our existing database, our models can predict how likely a ligand is to interact with a target, even if no explicit experimental data is available. This capability is incredibly valuable for medicinal chemists who are trying to understand the potential activity of new compounds.
Additionally, we have models that predict metabolite profiles—how the body is likely to process a compound. Understanding the metabolic pathways of a drug candidate is crucial for assessing its safety and efficacy. These predictions are built on experimentally identified metabolites, making them particularly reliable.
Orr: We’re also working on enabling our clients to augment their own proprietary data with our data collection. Many pharmaceutical companies have a deep history of chemistry data that they’ve developed in-house. With the addition of our data, they can create predictive models that are highly specific to their needs. This is a powerful capability that allows them to leverage their expertise while also benefiting from the breadth and depth of CAS data.
CAS: Given the complexities of managing such a vast amount of data, what are some of the biggest challenges you’ve faced in developing these models?
Adam: Creating the authority constructs I mentioned earlier is a painstaking process that requires a lot of human intervention. It’s not something that can be fully automated, especially when dealing with complex chemical and biological information. This process can be exceedingly challenging, particularly when considering the human investment required to ensure everything is correct.
Another significant challenge is the variability in how data is presented in the literature. For example, in patents, the data can be buried in tables, supplementary information, or scattered throughout the document. A machine alone cannot assemble all these pieces correctly. Human curators must intervene to ensure that the data is accurately extracted and normalized. This is not just a one-time task—it’s an ongoing effort that requires constant attention to detail.
Orr: I can provide an anecdote from a recent experience that illustrates this complexity. I encountered a measurement for a known drug approved in the late 1980s. The data referred to a paper from years before, and I was surprised to see this structure mentioned so early before the drug’s approval. When I looked closer, I realized that the publication contained several structures, but none looked exactly like the drug in question. It wasn’t until I examined the text referring to various substitutions that I found the correct structure. This level of complexity is beyond what current machines can handle, and it underscores the importance of human expertise in our data management processes.
We joke that we’ve built an "edge case machine" because we often deal with these kinds of complexities. Although these edge cases might make up a smaller percentage of the data, they can have an outsized impact on the accuracy of our models. Ensuring that these cases are handled correctly is critical to the overall success of our predictive models.
CAS: As publications and data are constantly emerging, how does CAS ensure that these models remain current?
Adam: Initially, when building our models, we updated them in large batches as we incorporated new data. We now retrain our models more frequently, in some cases bi-weekly. This ensures that our users are always working with the most up-to-date predictions. We’ve established pipelines for integrating new data, which continue to become more efficient.
Orr: We expect to integrate new data into our models within a few weeks of publication. Previously, we trained models when there was a significant change in the data landscape–for example when a new target finally had enough data to build a reliable predictive model. Expectations around data modeling and accuracy have been shifting quickly, and we’ll continue to monitor and retrain our models frequently to meet the needs of drug discovery scientists.
CAS: Is there anything on the horizon for CAS BioFinder and your predictive models that you’re particularly excited about?
Adam: Our data and solutions are constantly evolving, literally every day. But as of October 2024, we’re actively exploring several areas, such as ways to incorporate more advanced therapeutic modalities, like protein-based therapeutics and PROTACs, into our predictive framework. These areas are still developing, and we’re excited about the potential to push the boundaries of what our models can achieve. This includes areas like antibody-drug conjugates, which require a different approach to modeling than small molecules. We’re also looking at toxicity predictions, which are becoming increasingly important as the industry moves towards more complex therapeutic modalities.
Orr: Another exciting area is the use of knowledge graphs for predictive modeling. By expanding the biological context we provide—such as pathway information or biomarkers—we can leverage these relationships to create more sophisticated models. This could allow us to predict new drug-target interactions or identify novel biomarkers for diseases. We’re also experimenting with different methods for building these knowledge graphs, which would allow us to offer even more powerful predictive capabilities.
CAS: What makes your approach to predictive modeling in drug discovery so unique?
Orr: One of the things that truly sets CAS apart is our commitment to transparency and flexibility. We understand that our users may have different preferences regarding computational methods, so we’ve designed the CAS BioFinder Discovery Platform not just as a single application. Users can download data from our corpus, combine it with their own data, and use it with their preferred methods. This flexibility is crucial for enabling our clients to get the most out of our content and capabilities.
Adam: Every observation in CAS BioFinder is associated with provenance in the literature, meaning users can trace the data back to its original source. This transparency is essential for building trust with our users. We’re not just asking them to trust our models blindly—we’re providing them with the tools to verify the data themselves. This level of transparency and rigor makes CAS the best organization to tackle the challenges of predictive modeling for drug discovery.
CAS: Lastly, if you had a magic wand to change anything about the current state of data management and model building, what would you change?
Orr: We know there’s a bias in published literature towards positive results. But negative data—such as inactive molecules against a target—are just as valuable for building accurate models. Our machine-learning methods would benefit significantly if we had access to more negative data. However, this remains a significant challenge in the industry. It would be great if there were more incentives for academia and industry to publish this data.
Adam: It seems that many view AI and machine learning as a silver bullet that will solve the most challenging drug discovery problems, but that’s very unlikely without substantive changes. Time and time again, these technologies fail when they aren’t built on a solid data foundation. We’ve been repeating this point because it’s so critical: focus your energy where it matters most, on the data itself.
Reflecting on my past experience in the industry, I wish there had been a greater emphasis on the importance of underlying data structure and knowledge management. Today, it’s widely recognized that data is the foundation of successful experimentation and prediction, but many organizations still aren’t fully investing in this area. They recognize it as a problem but don’t always grasp how much energy and resources it takes to get it right. At CAS, we’re designed to handle this complexity, and we’ve seen the benefits of that investment.