Scientific search engines: Why designing them takes art plus

執筆者:

June 28, 2024

Search engines have become a standard tool for locating information that the name of the most common one is now a stand-in for the very act of searching — hence “googling” something. Yet the vastness of information available via search engines can be a hindrance when querying specifics such as searching for scientific concepts and research. Returning useful and relevant results while not missing novel pieces of information is a challenging balance, but new tools and design capabilities are improving the process.

Scientific researchers know this challenge all too well, and without the right tools, it can be harder to find relevant publications and materials. Return results that are too broad, and you can’t find what really matters. Return answers that are too narrowly focused, and you may miss an important new idea.

Determining the sweet spot for scientific research can be challenging, but fortunately it’s not out of reach.

Recall vs. precision in search engines

To understand scientific searching, it’s first important to understand the ways that search engines function. Google, for example, prioritizes recall. Put another way, it maximizes the number of answers that can potentially be returned for a query. There is no concern that the user needs to review every possible answer, and the technology attempts to rank the results by relevance so that the most valuable information to the user is at the top of the results page.

Conversely, other search engines can prioritize precision, which maximizes the most relevant answers for a given query. This type of searching is akin to using a library’s catalog search: there is an established set of results with numerous fields for search specification, such as title, author, and date. Maximizing precision can increase your chances of getting a manageable result set, but it also increases the possibility of missing something that falls just outside the tuning of the precision search.

Let’s explore an example of this difference: if a user searches the query “Plasticizer 40T,” a typical search engine will not recognize that the entire query represents a unique commercial substance. Prioritizing recall, it will return generic results for plasticizers that don’t include “40T.” The user will have to conduct a second query, refining to include “40T” in the results.

A more precision-focused, science-aware search, for example via CAS SciFinder^Ⓡ, evaluates the query against curated ontology data and recognizes this as a novel substance entity. It therefore immediately returns results for the specific substance Plasticizer 40T, even if a different name for this substance is used in the article text (see Figure 1).

**Figure 1**. Example CAS SciFinder reference search for “plasticizer 40T”.

‍

Designing the right search capabilities is central to what we do at CAS, which is why CAS SciFinder can easily recognize a novel substance in a term with scientific research connections. Search tools that are developed by scientists are inherently more sensitive to the terminology used by fellow researchers. Yet searching for one commercial substance is still a fairly defined exercise. What happens when a researcher is looking for relevance but also breadth relating to a disease, field of study, or concept?

How to find the search engine sweet spot

A common way to balance breadth and depth is through Boolean searching that utilizes “and,” “or,” and other conjunctions to connect terms. Using quotation marks to lock search terms together can also help refine queries for more precise results.

Regardless of how they craft a certain query, researchers can still find the sweet spot with a search engine that uses structured data and recognizes multi-term text strings. Structured data will have a consistent format and be organized into collections of data based on their characteristics. The search algorithm can then more efficiently identify and evaluate which results to return. Recognizing unique multi-term text strings as single entities and searching as such reduces non-relevant results that would be generated by individually searching the term fragments.

**Figure 2.** Graphic depicting the inherent challenge of balancing recall and precision in scientific search.

‍

How does one develop this type of content corpus? Through careful curation and the use of ontologies to build a refined understanding of user queries. The CAS Content Collection^TM, for example, is the largest human-curated repository of scientific information, and our curation policies identify the most relevant terms and substances to create unique indexed entries. Due to curation, search results aren’t limited to the contents of a title or abstract of a given publication.

Our indexing includes concepts and terminologies from the experimental sections of scientific literature that are specific to the novelty of the science in that publication. For example, an introductory section of a journal article may define a meta context of interest to the author — such as the next breakthrough cure for a certain disease — but the article’s actual novelty is a new analytical method to evaluate a chemical process. A solution like CAS SciFinder that uses curated data will focus on the analytical method, not the meta context, to better respond to a user’s query.

Ontologies connect the right ideas for relevant results

The way we build these connections is through ontologies, which are curated collections of terminology that include capturing synonym relationships. These relationships provide a refined yet still extensive list of terms to leverage. If a user searches for a substance by a commercial name, our ontology will include variations of chemical names, other commercial names, and even internal identifiers in patent filings. Without these connections, a typical search engine can’t identify relevant results.

This is why a search tool built by scientists for scientists can drive more efficient innovation — they can provide more relevant results faster than a search engine prioritizing recall, and the data captures hierarchical relationships between key terminologies.

For example, searching for the term Sonic Hedgehog in CAS SciFinder versus a common search engine yields markedly different results. CAS SciFinder immediately recognizes this as a protein and returns relevant scientific publications (see Figure 3). A general search engine, however, returns the well-known video game character, not the protein information a scientist would be seeking.

**Figure 3.** A CAS SciFinder search example for Sonic Hedgehog, a protein regulating embryonic morphogenesis in all animals.

Beyond ontologies, human-curated indexing further enables discovery where algorithms commonly fail. A human can recognize the connections between a code and chemical structure and build relationships defining a chemical entity that non-human curated indexing approaches might miss. This enables capturing of data in diagrams, such as molecules and compounds, and connects that information to explanations that may be in tables, graphs, or text elsewhere in the publication (see Figure 4).

Without curation, a typical search engine must rely on optical character recognition (OCR) for graphs and diagrams, and something as simple as poor image resolution could lead to an important discovery being missed. Human-curated solutions like those from our teams at CAS leverage the best of human expertise with the speed and algorithmic advances of cutting-edge technology.

‍

**Figure 4.** Examples of non-text data for curation of ontological information.

Data quality matters for the future of searching

Scientific searching isn’t like looking up a local restaurant review, and both academic and commercial institutions benefit from the solutions that balance recall and precision. As the body of knowledge grows in any scientific field, search capabilities need to evolve. This is only going to become more critical as AI-driven tools become the standard search solutions. Large language models (LLMs) can struggle with scientific searching unless they’re trained on the proper data and have layers of neural networks to handle non-text data.

With expert curation, robust ontologies, and the ability to leverage non-text data, specialized solutions can rise to the challenge of continually refining and improving scientific searching and innovative discoveries.