Though an increasing number of novel chemical compounds are being synthesized each year, there is growing concern that innovation may be stagnating in small molecule discovery. However, new research published by CAS scientists in the October 2019 Journal of Organic Chemistry reveals that the pace of small molecule innovation is actually accelerating from a structural perspective and offers insights into finding fruitful new areas for further investigation as chemists navigate a vast and largely unexplored chemical space.
Our analysis highlights changes in the framework or scaffold diversity of a large set of organic compounds over time. The findings have practical relevance for future chemical space exploration and support further investment in traditional and emerging approaches to small molecule discovery. As organizations increasingly strive for the most effective and efficient innovation investments, these findings can help inform future small molecule discovery strategies across industries.
Harnessing the power of curated data
The number of potentially synthesizable stable organic molecules is estimated to be 1063 for those with molecular weights of less than 500 Da and 10180 for those below 1000 Da. With such an immense number of possible compounds, chemists will only ever be able to sample a very small part of the structural diversity of chemical space. This raises the question of whether they are doing so efficiently and productively. In drug discovery, this is a particular concern as the potential human impact of innovation is high but cost containment is a growing concern, requiring researchers to seek more efficient approaches to exploring a structurally diverse set of compounds.
One way to determine the extent to which the chemical space has been explored in the search for new compounds is to analyze the diversity of known substances and look at trends over time. Since chemical frameworks are a conceptually simple way to understand the common features of molecules, CAS utilized frameworks or scaffolds (i.e., all of a compound’s ring systems and all the linkers that connect them) to assess the diversity of known substances. Each framework can be thought of as a region of the vast chemical space consisting of similar molecules, and the number of known compounds that share a given framework indicate the extent to which that region has been explored.
The CAS REGISTRYSM is a comprehensive collection of more than 150 million chemical substances extracted by our scientific analysts from journal articles, patents, and other sources dating back more than 150 years. This large body of consistently curated substance data uniquely lends itself to detailed comparative analysis across an extended period of time. In this study, we analyzed a subset of 30 million organic compounds for which framework data was available and a clear disclosure date could be validated to identify changes in structural diversity that have occurred over the last ten years. Extensive analysis of this dataset allowed us to determine the most frequently used scaffolds, types, diversity, and distributions.
What was revealed and why does it matter?
This study builds on our previously published framework analysis, which took a snapshot view of the CAS REGISTRY. More than ten years have elapsed since the publication of that first study, allowing us to compare data over a decade.
The key finding is that the number of new scaffolds at the graph/node level in CAS REGISTRY almost doubled in the 10-year interval between 2008 and 2018. This shows a high degree of innovation and demonstrates that scientists are increasingly venturing into unexplored regions of chemical space. Figure 1 shows the number of scaffolds, categorized by their year of first report, in ten-year intervals from pre-1949 to 2018. It is extremely encouraging to have such clear confirmation that innovation continues to accelerate, with the volume of new scaffolds being utilized doubling almost every 10 years.
The study also highlights that exploration of the chemical space is proceeding along two tracks: the re-use of previously used scaffolds (resulting in molecules with some structural similarity to previous ones) and the creation of new scaffolds (producing structurally novel molecules). This is a rational and effective strategy which is commonly used in drug discovery.
It is also clear that scaffold diversity has increased, as the addition of a large number of new scaffolds more than offset the extensive re-use of a relatively small number of existing scaffolds. Most of the new scaffolds were based on relatively new topological shapes rather than old shapes with new scaffolds. This suggests scientists are pushing the boundaries of the known chemical space.
Mapping future exploration
These findings provide scientists with evidence of the progress to date and the tremendous innovation opportunity still remaining in small molecule chemistry. In addition, they highlight the importance of pushing the boundaries even further. That said, figuring out which areas of chemical space to explore remains a critical decision to ensure efficient innovation. Knowing what areas are widely explored and under-explored guides scientists in navigating the landscape and improves the odds of success by identifying promising areas with limited activity to date. A clearer view of the current innovation landscape on a structural level also informs emerging efforts to leverage advanced data analytics and machine learning to accelerate innovation and identify whitespace by utilizing the known chemical space to chart the unknown.
Interested in discussing how this approach can be customized to inform your strategies for exploring chemical space? Contact us