Parsing 20 Years of Public Data by AI Maps Trends in Proteomics and Forecasts Technology

Published:

|

Updated:

Featured Image on Parsing 20 years of public data by AI maps trends in proteomics and forecasts technology

Published on Journal of Proteome Research (Josiah J. Green et. al.)

Abstract

The trends of the last twenty years in biotechnology were revealed using artificial intelligence and natural language processing (NLP) of publicly available data. Implementing this “science-of-science” approach, we capture convergent trends in the field of proteomics in both technology development and its application across the phylogenetic tree of life. With major gaps in our knowledge about protein composition, structure and location over time, we report trends in persistent, popular approaches and emerging technologies across 94 ideas from a corpus of 29 journals in PubMed over two decades. New metrics for clusters of these ideas reveal the progression and popularity of emerging approaches like single-cell, spatial, compositional and chemical proteomics designed to better capture protein-level chemistry and biology. This analysis of the proteomics literature with advanced analytic tools quantifies the Rate of Rise for a next generation of technologies to better define, quantify and visualize the multiple dimensions of the proteome that will transform our ability to measure and understand proteins in the coming decade.

Keywords: genomics, proteins, proteomics, proteoforms, single-cell biology, chemical proteomics, spatial biology, structural proteomics, single molecule protein sequencing, artificial intelligence, data mining, literature trends, consilience (convergent evidence), natural language processing, prediction

Graphical Abstract

graphic file with name nihms-1962395-f0001.jpg

Introduction

There is tremendous scientific and technological potential in the field of proteomics. Human diseases involve proteins, yet our current level of understanding of the proteome is incomplete, with major gaps in knowledge about the composition, localization and dynamics of proteins. Identifying what are the next key breakthroughs in protein-level chemistry and biology is of high interest to government and foundation stakeholders of biomedical research and companies as many aspire to see genomics-level impact from proteomics this decade. Based on recent technological advances in science-of-science analysis, we are able to leverage state-of-the-art technology and terabytes of publicly available data to inform and quantify trends on the past and future of proteomics1,2. We begin by introducing our analytic approach against the back-drop of recent advances in science-of-science technologies and scientific funding. We then describe the results in aggregate – i.e. proteomics of biology, the science that underpins technological advancements – and then shift gears into results pertaining to proteomics technology. We conclude by highlighting case studies such as PROTACs, and providing forecasts for technologies from the spatial, compositional, and single-cell proteomics domains.

Today, no automated methods exist that systematically identify hidden relationships3,4 at scale in a given corpus of scientific and technical literature5,6. Recent trends and associated funding priorities7,8 related to science and AI would treat this as a “black-box” problem and apply the latest from Machine Learning. Evidence that this approach could come to fruition is becoming more prevalent914. Despite being a solution to the scaling problem, black-box methods have their own shortcomings: (1) Machine learning generates predictions that are difficult to explain without supervision1517, and (2) Black-box processes do not allow for humans to intervene and correct system errors, whereas human ‘in the loop’ approaches will correct system errors, a vital function in identifying highly idiosyncratic scientific concepts1820. Recent years have also seen manifold advances in the ability to process and analyze semi-structured and unstructured data at scale through automated text analysis and pattern recognition21. Instead of black-box approaches, tried-and-tested methods from fields like Bayesian statistics and information theory are proving successful in gleaning insights from this new genre of data22,23. Extremely diverse data like the raw text of research publications, auto-generated tags for journal papers, etc. can now be processed using the same methods Claude Shannon applied to structured signal data in the mid-twentieth century24,25.

The success in genomics was a function of recurring, increasing funding in the wake of the Human Genome Project and activities of the National Human Genome Research Institute. Increases in funding drove consistent publications, patents and new company creation leading to next-generation sequencing (NGS)26. An example of parsing public data quantified the extent to which the fields of genomics and proteomics were funded between 2000 and 2021 (Supplemental Fig. S1). This is based on National Institute of Health (NIH) cumulative grant funding which is calculated by taking total cost of NIH projects with abstracts mentioning ‘proteo’ or ‘genomic’ and budget start dates in years 2000 – 2021. “Project” and “Abstract” data for NIH Grants were pulled from the ExPorter function of NIH Reporter (https://reporter.nih.gov/) . By 2021, a crescendo of increases in public and private sector support for proteomics and new companies was notable (see Supplemental Table S1). The ability to deal with the Human Proteome in a more deterministic fashion will rely on new technologies like AlphaFold27 better matched to capture the molecular complexity of our biology at the protein level. It is within this backdrop of technology advancement and scientific funding that our analysis was conducted.

Methodology

We analyzed a corpus of 29-journals in PubMed (Supplemental Section 2). Selection of the corpus listed in Supplemental Table S2 used the following criteria: (1) occurrence in PubMed, (2) journals that are of domain relevance to the field of proteomics, (3) articles that contain ‘proteo’ language in the article abstract, and (4) impact factors and other metrics were not a criteria. Based on this corpus, we applied explainable, information-theoretic AI approaches like LDA and Network Analysis to map macro-level trends in proteomics by identify groupings of similar words – called ideas – and analyze the growth and development of 94 ideas across the 2000s and 2010s through to 2023.

The purpose of the AI approaches implemented is to identify hidden relationships in large datasets. It uses the language from journal articles to identify key ideas from the community – it does not use bibliographic techniques such as citations. Ideas are groupings of similar words defined by the latent Dirichlet allocation (LDA) process2831, a method that uses the probability of word co-occurrence, what we call ideas in a document to reveal hidden co-occurring clusters of words. Once the ideas are defined, a Network Analysis is conducted with the ideas as nodes and co-occurrence of ideas as edges32,33. Individual ideas are conveyed using a hyphenated list of three words that represent its cluster of related terms found within the corpus from 1999–2023. From the network of ideas, the system computes the Popularity, Network Connection, and Rate of Rise for individual ideas; Actual Connection and Expected Connections are computed for 2-way idea pairs (described in Table 1). From the network of ideas, the system also generates themes using Louvain clustering34 to detect major sub-communities within the network. The ideas within each sub-community are labeled based on their shared terms.

Table 1: Description of analytic metrics that are output from the AI Engine.

Metric NameInsightDescription
Popularity (Individual Idea)Relative presence of an ideaThe relative presence of an idea in the corpus
Network Connection (Individual Idea)How connected is an ideaAverage connection of an idea across its pairwise connections in the corpus
Rate of Rise (Individual Idea)Early sign of idea popularityRate of change of Popularity of idea
Actual Connection (2-Way Idea Pair)Connectedness of two ideasPercentile of the pairwise co-occurrence between 2-way idea pairs in a corpus
Expected Connection (2-Way Idea Pair)Emerging connection between two ideasLink between a 2-way idea pair is strengthened by the presence of a shared relationship to a third idea

Open in a new tab

Using a set of metrics like Rate of Rise and Expected Connection, we further indicate the rate of their change today and what are likely to be growth areas in the 2020s. For technologies to meet the challenges that protein biology presents, the future will see that next-generation insights will come from closing knowledge gaps preventing us from a deeper, wholistic understanding of the proteome.

Results

Similar to the foundations of genomics in the early 2000s, proteomics technologies have matured over the past 20 years and metrics of this can capture trends over the last two decades. From 29 journals surveyed here, a corpus of 25,565 proteomics articles (Supplemental Table S2) were found by parsing the abstracts using LDA. From this set of articles, 94 ideas were captured by the analytics engine and described in short-hand using a 3-word summary of a cluster of related terms assembled (Supplemental Table S3). Note, the 3-word summary is shorthand for full 20 word full cluster detected by the model. From these word clusters mapping to the 94 ideas, 8 general themes were found (Fig. 1). The analytics methods used to derive these ideas and themes are described in Table 1. To capture the Popularity of these ideas and themes and track their dynamics in the field over time (vide infra), the 8 themes from articles in the 29-journal corpus broke down into biological domains aligned with the phylogenetic tree and with proteomics technologies. For example, Theme 7 (“Compositional Proteomics”) had a Popularity score of 65% (Fig. 1), meaning that the 14 ideas in this theme had an average of this score. From 1999–2014 in general, ideas were being changed and exchanged across many different groups including plant biology, mammalian biology, cancer research, and technology. By the most recent epoch 2020–2023, ideas have started to stabilize around technology and biology themes.

Fig. 1.
Popularity and Network Connection of ideas within the PubMed Proteomics Corpus across the history of data (1999–2023). Idea Popularity: a high score means ideas in this theme are more prevalent in the corpus compared to the average. Network Connection: a high score means ideas in this theme more regularly co-occur with popular ideas compared to the average.

The 94 individual ideas found within the corpus from 1999 to 2023 were assessed for their stability over time (Fig. 2a). The most recent Top Quartile of Popularity of these individual ideas is shown in Fig. 2a and has become stable science, which is a signal that a foundation has been laid for more complex concepts. (A full list of idea Popularity percentile rankings for individual ideas is provided in Supplemental Table S3). Relatively consistent and stable individual ideas can be thought of as base ingredients at a buffet that enable novel and dynamic combinations of 2-way idea pairs, a construct that can capture specific channels of future innovation. To assess trends in the field, we assembled and used these 2-way idea pairs (Fig. 2b) and visualized them with networks (Fig. 2c and d). The 2-way idea pairs have stabilized into three broad buckets (Fig. 2b): settled science, active research, and concepts trending downward in popularity. From the total of 8,836 2-way idea pairs in the data, many high-potential, high-fidelity ones emerged that showcase the breath of opportunities in proteomics. The analysis indicates that 41 2-way idea pairs are persistent and in the Top Quartile of Actual Connection over each 3-year epoch of the 24 years studied here (Fig. 2b). These are in the settled science bucket. Of the ~8,800 2-way idea pairs, ~10–15% are making large positive jumps in Expected Connection (defined in Table 1). These are part of the active research bucket. These increases can be driven by technology breakthroughs, effective collaborations, and active, growing communities. Another 10–15% of 2-way idea pairs are making large negative movements in Actual Connection (defined in Table 1). This is an indication that signals the ‘fizzling-out’ of 2-way idea pairs and are part of the low impact bucket in the swath of literature surveyed here.

Fig. 2.
Persistence of individual ideas (a, at left) and 2-way idea pairs (b, at right) in the Top Quartile Popularity and Actual Connection strength, respectively, from the beginning of the dataset to the most recent epoch (2020–2023), a 24-year period from Jul. 1999 – Jun. 2023. Stability over time is defined as the number of years an idea or 2-way idea pair persists in the Top Quartile. Full details of the list of 24 individual ideas in the most recent epoch’s Top Quartile (Jul. 2020 – Jun. 2023) are listed in Supplemental Table S3. (c, at left) Network of 24 most popular proteomics ideas and their connections in the Top Quartile in most recent epoch (2020–2023). Size of the nodes indicates relative Popularity of the idea in the corpus. Width of the edge indicates relative strength of the Actual Connection between the 2-way idea pair in the corpus. Colors of nodes associated with the key theme they represent. (See Supplemental Table S3 for a full list of ideas and Fig. 1 for the list of themes.) (d, at right), Network of all 94 proteomics ideas in the most recent epoch (2020–2023). The ideas in panels a and c are listed explicitly below (with the key theme assignments in parentheses):24 years: high-throughput-detection-technology (Single-Cell Proteomics, Compositional Proteomics), digestion-extraction-trypsin (Structural Proteomics), chemical-probes-affinity (Plant Proteomics, Mammalian Proteomics), database-spectra-search (Compositional Proteomics)21 years: variation-differential-measurements (Mammalian Proteomics, Compositional Proteomics), protein/protein-interaction-network (Bacterial Proteomics, Structural Proteomics), diagnostic-biomarker-therapeutic (Chemical Proteomics)18 years: serum-fluid-diagnostic (Mammalian Proteomics), breast-tumors-carcinoma (Mammalian Proteomics)15 years: statistical-algorithm-library (Compositional Proteomics)12 years: systems-biology-profiling (Single-Cell Proteomics), genome-annotation-predicted (Compositional Proteomics)9 years: modifications-posttranslational-PTMs (Chemical Proteomics), upregulated-downregulated-DEPs (Plant Proteomics, Compositional Proteomics)6 years: mRNA-translation-ribosome (Plant Proteomics), chromatin-factors-transcription (Bacterial Proteomics), drug-target-inhibitors (Chemical Proteomics), structural-conformational-folding (Chemical Proteomics, Structural Proteomics), lung-melanoma-tumors (Mammalian Proteomics)3 years: transcription-inhibition-NPC (Chemical Proteomics), ion-fragmentation-dissociation (Compositional Proteomics), domain-structure-site (Chemical Proteomics), SARS-CoV-immune-inflammatory (Mammalian Proteomics, Structural Proteomics), infection-virus-host (Structural Proteomics)

Proteomics across technology and the phylogenetic tree

The evolution of proteomics over the past 20 years has seen an overall increase in the connectedness of ideas related to core biology questions that deploy persistent proteomics platforms to diverse bacteria, plants and mammals. These concepts have remained stable in terms of their idea Popularity but have increased in the strength of their connections (Network Connection) to the rest of the ideas (Fig. 1).

Proteomics trends across bacterial, plant and mammalian biology

Of the 94 ideas in the PubMed abstracts, over 30% are directly related to implementation of established proteomics platforms in bacterial, plant or mammalian biology. This represents prior epochs where proteomics (often mass spectrometry-based) was used in both discovery and targeted mode. During these earlier epochs starting in 1999 and lasting until 2014, proteomics has been applied to biology across the phylogenetic tree (Fig. 2a). Note that Bacterial Proteomics had the highest Popularity at 48% (i.e., the average of ideas in this theme had the highest occurrence frequency in the corpus compared to the other two biology themes). Surprisingly Plant Proteomics had the highest Network Connection at 59%, indicating the 12 ideas in this theme contain many foundational concepts most frequently connected to the other 82 ideas in the corpus. This evidence shows the extent to which proteomics is becoming established and increasingly utilized to understand the biology of diverse organisms.

Sussing ideas and hidden relationships for technologies and their applications

Foundational biology concepts (ideas in Themes 1, 2 and 3 of Fig. 1) have higher Network Connection scores than the broader proteomics corpus, indicating strong connections and a clear establishment of ideas in the literature. This contrasts with technology & methods concepts (i.e., the ideas in Themes 4, 5, 6, 7 and 8 in Fig. 1), which have higher idea Popularity scores (and lower Network Connection scores) than the broader proteomics corpus. This indicates strong momentum in the development of these ideas. Indeed, the development of methods and technology for each of the three pillars of proteomics (sample processing, data acquisition and informatics) has been ongoing and robust throughout the >20 years studied here. We found that once a technology idea enters the Top Quartile and especially the Top Decile of Popularity, it is a signal of a good, established technological platform. For example, the top three technology ideas in Top Quartile across the entire timeframe (1999–2023) were: high-throughput-detection-technology (97%), chemical-probes-affinity (95%) and database-spectra-search (89%). Notably, the top three ideas in just the most recent epoch (2020–2023) were statistical-algorithm-library (100%), drug-target-inhibitors (99%), and SARS-CoV-immune-inflammatory (98%). Up and coming hot topics like single-cell proteomics (63% Popularity average across its eight ideas; average of top four was 80%, Supplemental Table S3) can be tracked using Rate of Rise (below) that address the lag in published literature versus what people are submitting in grants or seeing at conferences.

Chemical & structural proteomics trends accurately identified

Once individual ideas are established, we observe an increase in quality of 2-way idea pair connections to them. An example of this is the 2-way idea pair drug-target-inhibitors (a member of the Chemical Proteomics theme, which ranked second in the 2020–2023 epoch with a Popularity score of 99%) combined with ubiquitin-proteasome-ligase (Structural Proteomics theme, which was a top-third ranked idea in the most recent epoch with a Popularity score of 71%). The most relevant papers for this 2-way idea pair exemplify the concept of PROTACs (Supplemental Table S4Supplemental Section 4), an acronym referring to proteolysis targeting chimera (PROTAC), which gained popularity in the past decade as an approach for small molecule mediated removal of protein drug targets or those undesirable due to dysfunction. The model we developed was able to accurately forecast the beginnings and maturation of this line of research (Supplemental Fig. S3Supplemental Section 4). The data show that this 2-way idea pair has been steadily gaining traction in the literature (Fig. 3b, top panel) and is now among the highest scoring in the dataset in the most recent epoch (i.e., was in the Top Decile for Actual Connection and Top 1% of Expected Connection for the period 2020–2023). With landmark papers in 201535, this field was set up to grow markedly from 2017–2023. In fact, of the top 50 papers that map to this 2-way idea pair from 2020–2023, 30 map directly to PROTAC sub-field. Thus, the model and five metrics used here (defined in Table 1) were able to accurately capture the beginnings and recent growth of the PROTAC line of research (Supplemental Fig. S3Supplemental Section 4).

Fig. 3.
Analysis of three exemplary 2-way idea pairs in drug discovery in kinome and proteosome biology. The upper panels show the evolution of Expected Connection -vs- Actual Connection of the three 2-way idea pairs across 3-year epochs (last year of the epoch is shown). The lower panels show the count of articles mapping to the 2-way idea pairs across time. An article is counted when mentions of both ideas in the 2-way idea pair are in the top 10 most mentioned ideas for that paper (based on the overlap between the language of the article’s abstract and the ideas’ word clusters called out in Supplemental Table S3). Article count is highest when the language of a 2-way idea pair is new in the corpus. As the area matures, the textual cues detected lower as words tied to the ideas are used less prominently in the text and other 2-way idea pairs become more prominent. Maturing language leads to more regular and consistent use of these ideas in the text. Note for Bottom Panel: 2023 paper count data is as-of June 1st.

Results from two other 2-way idea pairs relevant to chemical genetics and drug discovery are shown in in the top panels of Fig. 3a and 3c. These 2-way idea pairs capture the large and persistent interest in the kinome captured in the 2-way idea pair between drug-target-inhibitors and activation-apoptosis-kinase (Fig. 3a) and the 2-way idea pair between drug-target-inhibitors and phosphopeptides-kinase-substrates (Fig. 3c). Kinase inhibitors, tools compounds and drug leads in kinome research encompass a huge swath of activity durably in the Top Decile of Popularity, and both the Actual and Expected Connection rankings over the past 20 years (Fig. 3a). As shown in the lower panels of Fig. 3, the number of papers using key terms for these popular ideas increase ~5-fold in moving in to the 2010s and 2020s. The interpretation is that the science and its language was relatively new and used often in abstracts and therefore ranked prominently around the same time the Actual Connection between the ideas is increasing. As the area captured by the 2-way idea pair matures, the textual cues are detected less frequently as words tied to these 2-way idea pairs are used less and other language become more prominent.

Chemical proteomics: ideas & recent trends

Ideas in chemical proteomics have grown in popularity across the 24 years, with one idea present in the Top Quartile of idea Popularity in 2002 increasing to now five as of 2023 (Fig. 2a). This suggests that these key chemical proteomics ideas provide a stable foundation from which researchers can explore 2-way idea pairs (e.g., in Fig. 3). This is reflected in the ideas that roll up into the theme of Chemical Proteomics, with its overall Popularity and Rate of Rise shown longitudinally in Fig. 4a and 4b, respectively. The data show that chemical proteomics has established itself as a top field of interest in the literature studied here. Ideas reflected by drug-target-inhibitors have emerged into the top recently (entered the Top Quartile 6 years ago in the 2017–2020 epoch, Fig. 2a), showing signs of momentum in that subset of the field containing activity-based protein profiling and ultra-deep bottom-up proteomics, with the combination of these technologies now just being published29. For the PROTAC area represented by the 2-way idea pair in Fig. 3b, the model used here predicts that this area will stay in the Top Decile of Actual Connection strength for the next 3-year horizon (Supplemental Fig. S3Supplemental Section 4).

Fig. 4.
Longitudinal view across 3-year epochs from 1999–2023 of the percentile rank of average Popularity (a) and Rate of Rise (b) of all the individual ideas within the key technology themes of Fig. 1 (Compositional Proteomics, Chemical Proteomics, Single-Cell Proteomics, Structural Proteomics and Spatial Proteomics) and the longitudinal view across 3-year epochs from 1999–2023 of the percentile rank of Popularity (c) and Rate of Rise (d) for ideas in the Spatial Proteomics theme.

Structural proteomics: ideas & recent trends

The presence of Structural Proteomics ideas in the Top Quartile of Popularity has been consistent since early 2000s with protein/protein-interaction-network having 75% Popularity score and higher from 2002–2023 (Fig. 2a). Structural Proteomics had a few key ideas whose metrics had upward moves in the most recent three epochs (nine years) as shown in Figs. 2a and Fig. 4bSARS-CoV-immune-inflammatory, structural-conformational-folding, and infection-virus-host all entered the Top Quartile of Popularity in 2020–2023 with recent Popularity scores of 98%, 91% and 85%, respectively.

Discussion

Proteomics Technologies – combinations and future trends

This work proposes an alternative method to prompt and prioritize technology development – one that depends on systematic exploitation of primary data and reliable, human-understandable analytics. Our results assert that such data and analytics can map scientific ideas and their 2-way idea pairs while also grouping them into relevant themes in a field. In addition to stable Popularity scores throughout the timeseries (Fig. 4a), themes have above average Network Connection scores in 4 of the 8 epochs which signals that their ideas are becoming well connected to other themes through many 2-way idea pairs. Many 2-way idea pair connections have enabled past breakthroughs in key areas such as pathway analysis in intracellular signaling, chemoproteomics to advance drug discovery by hitting “undruggable” targets36, and newer themes like spatial and single-cell proteomics. It is with this data-generated map of ideas and metrics in hand that we can look ahead to what technologies are on the rise and what combinations of them are poised for growth and major breakthroughs.

To forecast emergent trends, we look to the most recent epochs (2017–2020, 2020–2023) that show a rise in metrics associated with some proteomics technology themes (see far right of Fig. 4b). Note the particularly high values for the Rate of Rise for Single-Cell Proteomics, Compositional Proteomics and Spatial Proteomics, with the ideas within the latter broken out in Fig. 4d. Many of the ideas under the five general technology themes show these recent upticks in Rate of Rise and are predicted to continue growing in both their Network Connection and idea Popularity in the future. Specifically, ideas that show the largest increases in their Rate of Rise (i.e., 40–75% upticks from 2017 to 2023) map to three themes in the areas of single-cell proteomics, improvements in characterization of protein and proteoform composition and imaging technologies for spatial proteomics. Detailed analysis of three of these emerging technology themes and their underlying ideas are highlighted in Table 2 and taken in turn below, with supporting data provided in Supplemental Table S3 and key, underlying papers listed in Supplemental Table S5.

Table 2: Percentile score of metrics Popularity, Rate of Rise and Network Connection for the most recent 3-year epoch (2020–2023) of highlighted ideas (and their technology themes in parenthesis).

Idea (Theme)Rate of RiseIdea PopularityNetwork Connection
systems-profiling-biology (Single-Cell Proteomics)92%94%5%
class-molecules-antigen (Single-Cell Proteomics)90%52%9%
high-throughput-detection-technology (Single-Cell, Compositional Proteomics)88%97%26%
upregulated-downregulated-DEPs (Single-Cell, Compositional Proteomics)80%90%55%
PXD-ProteomeXchange-identifier (Compositional Proteomics)76%67%45%
subcellular-localization-nuclear (Spatial Proteomics)75%54%42%
antibodies-antigens-IgG (Spatial Proteomics)60%19%74%
spatial-imaging-profiles (Spatial Proteomics)56%25%29%
quantification-isobaric-iTRAQ (Single-Cell Proteomics)18%62%31%

Open in a new tab

Spatial Proteomics: emerging ideas

There has been a major increase in our abilities over the past half-decade to multiplex protein imaging and track dozens of protein channels to visualize single-cell biology37. Note that while Popularity in the literature corpus for these ideas were below the median (33%), their Rate of Rise metric from 2017–2023 rose sharply by an average of 40%. The three ideas found in this theme show drops in Popularity in the corpus (Fig. 4c), yet the Rate of Rise shown in the far right of Figs. 4b and 4d for the theme and individual ideas like subcellular-localization-nuclear (75% Rate of Rise, Table 2) capture the major increases in this area. This reflects the major innovation and recent growth in spatial and single-cell biology in the most recent epoch (2020–2023).

Compositional Proteomics: emerging ideas

The high-throughput-detection-technology idea maps to the subarea of proteomics focused on characterizing the primary structure, or composition of proteins and their proteoforms (i.e., with co-occurring modifications and sequence variations)3839. This idea aligns with the current challenge that there is still a major gap between the number of alternative transcripts asserted by RNA-seq and what can be measured with confidence using “bottom-up” proteomics (e.g., <0.1% of putative novel splice junctions in cancer xenografts)40. This idea has consistently been in the top of Popularity and Rate of Rise metrics (97% and 88%, respectively, Table 2). Despite the high idea Popularity, it has a relatively low Network Connection (26%) with other ideas. Future advances come from new combinations of ideas. Therefore, this idea is set to either remain with low connectedness to other areas, or sharply increase its Network Connection and thus have impactful, broad uptake in the field from large swaths of the community making new connections driving transformative innovations in the process3941.

Single-Cell Proteomics: emerging ideas

For the theme Single-Cell Proteomics and its 8 ideas, we found that four ideas were in the top ranked of both Popularity and Rate of Rise in the most recent epoch (Table 2top four rows). A key idea in this theme was upregulated-downregulated-DEPs with 2020–2023 metrics as follows: Popularity = 90%, Rate of Rise = 80%, Network Connection = 55%). This idea has strong metrics across the board, which reflect a core technology used to probe >1000 proteins from single cells as processed by mass spectrometry-based platforms using isobaric tags4243. These metrics indicate a strong uptake by the community and high connectedness to many other ideas in the field. Other popular ideas with high Rate of Rise are also in the Compositional Proteomics theme indicating a forward-looking combination of it with Single-Cell Proteomics; this forecasts an example of synergistic technologies to watch in the future. For example, single-cell with deeper, more complete coverage with improved compositional proteomics44 or chemoproteomics with proteoform detection45 are promising combinations.

Conclusions and Outlook

To survey publicly available data and convert that into deep insights into where future innovation and breakthroughs will emerge is a major goal of many stakeholders of science. Breakthroughs turned to stable, popular and connected science are the basis for enabling a next-generation of technologies and broader questions to be answered definitively. After more than two decades, proteomics has stabilized in multiple incarnations and enabled progress on higher-order problems, yet major knowledge gaps remain because measurement technologies are yet needed that match the scale of the challenge biology presents. Future combinations of technologies have been categorized and new metrics identified to gauge past, present and future trends to predict future areas for a next-generation impact as we unlock the proteome in composition, space and time45. The most recent epoch shows evidence for new combinations of popular protein-level measurements will be enabling for non-incremental progress in connecting genes to complex traits throughout this decade.

References

  • 1.Battiston F, Musciotto F, Wang D, Barabási A, Szell M & Sinatra R Taking census of physics. Nat Rev Phys 1, 89–97 (2019). [Google Scholar]
  • 2.Lever J, Zhao EY, Grewal J, Jones MR, Jones SJM CancerMine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer. Nat Methods 16, 505–507 (2019). [DOI] [PubMed] [Google Scholar]
  • 3.Kuhn Thomas. The Structure of Scientific Revolutions. University of Chicago Press: Chicago, 1962; pp 52–65. [Google Scholar]
  • 4.March JG Exploration and Exploitation in Organizational Learning. Organization Science 2, 71–87 (1991). [Google Scholar]
  • 5.Ioannidis JPA, Klavans R & Boyack KW Thousands of scientists publish a paper every five days. Nature 561, 167–169 (2018). [DOI] [PubMed] [Google Scholar]
  • 6.Park M, Leahey E & Funk RJ Papers and patents are becoming less disruptive over time. Nature 613, 138–144 (2023). [DOI] [PubMed] [Google Scholar]
  • 7.Nicholson JM & Ioannidis JPA Conform and be funded. Nature 492, 34–36 (2012). [DOI] [PubMed] [Google Scholar]
  • 8.Fang FC & Casadevall A Research Funding: the Case for a Modified Lottery. mBio 7, e00422–16 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hills TT & Dukas R The Evolution of Cognitive Search. In Cognitive Search: Evolution, Algorithms, and the Brain, Todd PM, Hills TT, & Robbins TW, Ed.; The MIT Press: Cambridge, 2012; pp 11–24. [Google Scholar]
  • 10.Ke Q, Ferrara E, Radicchi F & Flammini A Defining and identifying Sleeping Beauties in science. Proc. Natl. Acad. Sci. U.S.A 112, 7426–7431 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Weis JW & Jacobson JM Learning on knowledge graph dynamics provides an early warning of impactful research. Nature Biotechnology 39, 1300–1307 (2021). [DOI] [PubMed] [Google Scholar]
  • 12.Belikov AV, Rzhetsky A & Evans J Detecting signal from science: The structure of research communities and prior knowledge improves prediction of genetic regulatory experiments. Preprint at http://arxiv.org/abs/2008.09985 (2020).
  • 13.Foster JG, Shi F & Evans J Surprise! Measuring Novelty as Expectation Violation. Preprint at 10.31235/osf.io/2t46f (2021). [DOI] [Google Scholar]
  • 14.Lin Y, Evans JA & Wu L New directions in science emerge from disconnection and discord. Journal of Informetrics 16, 101234 (2022). [Google Scholar]
  • 15.Mikolov T, Chen K, Corrado G & Dean J Efficient Estimation of Word Representations in Vector Space. Preprint at http://arxiv.org/abs/1301.3781 (2013).
  • 16.Mikolov T, Sutskever I, Chen K, Corrado GS & Dean J Distributed Representations of Words and Phrases and their Compositionality.
  • 17.Pennington J, Socher R & Manning C Glove: Global Vectors for Word Representation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1532–1543 (Association for Computational Linguistics, 2014). doi: 10.3115/v1/D14-1162. [DOI] [Google Scholar]
  • 18.Simon HA Designing Organizations for an Information-Rich World. Computers, communications, and the public interest 37, 40–41 (1971). [Google Scholar]
  • 19.Byrum J Artificial Intelligence: Machines, man and intelligence. Analytics vol. 12 (2018). doi: 10.1287/LYTX.2018.02.03 [DOI] [Google Scholar]
  • 20.Sourati J & Evans J Accelerating science with human versus alien artificial intelligences. Preprint at http://arxiv.org/abs/2104.05188 (2021). [DOI] [PubMed]
  • 21.Parhami B Parallel Processing with Big Data. in Encyclopedia of Big Data Technologies, Sakr S & Zomaya A, Ed.; Springer International Publishing: New York, 2018; pp 1–7. [Google Scholar]
  • 22.Itti L & Baldi P Bayesian surprise attracts human attention. Vision Research 49, 1295–1306 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Perry C & DeDeo S The Cognitive Science of Extremist Ideologies Online. Preprint at 10.48550/arXiv.2110.00626 (2021). [DOI] [Google Scholar]
  • 24.Shannon CE A Mathematical Theory of Communication. The Bell System Technical Journal 27, 379–423 (1948). [Google Scholar]
  • 25.DeDeo S Major Transitions in Political Order. in From Matter to Life: Information and Causality 393–428 (Cambridge University Press, 2017). [Google Scholar]
  • 26.Collins FS, Green ED, Guttmacher AE & Guyer MS A vision for the future of genomics research. Nature 422, 835–847 (2003). [DOI] [PubMed] [Google Scholar]
  • 27.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P & Hassabis D Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.McCallum A MALLET: A Machine Learning for Language Toolkit. https://people.cs.umass.edu/~mccallum/mallet/ (2002).
  • 29.Blei DM, Ng AY & Jordan MI Latent Dirichlet Allocation. J. Machine Learning Research 3, 993–1022 (2003). [Google Scholar]
  • 30.Murdock J, Allen C & DeDeo S Exploration and exploitation of Victorian science in Darwin’s reading notebooks. Cognition 159, 117–126 (2017). [DOI] [PubMed] [Google Scholar]
  • 31.Denny MJ & Spirling A Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis 26, 168–189 (2018). [Google Scholar]
  • 32.Clauset A, Shalizi CR & Newman MEJ Power-Law Distributions in Empirical Data. SIAM Review 51, 661–703 (2009). [Google Scholar]
  • 33.Zhang J & Zhang K Likelihood and Consilience: On Forster’s Counterexamples to the Likelihood Theory of Evidence. Philosophy of Science 82, 930–940 (2015). [Google Scholar]
  • 34.Blondel VD, Guillaume J-L, Lambiotte R & Lefebvre E Fast unfolding of communities in large networks. J. Stat. Mech 2008, P10008 (2008). [Google Scholar]
  • 35.Bondeson DP, Mares A, Smith IED, Ko E, Campos S, Miah AH, Mulholland KE, Routly N, Buckley DL, Gustafson JL, Zinn N, Grandi P, Shimamura S, Bergamini G, Faelth-Savitski M, Bantscheff M, Cox C, Gordon DA, Willard RR, Flanagan JJ, Casillas LN, Votta BJ, den Besten W, Famm K, Kruidenier L, Carter PS, Harling JD, Churcher I & Crews CM Catalytic in vivo protein knockdown by small-molecule PROTACs. Nature chemical biology 11, 611–617, (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kuljanin M Mitchell DC, Schweppe DK, Gikandi AS, Nusinow DP, Bulloch NJ, Vinogradova EV, Wilson DL, Kool ET, Mancias JD, Cravatt BF & Gygi SP Reimagining high-throughput profiling of reactive cysteines for cell-based screening of large electrophile libraries. Nature Biotechnology 39, 630–641 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Smith LM, Kelleher NL, & Consortium for Top Down Proteomics. Proteoform: a single term describing protein complexity. Nat Methods 10, 186–187 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Smith LM & Kelleher NL Proteoforms as the next proteomics currency. Science 359, 1106–1107 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ruggles KV, Tang Z, Wang X, Grover H, Askenazi M, Teubl J, Cao S, McLellan MD, Clauser KR, Tabb DL, Mertins P, Slebos R Erdmann-Gilmore P, Li S, Gunawardena HP, Xie L, Liu T, Zhou JY, Sun S, Hoadley KA, Perou CM, Chen X, Davies SR, Maher CA, Kisinger CR, Rodland KD, Zhang H, Zhang Z, Ding L, Townsend RR, Rodriguez H, Chan D, Smith RD, Liebler DC, Carr SA, Payne S, Ellis MJ & Fenyő D An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer. Mol Cell Proteomics 15, 1060–1071 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Smith LM, Agar JN, Chamot-Rooke J, Danis PO, Ge Y, Loo JA, Paša-Tolić L, Tsybin YO & Kelleher NL, The Consortium for Top-Down Proteomics. The Human Proteoform Project: Defining the human proteome. Science Advances 7, eabk0734 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Slavov N Single-cell protein analysis by mass spectrometry. Curr Opin Chem Biol 60, 1–9 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Marx V Proteomics sets up single-cell and single-molecule solutions. Nat Methods 20, 350–354 (2023). [DOI] [PubMed] [Google Scholar]
  • 44.Burnum-Johnson KE, Conrads TP, Drake RR, Herr AE, Iyengar R, Kelly RT, Lundberg E, MacCoss MJ, Naba A, Nolan GP, Pevzner PA, Rodland KD, Sechi S, Slavov N, Spraggins JM, Van Eyk JE, Vidal M, Vogel C, Walt DR & Kelleher NL New Views of Old Proteins: Clarifying the Enigmatic Proteome. Molecular & cellular proteomics : MCP 21, 1–12 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kemper EK, Zhang Y, Dix MM & Cravatt BF Global profiling of phosphorylation-dependent changes in cysteine reactivity. Nat Methods 19, 341–352 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Bamman D, Underwood T & Smith NA A Bayesian Mixed Effects Model of Literary Character. in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 370–379 (Association for Computational Linguistics, 2014). doi: 10.3115/v1/P14-1035. [DOI] [Google Scholar]
  • 47.Rudin C Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 206–215 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
Scroll to Top