Technological advancements have significantly reduced the cost of panel assays and sequencing. However, the classification of genetic variants remains a challenge. Published case reports of patients with germline mutation and panel/sequencing studies on a cohort of patients serve as significant evidence (ACMG Guidelines: PP4, PM6, PS2, BP5, BS4, and PP1). Individual and multiple case reports and series also strengthen or refute gene-disease associations. The enormous magnitude of published literature (with more than 30 million articles) makes finding relevant case reports a nontrivial task. Typically, expert professionals collect and curate case reports to assure accuracy. However, such efforts are expensive, and the scale is limited.
Nucleati is pioneering technology to fully automate relevant literature collection and curation processes for bio-medical needs . One of the early outcomes is the Nucleati Germline Cancer Evidence Base. The knowledge base provides access to AI-curated evidence categorized as case reports, case series, and GWAS through intuitive UI. Nucleati AI attempts to extract and normalize genes, disease, patient age, sex, ethnicity, variations, variant location, molecular consequences, zygosity, pathogenicity, interventions, and diagnostic process for each identified case report. Additionally, for case series and GWAS data Nucleati AI extracts the number of cases and controls as well as the ethnic background of the cohort. Nucleati AI curated data is freely available to the community to use.
Here we provide primary and secondary insights and statistics to justify using Nucleati Germline Evidence Base as a complementary resource to identify clinical evidence from literature and use it for patient care.
Leading providers offer hereditary cancer screening panels. The panels include multiple genes based on the provider's discretion, primarily generated using existing evidence. Table 1 summarizes cancer-predisposing genes offered by nine cancer panel screening providers. While classic cancer-predisposing genes like MLH1, MSH2, TP53, POLE, etc., are present in all panels, ATR, FANCA, DDX41, etc., in only one panel. Table 1 also summarizes the number of individual case reports and case series reports available in the Nucleati Germline Cancer Evidence Base. Nucleati Germline Cancer Evidence Base has evidence for 95-97% of genes in one or more hereditary cancer panels. Columns 2 and 3 of Table 1 provide direct links to AI-curated data for case reports and case series for a given gene.
Evidence collected through Nucleati AI goes above and beyond established cancer and related disorder-predisposing genes. Table 2 summarizes genes, the number of case reports, and case series reports those are not present in selected gene panels from nine providers. Although the genes at the top (e.g., XRCC1, BRAF, KRAS, ERCC2, MDM2, CTNNB1, PI3KCA) are primarily mutated in tumors (profiled along with germline mutation), the articles also include germline mutations in these genes. A few genes with emerging cancer-associated phenotypes CDKN2B/ETV6 (Leukemia), XPC (melanoma, xeroderma, skin neoplasm), GATA3 (Breast Carcinoma) are noteworthy. Nucleati AI also identifies and curates evidence for genes associated with uncontrolled or abnormal growth symptoms like dysplasia, Meningioma, and Hemochromatosis (e.g., GNAS, TP63, ARMC5, COL2A1, RUNX2, HFE). Recently, exome sequencing has become affordable and amenable to routine clinical profiling. More and more evidence of the association of new genes with cancer will likely be published in the coming decade. Resources like Nucleati Germline Cancer Evidence base will be essential in identifying and cataloging emerging associations.
There is no direct way to compare AI-curated and human-curated data exhaustively. An indirect measure compares the number of articles collected by Nucleati AI and publicly accessible variant classification repositories: ClinVar and ClinGen. Chart 1 summarizes the overlapping and exclusive articles in ClinVar and Nucleati Germline Cancer Evidence Base for genes present in all nine panels used in the analysis. While there is a significant overlap between resources, there are exclusive articles in the ClinVar or Nucleati Germline Cancer Evidence Base.
Extending the analysis of evidence only present in ClinVar and Nucleati Germline Cancer Evidence Base for the BAP1 gene (56 and 95, respectively), we manually collected reported variations in the article. The manually collected variations by going through exclusive articles are summarized in Table 3. This table further strengthens the use of evidence in the form of articles identified by Nucleati AI.
ClinVar Exclusives | Nucleati Germline Cancer Evidence Base Exclusive | ||
---|---|---|---|
PubMed Id | Mutation(s) | PubMed Id | Mutation(s) |
18757409 | NM_004656.4(BAP1):c.2050C>T (p.Gln684Ter) NM_004656.4(BAP1):c.2017G>T (p.Glu673Ter) NM_004656.4(BAP1):c.1986del (p.Ile662fs) | 22889334 | NM_004656.3(BAP1):c.1708C>G (p.Leu570Val) |
23684012 | NM_004656.4(BAP1):c.2050C>T (p.Gln684Ter) | 23552620 | NM_004656.4(BAP1):c.214del (p.Ile72LeufsTer6) |
24728327 | NM_004656.4(BAP1):c.1786A>G (p.Ser596Gly) NM_004656.4(BAP1):c.1735G>A (p.Gly579Arg) NM_004656.4(BAP1):c.1408G>A (p.Gly470Arg) NM_004656.4(BAP1):c.1325C>G (p.Pro442Arg) NM_004656.4(BAP1):c.905C>T (p.Pro302Leu) NM_004656.4(BAP1):c.121G>A (p.Gly41Ser) | 23585512 | NM_004656.4(BAP1):c.758dup (p.Thr254AspfsTer30) |
26467025 | NM_004656.4(BAP1):c.2057-4G>T NM_004656.4(BAP1):c.1838C>T (p.Thr613Met) NM_004656.4(BAP1):c.1786A>G (p.Ser596Gly) NM_004656.4(BAP1):c.1730-1G>A NM_004656.4(BAP1):c.1729+8T>C NM_004656.4(BAP1):c.1413T>G (p.Ala471=) NM_004656.4(BAP1):c.1002A>G (p.Leu334=) NM_004656.4(BAP1):c.783G>A (p.Gln261=) NM_004656.4(BAP1):c.294C>T (p.Ser98=) NM_004656.4(BAP1):c.121G>A (p.Gly41Ser) | 25830670 | NM_004656.4(BAP1):c.2054A>T (p.Glu685Val) |
26689913 | NM_004656.4(BAP1):c.1735G>A (p.Gly579Arg) NM_004656.4(BAP1):c.1337A>G (p.Asn446Ser) | 26140217 | NM_004656.4(BAP1):c.518A>G (p.Tyr173Cys) |
25929848 | NM_004656.4(BAP1):c.1735G>A (p.Gly579Arg) | 25342144 | NM_004656.4(BAP1):c.134G>A (p.Gly45Glu) |
28380455 | NM_004656.4(BAP1):c.1735G>A (p.Gly579Arg) | 27751355 | NM_004656.4(BAP1):c.329_335delinsTC (p.Pro110LeufsTer14) |
26166446 | NM_004656.4(BAP1):c.188C>G (p.Ser63Cys) NM_004656.4(BAP1):c.121G>A (p.Gly41Ser) | 29891518 | NM_004656.4(BAP1):c.371C>T (p.Pro124Leu) |
22683710 | NM_004656.4(BAP1):c.121G>A (p.Gly41Ser) | 31706282 | NM_004656.4(BAP1):c.1265del (p.Gly422GlufsTer8) |
26845104 | NM_004656.4(BAP1):c.1063C>T (p.Gln355Ter) | 30578689 | NM_004656.4(BAP1):c.2001del (p.Thr668ProfsTer24) |
24166983 | NM_004656.4(BAP1):c.1946G>A (p.Cys649Tyr) | 29554022 | BAP1:p.D567X |
32068069 | NM_004656.4(BAP1):c.1147C>T (p.Arg383Cys) | 33093002 | NM_004656.4: c.255_255+6del |
29641532 | NM_004656.4(BAP1):c.1147C>T (p.Arg383Cys) | 33330039 | NC_000003.12:g.52406903_52406924del |
30480620 | NM_004656.4(BAP1):c.878C>T (p.Pro293Leu) | 32583627 | NM_004656.4(BAP1):c.1565_1566del (p.Pro522ArgfsTer14) |
28687356 | NM_004656.4(BAP1):c.944A>C (p.Glu315Ala) | 34504799 | NM_004656.4(BAP1):c.1777C>T (p.Gln593Ter) |
28170043 | NM_004656.4(BAP1):c.519T>G (p.Tyr173Ter) | 34725624 | NM_004656.4(BAP1):c.2050C>T (p.Gln684Ter) |
29684080 | NM_004656.4(BAP1):c.1943C>T (p.Ala648Val) NM_004656.4(BAP1):c.1810G>A (p.Val604Met) NM_004656.4(BAP1):c.1249A>G (p.Arg417Gly) NM_004656.4(BAP1):c.1201_1212del (p.Tyr401_Asp404del) | 35381901 | NM_004656.4(BAP1):c.898_899del (p.Arg300GlyfsTer6) |
16199547 | NM_004656.4(BAP1):c.1891-1G>A NM_004656.4(BAP1):c.1730-2A>G NM_004656.4(BAP1):c.1251-2A>G NM_004656.4(BAP1):c.783+2T>C NM_004656.4(BAP1):c.783+1G>A NM_004656.4(BAP1):c.581-1G>T NM_004656.4(BAP1):c.581-2A>G NM_004656.4(BAP1):c.437+1G>T NM_004656.4(BAP1):c.376-2A>G NM_004656.4(BAP1):c.375+2T>A NM_004656.4(BAP1):c.122+1G>T NM_004656.4(BAP1):c.122+1G>A NM_004656.4(BAP1):c.67+1del NM_004656.4(BAP1):c.38-1G>A | 35992853 | NM_004656.4(BAP1):c.535C>T (p.Arg179Trp) |
28034829 | NM_004656.4(BAP1):c.1441C>A (p.His481Asn) NM_004656.4(BAP1):c.1066C>T (p.Arg356Trp) | 35360426 | NM_004656.4(BAP1):c.604T>C (p.Trp202Arg) |
28900502 | NM_004656.4(BAP1):c.1441C>A (p.His481Asn) NM_004656.4(BAP1):c.1066C>T (p.Arg356Trp) | 35483881 | NM_004656.4(BAP1):c.535C>T (p.Arg179Trp) |
17576681 | NM_004656.4(BAP1):c.2057-3C>T NM_004656.4(BAP1):c.2056+5G>C NM_004656.4(BAP1):c.2056+3G>A NM_004656.4(BAP1):c.2056+1G>C NM_004656.4(BAP1):c.1983+6T>C NM_004656.4(BAP1):c.1983+4G>A NM_004656.4(BAP1):c.1890+6C>G NM_004656.4(BAP1):c.1890+5G>A NM_004656.4(BAP1):c.1729+6C>T NM_004656.4(BAP1):c.1729+3G>A NM_004656.4(BAP1):c.1729G>A (p.Glu577Lys) NM_004656.4(BAP1):c.1250+4A>G NM_004656.4(BAP1):c.931+6G>A NM_004656.4(BAP1):c.931+6G>T NM_004656.4(BAP1):c.931+4C>T NM_004656.4(BAP1):c.931+3A>C NM_004656.4(BAP1):c.931+3A>T NM_004656.4(BAP1):c.659+6G>A NM_004656.4(BAP1):c.659+5G>A NM_004656.4(BAP1):c.437G>A (p.Arg146Lys) NM_004656.4(BAP1):c.375+5G>C NM_004656.4(BAP1):c.375+4G>A NM_004656.4(BAP1):c.256-3C>T NM_004656.4(BAP1):c.256-3C>A NM_004656.4(BAP1):c.255+3C>G NM_004656.4(BAP1):c.255G>C (p.Gln85His) NM_004656.4(BAP1):c.123-3C>T NM_004656.4(BAP1):c.122+6T>A NM_004656.4(BAP1):c.122+5G>C NM_004656.4(BAP1):c.122G>T (p.Gly41Val) NM_004656.4(BAP1):c.67+6_67+7del NM_004656.4(BAP1):c.67+5G>C NM_004656.4(BAP1):c.38-3del NM_004656.4(BAP1):c.37+5G>A NM_007294.4(BRCA1):c.134+3A>C | 35814862 | BAP1:c.458_459delCT |
9536098 | NM_004656.4(BAP1):c.2057-3C>T NM_004656.4(BAP1):c.2056+5G>C NM_004656.4(BAP1):c.2056+3G>A NM_004656.4(BAP1):c.2056+1G>C NM_004656.4(BAP1):c.1983+6T>C NM_004656.4(BAP1):c.1983+4G>A NM_004656.4(BAP1):c.1890+6C>G NM_004656.4(BAP1):c.1890+5G>A NM_004656.4(BAP1):c.1729+6C>T NM_004656.4(BAP1):c.1729+3G>A NM_004656.4(BAP1):c.1729G>A (p.Glu577Lys) NM_004656.4(BAP1):c.1250+4A>G NM_004656.4(BAP1):c.931+6G>A NM_004656.4(BAP1):c.931+6G>T NM_004656.4(BAP1):c.931+4C>T NM_004656.4(BAP1):c.931+3A>C NM_004656.4(BAP1):c.931+3A>T NM_004656.4(BAP1):c.659+6G>A NM_004656.4(BAP1):c.659+5G>A NM_004656.4(BAP1):c.437G>A (p.Arg146Lys) NM_004656.4(BAP1):c.375+5G>C NM_004656.4(BAP1):c.375+4G>A NM_004656.4(BAP1):c.256-3C>T NM_004656.4(BAP1):c.256-3C>A NM_004656.4(BAP1):c.255+3C>G NM_004656.4(BAP1):c.255G>C (p.Gln85His) NM_004656.4(BAP1):c.123-3C>T NM_004656.4(BAP1):c.122+6T>A NM_004656.4(BAP1):c.122+5G>C NM_004656.4(BAP1):c.122G>T (p.Gly41Val) NM_004656.4(BAP1):c.67+6_67+7del NM_004656.4(BAP1):c.67+5G>C NM_004656.4(BAP1):c.38-3del NM_004656.4(BAP1):c.37+5G>A NM_007294.4(BRCA1):c.134+3A>C | 35114507 | BAP1:c.1780_1781insT, p.(G549Vfs*49) |
30258054 | NM_004656.4(BAP1):c.1339G>A | 19197335 | NM_004656.4(BAP1):c.294C>T (p.Ser98=) NM_004656.4(BAP1):c.1002A>G (p.Leu334=) NM_004656.4(BAP1):c.1026C>T (p.Ser342=) |
26554828 | NM_004656.4(BAP1):c.606G>T (p.Trp202Cys) NM_004656.4(BAP1):c.604T>C (p.Trp202Arg) | 21956388 | |
29610392 | NM_004656.4(BAP1):c.374A>C (p.Glu125Ala) NM_004656.4(BAP1):c.188C>G (p.Ser63Cys) | 24916674 | NM_004656.4(BAP1):c.605G>A (p.Trp202Ter) |
21642991 | NM_004656.4(BAP1):c.188C>G | 25468148 | NM_004656.4(BAP1):c.1026C>T (p.Ser342=) |
24894717 | NM_004656.4(BAP1):c.188C>G (p.Ser63Cys) | 27494029 | NM_004656.4(BAP1):c.1550C>T (p.Thr517Met) |
26452128 | NM_004656.4(BAP1):c.188C>G (p.Ser63Cys) | 29298805 | NM_004656.4(BAP1):c.233A>G (p.Asn78Ser) NM_004656.4(BAP1):c.1147C>T (p. Arg383Cys) NM_004656.4(BAP1):c.1748C>T (p.Ser583Leu) NM_004656.4(BAP1):c.1695dup (p.Glu566fs*1) NM_004656.4(BAP1):c.1717delC (p.Leu573fs*3) NM_004656.4(BAP1):c.1882_1885delTCAC (p.Ser628fs*8) NM_004656.4(BAP1):c.1717delC (p.Leu573fs*3) NM_004656.4(BAP1):c.1729+1G>A (p.?) NM_004656.4(BAP1):c.1891-1G>A (p.?) |
30039884 | NM_004656.4(BAP1):c.1550C>T (p.Thr517Met) | 29504908 | NM_004656.4(BAP1):c.1135G>A (p.Ala379Thr) |
29478780 | NM_004656.4(BAP1):c.959dup (p.Cys320fs) | 29769598 | NM_004656.4(BAP1):c.79del (p.Val27CysfsTer?) NM_004656.4(BAP1):c.2T>A (p.M1K) NM_004656.4(BAP1):c.505dup (p.His169ProfsTer14) |
27749792 | NM_004656.4(BAP1):c.122+1G>A | 30376426 | NM_004656.4(BAP1):c.T1938A (p.Tyr646Ter) NM_004656.4(BAP1):c.1882_1885del (p.Ser628ProfsTer8) NM_004656.4(BAP1):c.438-2A>G NM_004656.4(BAP1):c.1717del (p.Leu573TrpfsTer3) NM_004656.4(BAP1):c.659+1G>C NM_004656.4(BAP1):c.604T>C (p.Trp202Arg) NM_004656.4(BAP1):c.1153C>T (p.Arg385Ter) NM_004656.4(BAP1):c.1729G>T (p.Glu577Ter) NM_004656.4(BAP1):c.2050C>T (p.Gln684Ter) NM_004656.4(BAP1):c.122+1G>A NM_004656.4(BAP1):c.1203dup (p.Glu402Ter) NM_004656.4(BAP1):133G>A (p.Gly45Arg) |
23032617 | NM_004656.4(BAP1):c.79del (p.Val27fs) | 32012241 | NM_004656.4(BAP1):c.784-1G>A |
29978187 | NM_004656.4(BAP1):c.437G>A (p.Arg146Lys) | 33748184 | BAP1:p.C39fs |
28724667 | NM_004656.4(BAP1):c.132T>G (p.Tyr44Ter) NM_000059.4(BRCA2):c.3860del (p.Asn1287fs) | 34767027 | BAP1:p.K453Rfs* BAP1:p.L573fs* |
29351919 | NM_004656.4(BAP1):c.2116A>G (p.Ile706Val) | 33600035 | NM_004656.4(BAP1):c.783+2T>C |
35920959 | NM_004656.4(BAP1):c.1984-2A>C NC_000003.12:g.52407461_52407462del NM_004656.4(BAP1):c.2057-4G>T | ||
35885614 | NM_004656.4(BAP1):c.46dup (p.Thr16AsnfsTer?) NM_004656.4(BAP1)::c.1153C>T (p.Arg385Ter) NM_004656.4(BAP1)::c.605G>A (p.Trp202Ter) | ||
35777164 | NM_004656.4(BAP1):c.1337del (p.Asn446ThrfsTer?) NM_004656.4(BAP1):c.326_327dup (p.Pro110AspfsTer4) NM_004656.4(BAP1):c.605G>A (p.Trp202Ter) NM_004656.4(BAP1):c.677del (p.Ile226ThrfsTer5) NM_004656.4(BAP1):c.799_800del (p.Gln267AlafsTer16) | ||
35032816 | NM_004656.4(BAP1):c.38-1G>T |
Unlike raw text articles, Nucleati curated evidence efficiently extracts, understands, and normalizes attributes like age, sex, and ethnicity. These attributes across the several case reports provide secondary insights like age of onset or distribution as a function of sex and ethnic background. Chart 2 summarizes the age and sex distribution of case reports. The age distribution chart supports the early onset of disease with DICER1 mutation. Similarly, the sex distribution chart manifests BRCA1, CHEK2, and PALB2-driven cancers in females.
Nucleati Germline Cancer Evidence Base is a product of the first-of-its-kind fully automated medical-grade evidence curation pipeline. The pipeline consists of automated literature collection, curation, data normalization, and ontology mappings. Through the data presented above, we establish that the evidence present in the Nucleati Germline Cancer Evidence Base aids in the variant classification and gene-validity assessment for well-established as well as emerging cancer-predisposing genes. Additionally, the data curated using an in-house AI-driven data-curation pipeline provides secondary insights that are impossible to derive from any other comparable resources. Nucleati is developing a scaled-up product to offer evidence for any genetically predisposed disorder as curated case reports, case series, GWAS, and experimental studies.