Integrated Bioinformatics Resource for Rare Diseases

FAQ

RARe-SOURCE® provides details on rare diseases with genetic etiology and their associated genes. Information on the diseases and their associated genes is obtained from genetic and rare diseases (GARD) database and in the current version does not include all rare diseases with genetic etiology. For the literature AI and variant annotations, additional genes and known associations were obtained from OrphaNet and Ehrhart et. al., Sci Data, 2021.

RARe-SOURCE® implements artificial intelligence (AI) algorithms to identify rare disease and associated gene mentions in the titles and/or abstracts of published literature. The algorithms do not yet identify all possible mentions and might miss out some valid results. We are actively working on validating and improving the algorithms and the search results are expected to get better over time.

The results obtained from the AI algorithms are combined with the information on different disease and gene aliases to obtain articles where the disease or gene terms might be mentioned using a different naming convention. Although, this allows RARe-SOURCE® to find more articles related to the rare disease or gene of interest, it also has an increased probability of finding results that might not be relevant. In the current version, RARe-SOURCE® leans towards finding more articles than losing any that might be relevant.

RARe-SOURCE® automatically prioritizes results within each publication year, so that the most relevant results are floated to the top. The titles for each of the results are also prominently displayed and the articles are linked to PubMed, so any of the results can be quickly reviewed and verified.

We strive to comprehensively cover all concepts from the literature. However, text variations in disease names can sometimes pose challenges in accurate identification. Our approach utilizes advanced name recognition and relies on specific data sources, including the Genetic and Rare Diseases (GARD) resource, for identifying rare diseases. Despite these efforts, it is possible that we may not be able to identify all the variations in disease names comprehensively. We endeavor to address this limitation by dedicating additional effort to respond to requests for the inclusion of specific diseases not identified by our methods. We have integrated additional diseases and gene associations from OrphaNet and Ehrhart et. al., Sci Data, 2021 but realize they do not cover all rare diseases of interest. Feel free to contact us if you do not see your disease or gene in our resource, and we will make every effort to include it manually. Our goal is to enhance accuracy and inclusivity in biomedical information.

RARe-SOURCE® implements a combination of variant data integration and manual curation for providing genomic variant details. Variants for all genes in RARe-SOURCE® are obtained by integrating public variant databases and annotating them using OpenCravat. Manual curation of published literature is performed for specific diseases and associated genes. Manual curation is completed, and results are available for SLC6A8 (X-linked Creatine Transporter Deficiency). We are currently in the process of manually curating ASAH1 (Farber's Disease) variants.

Variants from several public databases, including ClinVar, gnomAD, COSMIC, ICGC, GDC, and ALFA are included. Details with version numbers for each of the data sources will be added soon.

Variants can be filtered using the interactive charts displayed above the tables and then clicking on ‘Refresh Data’.

The ‘Variant Type’ chart uses data from the ‘Consequences’ column and is based on Sequence Ontology (SO) terms, which provides standardized descriptions of variant effects (e.g., missense_variant, frameshift_variant, splice_donor_variant etc.). If a variant falls under multiple categories, the OpenCravat annotator uses the most severe consequence based on SO-defined hierarchy. For more information about Sequence Ontology, please refer to the SO website (http://www.sequenceontology.org/).

The Allele Frequency chart categorizes variants by the frequencies at which they occur in a population. Variants that occur at a higher frequency usually signify common or polymorphic variants, while rare variants are used for further analysis in disease research.

A variant is identified as pathogenic if it satisfies any of the following criteria – a REVEL score ≥ 0.685; CADD PHRED score ≥ 25.6; PolyPhen-2 annotated as “damaging”; ClinVar annotated as “pathogenic” or AlphaMissense classified as “likely pathogenic.” These variants are also presented in a separate ‘Pathogenic Variants’ tab for deeper explorations.

"MAF" stands for Minor Allele Frequency. It represents the frequency at which the less common allele, in this case a variant, occurs in each population. It is a measure of how often it appears among all the individuals sequenced as part of a study. MAF is used for understanding genetic diversity and for identifying alleles that may be associated with specific traits or diseases. A higher MAF (generally at > 5%) indicates that the variant is relatively common in the population, whereas a lower MAF suggests that the variant is rarer. MAF is often used in genetic research to filter variants, prioritize findings, and assess the potential significance of variants in disease association and risk assessment.

The highest allele frequency is determined by selecting the value of the greatest minor allele frequency across all population studies included in our database. These studies encompass gnomAD2, gnomAD3, thousand genomes, complete genomes 69, NCI60, HGDP European, GME, ESP6500, ExACNONTCGA, UK10K Cohort, and Alfa. The population/study corresponding to the maximum MAF is reported in the Max MAF Source.

Variants annotated by RARe-SOURCE® are displayed as ‘Annotated variants’ in the 2D and 3D protein visualizations. The clinical significance value from ClinVar is used for the pathogenicity impact values. Variants not in ClinVar are annotated as ‘No ClinVar annotation’.

At this time, RARe-SOURCE® allows users to access rare disease and gene information without logging in. In the future, RARe-SOURCE® may require certain users to be authenticated to access sensitive data, save dashboard information or share research with other users of the resource.

RARe-SOURCE® focuses on genotype-phenotype correlations for rare diseases and integrating related data from a variety of databases with the goal of:
  • Providing easy access to published literature on rare diseases without having to know and/or search for all associated disease and gene names and aliases.
  • Making variant annotations and impact assessments available for genes associated with rare diseases.
  • Mapping variants on the three-dimensional structure to assist researchers in investigating structure-function relationships and the impact of the variant on the protein’s ability to interact with signaling partners.
RARe-SOURCE® allows for data export in multiple formats:
  • All tabular data can be copied to system clipboard or exported in CSV or Microsoft Excel compatible formats.
  • 2-D protein structure visualization data can be exported by ProtVista in JSON format.
  • 3-D protein structure visualizations can be exported or downloaded in PNG image format.

Currently, RARe-SOURCE® integrates data from multiple internal and external sources that are frequently updated. Future plans include user submission features for variants and rare diseases.