In our previous work ( L. Wiel et al. Human Mutation, 2017 ) we observed that the presence of pathogenic missense variants at an aligned homologous domain position is often paired with the absence of population variation and vice versa. We realized that this type of information could be of great benefit to genetic diagnostics and that therefore it would be helpful to have an easy-to-use web server that could provide access to this wealth of information without the need for a bioinformatics intermediate.
The MetaDome web server is a further extension to our framework that maps population variation and known pathogenic mutations onto “meta-domains”. MetaDome takes as input the gene of interest and allows the user to select the preferred transcript. Using this information MetaDome provides protein domain and pathogenic variant annotation, and generates a ‘tolerance landscape’ for the gene’s protein, which visualizes regional tolerance to normal genetic variation. Furthermore, MetaDome uses homologous protein domain relations to aggregate population-based and pathogenic variants found across the genome that are aligned to the same position for the domain in the gene of interest. The use of these annotations can improve the interpretation of genetic variation.
Software architecture of MetaDome
MetaDome is primarily developed in Python and makes use of the Flask framework for the web server and the communication between the front-end with the back-end and with the database. The software architecture follows the Domain-driven design paradigm. The code is open source and can be found at our GitHub repository. Here are also detailed instructions on how to deploy the MetaDome web server. To ensure MetaDome can be deployed to any environment we have containerized the application via Docker.
Datasets of population and disease-causing genetic variation
Population variation is obtained from the Genome Aggregation Database ( gnomAD). MetaDome uses the VCF file and selects all synonymous and missense variants that meet the PASS filter criteria. For the disease-causing missense variants the VCF file from the public archive of clinically relevant variants (ClinVar) that have with disease-causing (Pathogenic) status are used.
A mapping between the world of genomics and proteomics
MetaDome features a PostgreSQL relational database wherein a complete mapping between genomic and protein positions is stored together with domain region annotation. The mapping is auto-generated by the MetaDome web server from the GENCODE Basic set and the UniProtKB/Swiss-Prot databank. The auto-generation is performed for each translation in the GENCODE set via a Protein-Protein BLAST to human Swiss-Prot canonical and isoform sequences. Only identical sequences are used for the mapping, for the others only the existence of the transcript is registered in the database.
Next, for each identical match between translation and Swiss-Prot sequence a ClustalW2 alignment is made between the two sequences. Then, for each nucleotide a mapping is made between the genomic position and the protein position that is stored in the database. As only the protein-coding information of a gene is needed for MetaDome each mapping represents a part of a codon. Each mapping is linked to a gene translation and a Swiss-Prot entry.
After the mapping process is complete, each Swiss-Prot sequence in the database is annotated via InterProScan for Pfam-A protein domains and each of these results are stored in the database. After this step the construction of the database is finished, but it is followed by constructing all meta-domain alignments. If you require a pre-build version of our database, please Contact us.
Composing a meta-domain
The meta-domains consist of homologous Pfam protein domain instances which are annotated for all protein sequences in our database via InterproScan. All domains that have multiple instances annotated to proteins are considered candidates for meta-domains. We consider protein domain homologues to have the same Pfam domain identifier which occur more than once for different regions in the genome. For each domain that respects this criteria, we generate a multiple sequence alignment (MSA) in the following manner. We retrieve all sequences for these domain instances, then we retrieve the Pfam HMM corresponding to the identifier and use the HMMER tool to align these protein sequences. This results in a Stockholm formatted MSA file which can be interpreted by any alignment visualization software of choice such as Jalview. In this Stockholm formatted file, all columns that correspond to the domain consensus represent the same homologous positions.
These Stockholm files are retrieved by the MetaDome web server when a user request meta-domain information for a position of their interest. Upon retrieval of this Stockholm file, the mapping database is used to obtain the corresponding genomic positions for each residue. These genomic positions are then used to annotate gnomAD or ClinVar single nucleotide variants found for the same columns.
Computing genetic tolerance and generating a tolerance landscape
We use the non-synonymous over synonymous ratio to quantify genetic tolerance in our Tolerance Landscape visualization. In our setting this score is based on the single nucleotide missense and synonymous variants (SNVs) from gnomAD in a protein-coding region. This score was corrected for the sequence composition of the protein coding region based on the total possible missense and synonymous SNVs. The generation of a Tolerance Landscape is a result of computing this ratio as a sliding window of 21 residues over the entirety of the protein of interest (e.g. calculated for ten residues left and right of each residue).