Immunome database for marsupials and monotremes
© Wong et al; licensee BioMed Central Ltd. 2011
Received: 7 April 2010
Accepted: 19 August 2011
Published: 19 August 2011
Skip to main content
© Wong et al; licensee BioMed Central Ltd. 2011
Received: 7 April 2010
Accepted: 19 August 2011
Published: 19 August 2011
To understand the evolutionary origins of our own immune system, we need to characterise the immune system of our distant relatives, the marsupials and monotremes. The recent sequencing of the genomes of two marsupials (opossum and tammar wallaby) and a monotreme (platypus) provides an opportunity to characterise the immune gene repertoires of these model organisms. This was required as many genes involved in immunity evolve rapidly and fail to be detected by automated gene annotation pipelines.
We have developed a database of immune genes from the tammar wallaby, red-necked wallaby, northern brown bandicoot, brush-tail possum, opossum, echidna and platypus. The resource contains 2,235 newly identified sequences and 3,197 sequences which had been described previously. This comprehensive dataset was built from a variety of sources, including EST projects and expert-curated gene predictions generated through a variety of methods including chained-BLAST and sensitive HMMER searches. To facilitate systems-based research we have grouped sequences based on broad Gene Ontology categories as well as by specific functional immune groups. Sequences can be extracted by keyword, gene name, protein domain and organism name. Users can also search the database using BLAST.
The Immunome Database for Marsupials and Monotremes (IDMM) is a comprehensive database of all known marsupial and monotreme immune genes. It provides a single point of reference for genomic and transcriptomic datasets. Data from other marsupial and monotreme species will be added to the database as it become available. This resource will be utilized by marsupial and monotreme immunologists as well as researchers interested in the evolution of mammalian immunity.
Recently, two marsupial genomes and one monotreme genome have been sequenced: the grey short-tailed opossum (Monodelphis domestica; 7× coverage) , the tammar wallaby (Macropus eugenii; 2×) (in prep.), and the platypus (Ornithorhynchus anatinus; 6×) . Marsupial and monotreme lineages branched off approximately 148 My and 166 My ago from the lineage leading to eutherian mammals . They hold a unique evolutionarily position providing a link to the reptilian phase of our ancestry. Combined with their unusual biological traits, they are capable of providing important insights to our understanding of mammalian biology and evolution.
Genome sequencing has generated huge amounts of genomic data. This has expedited the identification of genes in these species. Despite the availability of genome assemblies, only the most phylogenetically conserved immune genes have been identified using automated gene annotation pipelines. Genes involved in the immune response are subject to intense selective pressure due to the need to overcome pathogenic challenges. As a result, it is common for immune genes, particularly those with immunomodulatory roles, to show very low levels of sequence conservation between species [4, 5]. This has lead to many key immune molecules being missed by the Ensembl  and NCBI's Gnomon http://www.ncbi.nlm.nih.gov genome annotation platforms. Less than a third of all opossum immune genes that were annotated using specialized search strategies by Wong et al. 2006 , Belov et al. 2006  and Belov et al. 2007 , were predicted by the Ensembl pipeline . Aside from high levels of sequence divergence, many immune gene families have also evolved through rapid successions of gene loss and gain, resulting in a lack of direct orthologs. Hence, these genes are difficult to characterize through local pairwise similarity search algorithms, such as BLAST , which use a single gene sequence to query a database.
To overcome the lack of annotated sequence information for immune genes, targeted, manually-curated strategies were applied [7, 9, 11, 12]. Identification of the most highly divergent sequences required an intensive combination of strategies incorporating hidden Markov model searches, exploitation of conserved syntenic regions, sensitive local search algorithms and gene prediction integrating extrinsic information [7, 9, 11, 12]. Less divergent genes missed by Ensembl could be identified and annotated using chained-BLAST searches .
Here, we present a database of curated marsupial and monotreme immune sequences. We have included novel predicted and expressed sequences as well as previously annotated genes [7, 9, 11–45]. Examples of gene groups represented in the database include chemokines, interleukins, Natural Killer (NK) receptors, Major Histocompatibility Complex (MHC) antigens, surface receptors, antimicrobial peptides. Annotations derived from a transcriptomic analysis on a primary lymphoid organ have also been included . Many of these genes (e.g. 209 expressed tammar genes) have not been annotated by Ensembl and their sequences are not curated by other public databases. The database consists of a simple interface, and features several methods for users to query the sequences. On entry to the database, sequences were further annotated to provide searchable functional information. Availability of a comprehensive gene set assists large-scale projects such as transcriptomic analysis and microarray studies. Also, it facilitates the development of marsupial- and monotreme-specific reagents allowing for detailed analyses of metatherian and prototherian immune responses.
IDMM was implemented using the Python web framework Django (version 1.1)  with a SQLite3 (version 3.6.3) database . Data can be easily updated by approved managers through a simple web interface. Once sequences are added, they are automatically matched to HGNC names and GO terms through a BLAST search. Amino acid sequences are additionally searched against the Conserved Domain Database (CDD)  to create protein domain annotations. Sequences are stored in FASTA format and are identified by their sequence header description which includes the gene name and species name.
A total of 2,935 genes, 602 expressed (538 tammar wallaby, 24 opossum, 16 platypus, 11 echidna, 6 red-necked wallaby, 4 brushtail possum and 3 bandicoot) and 2,333 predicted (1,639 opossum, 694 platypus), are currently stored in the database. The database includes 1,985 published sequences. We have integrated data from various published resources, which include expressed and predicted genes from opossum (1,663) [7–9, 18, 34, 35, 38, 39, 43–45, 50], tammar (37) [21–26, 33, 35–37, 45], brushtail possum (4) [18, 20, 27–29, 40] echidna (11) [14, 17, 19, 30, 32], bandicoot (3) , red-necked wallaby (6)  and platypus (261) [11–14, 16, 18, 19, 31, 41]. Manually annotated gene families include: major histocompatibility complex (MHC), leucocyte receptor complex (LRC), cytokine, defensin, cathelicidin, natural killer complex (NKC) and Fc receptor genes. Both opossum and platypus sequences were annotated using a curated list of human immune genes from the IRIS database . For predicted genes, candidate gene regions were first identified using either BLAST  or HMMER hidden Markov model  searches. Following this, best hits were either concatenated into genes or used to predict a full gene model using a gene prediction program. 516 wallaby genes were annotated based on opossum genes identified in Wong et al. 2006  and Belov et al. 2007 . Of these, at least 217 were not annotated by Ensembl (version 58). Wallaby reads were derived from the pyrosequencing of wallaby thymus transcriptomes and annotated using the wallaby (v1.0) genome assembly . For each annotated wallaby gene there were often multiple, overlapping reads; these were assembled and included in the database (1,786 wallaby reads in total). 449 platypus gene sequences were obtained by concatenation of the highest-scoring IRIS BLAST hits against the platypus genome assembly (v5.0) (Unpublished). Of these, 366 genes were not annotated by Ensembl (version 58).
All reads were defined by their species name, a gene symbol, the method of identification and sequence type (nucleotide or amino acid). To facilitate the retrieval of genes associated with specific immune roles, we categorized genes based on nine functional terms. These include the broad categories of humoral and cellular immunity and components of the innate (inflammation and complement system) and adaptive (antigen processing and presenting and phagocytosis) immune responses, as well as genes with regulatory functions such as chemokines and transcription factors. To provide additional sequence-based and functional information, all sequences were automatically annotated upon submission to the database. Automatic annotation was performed by searching the human SWISS-PROT  database at NCBI  with the submitted sequences using the network BLAST client (netblast) . This resulted in the association of sequences with the official human gene names , GO ontology terms , and, for protein sequences, domain names. The accession of the best hit from each BLAST search was retrieved and matched to a list of pre-generated accession-specific tags if the E-value was less than 1e-3. These tags were linked to human gene names and gene ontology annotations using Entrez Gene data .
All gene names determined through annotation can be browsed. All sequences have been annotated with a gene name based on the human gene symbol, with the exception of lineage-specific expansions, such as NKC genes and MHC genes. Characterized species-specific expansions (i.e. without human one-to-one orthologs) are labelled using the gene family name followed by a unique set of numeric identifiers.
A simple keyword search permits users to query the database using any string of characters from any description line in FASTA sequences, human gene descriptions and GO names. All FASTA descriptions contain the common name of the species from which the sequences were derived. In addition to terms present in the FASTA header description, users may also search terms generated by automatic annotation which include full gene name (in addition to HGNC symbol) and GO terms. Only sequences of high similarity (E-value < 1e-3) to human genes were automatically annotated. Two keyword searches are available: one for exact but case-insensitive match in sequence headers only and one which matches all terms containing the keyword from all associations, including, for example, GO descriptions.
Sequence retrieval via the initial sequence identification method (e.g. BLAST) allows simple discrimination between expressed and predicted genes. It is important to note that while chained high scoring BLAST alignments may provide more sequence information, the predicted sequence may not be identical to the actual transcribed sequence. We have also provided information on the identification method used on each sequence label.
To facilitate the retrieval of marsupial and monotreme homologs to human genes, a list of human gene symbols is available for browsing. We queried marsupial and monotreme database sequences against all human proteins and linked the best hits based on the E-value. The resultant annotations are, in effect, reciprocal best hits of predicted genes. By comparison of gene symbols, users can rapidly determine the accuracy of an ortholog assignment. This strategy provides a measure of the level of confidence in the assigned gene name.
To facilitate rapid identification of gene family members, users can search for sequences based on annotated protein functional units from the Conserved Domain Database (CDD). CDD names can be browsed by list and by hyperlinks via 'tag cloud'. Conserved domain annotations are only available for amino acid sequences.
Users may interrogate biological and molecular functional processes and structural components through a GO browser (Figure 2). The browser follows the tree-like hierarchy of GO data by linking general terms to specific terms. The GO terms associated with monotreme and marsupial sequences are inferred through sequence similarity to human Entrez gene annotations. For each term the number of associated database sequences is located after the GO name. By clicking on this number users can extract all associated sequences. Note that Entrez GO terms often miss higher level terms which will underestimate the number of genes in a category. Therefore, it is advisable to browse through GO child terms.
Users can direct BLAST queries against the sequence database. Users can perform nucleotide, translated nucleotide and protein searches. Results are presented in standard BLAST text output format.
With the exception of BLAST searches, sequences are viewed through a standard retrieval interface. FASTA headers uniquely identify each sequence. In addition to the option of retrieving reads individually, users may choose to retrieve all identified sequences at once. Users can also fetch all associated annotations for each sequence. An option to display amino acid or nucleotide sequences is available.
Targeted search strategies for immune genes and gene families have led to the annotation of previously unidentified marsupials and monotreme genes in the recent genome assemblies of the opossum, tammar wallaby and platypus. Genes involved in immunity are generally poorly annotated in genome assemblies due to their high rate of sequence divergence and gene duplications. This high sequence divergence of marsupial and monotreme immune genes also renders them difficult to isolate with classical lab techniques. IDMM provides easy access to marsupial and monotreme immune sequences. It hosts a catalogue of novel and integrated sets of published genes, searchable through a simple-to-use and fast interface. The availably of these sequences will facilitate the development of species-specific immunological reagents, enabling accurate studies of immune responses in these species. This database will be useful for comparative studies of immunity.
IDMM is publicly available at http://hp580.angis.org.au/tagbase/gutentag/.
We thank Dao Mai, Lee Render, Dr Matthew Hobbs and Mette Lille for technical advice and suggestions for improving the database. EW was supported by an ARC Kangaroo Genomics and Jean Walker postgraduate scholarship.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.