Immunome database for marsupials and monotremes

Background To understand the evolutionary origins of our own immune system, we need to characterise the immune system of our distant relatives, the marsupials and monotremes. The recent sequencing of the genomes of two marsupials (opossum and tammar wallaby) and a monotreme (platypus) provides an opportunity to characterise the immune gene repertoires of these model organisms. This was required as many genes involved in immunity evolve rapidly and fail to be detected by automated gene annotation pipelines. Description We have developed a database of immune genes from the tammar wallaby, red-necked wallaby, northern brown bandicoot, brush-tail possum, opossum, echidna and platypus. The resource contains 2,235 newly identified sequences and 3,197 sequences which had been described previously. This comprehensive dataset was built from a variety of sources, including EST projects and expert-curated gene predictions generated through a variety of methods including chained-BLAST and sensitive HMMER searches. To facilitate systems-based research we have grouped sequences based on broad Gene Ontology categories as well as by specific functional immune groups. Sequences can be extracted by keyword, gene name, protein domain and organism name. Users can also search the database using BLAST. Conclusion The Immunome Database for Marsupials and Monotremes (IDMM) is a comprehensive database of all known marsupial and monotreme immune genes. It provides a single point of reference for genomic and transcriptomic datasets. Data from other marsupial and monotreme species will be added to the database as it become available. This resource will be utilized by marsupial and monotreme immunologists as well as researchers interested in the evolution of mammalian immunity.


Background
Recently, two marsupial genomes and one monotreme genome have been sequenced: the grey short-tailed opossum (Monodelphis domestica; 7× coverage) [1], the tammar wallaby (Macropus eugenii; 2×) (in prep.), and the platypus (Ornithorhynchus anatinus; 6×) [2]. Marsupial and monotreme lineages branched off approximately 148 My and 166 My ago from the lineage leading to eutherian mammals [3]. They hold a unique evolutionarily position providing a link to the reptilian phase of our ancestry. Combined with their unusual biological traits, they are capable of providing important insights to our understanding of mammalian biology and evolution.
Genome sequencing has generated huge amounts of genomic data. This has expedited the identification of genes in these species. Despite the availability of genome assemblies, only the most phylogenetically conserved immune genes have been identified using automated gene annotation pipelines. Genes involved in the immune response are subject to intense selective pressure due to the need to overcome pathogenic challenges. As a result, it is common for immune genes, particularly those with immunomodulatory roles, to show very low levels of sequence conservation between species [4,5]. This has lead to many key immune molecules being missed by the Ensembl [6] and NCBI's Gnomon http:// www.ncbi.nlm.nih.gov genome annotation platforms. Less than a third of all opossum immune genes that were annotated using specialized search strategies by Wong et al. 2006 [7], Belov et al. 2006 [8] and Belov et al. 2007 [9], were predicted by the Ensembl pipeline [6].
Aside from high levels of sequence divergence, many immune gene families have also evolved through rapid successions of gene loss and gain, resulting in a lack of direct orthologs. Hence, these genes are difficult to characterize through local pairwise similarity search algorithms, such as BLAST [10], which use a single gene sequence to query a database.
To overcome the lack of annotated sequence information for immune genes, targeted, manually-curated strategies were applied [7,9,11,12]. Identification of the most highly divergent sequences required an intensive combination of strategies incorporating hidden Markov model searches, exploitation of conserved syntenic regions, sensitive local search algorithms and gene prediction integrating extrinsic information [7,9,11,12]. Less divergent genes missed by Ensembl could be identified and annotated using chained-BLAST searches [9].
Here, we present a database of curated marsupial and monotreme immune sequences. We have included novel predicted and expressed sequences as well as previously annotated genes [7,9,. Examples of gene groups represented in the database include chemokines, interleukins, Natural Killer (NK) receptors, Major Histocompatibility Complex (MHC) antigens, surface receptors, antimicrobial peptides. Annotations derived from a transcriptomic analysis on a primary lymphoid organ have also been included [46]. Many of these genes (e.g. 209 expressed tammar genes) have not been annotated by Ensembl and their sequences are not curated by other public databases. The database consists of a simple interface, and features several methods for users to query the sequences. On entry to the database, sequences were further annotated to provide searchable functional information. Availability of a comprehensive gene set assists large-scale projects such as transcriptomic analysis and microarray studies. Also, it facilitates the development of marsupial-and monotreme-specific reagents allowing for detailed analyses of metatherian and prototherian immune responses.

Construction and content
IDMM was implemented using the Python web framework Django (version 1.1) [47] with a SQLite3 (version 3.6.3) database [48]. Data can be easily updated by approved managers through a simple web interface. Once sequences are added, they are automatically matched to HGNC names and GO terms through a BLAST search. Amino acid sequences are additionally searched against the Conserved Domain Database (CDD) [49] to create protein domain annotations. Sequences are stored in FASTA format and are identified by their sequence header description which includes the gene name and species name.

Sequence annotation
All reads were defined by their species name, a gene symbol, the method of identification and sequence type (nucleotide or amino acid). To facilitate the retrieval of genes associated with specific immune roles, we categorized genes based on nine functional terms. These include the broad categories of humoral and cellular immunity and components of the innate (inflammation and complement system) and adaptive (antigen processing and presenting and phagocytosis) immune responses, as well as genes with regulatory functions such as chemokines and transcription factors. To provide additional sequence-based and functional information, all sequences were automatically annotated upon submission to the database. Automatic annotation was performed by searching the human SWISS-PROT [53] database at NCBI [54] with the submitted sequences using the network BLAST client (netblast) [10]. This resulted in the association of sequences with the official human gene names [55], GO ontology terms [56], and, for protein sequences, domain names. The accession of the best hit from each BLAST search was retrieved and matched to a list of pre-generated accession-specific tags if the E-value was less than 1e-3. These tags were linked to human gene names and gene ontology annotations using Entrez Gene data [57].

User interface
Users can interrogate the database and retrieve gene sequences through a variety of simple query tools. The search interface spans three webpages. From the main page, users can query the database through keyword, organism name, human gene name, protein domain name and by the method through which sequences were obtained (Figure 1). A link exists to a GO term browser where terms can be examined in a tree structure that supports the natural relationships between GO terms (Figure 2). Finally, the BLAST program is implemented for users to search against sequences in the database.

Search by curated gene symbols
All gene names determined through annotation can be browsed. All sequences have been annotated with a gene name based on the human gene symbol, with the exception of lineage-specific expansions, such as NKC genes and MHC genes. Characterized species-specific expansions (i.e. without human one-to-one orthologs)  are labelled using the gene family name followed by a unique set of numeric identifiers.

Keyword search
A simple keyword search permits users to query the database using any string of characters from any description line in FASTA sequences, human gene descriptions and GO names. All FASTA descriptions contain the common name of the species from which the sequences were derived. In addition to terms present in the FASTA header description, users may also search terms generated by automatic annotation which include full gene name (in addition to HGNC symbol) and GO terms. Only sequences of high similarity (E-value < 1e-3) to human genes were automatically annotated. Two keyword searches are available: one for exact but caseinsensitive match in sequence headers only and one which matches all terms containing the keyword from all associations, including, for example, GO descriptions.

Search by sequence identification method
Sequence retrieval via the initial sequence identification method (e.g. BLAST) allows simple discrimination between expressed and predicted genes. It is important to note that while chained high scoring BLAST alignments may provide more sequence information, the predicted sequence may not be identical to the actual transcribed sequence. We have also provided information on the identification method used on each sequence label.

Search by HGNC gene symbols
To facilitate the retrieval of marsupial and monotreme homologs to human genes, a list of human gene symbols is available for browsing. We queried marsupial and monotreme database sequences against all human proteins and linked the best hits based on the E-value. The resultant annotations are, in effect, reciprocal best hits of predicted genes. By comparison of gene symbols, users can rapidly determine the accuracy of an ortholog assignment. This strategy provides a measure of the level of confidence in the assigned gene name.

Search by conserved protein domains
To facilitate rapid identification of gene family members, users can search for sequences based on annotated protein functional units from the Conserved Domain Database (CDD). CDD names can be browsed by list and by hyperlinks via 'tag cloud'. Conserved domain annotations are only available for amino acid sequences.

Search based on GO terms
Users may interrogate biological and molecular functional processes and structural components through a GO browser (Figure 2). The browser follows the tree-like hierarchy of GO data by linking general terms to specific terms. The GO terms associated with monotreme and marsupial sequences are inferred through sequence similarity to human Entrez gene annotations. For each term the number of associated database sequences is located after the GO name. By clicking on this number users can extract all associated sequences. Note that Entrez GO terms often miss higher level terms which will underestimate the number of genes in a category. Therefore, it is advisable to browse through GO child terms.

Search through the BLAST interface
Users can direct BLAST queries against the sequence database. Users can perform nucleotide, translated nucleotide and protein searches. Results are presented in standard BLAST text output format.

Sequence retrieval
With the exception of BLAST searches, sequences are viewed through a standard retrieval interface. FASTA headers uniquely identify each sequence. In addition to the option of retrieving reads individually, users may choose to retrieve all identified sequences at once. Users can also fetch all associated annotations for each sequence. An option to display amino acid or nucleotide sequences is available.

Conclusion
Targeted search strategies for immune genes and gene families have led to the annotation of previously unidentified marsupials and monotreme genes in the recent genome assemblies of the opossum, tammar wallaby and platypus. Genes involved in immunity are generally poorly annotated in genome assemblies due to their high rate of sequence divergence and gene duplications. This high sequence divergence of marsupial and monotreme immune genes also renders them difficult to isolate with classical lab techniques. IDMM provides easy access to marsupial and monotreme immune sequences. It hosts a catalogue of novel and integrated sets of published genes, searchable through a simple-to-use and fast interface. The availably of these sequences will facilitate the development of species-specific immunological reagents, enabling accurate studies of immune responses in these species. This database will be useful for comparative studies of immunity.

Availability and requirements
IDMM is publicly available at http://hp580.angis.org.au/ tagbase/gutentag/. Authors' contributions EW sourced and identified the sequences, designed and implemented the database and web interface. AP, KB and EW conceived the concept. EW wrote the manuscript and KB and AP edited the manuscript. All authors read and approved the final manuscript.