Research article | Open | Published:
Classification of dendritic cell phenotypes from gene expression data
BMC Immunologyvolume 12, Article number: 50 (2011)
The selection of relevant genes for sample classification is a common task in many gene expression studies. Although a number of tools have been developed to identify optimal gene expression signatures, they often generate gene lists that are too long to be exploited clinically. Consequently, researchers in the field try to identify the smallest set of genes that provide good sample classification. We investigated the genome-wide expression of the inflammatory phenotype in dendritic cells. Dendritic cells are a complex group of cells that play a critical role in vertebrate immunity. Therefore, the prediction of the inflammatory phenotype in these cells may help with the selection of immune-modulating compounds.
A data mining protocol was applied to microarray data for murine cell lines treated with various inflammatory stimuli. The learning and validation data sets consisted of 155 and 49 samples, respectively. The data mining protocol reduced the number of probe sets from 5,802 to 10, then from 10 to 6 and finally from 6 to 3. The performances of a set of supervised classification models were compared. The best accuracy, when using the six following genes --Il12b, Cd40, Socs3, Irgm1, Plin2 and Lgals3bp-- was obtained by Tree Augmented Naïve Bayes and Nearest Neighbour (91.8%). Using the smallest set of three genes --Il12b, Cd40 and Socs3-- the performance remained satisfactory and the best accuracy was with Support Vector Machine (95.9%). These data mining models, using data for the genes Il12b, Cd40 and Socs3, were validated with a human data set consisting of 27 samples. Support Vector Machines (71.4%) and Nearest Neighbour (92.6%) gave the worst performances, but the remaining models correctly classified all the 27 samples.
The genes selected by the data mining protocol proposed were shown to be informative for discriminating between inflammatory and steady-state phenotypes in dendritic cells. The robustness of the data mining protocol was confirmed by the accuracy for a human data set, when using only the following three genes: Il12b, Cd40 and Socs3. In summary, we analysed the longitudinal pattern of expression in dendritic cells stimulated with activating agents with the aim of identifying signatures that would predict or explain the dentritic cell response to an inflammatory agent.
Genome-wide screening of expression profiles has provided a broad perspective on gene regulation in health and disease. Gene expression is controlled over a wide range through complex interplay between DNA regulatory proteins, microRNA molecules and epigenetic modifications determining transcript production [1–3]. For example, gene expression profiles in mouse dendritic cells (DCs) in response to microbial organisms and their components have been studied using a functional genomics approach and the molecular patterns involved in DCs activation have been determined [4–7]. However, the high-dimensionality inherent in genome-wide analyses makes it difficult to extract biologically useful information from gene expression data. Early attempts at genome-wide expression analysis used unsupervised methods to identify groups of genes or conditions with similar expression profiles [8–10]; the observation that functionally related or co-regulated genes often cluster together was used to provide biological insight. Classification studies in the field of microarray analysis have become important for the development of diagnostic tests. One of the most common approaches for supervised classification is binary classification, which distinguishes between two types of phenotype: positive, for example compound A-treated samples, and negative, often control or compound B-treated samples. A collection of samples with known type labels is used to train a classifier that is then used to classify new samples. For example, the supervised classification models Support Vector Machines , Classification Trees  and Artificial Neural Networks  have led to the generation of functional gene signatures for haematological malignancies [8, 14–16], and for the identification of molecular markers that provide accurate diagnosis, prognosis and selection of treatment regimens for human diseases [17–20]. These methods are able to identify genes and, consequently gene networks, associated with particular phenotypes. More recently, supervised classification models combining cross validation and heuristic search strategies have been used to discover optimal expression signatures in cancer [21–23]. However, despite the number of classification methods that have been developed for this kind of knowledge extraction, such knowledge has not yet been widely used in diagnostic or prognostic decision-support systems . This is partly due to the variability of the results obtained  and also to the different data sets used [25, 26].
Few methods have been used to identify specific expression signatures that could contribute to the molecular diagnosis of inflammatory-based diseases. The Random Forests method has been used to generate a 44-gene signature in DCs to distinguish between inflammatory and non-inflammatory stimuli, but this gene signature is too large for clinical exploitation . Here, we report a data mining protocol developed through the analysis of a database generated from microarray experiments with DCs exposed to various stimuli able to induce cell activation. This protocol allowed the selection of a small set of genes which were subsequently used by supervised classification models to make inferences concerning the inflammatory state of the samples.
The Knowledge Extraction Protocol (KEP), depicted in Figure 1, was used to select relevant probe sets (genes) and to train supervised classification models to discriminate between "inflammatory" and "not inflammatory" phenotypes of DCs.
Mouse data: two microarray data sets, namely the Learning Data Set and the Validation Data Set, were defined. The Learning Data Set included the results obtained from microarray experiments performed with: Affymetrix MGU74Av2 arrays (89 samples - 9 different stimuli) , Affymetrix MOE430A arrays (44 samples - 4 different stimuli) and MOE430A 2.0 arrays (22 samples - 2 different stimuli). The Validation Data Set the results of microarray experiments performed with: Affymetrix MGU74Av2 arrays (43 samples - 6 different stimuli)  and MOE430A 2.0 arrays (6 samples - 1 stimulus; this stimulus is the only one that was not with the DC cell line D1 , but used bone marrow-derived DCs (BMDC) ).
The differences in array formats required the data to be standardised. GeneChip Mouse Expression 430 (MOE430A 2.0) is the latest version of Affymetrix mouse arrays and contains 22,600 probe sets. All the probe sets of the MOE430A array are included in the MOE430A 2.0 array. The older mouse array, MGU74Av2, contains 12,488 probe sets that only partially match the probe sets of its more recent releases. Affymetrix provides "best match" probe set tables which allow the mapping of equivalent probe sets between different array releases.
The following pre-processing steps were performed: a) Probe set best matching between MOE430A and MGU74Av2. This resulted in 8,904 probe sets, also included in the MOE430A 2.0 array; b) Probe set filtering based on Affymetrix grading A annotation. This step retained 8,349 probe sets out of the 8,904 available; c) Probe set filtering based on expression signals. Every probe set whose expression signal was below 100 was discarded, such that 5,802 probe sets of the 8,349 available were retained; d) per sample Z-score computation.
The pre-processing procedure generated the Pre-processed Learning Data Set, which consisted of 155 samples (15 different stimuli), and the Pre-processed Validation Data Set, which consisted of 49 samples (7 different stimuli). Both data sets contained the same 5,802 probe sets. The class counts for the two data sets are summarised in Table 1 and the detailed list of the experiments and array types is reported in Additional file 1.
Feature selection involves the identification and removal of non significant features. The probe sets which provide no information helping to discriminate between "inflammatory" and "not inflammatory" states of the samples are thereby removed from the analysis.
The Weka software environment was used for feature selection . The feature selection task was performed through an ADTree-based wrapper schema (default parameter values) applied to the Pre-processed Learning Data Set. This step selected an expression signature of ten probe sets (Table 2) from among the initial 5,802, which generated the Selected Features Learning Data Set.
Model Training and Performance Estimation
This task, implemented through the Weka software environment, used the Selected Features Learning Data Set to train, evaluate and compare the performance of the following supervised classification models: ZeroR, IB-3, C4.5, Logistic, Multi Layer Perceptron (MLP), Naïve Bayes (NB), Random Forest (RF), Support Vector Machines (SMO-puk) and Tree Augmented Naïve bayes (TAN).
These models were chosen because they are state-of-the-art for solving supervised classification problems. ZeroR uses the majority criteria to classify a sample, i.e. it classifies each sample according to the majority of the class distribution. The weighted averages, estimated through ten repeated 10-fold cross validations, of the following performance measures are reported in Table 3: Precision, Recall, F-measure, ROC and Accuracy. ZeroR was used as the baseline measure of performance, and the performance of the other models was assessed from ROC values: the ROC values were 97.5% for each C4.5, 100% for MLP 99.9% for IB-3 99.8% for RF, 99.0% for SMO-puk, and 99.2% for TAN, and 98.6% for both Logistic and NB. However, using accuracy to compare the supervised classification models, a different picture is obtained. The model with the highest accuracy value was RF (99.1%). The other accuracy values were 98.6% for both SMO-puk and MLP, 98.1% for IB-3, 96.3% for both TAN and C4.5, 95.5% for Logistic and 94.2%, the lowest value, for NB.
Supervised classification models, which generate the selected gene expression signature, need to be able to classify data sets other than the one they were trained on if they are to be useful. Therefore, the performance of the supervised classification models was evaluated by exploiting the Selected Features Validation Data Set (Table 4). The Bayesian models, NB (93.0%) and TAN (92.8%), attained the highest ROC values and both IB-3 (92.6%) and C4.5 (91.2%) gave good ROC values. However, the ROC values were substantially lower for RF (89.6%), MLP (88.1%), SMO-puk (86.7%) and Logistic (86.6%). The ZeroR model gave an ROC value of 50% confirming, as was expected, that it behaves like a random guessing model. A different picture emerged when the accuracy performance measure was used. Indeed, the best accuracy value (93.9%) was for C4.5 and RF. The accuracy value for the TAN model was 91.8% and that for SMO-puk was 89.8%. The accuracy values were lower for NB (87.8%), IB-3 (85.7%) and Logistic (81.6%). The model with the worst accuracy value was MLP (77.6%).
Functional Gene Selection
The annotations of the ten selected genes (Table 2) indicate that four, namely Socs3, Irgm1, Il12b and Cd40, are associated with known immune-related functions. Expression of six of the ten selected genes differs between the "non inflammatory" and "inflammatory" classes with an absolute Log2 FoldChange (LogFC) greater than 1. A heatmap (Figure 2) was established for the LogFC of the average signal intensities of the selected genes for the "non inflammatory" and "inflammatory" experiments, calculated on the median expression value for that gene. Il2b and Socs3 are up-regulated with LogFC values of 4.1 and 2.7, respectively. Irgm1, Plin2, Lgals3bp and Smarcc1 are down-regulated with LogFC values of -1.1, -5.6, -2.7 and -2.9, respectively in the samples induced with inflammatory stimuli. The remaining four genes, namely Cd40, Dock5, Rnf34 and Rab24, show a level of up-regulation or down-regulation resulting in a value of LogFC which is smaller than 1. To characterize the selected gene expression signature further, the ten genes were examined with Ingenuity® Pathway Analysis (IPA) software and the Ingenuity® Knowledge Base (IKB). The IPA software was queried to find the biological interactions (direct and indirect) among the ten genes. The top network retrieved (IPA score equal to 16), depicted in Figure 3, contains six genes of the selected gene expression signature (grey nodes in Figure 3) and 25 further genes (white nodes in Figure 3) that were added by the IKB to build the network. The biological functions associated with this network are the following: Cellular Growth and Proliferation, Haematological System Development and Function, Humoral Immune Response.
The molecular and cellular functions of the genes included in the selected gene expression signature were analysed with IPA (Table 5). This identified the Infection Mechanism to be the top function related to "Diseases and Disorders", the Cellular Growth and Proliferation to be the top function related to "Molecular and Cellular Functions" and the Haematological System Development and Function to be the top function related to "Physiological System Development and Function".
A smaller set of genes (Table 6) was obtained by removing those genes not included in the IPA top network (Figure 3). The performances of the classification models which exploit this reduced set of genes on the Selected Features Validation Data Set are reported in Table 7. The ROC values of RF, MLP, SMO-puk and IB-3 were not significantly affected by the functional gene selection step. However, the ROC values for NB, TAN and C4.5 increased whereas that for Logistic decreased. The accuracy values of TAN, SMO-puk and NB were not affected by the functional gene selection step; they increased from 85.7% to 91.8% for IB-3, from 77.6% to 81.6% for MLP and from 81.6% to 83.7% for Logistic, but decreased from 93.9% to 85.7% for both C4.5 and RF. The heatmap in Figure 4 shows the modulation of the six genes in the Selected Features Validation Data Set. Il2b, Socs3 and Cd40 were up-regulated in the Selected Features Validation Data Set also; with Cd40 being up-regulated (LogFC = 4.5) in the Selected Features Validation Data Set in comparison with the Selected Features Learning Data Set (LogFC = 0.45). Furthermore, Irgm1 was up-regulated (LogFC = 1.9) in the Selected Features Validation Data Set but down-regulated in the Selected Features Learning Data Set (LogFC = -5.6). Plin2, Lgals3bp and Smarcc1 were not modulated in the Selected Features Validation Data Set but were down-regulated in the Selected Features Validation Data Set (Figure 2). The best classification models, i.e. IB-3 and TAN, misclassified four of the 49 samples belonging to the Selected Features Validation Data Set. One sample was genuinely allocated to the wrong group, whereas two were known to be labelled with the wrong class and one was known to be an outlier.
Reducing the number of genes from ten to six on the basis of the information derived from the top network generated by IPA gave satisfactory accuracy values. Therefore, a further Functional Gene Selection step was performed. Three of the selected genes were directly linked to each other in the IPA top network: Cd40, Il12b and Socs3 (Figure 5). The results of the Validation task, when only the above genes were used, are reported in Table 8. The model that giving the best accuracy value was SMO-puk (95.9%). The second best accuracy value (91.8%) was with IB-3 and NB. Logistic and TAN gave the same, satisfactory, accuracy value (89.8%). That for MLP was 87.8% and the lowest value (85.7%) was for C4.5 and RF. The best model, i.e. SMO-puk, misclassified two of the 49 samples. These samples were those known to be labelled in the wrong class. These findings confirm that the three genes are sufficient for correct classification of all the samples of the Selected Features Validation Data Set.
A 3-gene signature associated with inflammation in Human Dendritic Cells
Human Data. To test the general applicability of the proposed protocol, Affymetrix HGU133A gene expression microarray data for 27 human samples (corresponding to nine time series) was used to validate the performance of the 3-gene signature classifiers, also in human dendritic cells. A data set for human monocyte-derived dendritic cells treated with Mycobacteria tuberculosis was derived from a previous study  and tested (Table 9). All the supervised classification models, with the exception of IB-3 and SMO-puk, achieved an accuracy of 100% indicating that the 3-gene signature selected on mouse DCs indeed corresponds to a general signature of inflammation in dendritic cells in both human and mouse systems. Therefore, we suggest CD40, Il12b and Socs3 can be considered to be the master genes of inflammation and activation in DCs.
In this study, we used advanced supervised analysis to derive specific transcriptional signatures from differentially activated DCs and assessed whether this molecular signatures can define DCs phenotypes in vitro. DCs form the connection between innate and adaptive mechanisms of the immune system. Studies in mice have demonstrated that cellular vaccination with antigen-bearing DCs is efficient in stimulating antigen-specific T cell responses. Because of the immune-regulating functions of DCs, the therapeutic use of DCs in medicine to control immune responses is an attractive strategy. DCs are indeed regarded as a powerful tool for anti-cancer immunotherapy . In addition, to treat patients suffering from autoimmune or inflammatory diseases, it is desirable to downregulate immune responses in an antigen-specific or a tissue-specific manner without causing systemic immunosuppression. Moreover, graft-versus-host disease (GVHD) and graft rejection are the most serious problems in transplantation medicine, and control of alloreactive immune responses is the key to overcoming these problems. Therefore, antigen-specific negative regulation by DCs with immunosuppressive function is considered to be a promising treatment method also in the field of transplantation medicine [32, 33]. In summary, a number of studies describe the generation of DCs from sources aiming at cell therapy [34, 35]. Nevertheless, no methods exist today to test quality of the cell type generated. Therefore, a molecular test that could confirm DCs quality before their use in clinic will provide valuable information into the field of DCs therapies.
The problem of sample classification via gene signatures derived from transcriptional profiling has received increasing attention in the context of DNA microarrays. We used various aspects of the evaluation of gene selection approaches by combining the analysis of different markers of performance. First, we selected a list of genes, from whole-genome profiling of DCs, able to discriminate DC activation state. Second, to reduce the bias due to the classification model, we estimated different parameters through optimisation on an independent validation data set.
The Knowledge Extraction Protocol (KEP) (Figure 1) selected ten genes that, on the Selected Features Validation Data Set, discriminated between "inflammatory" and "not inflammatory" stimuli with an accuracy of 93.9% for C4.5 and RF and of 91.8% for TAN.
Six of the ten genes selected were modulated in the Selected Features Learning Data Set between the "not inflammatory" and "inflammatory" classes with an absolute Log2FoldChange (LogFC) greater than 1. The heatmap of the selected genes is shown in Figure 2 and revealed that two of them were up-regulated and four were down-regulated. Il2b, Socs3 and Cd40 were up-regulated (Figure 4) also in the Selected Features Validation Data Set; notably, Cd40 was up-regulated (4.5 LogFC) in the inflammatory state samples of the Selected Features Validation Data Set, compared to 0.45 LogFC in the Selected Features Learning Data Set. Plin2, Lgals3bp and Smarc1 were not substantially modulated in the Selected Features Validation Data Set and were down-regulated in the Selected Features Learning Data Set. Modulation of these selected genes should be further investigated biologically to validate these findings.
KEP misclassified four of the 49 samples of the Selected Features Validation Data Set; one sample was derived from D1 cells treated with the Listeria monocytogenes EGD for 4 h replicate A, and three samples from D1 treated with the Listeria innocua 0 h replicates A and B and 8 h replicate A. The two time 0 h samples of the Listeria innocua experiment were known to be mislabelled, and the sample 8 h was found to be an outlier. Hierarchical clustering analysis of the samples from this Listeria monocytogenes EGD experiment did not show any anomaly that might provide an explanation for the misclassification (data not shown). Remarkably, in the Selected Features Validation Data Set, samples from experiments involving cells from different sources (e.g. bone-marrow derived DCs) were not misclassified. This suggested that the KEP presented in this work may discriminate inflammatory signatures for DCs from diverse sources.
Several methods, including traditional statistical techniques and state of the art computer-intensive methodologies, have been investigated to predict inflammatory signatures in DCs. Activation of DCs with LPS and with IFN-β have been shown to generate cells prone to produce Th1 attractants that are effective for adoptive immune cancer therapy [36, 37]. It has been also demonstrated that DCs exposed to supernatants derived from tumours treated with some cytotoxic drugs are capable to modulate co-stimulatory markers and to trigger T cell responses . A 44-gene signature in DCs, able to discriminate between different functional states, is described in . Here, we report a significant improvement over the previous work by reducing the number of genes in the signature and by testing their performance with DCs derived from different hosts, namely mouse and human. We selected a signature of inflammation based on the expression of ten genes and demonstrated that this list could be further reduced to three genes without significantly affecting the classification performance. The three genes, namely CD40, Il12b and Socs3, can thus be considered to be the master genes of activation/inflammation in DCs. CD40 mediates a broad variety of immune and inflammatory responses, and the ligand-receptor interaction is responsible for immune activation; Il12b is a part of the IL12 cytokine complex, a cytokine that acts on T and natural killer cells, and has a broad range of biological activities, the most important being the induction of Th1 cells development; the Socs3 gene encodes a member of the STAT-induced STAT inhibitor (SSI) family, also known as the suppressor of cytokine signalling (SOCS) family. SSI family members are cytokine-inducible negative regulators of cytokine signalling [39–42]. Therefore, the regulation of these genes in concert in DCs suggests that they may serve as molecular markers of inflammation/activation both in human and murine DCs.
Experimental and bioinformatics strategies of this type may be used to improve treatment decisions for other inflammatory contexts, particularly chronic diseases. The whole-genome approach holds the promise to define the DCs functional quality that results in a better prediction of the stimulatory capacity of the cells. This approach may become a powerful strategy in personalised medicine.
The Knowledge Extraction Protocol (Figure 1) is based on Data Mining (DM) [43, 44] and consists of the following tasks; Data Selection, Pre-processing, Feature Selection, Model Training and Performance Estimation, Validation and Functional Gene Selection.
Mouse data: all time-series experiments of the Learning Data Set used the murine cell line D1  treated for 0, 2, 4, 8, 12 and 24 hours with "inflammatory" (CpG, Shistosomula eggs, LPS, Leishmania promastigote, Zymosan, polyIC, Listeria monocytogenes, Listeria innocua, Bordetella pertussis, Bordetella parapertussis, Lactobacillus paracasei, Lactobacillus lactis) and "not inflammatory" stimuli (Shistosomula SLA, Leishmania amastigote, dexamethasone) . The Validation Data Set includes experiments performed with "not inflammatory" (cholera toxin) and "inflammatory" stimuli (Listeria monocytogenes EGD-e, EGD-d, EGD-p, Listeria innocua, LPS). Time 0 hours experiments were labelled as "not inflammatory". All the experiments were performed with D1 cells, with the exception of the LPS time series that was produced with bone marrow-derived murine DCs . Most experiments were done on biological duplicates. Total RNA was extracted, labelled and hybridized to an Affymetrix GeneChip® as described in .
Human Data: the human dataset used for the validation for human DCs was obtained from a previous study . Briefly, human DCs were differentiated from human circulating monocytes and treated with M. tuberculosis H37Rv at multiplicity of infection of 1 for 4, 18 and 48 h. Total RNA was extracted, labelled and hybridised to a Human U133A Affymetrix GeneChip® as described in .
For all the arrays, both with human and mouse sets, signal summarisation was performed using the Affymetrix GeneChip Operating Software® (GCOS) and the MicroArray Suite version 5 (MAS 5.0) algorithm with scaling intensity target set to 100.
Mouse Data: three kinds of arrays (Affymetrix® MOE430 2.0, MOE430A 2.0 and MGU74Av2) were used. All probe sets represented on the GeneChip® MOE430A (22,690 probe sets) are included on the GeneChip® MOE430A 2.0 array; the MG-U74Av2 array contains different probe sets (12,488 probe sets). The probe sets associated with the MOE430A, MOE430A 2.0 and MG-U74Av2 arrays mapped with the "mgu74v2_vs_mouse430_best_match" annotation table from Affymetrix http://www.affymetrix.com/support/technical/comparison_spreadsheets.affx?pnl=1_2#1_2. Only the probe sets associated with the Affymetrix annotation Grade "A" were retained (8,349 probe sets). The pre-processing task removes from the Learning Data Set/Validation Data Set those probe sets associated with high levels of noise, and labels samples as inflammatory or not inflammatory and thus generates the Pre-processed Learning Data Set/Pre-processed Validation Data Set. The noisy probe sets are removed by using the probe set filter procedure which selects a probe set in the case where its signal exceeds 100 for at least two samples. The pre-processed data sets consisted of 5,802 features (probe sets). Note that the pre-processing task transforms the Learning Data Set/Validation Data Set in such a way that each measurement of a probe set, associated with a given point in time, becomes an observation in the corresponding Pre-processed Learning Data Set/Pre-processed Validation Data Set. The Pre-processed Learning Data Set consisted of 155 cases (15 stimuli, 30 time series) and the Pre-processed Validation Data Set consisted of 49 cases (7 stimuli, 12 time series). The counts of the class variables are reported in Table 1. Intensity data was used to compute per-sample Z-score.
Human Data: Affymetrix NetAffx tool http://www.affymetrix.com/index.affx was used to retrieve all human corresponding orthologous probe sets for Cd40, Il12b and Socs3 from the Affymetrix® GeneChip® HGU133 A array. In case of multiple probe sets for the same gene, as was the case for Cd40, we chose the most similar in gene sequence mapping between the human and mouse genomes. Intensity data was used to compute per-sample Z-scores. The human dataset resulted from three probe sets and 27 samples (1 stimulus, 9 time series) all labeled as "inflammatory".
KEP performs the feature selection task through the ADTree algorithm  applied to the Pre-processed Learning Data Set. The Weka software environment, Ver. 3.5.6 , was used with 10-fold cross validation to obtain the Selected Features Learning Data Set.
Model Learning and Performance Estimation
Model Learning and Performance Estimation, applied to the Selected Features Learning Data Set, is concerned with the training and estimation of the classification performance of the following DM models; ZeroR, Nearest Neighbour, C4.5, Logistic, Multi Layer Perceptron, Naïve Bayes, Random Forest, Support Vector Machines and Tree Augmented Naïve Bayes. ZeroR uses the majority criteria to classify a sample, i.e. it classifies each sample according to the majority of the class distribution. It is useful to provide a baseline measure of performance. Nearest Neighbour (IB-k with k = 3 and default learning parameter values) is a k nearest neighbour algorithm. C4.5 (J48 with default learning parameter values) is a decision tree, Logistic is a multinomial logistic regression model with a ridge estimator (default learning parameter values), Multi Layer Perceptron (MLP with default learning parameter values) is a feed-forward neural network. Naïve Bayes (NB default learning parameter values) is a widely used supervised classifier. Random Forest (RF with default learning parameter values) is the well-known supervised classification model from Leo Breiman. Support Vector Machines (SOM with the puk kernel and default learning parameter values) are widely used in Bioinformatics and Tree Augmented Naïve bayes (TAN with default learning parameter values) is a parsimonious version of Bayesian Networks. To evaluate and compare the quality of the DM models, the following performance measures were determined: Precision; Recall; F-measure; ROC; and Accuracy. To reduce the risk of overfitting, the n-fold cross validation schema was repeated s times. In brief, each replicate is associated with a different value of the seed responsible for the random partitioning of the Selected Features Learning Data Set. The mean values, across ten replicates (s = 10), of the performance measures estimated through the 10-fold cross validation (n = 10), are summarized in the Learning Performance Report. The minimum (min), mean (mid) and maximum (max) values of the considered performance measures are computed.
DM models were validated, through the Validation task, by exploiting the Selected Features Validation Data Set. This data set was obtained by applying the same filters as applied to the Learning Data Set to the Validation Data Set, and by using only those features which were selected through the Feature Selection task applied to the Pre-processed Learning Data Set.
Functional Gene Selection
Functional Gene Selection (network analysis) determines the biological significance of the selected gene expression signature. This task was performed using the Ingenuity Pathway Analysis (IPA) software package Ver. 8.0 and content Ver. 2802 which returns graphical representations of the molecular relationships between input molecules. The IPA GUI is exploited to perform the following actions; i) to search the corresponding object in the manually curated Ingenuity's Knowledge Base (IKB) in which the gene symbol is associated with probe set identifiers (Table 2), ii) to use the selected genes as input into the IPA Core Analysis; iii) to find direct and indirect relationships between the genes (network parameters: a) number of molecules per net equal to 35; b) 25 nets per analysis) through the analysis algorithm; iv) to edit the retrieved network (e.g. to delete peripheral nodes) and, to provide a statistical report concerning relevant pathway nets together with their functional analysis.
The significance of the association between the list of genes and the canonical pathway retrieved by IPA was assessed in two ways: i) the ratio between the number of molecules from the list that map to the pathway and the total number of molecules that map to the canonical pathway. ii) Fisher's statistic was used to compute the probability value of the null hypothesis, i.e. the probability that the association between the genes included in the list and the canonical pathway is explained by chance alone. The goal of this task is twofold: first to find an explanation for the genes which were selected by the Feature Selection task and which are included in the IPA output, and second to understand the reason why some genes, which were selected by the Feature Selection task, are not included in the IPA output. Then, the Reduced set of Features, consisting of the genes included in the IPA output and/or which are believed to be wrongly not included in the list of the selected genes is formed. The Reduced set of Features is then used to perform a new validation of DM models.
Microarray accession numbers
All microarray data are available from the ArrayExpress database http://www.ebi.ac.uk/arrayexpress/ under the following accession codes. Microarrays data accession number on MGU74Av2 arrays is E-MEXP-2715. Microarray on MOE430A and MOE430A 2.0 have the following accession numbers for reviewer:
Bordetella pertussis (ID: Reviewer_E-MEXP-3160 PW: fsEa22df), Bordetella parapertussis (ID: ID:Reviewer_E-MEXP-3156 PW: o5TTIG2b), Listeria monocytogenes (ID: Reviewer_E-MEXP-3159 PW: habdgpze), Listeria innocua (ID: Reviewer_E-MEXP-3158 PW: ojiep0qb), Lactobacillus paracasei (ID: Reviewer_E-MEXP-3157 PW: mmcbtpma), Lactococcus lactis (ID:Reviewer_E-MEXP-3162 PW: 3vvnihhg)
Arora A, Simpson DA: Individual mRNA expression profiles reveal the effects of specific microRNAs. Genome Biol. 2008, 9 (5): R82-10.1186/gb-2008-9-5-r82.
Hobert O: Gene regulation by transcription factors and microRNAs. Science. 2008, 319 (5871): 1785-1786. 10.1126/science.1151651.
Jaenisch R, Bird A: Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet. 2003, 33 (Suppl): 245-254.
Foti M, Ricciardi-Castagnoli P, Granucci F: Gene expression profiling of dendritic cells by microarray. Methods Mol Biol. 2007, 380: 215-224. 10.1007/978-1-59745-395-0_13.
Torri A, Beretta O, Ranghetti A, Granucci F, Ricciardi-Castagnoli P, Foti M: Gene expression profiles identify inflammatory signatures in dendritic cells. PLoS One. 5 (2): e9404
Mortellaro A, Urbano M, Citterio S, Foti M, Granucci F, Ricciardi-Castagnoli P: Generation of murine growth factor-dependent long-term dendritic cell lines to investigate host-parasite interactions. Methods Mol Biol. 2009, 531: 17-27. 10.1007/978-1-59745-396-7_2.
Foti M, Granucci F, Pelizzola M, Beretta O, Ricciardi-Castagnoli P: Dendritic cells in pathogen recognition and induction of immune responses: a functional genomics approach. J Leukoc Biol. 2006, 79 (5): 913-916. 10.1189/jlb.1005547.
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403 (6769): 503-511. 10.1038/35000501.
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999, 96 (12): 6745-6750. 10.1073/pnas.96.12.6745.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95 (25): 14863-14868. 10.1073/pnas.95.25.14863.
Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA. 2000, 97 (1): 262-267. 10.1073/pnas.97.1.262.
Zhang H, Yu CY, Singer B, Xiong M: Recursive partitioning for tumor classification with gene expression microarray data. Proc Natl Acad Sci USA. 2001, 98 (12): 6730-6735. 10.1073/pnas.111153698.
Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, et al: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001, 7 (6): 673-679. 10.1038/89044.
Jelinek DF, Tschumper RC, Stolovitzky GA, Iturria SJ, Tu Y, Lepre J, Shah N, Kay NE: Identification of a global gene expression signature of B-chronic lymphocytic leukemia. Mol Cancer Res. 2003, 1 (5): 346-361.
Savage KJ, Monti S, Kutok JL, Cattoretti G, Neuberg D, De Leval L, Kurtin P, Dal Cin P, Ladd C, Feuerhake F, et al: The molecular signature of mediastinal large B-cell lymphoma differs from that of other diffuse large B-cell lymphomas and shares features with classical Hodgkin lymphoma. Blood. 2003, 102 (12): 3871-3879. 10.1182/blood-2003-06-1841.
De Vos J, Thykjaer T, Tarte K, Ensslen M, Raynaud P, Requirand G, Pellet F, Pantesco V, Reme T, Jourdan M, et al: Comparison of gene expression profiling between malignant and normal plasma cells with oligonucleotide arrays. Oncogene. 2002, 21 (44): 6848-6857. 10.1038/sj.onc.1205868.
Kononen J, Bubendorf L, Kallioniemi A, Barlund M, Schraml P, Leighton S, Torhorst J, Mihatsch MJ, Sauter G, Kallioniemi OP: Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med. 1998, 4 (7): 844-847. 10.1038/nm0798-844.
Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR, Landfield PW: Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci USA. 2004, 101 (7): 2173-2178. 10.1073/pnas.0308512100.
Kuhn A, Goldstein DR, Hodges A, Strand AD, Sengstag T, Kooperberg C, Becanovic K, Pouladi MA, Sathasivam K, Cha JH, et al: Mutant huntingtin's effects on striatal gene expression in mice recapitulate changes observed in human Huntington's disease brain and do not differ with mutant huntingtin length or wild-type huntingtin dosage. Hum Mol Genet. 2007, 16 (15): 1845-1861. 10.1093/hmg/ddm133.
Huang X, Pan W, Grindle S, Han X, Chen Y, Park SJ, Miller LW, Hall J: A comparative study of discriminating human heart failure etiology using gene expression profiles. BMC Bioinformatics. 2005, 6: 205-10.1186/1471-2105-6-205.
Deb K, Raji Reddy A: Reliable classification of two-class cancer data using evolutionary algorithms. Biosystems. 2003, 72 (1-2): 111-129. 10.1016/S0303-2647(03)00138-2.
Jirapech-Umpai T, Aitken S: Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinformatics. 2005, 6: 148-10.1186/1471-2105-6-148.
Sun Y, Goodison S, Li J, Liu L, Farmerie W: Improved breast cancer prognosis through the combination of clinical and genetic markers. Bioinformatics. 2007, 23 (1): 30-37. 10.1093/bioinformatics/btl543.
Dalton WS, Friend SH: Cancer biomarkers--an invitation to the table. Science. 2006, 312 (5777): 1165-1168. 10.1126/science.1125948.
Niijima S, Kuhara S: Recursive gene selection based on maximum margin criterion: a comparison with SVM-RFE. BMC Bioinformatics. 2006, 7: 543-10.1186/1471-2105-7-543.
Inza I, Larranaga P, Blanco R, Cerrolaza AJ: Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med. 2004, 31 (2): 91-103. 10.1016/j.artmed.2004.01.007.
Winzler C, Rovere P, Rescigno M, Granucci F, Penna G, Adorini L, Zimmermann VS, Davoust J, Ricciardi-Castagnoli P: Maturation stages of mouse dendritic cells in growth factor-dependent long-term cultures. J Exp Med. 1997, 185 (2): 317-328. 10.1084/jem.185.2.317.
Zanoni I, Ostuni R, Capuano G, Collini M, Caccia M, Ronchi AE, Rocchetti M, Mingozzi F, Foti M, Chirico G, et al: CD14 regulates the dendritic cell life cycle after LPS exposure through NFAT activation. Nature. 2009, 460 (7252): 264-268. 10.1038/nature08118.
Witten IH, Frank E: Practical machine learning tools and techniques. 2005, Elsevier
Tailleux L, Waddell SJ, Pelizzola M, Mortellaro A, Withers M, Tanne A, Castagnoli PR, Gicquel B, Stoker NG, Butcher PD, et al: Probing host pathogen cross-talk by transcriptional profiling of both Mycobacterium tuberculosis and infected human dendritic cells and macrophages. PLoS One. 2008, 3 (1): e1403-10.1371/journal.pone.0001403.
Banchereau J, Palucka AK: Dendritic cells as therapeutic vaccines against cancer. Nat Rev Immunol. 2005, 5 (4): 296-306. 10.1038/nri1592.
Dhodapkar MV, Steinman RM: Antigen-bearing immature dendritic cells induce peptide-specific CD8(+) regulatory T cells in vivo in humans. Blood. 2002, 100 (1): 174-177. 10.1182/blood.V100.1.174.
Fairchild PJ, Brook FA, Gardner RL, Graca L, Strong V, Tone Y, Tone M, Nolan KF, Waldmann H: Directed differentiation of dendritic cells from mouse embryonic stem cells. Curr Biol. 2000, 10 (23): 1515-1518. 10.1016/S0960-9822(00)00824-1.
Senju S, Haruta M, Matsumura K, Matsunaga Y, Fukushima S, Ikeda T, Takamatsu K, Irie A, Nishimura Y: Generation of dendritic cells and macrophages from human induced pluripotent stem cells aiming at cell therapy. Gene Ther.
Salguero G, Sundarasetty BS, Borchers S, Wedekind D, Eiz-Vesper B, Velaga S, Jirmo A, Behrens G, Warnecke G, Knofel AK: Pre-conditioning therapy with lentivirally reprogrammed dendritic cells accelerates the homeostatic expansion of antigen-reactive human T cells in NOD.Rag1 horizontal line/horizontal line. IL-2rgammac horizontal line/horizontal line mice. Hum Gene Ther.
Stroncek DF, Jin P, Ren J, Feng J, Castiello L, Civini S, Wang E, Marincola FM, Sabatino M: Quality assessment of cellular therapies: the emerging role of molecular assays. Korean J Hematol. 45 (1): 14-22.
Jin P, Han TH, Ren J, Saunders S, Wang E, Marincola FM, Stroncek DF: Molecular signatures of maturing dendritic cells: implications for testing the quality of dendritic cell therapies. J Transl Med. 8: 4
Liu WM, Dennis JL, Fowler DW, Dalgleish AG: The gene expression profile of unstimulated dendritic cells can be used as a predictor of function. Int J Cancer.
Ma DY, Clark EA: The role of CD40 and CD154/CD40L in dendritic cells. Semin Immunol. 2009, 21 (5): 265-272. 10.1016/j.smim.2009.05.010.
O'Sullivan B, Thomas R: Recent advances on the role of CD40 and dendritic cells in immunity and tolerance. Curr Opin Hematol. 2003, 10 (4): 272-278. 10.1097/00062752-200307000-00004.
Bastos KR, Marinho CR, Barboza R, Russo M, Alvarez JM, D'Imperio Lima MR: What kind of message does IL-12/IL-23 bring to macrophages and dendritic cells?. Microbes Infect. 2004, 6 (6): 630-636. 10.1016/j.micinf.2004.02.012.
Dimitriou ID, Clemenza L, Scotter AJ, Chen G, Guerra FM, Rottapel R: Putting out the fire: coordinated suppression of the innate and adaptive immune systems by SOCS1 and SOCS3 proteins. Immunol Rev. 2008, 224: 265-283. 10.1111/j.1600-065X.2008.00659.x.
Hand D, Mannila H, Smyth P: Principles of Data Mining. 2001, Massachusets Institute of Technology
Han J, Kamber M, Pei J: Data Mining: Concepts and Techniques. 2006, Morgan Kaufmann
Freund Y, Mason L: The alternating decision tree learning algorithm. 1999, Proceeding of the Sixteenth International Conference on Machine Learning: 1999, 124-133.
Cover TM, Hart PE: Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967, 13 (1): 21-27. 10.1109/TIT.1967.1053964.
Quinlan JR: C4.5: Programs for Machine Learning. 1993, Morgan Kaufmann
Agresti A: Categorical Data Analysis. 2002, Wiley-Interscience, Second
Duda RO, Hart PE, Stork DG: Pattern Classification. 2001, John Wiley & Sons, Second
Breiman L: Random Forests. Machine Learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.
Vapnik VN: The Nature of Statistical Learning Theory. 1995, Springer-Verlag
Friedman N, Geiger D, Goldszmidt M: Bayesian Network Classifiers. Machine Learning. 1997, 29 (2): 131-163. 10.1023/A:1007465528199.
This work was supported by the grants from European Commission 6th Framework Program (AIDS-CoVAC), European Commission 7th Framework Programs: FIGH-MG 242210 and TOLERAGE 202156, the Italian Ministry of Education and Research PRIN-COFIN20077NFBH8_003. The authors wish to thank Donatella Biancolini, and Angela Papagna from the Genopolis consortium for microarray data generation. We wish to thank Prof. Raffaele Calogero for critical reading of the manuscript and for helpful discussions. The authors would like to thank L.Tailleux, and O. Neyrolles who developed the M.tuberculosis infection model and F. Granucci and I. Zanoni who generated the BMDC dataset.
GTF participated in data collection, pre-processing and participated in writing the manuscript. VV analyzed the data. PRC participated in generating the microarray data. FZ participated in the design of the study, data pre-processing and writing the manuscript. FS designed the KEP protocol, and participated in writing the manuscript. MF participated in designing the study, coordinating the study, generating microarray data and writing of the manuscript. All authors have read and approved the final version of the manuscript.
Giacomo Tuana, Viola Volpato contributed equally to this work.