Gene expression trees in lymphoid development

Background The regulatory processes that govern cell proliferation and differentiation are central to developmental biology. Particularly well studied in this respect is the lymphoid system due to its importance for basic biology and for clinical applications. Gene expression measured in lymphoid cells in several distinguishable developmental stages helps in the elucidation of underlying molecular processes, which change gradually over time and lock cells in either the B cell, T cell or Natural Killer cell lineages. Large-scale analysis of these gene expression trees requires computational support for tasks ranging from visualization, querying, and finding clusters of similar genes, to answering detailed questions about the functional roles of individual genes. Results We present the first statistical framework designed to analyze gene expression data as it is collected in the course of lymphoid development through clusters of co-expressed genes and additional heterogeneous data. We introduce dependence trees for continuous variates, which model the inherent dependencies during the differentiation process naturally as gene expression trees. Several trees are combined in a mixture model to allow inference of potentially overlapping clusters of co-expressed genes. Additionally, we predict microRNA targets. Conclusion Computational results for several data sets from the lymphoid system demonstrate the relevance of our framework. We recover well-known biological facts and identify promising novel regulatory elements of genes and their functional assignments. The implementation of our method (licensed under the GPL) is available at .


Method Implementation
The Mixture of Dependence Trees MixDTrees estimation was implemented in Matlab14 using the Bayesian Network Toolbox (http://bnt.sourceforge.net/). All matlab source files, window binaries, tutorials and sample data sets are available at the tool webpage http://algorithmics.molgen.mpg.de/MixDTrees. We also used the Bayesian Network Toolbox for implementations of the mixture of Gaussians with full and diagonal matrices. An implementation of the k-means algorithm in Python was obtained at Open Source clustering (http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/), while SOM implementation was obtained at SOM Matlab Toolbox (http://www.cis.hut.fi/projects/somtoolbox/).

Mixture of Dependence Trees Estimation
We present in this section the derivation of the estimates of a MixDTrees. We concentrate on the estimates of a mixture of conditional Gaussians, the building blocks of the MixDTrees. For more details on mixture models see [?].

Maximum Likelihood
Consider a continuous bivariate X = (X u , X v ) and x = (x u , x v ) a realization of X. For clustering purposes, we also define a unknown discrete variable Y , where the realization y ∈ {1, .., k} indicates the cluster of a given realization x [?]. Setting pa(u) = v and y = k, the conditional Gaussian density function is defined as where τ uk = (σ u|v,k , µ u|v,k , w u|v,k ) and θ k = (τ uk , τ vk ) are the unknown model parameters. By marginalizing over y, the hidden variable, we get the following mixture density, or convex combination of component clusters, one component corresponding to one cluster, where the mixture weights α k = p[y = k|θ k ], α k ≥ 0, K k=1 α k = 1 and Θ = (θ 1 , ..., θ K , α 1 , .., α K ). Let x i = [x iu , x iv ] be the developmental profile of gene i, and x iu be the expression value of the gene i in development stage u for 1 ≤ i ≤ N . The likelihood of the data assuming that genes are independently distributed is given by, and the complete likelihood takes the following multinomial form x iv ] is the posterior probability (or responsibility) [?] that gene i belongs to cluster k.
To simplify computation, we use the log of the complete likelihood CLL in our derivations.
In the following, we find parameters maximizing the logarithm of the complete data likelihood by calculating appropriate derivatives and finding critical points.

Mean
Setting to zero, we find, Note that µ u|v,k =μ u|k , only if w u|v,k = 0. For w u|v,k = 0, µ u|v,k should be interpreted as the factor of the linear combination. Since this parameter is hard to interpret, we plug in the above definition in the original formula, and adopt the folowing formulation of the conditional Gaussian in the original paper and from here on.

Regression Parameter Estimation
For the regression parameter w u|v,k we obtain, By definitions ofμ u|k andμ v|k this simplifies to, Setting to zero, we find, (5)

Variance
To simplify the derivations for the variance parameter we substitute (σ 2 u|v,k ) −1 by γ u|v,k and obtain, Setting to zero we have By definition ofσ 2 u|k ,σ 2 v|k andσ y,x|k we arrive at By definition of w, this yields,

Maximum a Priori
We propose a Maximum a Priori (MAP) estimation to regularize the parameter w y|x,k and avoid over fitting when there is low evidence for a given model (or low α k ).

Prior on w y|x,k
We define the prior of w y|x,k to be proportional to where β u|v,k is a hyper-parameter.
, which is invariant to the scale of the variates X u and X v , is a more appropriate extension of the above.
The MAP is obtained from the posterior We can take the derivate MAP with respect to w u|v,k as follows and obtain the map estimate,ŵ u|v,k =σ u,v|k When β u|v,k → ∞, the prior becomes non-informative; that is, the MAP and ML estimates are equal. The MAP estimator of σ 2 u|v,k can be derived as (again we substitute (σ 2 u|v,k ) −1 by γ u|v,k ) : Again, when β u|v,k → ∞, the prior becomes non-informative. In a empirical Bayes approach [2], we can estimate the maximum a posteriori value of β u|v,k from the data, and hence Since the MAP estimates of σ 2 u|v,k and w u|v,k are also dependent on β u|v,k , we approximate the estimate with the MLE definitions ofσ 2 u|v,k andŵ u|v,k ; this yields: This empirical prior penalizes variables with large variances (and low covariances) enforcing lower w u|v,k , thus avoiding over-fitting.

Results
The normalized data sets of Tcell, Bcell and LymphoidTree can be found at http://algorithmics.molgen. mpg.de/MixDTrees/Data. The first column corresponds to the gene identifier and the second column to the cluster id. The Mixture of Dependence Trees Reports from the data set analysis are available at http:// algorithmics.molgen.mpg.de/MixDTrees/Reports.

Additional Simulated Results SIM2
We use MixDTrees with random parameterizations to generate simulated data. For the tree structure given in Fig.1, we randomly chose the µ u|v,k from the range [−1.5, 1.5], σ 2 u|v,k from [0, 0.5] and w u|v,k from [−0.5, 0.5]. We performed experiments with k = 3, 5, 10 and 14 in three component distributions settings: uniform, one big component (α = 0.5) and two small components (α = 0.05). For each setting, we generated one mixture, and sampled 1000 development profiles.
To inspect the robustness of the method, we added normally distributed noise with mean zero and variance in the interval [0.0, 1.0]. In all three experimental settings (Fig.1,Fig.2 and Fig.3), sensitivity and specificity do not decrease significantly until large amount of noise is present (σ noise /σ sample > 1).
Figure 4: Specificity and Sensitivity vs. noise ratio in the simulated data with two small components. The noise ratio is measured by dividing the added uniform noise by the mean σ 2 u|v,k of the original tree.