--- title: "Sparse structure estimation for multivariate count data with PLN-network" author: "PLN team" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 4 bibliography: article/PLNreferences.bib link-citations: yes vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Sparse structure estimation for multivariate count data with PLN-network} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( screenshot.force = FALSE, echo = TRUE, rows.print = 5, message = FALSE, warning = FALSE) set.seed(178643) ``` ## Preliminaries This vignette illustrates the standard use of the `PLNnetwork` function and the methods accompanying the R6 Classes `PLNnetworkfamily` and `PLNnetworkfit`. ### Requirements The packages required for the analysis are **PLNmodels** plus some others for data manipulation and representation: ```{r requirement} library(PLNmodels) library(ggplot2) ``` ### Data set We illustrate our point with the trichoptera data set, a full description of which can be found in [the corresponding vignette](Trichoptera.html). Data preparation is also detailed in [the specific vignette](Import_data.html). ```{r data_load} data(trichoptera) trichoptera <- prepare_data(trichoptera$Abundance, trichoptera$Covariate) ``` The `trichoptera` data frame stores a matrix of counts (`trichoptera$Abundance`), a matrix of offsets (`trichoptera$Offset`) and some vectors of covariates (`trichoptera$Wind`, `trichoptera$Temperature`, etc.) ### Mathematical background The network model for multivariate count data that we introduce in @PLNnetwork is a variant of the Poisson Lognormal model of @AiH89, see [the PLN vignette](PLN.html) as a reminder. Compare to the standard PLN model we add a sparsity constraint on the inverse covariance matrix ${\boldsymbol\Sigma}^{-1}\triangleq \boldsymbol\Omega$ by means of the $\ell_1$-norm, such that $\|\boldsymbol\Omega\|_1 < c$. PLN-network is the equivalent of the sparse multivariate Gaussian model [@banerjee2008] in the PLN framework. It relates some $p$-dimensional observation vectors $\mathbf{Y}_i$ to some $p$-dimensional vectors of Gaussian latent variables $\mathbf{Z}_i$ as follows \begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}\left({\boldsymbol\mu},\boldsymbol\Omega^{-1}\right) & \|\boldsymbol\Omega\|_1 < c \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right) \end{array} \end{equation} The parameter ${\boldsymbol\mu}$ corresponds to the main effects and the latent covariance matrix $\boldsymbol\Sigma$ describes the underlying structure of dependence between the $p$ variables. The $\ell_1$-penalty on $\boldsymbol\Omega$ induces sparsity and selection of important direct relationships between entities. Hence, the support of $\boldsymbol\Omega$ correspond to a network of underlying interactions. The sparsity level ($c$ in the above mathematical model), which corresponds to the number of edges in the network, is controlled by a penalty parameter in the optimization process sometimes referred to as $\lambda$. All mathematical details can be found in @PLNnetwork. #### Covariates and offsets Just like PLN, PLN-network generalizes to a formulation close to a multivariate generalized linear model where the main effect is due to a linear combination of $d$ covariates $\mathbf{x}_i$ and to a vector $\mathbf{o}_i$ of $p$ offsets in sample $i$. The latent layer then reads \begin{equation} \mathbf{Z}_i \sim \mathcal{N}\left({\mathbf{o}_i + \mathbf{x}_i^\top\mathbf{B}},\boldsymbol\Omega^{-1}\right), \qquad \|\boldsymbol\Omega\|_1 < c , \end{equation} where $\mathbf{B}$ is a $d\times p$ matrix of regression parameters. #### Alternating optimization Regularization via sparsification of $\boldsymbol\Omega$ and visualization of the consecutive network is the main objective in PLN-network. To reach this goal, we need to first estimate the model parameters. Inference in PLN-network focuses on the regression parameters $\mathbf{B}$ and the inverse covariance $\boldsymbol\Omega$. Technically speaking, we adopt a variational strategy to approximate the $\ell_1$-penalized log-likelihood function and optimize the consecutive sparse variational surrogate with an optimization scheme that alternates between two step 1. a gradient-ascent-step, performed with the CCSA algorithm of @Svan02 implemented in the C++ library [@nlopt], which we link to the package. 2. a penalized log-likelihood step, performed with the graphical-Lasso of @FHT08, implemented in the package **fastglasso** [@glassofast]. More technical details can be found in @PLNnetwork ## Analysis of trichoptera data with a PLNnetwork model In the package, the sparse PLN-network model is adjusted with the function `PLNnetwork`, which we review in this section. This function adjusts the model for a series of value of the penalty parameter controlling the number of edges in the network. It then provides a collection of objects with class `PLNnetworkfit`, corresponding to networks with different levels of density, all stored in an object with class `PLNnetworkfamily`. ### Adjusting a collection of network - a.k.a. a regularization path `PLNnetwork` finds an hopefully appropriate set of penalties on its own. This set can be controlled by the user, but use it with care and check details in `?PLNnetwork`. The collection of models is fitted as follows: ```{r simple PLNnetwork} network_models <- PLNnetwork(Abundance ~ 1 + offset(log(Offset)), data = trichoptera) ``` Note the use of the `formula` object to specify the model, similar to the one used in the function `PLN`. ### Structure of `PLNnetworkfamily` The `network_models` variable is an `R6` object with class `PLNnetworkfamily`, which comes with a couple of methods. The most basic is the `show/print` method, which sends a very basic summary of the estimation process: ```{r show} network_models ``` One can also easily access the successive values of the criteria in the collection ```{r collection criteria} network_models$criteria %>% head() %>% knitr::kable() ``` A diagnostic of the optimization process is available via the `convergence` field: ```{r convergence criteria} network_models$convergence %>% head() %>% knitr::kable() ``` An nicer view of this output comes with the option "diagnostic" in the `plot` method: ```{r diagnostic, fig.width=7, fig.height=5} plot(network_models, "diagnostic") ``` ### Exploring the path of networks By default, the `plot` method of `PLNnetworkfamily` displays evolution of the criteria mentioned above, and is a good starting point for model selection: ```{r plot, fig.width=7, fig.height=5} plot(network_models) ``` Note that we use the original definition of the BIC/ICL criterion ($\texttt{loglik} - \frac{1}{2}\texttt{pen}$), which is on the same scale as the log-likelihood. A [popular alternative](https://en.wikipedia.org/wiki/Bayesian_information_criterion) consists in using $-2\texttt{loglik} + \texttt{pen}$ instead. You can do so by specifying `reverse = TRUE`: ```{r plot-reverse, fig.width=7, fig.height=5} plot(network_models, reverse = TRUE) ``` In this case, the variational lower bound of the log-likelihood is hopefully strictly increasing (or rather decreasing if using `reverse = TRUE`) with a lower level of penalty (meaning more edges in the network). The same holds true for the penalized counterpart of the variational surrogate. Generally, smoothness of these criteria is a good sanity check of optimization process. BIC and its extended-version high-dimensional version EBIC are classically used for selecting the correct amount of penalization with sparse estimator like the one used by PLN-network. However, we will consider later a more robust albeit more computationally intensive strategy to chose the appropriate number of edges in the network. To pursue the analysis, we can represent the coefficient path (i.e., value of the edges in the network according to the penalty level) to see if some edges clearly come off. An alternative and more intuitive view consists in plotting the values of the partial correlations along the path, which can be obtained with the options `corr = TRUE`. To this end, we provide the S3 function `coefficient_path`: ```{r path_coeff, fig.width=7, fig.height=7} coefficient_path(network_models, corr = TRUE) %>% ggplot(aes(x = Penalty, y = Coeff, group = Edge, colour = Edge)) + geom_line(show.legend = FALSE) + coord_trans(x="log10") + theme_bw() ``` ### Model selection issue: choosing a network To select a network with a specific level of penalty, one uses the `getModel(lambda)` S3 method. We can also extract the best model according to the BIC or EBIC with the method `getBestModel()`. ```{r extract models} model_pen <- getModel(network_models, network_models$penalties[20]) # give some sparsity model_BIC <- getBestModel(network_models, "BIC") # if no criteria is specified, the best BIC is used ``` An alternative strategy is to use StARS [@stars], which performs resampling to evaluate the robustness of the network along the path of solutions in a similar fashion as the stability selection approach of @stabilitySelection, but in a network inference context. Resampling can be computationally demanding but is easily parallelized: the function `stability_selection` integrates some features of the **future** package to perform parallel computing. We set our plan to speed the process by relying on 2 workers: ```{r future, eval = FALSE} library(future) plan(multisession, workers = 2) ``` We first invoke `stability_selection` explicitly for pedagogical purpose. In this case, we need to build our sub-samples manually: ```{r stability} n <- nrow(trichoptera) subs <- replicate(10, sample.int(n, size = n/2), simplify = FALSE) stability_selection(network_models, subsamples = subs) ``` Requesting 'StARS' in `gestBestmodel` automatically invokes `stability_selection` with 20 sub-samples, if it has not yet been run. ```{r extract models stars} model_StARS <- getBestModel(network_models, "StARS") ``` When "StARS" is requested for the first time, `getBestModel` automatically calls the method `stability_selection` with the default parameters. After the first call, the stability path is available from the ` plot` function: ```{r plot stability, fig.width=7, fig.height=5} plot(network_models, "stability") ``` When you are done, do not forget to get back to the standard sequential plan with *future*. ```{r future_off, eval = FALSE} future::plan("sequential") ``` ### Structure of a `PLNnetworkfit` The variables `model_BIC`, `model_StARS` and `model_pen` are other `R6Class` objects with class `PLNnetworkfit`. They all inherits from the class `PLNfit` and thus own all its methods, with a couple of specific one, mostly for network visualization purposes. Most fields and methods are recalled when such an object is printed: ```{r show/print} model_StARS ``` The `plot` method provides a quick representation of the inferred network, with various options (either as a matrix, a graph, and always send back the plotted object invisibly if users needs to perform additional analyses). ```{r extract} my_graph <- plot(model_StARS, plot = FALSE) my_graph ``` ```{r stars_network, fig.width=7, fig.height=7} plot(model_StARS) plot(model_StARS, type = "support", output = "corrplot") ``` We can finally check that the fitted value of the counts -- even with sparse regularization of the covariance matrix -- are close to the observed ones: ```{r fitted, fig.cap = "fitted value vs. observation", fig.dim=c(7,5)} data.frame( fitted = as.vector(fitted(model_StARS)), observed = as.vector(trichoptera$Abundance) ) %>% ggplot(aes(x = observed, y = fitted)) + geom_point(size = .5, alpha =.25 ) + scale_x_log10(limits = c(1,1000)) + scale_y_log10(limits = c(1,1000)) + theme_bw() + annotation_logticks() ``` ## References