This tutorial shows how to work with CellRank using the low-level mode. We will interact directly with CellRank’s two main modules, kernels and estimators. We assume that you have gone through the high level tutorial already.
The first part of this tutorial is very similar to scVelo’s tutorial on pancreatic endocrinogenesis. This is essentially the same as in the high level tutorial, so feel free to skip the beginning and go directly to the section Run CellRank. The data we use here comes from Bastidas-Ponce et al. (2018). For more info on scVelo, see the documentation or read the article.
This tutorial notebook can be downloaded using the following link.
Import packages & data¶
import scvelo as scv import scanpy as sc import cellrank as cr import numpy as np scv.settings.verbosity = 3 scv.settings.set_figure_params('scvelo') cr.settings.verbosity = 2
First, we need to get the data. The following commands will download the
adata object and save it under
adata = cr.datasets.pancreas() scv.utils.show_proportions(adata) adata
Abundance of ['spliced', 'unspliced']: [0.81 0.19]
AnnData object with n_obs × n_vars = 2531 × 27998 obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime' var: 'highly_variable_genes' uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca' obsm: 'X_pca', 'X_umap' layers: 'spliced', 'unspliced' obsp: 'connectivities', 'distances'
Pre-process the data¶
Filter out genes which don’t have enough spliced/unspliced counts, normalize and log transform the data and restrict to the top highly variable genes. Further, compute principal components and moments for velocity estimation. These are standard scanpy/scvelo functions, for more information about them, see the scVelo API.
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000) sc.tl.pca(adata) sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30) scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
Filtered out 22024 genes that are detected 20 counts (shared). Normalized count data: X, spliced, unspliced. Exctracted 2000 highly variable genes. Logarithmized X. computing moments based on connectivities finished (0:00:00) --> added 'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
We will use the dynamical model from scVelo to estimate the velocities. The first step, estimating the parameters of the dynamical model, may take a while (~10min). To make sure we only have to run this once, we developed a caching extension called scachepy. scachepy does not only work for
recover_dynamics, but it can cache the output of almost any scanpy or scvelo function. To install it,
pip install git+https://github.com/theislab/scachepy
If you don’t want to install scachepy now, don’t worry, the below cell will run without it as well and this is the only place in this tutorial where we’re using it.
try: import scachepy c = scachepy.Cache('../../cached_files/basic_tutorial/') c.tl.recover_dynamics(adata, force=False) except ModuleNotFoundError: print("You don't seem to have scachepy installed, but that's fine, you just have to be a bit patient (~10min). ") scv.tl.recover_dynamics(adata)
You don't seem to have scachepy installed, but that's fine, you just have to be a bit patient (~10min). recovering dynamics finished (0:11:01) --> added 'fit_pars', fitted parameters for splicing dynamics (adata.var)
Once we have the parameters, we can use these to compute the velocities and the velocity graph. The velocity graph is a weighted graph that specifies how likely two cells are to transition into another, given their velocity vectors and relative positions.
scv.tl.velocity(adata, mode='dynamical') scv.tl.velocity_graph(adata)
computing velocities finished (0:00:02) --> added 'velocity', velocity vectors for each individual cell (adata.layers) computing velocity graph finished (0:00:03) --> added 'velocity_graph', sparse matrix with cosine correlations (adata.uns)
scv.pl.velocity_embedding_stream(adata, basis='umap', legend_fontsize=12, title='', smooth=.8, min_mass=4)
computing velocity embedding finished (0:00:00) --> added 'velocity_umap', embedded velocity vectors (adata.obsm)
CellRank is a package for analyzing directed single cell data, whereby we mean single cell data that can be respresented via a directed graph. Most prominently, this is the case for single cell data where velocities have been computed - we can use these to direct the KNN graph. However, there are other situations in which we can inform the KNN graph of the direciton of the process, using i.e. pseudotime (see Palantir) or information obtained via
mRNA labeling with e.g. scSLAM-seq, scEU-seq or sci-fate. Because we wanted CellRank to be widely applicable, no matter how directionality was introduced to the data, we split it up into two main modules,
estimators. In short,
kernels allow you to compute a (directed) transition matrix, whereas
estimators allow you to analyze
To construct a transition matrix, CellRank offers a number of kernel classes in
cellrank.tl.kernels. Currently implemented are the following:
VelocityKernel: compute transition matrix based on RNA velocity.
ConnectivityKernel: compute symmetric transition matrix based on transcriptomic similarity (essentially a DPT kernel).
PalantirKernel: mimics Palantir.
These kernels can be combined by simply using the
* operator, we will demonstrate this below. To find out more, check out the API. Note that the
kernel classes are designed to be easy to extend to incoporate future kernels based on e.g. mRNA labeling or other sources of directionality. Let’s start with the
from cellrank.tl.kernels import VelocityKernel vk = VelocityKernel(adata)
To lern more about this object, we can print it:
There is not very much there yet. We can change this by computing the transition matrix:
Computing transition matrix based on velocity correlations using `'deterministic'` mode Estimating `softmax_scale` using `'deterministic'` mode Setting `softmax_scale=3.7951` Finish (0:00:07)
To see how exactly this transition matrix was computed, we can print the kernel again:
<Velo[softmax_scale=3.8, mode=deterministic, seed=44219]>
There’s a lot more info now! To find out what all of these mean, check the docstring of
.compute_transition_matrix. The most important bits of information here are -
mode='deterministic: by default, the computation is deterministic, but we can also sample from the velocity distribution (
mode='sampling'), get a 2nd order estimate (
mode='stochastic') or a Monte Carlo estimate (
backward=False: run the process in the forward direction. To change this, set
backward=Truewhen initializing the
softmax_scale: scaling factor used in the softmax to transform cosine similarities into probabilities. The larger this value, the more centered the distribution will be around the most likely cell. If
None, use velocity variances to scale the softmax, i.e. an automatic way to tune it in terms of local variance in velocities. This requires one additional run (always in ‘deterministic’ mode, to quickly estimate the scale).
The velocity kernel we computed above would allow us to reproduce the results from the high level tutorial. However, for the sake of demonstration, let’s suppose that our velocities are very noisy and we want to make the analysis more robust by combining the velocity kernel with a connectivity kernel. This is very easy:
from cellrank.tl.kernels import ConnectivityKernel ck = ConnectivityKernel(adata).compute_transition_matrix()
Computing transition matrix based on connectivities Finish (0:00:00)
Note how it’s possible to call the
.compute_transition_matrix method direcly when initializing the kernel - this works for all kernel classes. Given these two kernels now, we can combine them:
combined_kernel = 0.8 * vk + 0.2 * ck
Let’s print the
combined_kernel to see what happened:
((0.8 * <Velo[softmax_scale=3.8, mode=deterministic, seed=44219]>) + (0.2 * <Conn[dnorm=True]>))
There we go, we took the two computed transition matrices stored in the kernel object and combined them using a weighted mean, with weights given by the factors we provided. We will use the
combined_kernel in the
estimators section below.
Before moving on to the
estimators, let’s demonstrate how to set up a
PalantirKernel. For this, we need a pseudotemporal ordering of the cells. Any pseudotime method can be used here. Note that this won’t exactly reproduce the original palantir implementation becasue it uses a specific representation of the data and a specific pseudotime. We will simply use DPT here:
root_idx = np.where(adata.obs['clusters'] == 'Ngn3 low EP') adata.uns['iroot'] = root_idx sc.tl.dpt(adata)
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.
Note that we did not use the above
VelocityKernel to infer the initial state here as we assume that in a situation where you want to apply Palantir, you probably don’t have acess to velocities!
The last step is to initialize the
PalantirKernel based on the
adata object and the pre-computed diffusion pseudotime. If you want to use another pseudotime, use the
from cellrank.tl.kernels import PalantirKernel pk = PalantirKernel(adata, time_key='dpt_pseudotime').compute_transition_matrix() print(pk)
Computing transition matrix based on Palantir-like kernel Finish (0:00:01) <Pala[k=3, dnorm=True, n_neighs=30]>
Estimators take a
kernel object and offer methods to analyze it. The main objective is to decompose the state space into a set of macrostates that represent the slow-time scale dynamics of the process. A subset of these macrostates will be the initial or terminal states of the process, the remaining states will be intermediate transient states. CellRank currently offers two estimator classes in
CFLARE: Clustering and Filtering Left And Right Eigenvectors. Heuristic method based on the spectrum of the transition matrix.
GPCCA: Generalized Perron Cluster Cluster Analysis: project the Markov chain onto a small set of macrostates using a Galerkin projection which maximizes the self-transition probability for the macrostates, see Reuter et al. (2018).
For more information on the estimators, have a look at the API. We will demonstrate the
GPCCA estimator here, however, the
CFLARE estimator has a similar set of methods (which do different things internally). Let’s start by initializing a
GPCCA object based on the
combined_kernel we constructed above:
from cellrank.tl.estimators import GPCCA g = GPCCA(combined_kernel) print(g)
WARNING: Computing transition matrix using the default parameters GPCCA[n=2531, kernel=((0.8 * <Velo[softmax_scale=3.8, mode=deterministic, seed=44219]>) + (0.2 * <Conn[dnorm=True]>))]
Additionaly to the information about the kernel it is based on, this prints out the number of states in the underlying Markov chain. GPCCA needs a real sorted Schur decomposition to work with, so let’s start by computing this and visualizing eigenvalues in complex plane:
Computing Schur decomposition When computing macrostates, choose a number of states NOT in `[6, 10, 13, 16, 19]` Adding `.eigendecomposition` `adata.uns['eig_fwd']` `.schur` `.schur_matrix` Finish (0:00:00)
To compute the Schur decomposition, there are two methods implemented
scipy.linalg.schurto compute a full real Schur decomposition and sort it using a python implementation of Brandts (2002). Note that
scipy.linalg.schuronly supports dense matrices, so consider using this for small cell numbers (<10k).
method='krylov': use an interative, krylov-subspace based algorightm provided in SLEPc to directly compute a partial, sorted, real Schur decomposition. This works with sparse matrices and will scale to extremly large cell numbers.
The real Schur decomposition for transition matrix
T is given by
Q U Q**(-1), where
Q is orthogonal and
U is quasi-upper triangular, which means it’s upper triangular except for 2x2 blocks on the diagonal. 1x1 blocks on the diagonal represent real eigenvalues, 2x2 blocks on the diagonal represent complex eigenvalues. Above, we plotted the top 20 eigenvalues of the matrix
T to see whether there is an apparent eigengap. In the present case, there seems to be such a gap after
the first 3 eigenvalues. We can visualize the corresponding Schur vectors in the embedding:
These vectors will span an invariant subspace, let’s call it
X (Schur vectors in the columns). The next step in
GPCCA is to find a linear combination of these vectors such that the Markov chain defined on the subset of states has large selt-transition probability. We do this by calling the following method:
g.compute_macrostates(n_states=3, cluster_key='clusters') g.plot_macrostates()
Computing `3` macrostates INFO: Using pre-computed schur decomposition Adding `.macrostates_memberships` `.macrostates` `.schur` `.coarse_T` `.coarse_stationary_distribution` Finish (0:00:00)
We can look at individual states by passing
To see the most likely cells for each of the states, set
discrete=True. This uses top
n cells from each lineage, where
n can be set in
compute_macrostates (defaults to 30 cells).
So far, we haven’t written anything to
adata yet. This happens with the following call:
Adding `adata.obs['terminal_states_probs']` `adata.obs['terminal_states']` `adata.obsm['macrostates_fwd']` `.terminal_states_probabilities` `.terminal_states`
.set_terminal_states_from_macrostates is essentially a restriction of the set of macrostates to a set of main states, which could be initial or terminal states, depending on the direction of the process.
.set_terminal_states_from_macrostates has a parameter
names which allows you to pass a list of names like
['Alpha', 'Beta', 'Epsilon'], which is handy in a situations where you have computed several macrostates, but only a subset of them represents your terminal states. Using the
names, you can also combine several macrostates, see the API.
After setting our terminal states, we can compute absorption probabilities towards them as:
Computing absorption probabilities Adding `adata.obsm['to_terminal_states']` `adata.obs['to_terminal_states_dp']` `.absorption_probabilities` `.diff_potential` Finish (0:00:00)
Once we have set the
absorption_probabilities, we can correlate them against all genes to find potential lineage drivers. Below, we show how to do this for just one state. Note that results are written to the
.var attribute of
alpha_drivers = g.compute_lineage_drivers(lineages='Alpha', return_drivers=True) alpha_drivers.sort_values(by="Alpha", ascending=False)
Computing correlations for lineages `['Alpha']` restricted to clusters `None` in layer `X` with `use_raw=False` Adding `.lineage_drivers` `adata.var['to Alpha']` Finish (0:00:00)
2000 rows × 1 columns
We can look at some of the identified genes:
To find genes wich may be involved in fate choice early on, we could restrict the correlation to a subset of our clusters using the
To demonstrate yet another plotting function, we show how to visualize the distribution over fate choices using violin plots below:
cr.pl.cluster_fates(adata, cluster_key='clusters', mode='violin')
We can see from this plot that at E15.5 in endocrine development, predominantly Beta cell are being produced. We can also see a bimodality in the Alpha cluster, which suggests that within that cluster, there is a subpopulation of cells which isn’t yet commited.
initial states: to compute initial states, simply repeat the above analysis passing
backward=Trueto the kernel object when initializing it. The estimator object can be initialized as above with no changes needed.
Palantir kernel: we could repeat the same analysis using the
PalantirKernelwe initialized by simply passing it to the
g = cr.tl.estimators.GPCCA(pk). CellRank is build in a modular fashion - you can combine any
CFLARE estimator: you can initialize a
c = cr.tl.estimators.CFLARE(kenel). Most methods will have the same name, the two big differences are that CFLARE works with the eigendecomposition rather than the Schur decomposition and the way in which invariant subspaces are used to compute macrostates.