Kernels and estimators
This tutorial shows how to work with CellRank using the low-level mode. We will interact directly with CellRank’s two main modules, kernels and estimators. Don’t worry - this really isn’t any more complicated than using scikit-learn.
We’ll demonstrate this using a dataset of mouse pancreatic development at E15.5 [Bastidas-Ponce et al., 2019]. We’ll use scVelo a bit to work with RNA velocity, if you’re unfamiliar with that, see the documentation or read [Bergen et al., 2020].
This tutorial notebook can be downloaded using the following link.
Concept of kernels and estimators
The idea behind CellRank is to have a flexible and extensible scanpy/AnnData compatible modeling framework for single-cell data that resolves around the concept of Markov chains. Each state in the Markov chain is given by one observed cell and transition probabilities among cells only depend on the current cell and not on the history. To make CellRank as flexible as possible, we modularized it into kernels which compute a cell-cell transition matrix and estimators which use the transition matrix formulate hypotheses about the underlying dynamics, see the above figure.
Kernels read data from an AnnData object like RNA velocity vectors, transcriptomic similarity, spatial proximity, pseudotime, etc. and use it to compute a sparse transition matrix (see figure above). We set up the kernel structure so that it’s easy for you to contribute your own kernel via CellRank’s external module, check out our create your own kernel tutorial.
Estimators read a cell-cell transition matrix from a kernel object and compute aggregate dynamics based on it. Each estimator works with every kernel - the core idea here is that it does not matter how you computed your transition matrix to run an estimator on top, you can even compute a transition matrix outside of CellRank and import it via a PrecomputedKernel - you will still be able to use every estimator on top. Our main estimator is called Generalized Perron Cluster Cluster Analysis (GPCCA) and we interface to the great pyGPCCA library for it, see [Reuter et al., 2018] and [Reuter et al., 2019].
Besides kernels and estimators, there are downstream plotting functions that use fate probabilities to do fun stuff, i.e. giving you a directed version of PAGA, plotting gene expression trends along lineages or embedding cells in a circle based on where they’re likely to go - to name only a few. Check out CellRank’s API and our gallery for a full list.
This tutorial will focus on getting you familiar with the structure of kernels and estimators, how they can be used to set up and analyze a single-cell Markov chain and how you can extract real biological knowledge from them. If you would like to learn something about the math behind them, or see more biological applications, check out the CellRank manuscript [Lange et al., 2022]. Enjoy the tutorial!
Import packages & data
Easiest way to start is to download Miniconda3 along with the environment file found here. To create the environment, run
conda create -f environment.yml.
import sys if "google.colab" in sys.modules: !pip install -q git+https://github.com/theislab/cellrank@dev
import scvelo as scv import scanpy as sc import cellrank as cr import numpy as np scv.settings.verbosity = 3 scv.settings.set_figure_params("scvelo") cr.settings.verbosity = 2
We’ll not go trough all the pre-processing steps described in the basic tutorial again - instead, we will download an AnnData object that contains pre-processed (filtered, total counts normalized & log-transformed) data where RNA velocity has already been pre-computed for us. The following commands will download the pre-processed
adata object and save it under
- this object is about 150 MB large.
adata = cr.datasets.pancreas_preprocessed() scv.utils.show_proportions(adata) adata
Abundance of ['spliced', 'unspliced']: [0.81 0.19]
100%|██████████| 140M/140M [00:03<00:00, 38.7MB/s]
AnnData object with n_obs × n_vars = 2531 × 2000 obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'dpt_pseudotime' var: 'highly_variable_genes', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'fit_r2', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_variance', 'fit_alignment_scaling', 'velocity_genes' uns: 'clusters_colors', 'clusters_fine_colors', 'diffmap_evals', 'iroot', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca', 'recover_dynamics', 'velocity_graph', 'velocity_graph_neg', 'velocity_params' obsm: 'X_diffmap', 'X_pca', 'X_umap', 'velocity_umap' varm: 'PCs', 'loss' layers: 'Ms', 'Mu', 'fit_t', 'fit_tau', 'fit_tau_', 'spliced', 'unspliced', 'velocity', 'velocity_u' obsp: 'connectivities', 'distances'
Initialize a kernel
To construct a transition matrix, CellRank offers a number of kernel classes in
cellrank.tl.kernels. To demonstrate the concept, we’ll restrict our attention to the following two kernels here:
VelocityKernel: compute transition matrix based on RNA velocity.
ConnectivityKernel: compute symmetric transition matrix based on transcriptomic similarity - essentially a diffusion-pseudotime (DPT) kernel [Haghverdi et al., 2016].
For a full list of kernels, check out our API. To initialize a kernel object, simply run the following:
from cellrank.tl.kernels import VelocityKernel vk = VelocityKernel(adata)
Note that kernels need an AnnData object to read data from it - CellRank is part of the Scanpy/AnnData ecosystem. The only exception to this is the PrecomputedKernel which directly accepts a transition matrix, thus making it possible to interface to CellRank from outside the scanpy/AnnData world.
To learn more about our kernel object, we can print it:
There isn’t very much here yet!
Compute a transition matrix
To compute a transition matrix, every kernel object has a corresponding method:
Computing transition matrix based on logits using `'deterministic'` mode Estimating `softmax_scale` using `'deterministic'` mode Setting `softmax_scale=3.9741` Finish (0:00:03)
100%|██████████| 2531/2531 [00:01<00:00, 1400.25cell/s] 100%|██████████| 2531/2531 [00:01<00:00, 1695.07cell/s]
Which algorithm is run under the hood to get the cell-cell transition probabilities depends entirely on the kernel - in this case, we compared RNA velocity vectors to transcriptomic displacements in local neighborhoods to estimate where each single cell is likely to go. To see how exactly this transition matrix was computed, we can print the kernel again:
<VelocityKernel[softmax_scale=3.97, mode=deterministic, seed=24756, scheme=<CorrelationScheme>]>
There’s a lot more info now! To find out what all of these mean, check the docstring of
.compute_transition_matrix. The most important bits of information here are:
mode='deterministic: by default, the computation is deterministic, but we can also sample from the velocity distribution (
mode='sampling'), get a 2nd order estimate (
mode='stochastic') or a Monte Carlo estimate (
backward=False: run the process in the forward direction. To change this, set
backward=Truewhen initializing the
softmax_scale: scaling factor used in the softmax to transform cosine similarities into probabilities. The larger this value, the more centered the distribution will be around the most likely cell. If
None, use velocity variances to scale the softmax, i.e. an automatic way to tune it in terms of local variance in velocities. This requires one additional run (always in ‘deterministic’ mode, to quickly estimate the scale).
Combining two kernels
RNA velocity vectors can sometimes be very noisy - to regularize them, let’s combine the
VelocityKernel with a
ConnectivityKernel, computed on the basis of transcriptomic similarity. Let’s start by initializing such a kernel and computing a second transition matrix:
from cellrank.tl.kernels import ConnectivityKernel ck = ConnectivityKernel(adata).compute_transition_matrix()
Computing transition matrix based on `adata.obsp['connectivities']` Finish (0:00:00)
ConnectivityKernel has a parameter
conn_key which is used to read the KNN graph from
adata.obsp. By default, this is set to
connectivities, meaning that a transcriptomic similarity-based KNN graph is used. If you have spatial data and you’ve processed in using squidpy [Palla et al., 2021], you can read a spatial-proximity based KNN graph by varying the
conn_key to whatever was used to save the graph in squidpy, by default,
Note how it’s possible to call the
.compute_transition_matrix method direcly when initializing the kernel - this works for all kernel classes. Given these two kernels now, we can combine them with some relative weights:
combined_kernel = 0.8 * vk + 0.2 * ck
Let’s print the
combined_kernel to see what happened:
((0.8 * <VelocityKernel[softmax_scale=3.97, mode=deterministic, seed=24756, scheme=<CorrelationScheme>]>) + (0.2 * <ConnectivityKernel[dnorm=True, key=connectivities]>))
There we go, we took the two computed transition matrices stored in the kernel object and combined them using a weighted mean, with weights given by the factors we provided. We will use the
combined_kernel in the
estimators section below.
Initialize an estimator
Estimators take a
kernel object and offer methods to analyze it. The main objective is to decompose the state space into a set of macrostates that represent the slow-time scale dynamics of the process. A subset of these macrostates will be the initial or terminal states of the process, the remaining states will be intermediate transient states. CellRank currently offers two estimator classes in
CFLARE: Clustering and Filtering Left And Right Eigenvectors. Heuristic method based on the spectrum of the transition matrix.
GPCCA: Generalized Perron Cluster Cluster Analysis: project the Markov chain onto a small set of macrostates using a Galerkin projection which maximizes the self-transition probability for the macrostates, see [Reuter et al., 2018] and [Reuter et al., 2019].
For more information on the estimators, have a look at the API. We will demonstrate the
GPCCA estimator here, however, the
CFLARE estimator has a similar set of methods (which do different things internally). Let’s start by initializing a
GPCCA object based on the
combined_kernel we constructed above:
from cellrank.tl.estimators import GPCCA g = GPCCA(combined_kernel) print(g)
GPCCA[n=2531, kernel=((0.8 * <VelocityKernel[softmax_scale=3.97, mode=deterministic, seed=24756, scheme=<CorrelationScheme>]>) + (0.2 * <ConnectivityKernel[dnorm=True, key=connectivities]>))]
Additionally to the information about the kernel it is based on, this prints out the number of states in the underlying Markov chain.
Compute a matrix decomposition
GPCCA needs a real sorted Schur decomposition to work with, so let’s start by computing this and visualizing eigenvalues in complex plane:
WARNING: Unable to import `petsc4py` or `slepc4py`. Using `method='brandts'` WARNING: For `method='brandts'`, dense matrix is required. Densifying Computing Schur decomposition WARNING: Using `20` components would split a block of complex conjugate eigenvalues. Using `n_components=21` When computing macrostates, choose a number of states NOT in `[5, 10, 16, 18, 20]` Adding `adata.uns['eigendecomposition_fwd']` `.schur_vectors` `.schur_matrix` `.eigendecomposition` Finish (0:00:12)
To compute the Schur decomposition, there are two methods implemented:
scipy.linalg.schurto compute a full real Schur decomposition and sort it [Brandts, 2002]. Note that
scipy.linalg.schuronly supports dense matrices, so consider using this for small cell numbers (<10k).
method='krylov': use an iterative, krylov-subspace based algorithm provided in SLEPc to directly compute a partial, sorted, real Schur decomposition. This works with sparse matrices and will scale to extremely large cell numbers.
The real Schur decomposition for transition matrix
T is given by
Q U Q**(-1), where
Q is orthogonal and
U is quasi-upper triangular, which means it’s upper triangular except for 2x2 blocks on the diagonal. 1x1 blocks on the diagonal represent real eigenvalues, 2x2 blocks on the diagonal represent complex eigenvalues. Above, we plotted the top 20 eigenvalues of the matrix
T to see whether there is an apparent eigengap. In the present case, there seems to be such a gap after
the first 3 eigenvalues.
Schur vectors will span an invariant subspace, let’s call it
X (Schur vectors in the columns). Note that the Schur decomposition is not unique - however, the subspace spanned by the top Schur vectors is unique so will be our initial & terminal states as well as fate probabilities.
Infer terminal states
The next step in
GPCCA is to find a linear combination of these vectors such that the Markov chain defined on the subset of states has large self-transition probability. We do this by calling the following method:
g.compute_macrostates(n_states=3, cluster_key="clusters") g.plot_macrostates()
Computing `3` macrostates Adding `.macrostates` `.macrostates_memberships` `.coarse_T` `.coarse_initial_distribution `.coarse_stationary_distribution` `.schur_vectors` `.schur_matrix` `.eigendecomposition` Finish (0:00:00)
We can look at individual states by passing
To see the most likely cells for each of the states, set
discrete=True. This uses top
n cells from each lineage, where
n can be set in
compute_macrostates (defaults to 30 cells).
So far, we have just computed macrostates. To restrict these to the set of terminal states, there are three possibilities:
estimator.compute_terminal_states(): automatically selects the terminal states from the set of macrostates via a stability criterion.
estimator.set_terminal_states_from_macrostates(): manually restrict the macrostates by passing a list of macrostate-names that you know are terminal in your data.
estimator.set_terminal_states: manually set the terminal states, without computing macrostates, i.e. entirely manually. This is only recommended in situations where you’re very certain where your terminal states are and you would just like to use the estimator to compute fate probabilities towards them.
We’ll use the automatic way below (it doesn’t matter in this case since we have only computed three macrostates, and all of them are terminal):
Adding `adata.obs['terminal_states']` `adata.obs['terminal_states_probs']` `.terminal_states` `.terminal_states_probabilities` `.terminal_states_memberships Finish`
Estimate fate probabilities
Once we have computed the set of terminal states, we can compute fate probabilities towards them. We do this using absorption probabilities, an efficient way to compute where a cell is likely to end up if its transitions are described by a Markov chain.
Computing absorption probabilities WARNING: Unable to import petsc4py. For installation, please refer to: https://petsc4py.readthedocs.io/en/stable/install.html. Defaulting to `'gmres'` solver. Adding `adata.obsm['to_terminal_states']` `.absorption_probabilities` Finish (0:00:00)
100%|██████████| 3/3 [00:00<00:00, 34.96/s]
If you have a large dataset, make sure you:
use the PETSc libary by setting
parallelize your computations via the
n_jobsparameter if you’re looking at many temrinal states (>5)
use the iterative GMRES solver via
CellRank’s implementation of absorption probabilities is extremely fast and memory efficient - in our own experiments, it takes us about 2min to compute absorption probabilities for 100k cells towards 3 terminal states (30 cells each) using an Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz with 32 cores with a peak memory usage of 35 about GiB. These experiments were run in CellRank 1.3 - version 1.4 includes an additional speedup by a factor 5.
Find driver genes
Once we have computed the
absorption_probabilities, we can correlate them against all genes to find potential lineage drivers. Below, we show how to do this for just one terminal state. Note that results are written to the
.var attribute of
alpha_drivers = g.compute_lineage_drivers(lineages="Alpha", return_drivers=True) alpha_drivers.sort_values(by="Alpha_corr", ascending=False)
Adding `adata.varm['terminal_lineage_drivers']` `.lineage_drivers` Finish (0:00:00)
2000 rows × 5 columns
To find genes which may be involved in fate choice early on, we could restrict the correlation to a subset of our clusters using the
clusters parameters. Let’s take a look at some of the identified putative Alpha lineage drivers:
Use a downstream plotting function
That’s it for kernels and estimators already! In the last part, we’ll demonstrate how a downstream plotting function can be used to visualize fate probabilities. Note that we won’t call this though the estimator interface but though CellRank’s plotting API. This works because estimators write to AnnData and downstream plotting functions read from AnnData.
The function we’re using here computes a circular projection of our dataset by arranging the terminal states around the unit circle and placing cell inside, depending on their fate probabilities. This is inspired by STEMNET [Velten et al., 2017].
cr.pl.circular_projection(adata, keys="clusters", legend_loc="right")
Congrats! You made it though our kernels & estimators tutorial and you’re now familiar with the structure of CellRank. I hope we were able to show you how you can use kernels to compute transition probabilities and estimators to analyze them. If you want to learn more about how to:
compute initial states by simply repeating the above analysis passing
backward=Trueto the kernel object when initializing it. The estimator object can be initialized as above with no changes needed.
experiment with some of our kernels and estimators to build your own transition matrix and analyze it. Check the API for a full list of kernels and estimators and check the gallery to see them in action.
learn how to use CellRank when you don’t have RNA velocity information by following our CellRank beyond RNA velocity tutorial.
open an issue if something is unclear or does not work for you.
contribute your own kernel, check our contributions guide and our Creating a new kernel tutorial.