Getting Started with CellRank¶
Preliminaries¶
CellRank is a software toolkit to study dynamical biological processes, like development, regeneration, cancer or reprogramming, based on multi-view single-cell genomics data [Lange et al., 2022], including RNA velocity [Bergen et al., 2020, La Manno et al., 2018], developmental potentials [Gulati et al., 2020], pseudotemporal orderings [Bendall et al., 2014, Haghverdi et al., 2016, Setty et al., 2019, Trapnell et al., 2014], (spatial) time course studies [Klein et al., 2023, Schiebinger et al., 2019], lineage-traced data [Alemany et al., 2018, Forrow and Schiebinger, 2021, Lange et al., 2023, Raj et al., 2018, Spanjaard et al., 2018], metabolic labels [Battich et al., 2020, Erhard et al., 2019, Qiu et al., 2020, Qiu et al., 2022], and more.
Oftentimes, the type of downstream analysis does not depend on the modality that was used to estimate cellular transitions - that’s why we modularized CellRank into kernels
, which compute cell-cell transition matrices, and estimators
, which analyze the transition matrices.
In this tutorial, you will learn how to:
use CellRank
kernels
to compute a transition matrix of cellular dynamics.use CellRank
estimators
to analyze the transition matrix, including the computation offate probabilities
,driver genes
, andgene expression trends
.read and write CellRank
kernels
.
This tutorial notebook can be downloaded using the following link.
Note
If you want to run this on your own data, you will need:
a scRNA-seq dataset for which you can compute a cell-cell transition matrix using any CellRank
kernel
.
Note
If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse. You can also contact us using info@cellrank.org.
Import packages & data¶
import sys
if "google.colab" in sys.modules:
!pip install -q git+https://github.com/theislab/cellrank
import cellrank as cr
import scanpy as sc
sc.settings.set_figure_params(frameon=False, dpi=100)
cr.settings.verbosity = 2
Global seed set to 0
import warnings
warnings.simplefilter("ignore", category=UserWarning)
To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of human bone marrow, which can be conveniently acessed through bone_marrow
[Setty et al., 2019].
adata = cr.datasets.bone_marrow()
adata
AnnData object with n_obs × n_vars = 5780 × 27876
obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'
var: 'palantir'
uns: 'clusters_colors', 'palantir_branch_probs_cell_types'
obsm: 'MAGIC_imputed_data', 'X_tsne', 'palantir_branch_probs'
layers: 'spliced', 'unspliced'
Move MAGIC imputed data to layers
[van Dijk et al., 2018].
adata = adata[:, adata.var["palantir"]].copy()
adata.layers["MAGIC_imputed_data"] = adata.obsm["MAGIC_imputed_data"].copy()
Preprocess the data¶
Filter and normalize.
sc.pp.filter_genes(adata, min_cells=5)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=3000)
Compute PCA and k-nearest neighbor (k-NN) graph.
sc.tl.pca(adata, random_state=0)
sc.pp.neighbors(adata, random_state=0)
Visualize this data, using the t-SNE embedding, cluster labels and Palantir pseudotime provided with the original study [Setty et al., 2019].
Working with kernels¶
Set up a kernel¶
To construct a transition matrix, CellRank offers a number of kernel classes in kernels
. To demonstrate the concept, we’ll use the PseudotimeKernel
which biases k-NN graph edges to point into the direction of increasing pseudotime, inspired by Palantir [Setty et al., 2019].
For a full list of kernels, check out our API
. To initialize a kernel object, simply run the following:
from cellrank.kernels import PseudotimeKernel
pk = PseudotimeKernel(adata, time_key="palantir_pseudotime")
Note that kernels need an anndata.AnnData
object to read data from it - CellRank is part of the Scanpy/AnnData ecosystem. The only exception to this is the cellrank.kernels.PrecomputedKernel
which directly accepts a transition matrix, thus making it possible to interface to CellRank from outside the scanpy/AnnData world.
To learn more about our kernel object, we can print it.
pk
PseudotimeKernel[n=5780]
There isn’t very much here yet! Let’s use this kernel to compute a cell-cell transition_matrix
.
pk.compute_transition_matrix()
Computing transition matrix based on pseudotime`
Finish (0:00:01)
PseudotimeKernel[n=5780, dnorm=False, scheme='hard', frac_to_keep=0.3]
If we print the kernel again, we can inspect the parameters that were used to compute this transition matrix.
pk
PseudotimeKernel[n=5780, dnorm=False, scheme='hard', frac_to_keep=0.3]
To get a first impression of the cellular dynamics in this dataset, we can simulate random walks
on the Markov chain implied by the transition matrix, starting from hematopoietic stem cells (HSCs), and visualize these in the t-SNE embedding.
pk.plot_random_walks(
seed=0,
n_sims=100,
start_ixs={"clusters": "HSC_1"},
basis="tsne",
legend_loc="right",
dpi=150,
)
Black and yellow dots indicate random walk start and terminal cells, respectively. Overall, this reflects the known differentiation hierachy in human hematopoiesis.
See also
To learn more about the PseudotimeKernel
, see CellRank Meets Pseudotime.
Kernel overview¶
There exist CellRank kernels for many different data modalities. Each modality and biological system comes with its own strenghts and limitations, so it’s important to choose the kernel carefully. We provide some guidance in the figure below. However, please check the kernel API
for a complete and up-to-date list, as new kernels will come.
If you already have a cell-cell transition matrix for your data, you can pass that directly using the cellrank.kernels.PrecomputedKernel
, CellRank’s interface with external methods.
Combining different kernels¶
Any two kernels can be globally combined via a weighted mean. To demonstrate this, let’s set up an additional cellrank.kernels.ConnectivityKernel
, based on gene expression similarity.
from cellrank.kernels import ConnectivityKernel
ck = ConnectivityKernel(adata).compute_transition_matrix()
Computing transition matrix based on `adata.obsp['connectivities']`
Finish (0:00:00)
Print the kernel to see some properties
ck
ConnectivityKernel[n=5780, dnorm=True, key='connectivities']
Combine with the PseudotimeKernel
PseudotimeKernel from above.
combined_kernel = 0.8 * pk + 0.2 * ck
combined_kernel
(0.8 * PseudotimeKernel[n=5780, dnorm=False, scheme='hard', frac_to_keep=0.3] + 0.2 * ConnectivityKernel[n=5780, dnorm=True, key='connectivities'])
This works for any combination of kernels, and generalizes to more than two kernels. Kernel combinations allow you to use distinct sources of information to describe cellular dynamics.
Writing and reading a kernel¶
Sometimes, we might want to write the kernel to disk, to either continue with the analysis later on, or to pass it to someone else. This can be easily done; simply write the kernel to adata, and write adata to file:
pk.write_to_adata()
adata.write("hematopoiesis.h5ad")
To continue with the analysis, we read the anndata.AnnData
object from disk, and initialize a new kernel from the AnnData object.
adata = sc.read("hematopoiesis.h5ad")
pk_new = cr.kernels.PseudotimeKernel.from_adata(adata, key="T_fwd")
pk_new
PseudotimeKernel[n=5780, dnorm=False, frac_to_keep=0.3, scheme='hard']
Working with estimators¶
Set up an estimator¶
Estimators take a kernel
object and offer methods to analyze it. The main objective is to decompose the state space into a set of macrostates that represent the slow-time scale dynamics of the process. A subset of these macrostates will be the initial or terminal states of the process, the remaining states will be intermediate transient states. CellRank currently offers two estimator classes in cellrank.estimators
:
CFLARE
: Clustering and Filtering Left And Right Eigenvectors. Heuristic method based on the spectrum of the transition matrix.GPCCA
: Generalized Perron Cluster Cluster Analysis: project the Markov chain onto a small set of macrostates using a Galerkin projection which maximizes the self-transition probability for the macrostates, [Reuter et al., 2019, Reuter et al., 2018].
We recommend the CFLARE
estimator; let’s start by initializing it.
from cellrank.estimators import GPCCA
g = GPCCA(pk_new)
print(g)
GPCCA[kernel=PseudotimeKernel[n=5780], initial_states=None, terminal_states=None]
Identify initial & terminal states¶
Fit the estimator to compute macrostates of cellular dynamics; these may include initial
, terminal
and intermediate states.
g.fit(n_states=10, cluster_key="clusters")
g.plot_macrostates(which="all")
Computing Schur decomposition
Adding `adata.uns['eigendecomposition_fwd']`
`.schur_vectors`
`.schur_matrix`
`.eigendecomposition`
Finish (0:00:00)
Computing `10` macrostates
WARNING: Color sequence contains non-unique elements
Adding `.macrostates`
`.macrostates_memberships`
`.coarse_T`
`.coarse_initial_distribution
`.coarse_stationary_distribution`
`.schur_vectors`
`.schur_matrix`
`.eigendecomposition`
Finish (0:00:08)
For each macrostate, the algorithm computes an associated stability value. Let’s use the most stable macrostates as terminal
states [Lange et al., 2022, Reuter et al., 2019, Reuter et al., 2018].
g.predict_terminal_states(method="top_n", n_states=6)
g.plot_macrostates(which="terminal")
This correctly identified all major terminal states
in this dataset. CellRank can also identify initial states
- in this dataset, that does not make too much sense though, as the initial state was manually passed to Palantir to root the pseudotime computation.
See also
To learn more about initial
and terminal
state identification, see Computing Initial and Terminal States.
Compute fate probabilities and driver genes¶
We can now compute how likely each cell is to reach each terminal state using compute_fate_probabilities()
.
g.compute_fate_probabilities()
g.plot_fate_probabilities(legend_loc="right")
The plot above combines fate probabilities towards all terminal states, each cell is colored according to its most likely fate; color intensity reflects the degree of lineage priming. We could equally plot fate probabilities separately for each terminal state, or we can visualize them jointly in a circular projection
[Lange et al., 2022, Velten et al., 2017].
Each dot represents a cell, colored by cluster labels. Cells are arranged inside the circle according to their fate probabilities; fate biased cells are placed next to their corresponding corner while naive cells are placed in the middle.
To infer putative driver genes for any of these trajectories, we correlate expression values with fate probabilities.
mono_drivers = g.compute_lineage_drivers(lineages="Mono_1_1")
mono_drivers.head(10)
Adding `adata.varm['terminal_lineage_drivers']`
`.lineage_drivers`
Finish (0:00:00)
Mono_1_1_corr | Mono_1_1_pval | Mono_1_1_qval | Mono_1_1_ci_low | Mono_1_1_ci_high | |
---|---|---|---|---|---|
index | |||||
AZU1 | 0.774242 | 0.0 | 0.0 | 0.763706 | 0.784367 |
MPO | 0.741949 | 0.0 | 0.0 | 0.730134 | 0.753321 |
ELANE | 0.736333 | 0.0 | 0.0 | 0.724302 | 0.747916 |
CTSG | 0.722017 | 0.0 | 0.0 | 0.709442 | 0.734133 |
PRTN3 | 0.721471 | 0.0 | 0.0 | 0.708875 | 0.733606 |
CFD | 0.614439 | 0.0 | 0.0 | 0.598133 | 0.630237 |
RNASE2 | 0.595055 | 0.0 | 0.0 | 0.578143 | 0.611456 |
MS4A3 | 0.581471 | 0.0 | 0.0 | 0.564147 | 0.598283 |
SRGN | 0.570484 | 0.0 | 0.0 | 0.552833 | 0.587622 |
CSTA | 0.559455 | 0.0 | 0.0 | 0.541484 | 0.576915 |
See also
To learn more about fate probabilities and driver genes, see Estimating Fate Probabilities and Driver Genes.
Visualize expression trends¶
Given fate probabilities and a pseudotime, we can plot trajectory-specific gene expression trends. Specifically, we fit Generalized Additive Models (GAMs), weighthing each cells contribution to each trajectory according to its vector of fate probabilities. We start by initializing a model
.
model = cr.models.GAMR(adata)
Note
The GAMR
uses the R package mgcv for parameter fitting; this requires you to have rpy2 installed. To avoid this and remain entirely in python, you can use cellrank.models.GAM
, which uses pyGAM for parameter fitting.
With the model initialized, we can visualize gene dynamics along specific trajectories.
cr.pl.gene_trends(
adata,
model=model,
data_key="MAGIC_imputed_data",
genes=["GATA1", "CD34", "IRF8", "MPO"],
same_plot=True,
ncols=2,
time_key="palantir_pseudotime",
hide_cells=True,
)
Note
We use MAGIC-imputed data for gene-trend visualization [van Dijk et al., 2018]. We don’t use imputed data for any other task.
Above, we grouped expression trends by gene, and visualized several trajectories per panel. We can also do it the other way round - group expression trends by trajectory, visualize several genes per panel. While this is possible using the line plots from above (set transpose=True
), we’ll demonstrate it using heatmaps.
cr.pl.heatmap(
adata,
model=model,
data_key="MAGIC_imputed_data",
genes=["GATA1", "CD34", "IRF8", "MPO"],
lineages=["Mono_1_1", "Mega", "CLP"],
time_key="palantir_pseudotime",
cbar=False,
show_all_genes=True,
)
See also
To learn more about gene trend visualization and clustering, see Visualizing and Clustering Gene Expression Trends.
Closing matters¶
Creating a new kernel - contributing to CellRank¶
If you have you own method in mind for computing a cell-cell transition matrix, potentially based on some new data modality, consider including it as a kernel in CellRank. That way, you benefit from CellRank’s established interfaces with packages like scanpy
or AnnData
, can take advantage of existing downstream analysis capabilities, and immediately reach a large audience of existing CellRank users.
See also
To learn more about creating your own kernel, see TODO.
What’s next?¶
In this tutorial, you learned the basics of CellRank. For the next steps, we recommend to:
explore the various CellRank kernels (REF) and corresponding tutorials.
take a look at the API to learn about parameter values you can use to adapt these computations to your data.
dive deeper into initial and terminal states, fate probabilities, and gene expression trends.
Package versions¶
cr.logging.print_versions()
cellrank==1.5.2.dev206+ga2748bea scanpy==1.9.3 anndata==0.8.0 numpy==1.24.2 numba==0.57.0rc1 scipy==1.10.1 pandas==1.5.3 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.4 scvelo==0.3.0 pygam==0.8.0 matplotlib==3.7.0 seaborn==0.12.2