CellRank Meets RNA Velocity¶
Preliminaries¶
In this tutorial, you will learn how to:
use
scvelo
to compute RNA velocity [Bergen et al., 2020, La Manno et al., 2018].set up CellRank’s
VelocityKernel
and compute a transition matrix based on RNA velocity.combine the
VelocityKernel
with theConnectivityKernel
to emphasize gene expression similarity.visualize the transition matrix in a low-dimensional embedding.
This tutorial notebook can be downloaded using the following link.
Note
If you want to run this on your own data, you will need:
a scRNA-seq dataset which has been preprocessed to contain unspliced & spliced counts using a software like the Velocyto command line tool.
Note
If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse. You can also contact us using info@cellrank.org.
Import packages & data¶
import sys
if "google.colab" in sys.modules:
!pip install -q git+https://github.com/theislab/cellrank
import numpy as np
import cellrank as cr
import scanpy as sc
import scvelo as scv
scv.settings.verbosity = 3
scv.settings.set_figure_params("scvelo")
cr.settings.verbosity = 2
Global seed set to 0
import warnings
warnings.simplefilter("ignore", category=UserWarning)
To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of mouse pancreas development at embryonic day (E) 15.5, which can be conveniently acessed through pancreas
[Bastidas-Ponce et al., 2019, Bergen et al., 2020].
We visualize the fraction of spliced/unspliced reads; these are required to estimate RNA velocity.
adata = cr.datasets.pancreas()
scv.pl.proportions(adata)
adata
AnnData object with n_obs × n_vars = 2531 × 27998
obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime'
var: 'highly_variable_genes'
uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
Preprocess the data¶
Filter out genes which don’t have enough spliced/unspliced counts, normalize and log transform the data and restrict to the top highly variable genes. Further, compute principal components and moments for velocity estimation.
scv.pp.filter_and_normalize(
adata, min_shared_counts=20, n_top_genes=2000, subset_highly_variable=False
)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30, random_state=0)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
Filtered out 22024 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.
Logarithmized X.
computing moments based on connectivities
finished (0:00:00) --> added
'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
Run scVelo¶
We will use scVelo’s dynamical model
to estimate model parameters.
Note
Please make sure to have at least version 0.2.3
of scVelo installed to make use parallelisation in scv.tl.recover_dynamics
. On my MacBook, using 8 cores, the below cell takes about 1 min to execute.
scv.tl.recover_dynamics(adata, n_jobs=8)
recovering dynamics (using 8/8 cores)
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
finished (0:01:11) --> added
'fit_pars', fitted parameters for splicing dynamics (adata.var)
Once we have the parameters, we can use these to compute the actual velocities.
scv.tl.velocity(adata, mode="dynamical")
computing velocities
finished (0:00:01) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
We’ll use CellRank to visualize the velocities in an embedding further down below.
Combine RNA velocity with expression similarity in high dimensions¶
Set up the VelocityKernel¶
Set up the VelocityKernel
from the AnnData
object containing the scVelo-computed velocities.
vk = cr.kernels.VelocityKernel(adata)
Use the VelocityKernel
to compute a transition matrix by correlating each cell’s velocity vector with the displacement vectors towards nearest neighbors, directly in high-dimensional gene expression space.
We do this using the compute_transition_matrix()
method.
vk.compute_transition_matrix()
Computing transition matrix using `'deterministic'` model
Using `softmax_scale=4.0132`
Finish (0:00:01)
VelocityKernel[n=2531, model='deterministic', similarity='correlation', softmax_scale=4.013]
By default, we use the deterministic
mode to compute the transiton matrix. If you want to propagate uncertainty in the velocity vectors, check out the stochastic
and monte_carlo
modes. The stochastic
mode estimates a distribution over velocity vectors using the KNN graph and propagates this distribution into the transition matrix using an analytical approximation.
Note
Please check out our “CellRank for directed single-cell fate mapping” paper to learn more on uncertainty propagation and how it makes computations more robust to noise [Lange et al., 2022].
Combine with gene expression similarity¶
RNA velocity can be a very noise quantity; to make our computations more robust, we combine the VelocityKernel
with the similarity-based ConnectivityKernel
.
ck = cr.kernels.ConnectivityKernel(adata)
ck.compute_transition_matrix()
combined_kernel = 0.8 * vk + 0.2 * ck
Computing transition matrix based on `adata.obsp['connectivities']`
Finish (0:00:00)
Above, we combined the two kernels with custom weights given to each. The same syntax can be used to combine any number of CellRank kernels, see the getting started tutorial.
Let’s print the combined kernel object to see what it contains:
print(combined_kernel)
(0.8 * VelocityKernel[n=2531] + 0.2 * ConnectivityKernel[n=2531])
This tells us about the kernels we combined, the weights we used to combine them, and the parameters that went into transition matrix computation for each kernel. Check out the API
to learn more about the parameters. Next, let’s explore ways to visualize the computed transition matrix.
Visualize the transition matrix¶
Similar to scvelo
[Bergen et al., 2020] and velocyto [La Manno et al., 2018], CellRank visualizes the transition matrix in any low dimensional embedding (UMAP, t-SNE, PCA, Diffmap, etc.) via arrows or streamlines.
vk.plot_projection()
As shown before in the scVelo publication [Bergen et al., 2020], the projected velocity vectors capture the overall trend in this system: Neurogenin 3 low endocrine progenitor cells (Ngn3 low EP) gradually transition via indermediate stages towards terminal, hormone producing Alpha, Beta, Epsilon and Delta cells. Another way to visualize this is via random walks.
vk.plot_random_walks(start_ixs={"clusters": "Ngn3 low EP"}, max_iter=200, seed=0)
We simulated random walks starting in the Ngn3 low EP
cluster. Black and yellow dots denote the start and end points of a random walk, respectively. In agreement with known biology, the Beta cluster is dominant at E15.5 [Bastidas-Ponce et al., 2019].
Note
Take these low-dimensional representations with a grain of salt; streamlines in a low-dimensional embeddings are very biased by the topology of that embedding. They often oversmooth that actual dynamics, or obscure them entirely. Random walks are a better alternative; these are sampled directly from the Markov chain and don’t depend on the low dimensional embedding. Additionally, we recommend analyzing high-dimensional dynamics using estimators
.
Closing matters¶
Write results to file¶
We will use the kernel computed here in other tutorials to demonstrate downstream applications. For this reason, we write the kernel attributes to the underlying anndata.AnnData
object here and save to disk. In other tutorials, we can then reconstruct the kernel from the saved object. The dataset written here is available via cellrank.datasets.pancreas
by specifying kind="preprocessed-kernel"
.
vk.write_to_adata()
adata.write(
"datasets/endocrinogenesis_day15.5_velocity_kernel.h5ad", compression="gzip"
)
What’s next?¶
In this tutorial, you learned how to use CellRank to compute a transition matrix using RNA velocity and gene expression similarity and how it can be visualized in low dimensions [Bergen et al., 2020, La Manno et al., 2018, Lange et al., 2022]. The real power of CellRank comes in when you use estimators to analyze the transition matrix directly, rather than projecting it. For the next steps, we recommend…
going through the initial and terminal states tutorial to learn how to use the transition matrix to automatically identify initial and terminal states.
reading our “CellRank for directed single-cell fate mapping” paper to learn more about the methods we used here [Lange et al., 2022].
taking a look at the
API
to learn about parameter values you can use to adapt these computations to your data.
Package versions¶
We used the following package versions to generate this tutorial:
cr.logging.print_versions()
cellrank==1.5.1+gedbc651e scanpy==1.9.3 anndata==0.8.0 numpy==1.24.2 numba==0.57.0rc1 scipy==1.10.1 pandas==1.5.3 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.4 scvelo==0.3.0 pygam==0.8.0 matplotlib==3.7.0 seaborn==0.12.2