# CellRank Meets RNA Velocity#

## Preliminaries#

In this tutorial, you will learn how to:

use

`scvelo`

to compute RNA velocity [Bergen*et al.*, 2020, La Manno*et al.*, 2018].set up CellRank’s

`VelocityKernel`

and compute a transition matrix based on RNA velocity.combine the

`VelocityKernel`

with the`ConnectivityKernel`

to emphasize gene expression similarity.visualize the transition matrix in a low-dimensional embedding.

This tutorial notebook can be downloaded using the following link.

Note

If you want to run this on your own data, you will need:

a scRNA-seq dataset which has been preprocessed to contain unspliced & spliced counts using a software like the Velocyto command line tool.

Note

If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse. You can also contact us using info@cellrank.org.

### Import packages & data#

```
import sys
if "google.colab" in sys.modules:
!pip install -q git+https://github.com/theislab/cellrank
```

```
import numpy as np
import cellrank as cr
import scanpy as sc
import scvelo as scv
scv.settings.verbosity = 3
scv.settings.set_figure_params("scvelo")
cr.settings.verbosity = 2
```

```
Global seed set to 0
```

```
import warnings
warnings.simplefilter("ignore", category=UserWarning)
```

To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of mouse pancreas development at embryonic day (E) 15.5, which can be conveniently acessed through `pancreas`

[Bastidas-Ponce *et al.*, 2019, Bergen *et al.*, 2020].

We visualize the fraction of spliced/unspliced reads; these are required to estimate RNA velocity.

```
adata = cr.datasets.pancreas()
scv.pl.proportions(adata)
adata
```

```
AnnData object with n_obs × n_vars = 2531 × 27998
obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime'
var: 'highly_variable_genes'
uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
```

### Preprocess the data#

Filter out genes which don’t have enough spliced/unspliced counts, normalize and log transform the data and restrict to the top highly variable genes. Further, compute principal components and moments for velocity estimation.

```
scv.pp.filter_and_normalize(
adata, min_shared_counts=20, n_top_genes=2000, subset_highly_variable=False
)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30, random_state=0)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
```

```
Filtered out 22024 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.
Logarithmized X.
computing moments based on connectivities
finished (0:00:00) --> added
'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
```

### Run scVelo#

We will use scVelo’s `dynamical model`

to estimate model parameters.

Note

Please make sure to have at least version `0.2.3`

of scVelo installed to make use **parallelisation** in `scv.tl.recover_dynamics`

. On my MacBook, using 8 cores, the below cell takes about 1 min to execute.

```
scv.tl.recover_dynamics(adata, n_jobs=8)
```

```
recovering dynamics (using 8/8 cores)
```

```
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
```

```
finished (0:01:11) --> added
'fit_pars', fitted parameters for splicing dynamics (adata.var)
```

Once we have the parameters, we can use these to compute the actual velocities.

```
scv.tl.velocity(adata, mode="dynamical")
```

```
computing velocities
finished (0:00:01) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
```

We’ll use CellRank to visualize the velocities in an embedding further down below.

## Combine RNA velocity with expression similarity in high dimensions#

### Set up the VelocityKernel#

Set up the `VelocityKernel`

from the `AnnData`

object containing the scVelo-computed velocities.

```
vk = cr.kernels.VelocityKernel(adata)
```

Use the `VelocityKernel`

to compute a transition matrix by correlating each cell’s velocity vector with the displacement vectors towards nearest neighbors, directly in high-dimensional gene expression space.

We do this using the `compute_transition_matrix()`

method.

```
vk.compute_transition_matrix()
```

```
Computing transition matrix using `'deterministic'` model
```

```
Using `softmax_scale=4.0132`
```

```
Finish (0:00:01)
```

```
VelocityKernel[n=2531, model='deterministic', similarity='correlation', softmax_scale=4.013]
```

By default, we use the `deterministic`

mode to compute the transiton matrix. If you want to propagate uncertainty in the velocity vectors, check out the `stochastic`

and `monte_carlo`

modes. The `stochastic`

mode estimates a distribution over velocity vectors using the KNN graph and propagates this distribution into the transition matrix using an analytical approximation.

Note

Please check out our “CellRank for directed single-cell fate mapping” paper to learn more on uncertainty propagation and how it makes computations more robust to noise [Lange *et al.*, 2022].

### Combine with gene expression similarity#

RNA velocity can be a very noise quantity; to make our computations more robust, we combine the `VelocityKernel`

with the similarity-based `ConnectivityKernel`

.

```
ck = cr.kernels.ConnectivityKernel(adata)
ck.compute_transition_matrix()
combined_kernel = 0.8 * vk + 0.2 * ck
```

```
Computing transition matrix based on `adata.obsp['connectivities']`
Finish (0:00:00)
```

Above, we combined the two kernels with custom weights given to each. The same syntax can be used to combine any number of CellRank kernels, see the getting started tutorial.

Let’s print the combined kernel object to see what it contains:

```
print(combined_kernel)
```

```
(0.8 * VelocityKernel[n=2531] + 0.2 * ConnectivityKernel[n=2531])
```

This tells us about the kernels we combined, the weights we used to combine them, and the parameters that went into transition matrix computation for each kernel. Check out the `API`

to learn more about the parameters. Next, let’s explore ways to visualize the computed transition matrix.

### Visualize the transition matrix#

Similar to `scvelo`

[Bergen *et al.*, 2020] and velocyto [La Manno *et al.*, 2018], CellRank visualizes the transition matrix in any low dimensional embedding (UMAP, t-SNE, PCA, Diffmap, etc.) via arrows or streamlines.

```
vk.plot_projection()
```

```
Projecting transition matrix onto `umap`
Adding `adata.obsm['T_fwd_umap']`
Finish (0:00:00)
```

As shown before in the scVelo publication [Bergen *et al.*, 2020], the projected velocity vectors capture the overall trend in this system: *Neurogenin 3* low endocrine progenitor cells (Ngn3 low EP) gradually transition via indermediate stages towards terminal, hormone producing Alpha, Beta, Epsilon and Delta cells. Another way to visualize this is via random walks.

```
vk.plot_random_walks(start_ixs={"clusters": "Ngn3 low EP"}, max_iter=200, seed=0)
```

```
Simulating `100` random walks of maximum length `200`
```

```
Finish (0:00:01)
Plotting random walks
```

We simulated random walks starting in the `Ngn3 low EP`

cluster. Black and yellow dots denote the start and end points of a random walk, respectively. In agreement with known biology, the Beta cluster is dominant at E15.5 [Bastidas-Ponce *et al.*, 2019].

Note

Take these low-dimensional representations with a grain of salt; streamlines in a low-dimensional embeddings are very biased by the topology of that embedding. They often oversmooth that actual dynamics, or obscure them entirely. Random walks are a better alternative; these are sampled directly from the Markov chain and don’t depend on the low dimensional embedding. Additionally, we recommend analyzing high-dimensional dynamics using `estimators`

.

## Closing matters#

### Write results to file#

We will use the kernel computed here in other tutorials to demonstrate downstream applications. For this reason, we write the kernel attributes to the underlying `anndata.AnnData`

object here and save to disk. In other tutorials, we can then reconstruct the kernel from the saved object. The dataset written here is available via `cellrank.datasets.pancreas`

by specifying `kind="preprocessed-kernel"`

.

```
vk.write_to_adata()
adata.write(
"datasets/endocrinogenesis_day15.5_velocity_kernel.h5ad", compression="gzip"
)
```

### What’s next?#

In this tutorial, you learned how to use CellRank to compute a transition matrix using RNA velocity and gene expression similarity and how it can be visualized in low dimensions [Bergen *et al.*, 2020, La Manno *et al.*, 2018, Lange *et al.*, 2022]. The real power of CellRank comes in when you use estimators to analyze the transition matrix directly, rather than projecting it. For the next steps, we recommend…

going through the initial and terminal states tutorial to learn how to use the transition matrix to automatically identify initial and terminal states.

reading our “CellRank for directed single-cell fate mapping” paper to learn more about the methods we used here [Lange

*et al.*, 2022].taking a look at the

`API`

to learn about parameter values you can use to adapt these computations to your data.

### Package versions#

We used the following package versions to generate this tutorial:

```
cr.logging.print_versions()
```

```
cellrank==1.5.1+gedbc651e scanpy==1.9.3 anndata==0.8.0 numpy==1.24.2 numba==0.57.0rc1 scipy==1.10.1 pandas==1.5.3 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.4 scvelo==0.3.0 pygam==0.8.0 matplotlib==3.7.0 seaborn==0.12.2
```