# Getting Started with CellRank#

## Preliminaries#

CellRank is a software toolkit to study dynamical biological processes, like development, regeneration, cancer or reprogramming, based on multi-view single-cell genomics data [Lange *et al.*, 2022], including RNA velocity [Bergen *et al.*, 2020, La Manno *et al.*, 2018], developmental potentials [Gulati *et al.*, 2020], pseudotemporal orderings [Bendall *et al.*, 2014, Haghverdi *et al.*, 2016, Setty *et al.*, 2019, Trapnell *et al.*, 2014], (spatial) time course studies [Klein *et al.*, 2023, Schiebinger *et al.*, 2019], lineage-traced data [Alemany *et al.*, 2018, Forrow and Schiebinger, 2021, Lange *et al.*, 2023, Raj *et al.*, 2018, Spanjaard *et al.*, 2018], metabolic labels [Battich *et al.*, 2020, Erhard *et al.*, 2019, Qiu *et al.*, 2020, Qiu *et al.*, 2022], and more.

Oftentimes, the type of downstream analysis does not depend on the modality that was used to estimate cellular transitions - that’s why we modularized CellRank into `kernels`

, which compute cell-cell transition matrices, and `estimators`

, which analyze the transition matrices.

In this tutorial, you will learn how to:

use CellRank

`kernels`

to compute a transition matrix of cellular dynamics.use CellRank

`estimators`

to analyze the transition matrix, including the computation of`fate probabilities`

,`driver genes`

, and`gene expression trends`

.read and write CellRank

`kernels`

.

This tutorial notebook can be downloaded using the following link.

Note

If you want to run this on your own data, you will need:

a scRNA-seq dataset for which you can compute a cell-cell transition matrix using any CellRank

`kernel`

.

Note

If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse. You can also contact us using info@cellrank.org.

### Import packages & data#

```
import sys
if "google.colab" in sys.modules:
!pip install -q git+https://github.com/theislab/cellrank
```

```
import cellrank as cr
import scanpy as sc
sc.settings.set_figure_params(frameon=False, dpi=100)
cr.settings.verbosity = 2
```

```
Global seed set to 0
```

```
import warnings
warnings.simplefilter("ignore", category=UserWarning)
```

To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of human bone marrow, which can be conveniently acessed through `bone_marrow`

[Setty *et al.*, 2019].

```
adata = cr.datasets.bone_marrow()
adata
```

```
AnnData object with n_obs × n_vars = 5780 × 27876
obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'
var: 'palantir'
uns: 'clusters_colors', 'palantir_branch_probs_cell_types'
obsm: 'MAGIC_imputed_data', 'X_tsne', 'palantir_branch_probs'
layers: 'spliced', 'unspliced'
```

Move MAGIC imputed data to `layers`

[van Dijk *et al.*, 2018].

```
adata = adata[:, adata.var["palantir"]].copy()
adata.layers["MAGIC_imputed_data"] = adata.obsm["MAGIC_imputed_data"].copy()
```

### Preprocess the data#

Filter and normalize.

```
sc.pp.filter_genes(adata, min_cells=5)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=3000)
```

Compute PCA and k-nearest neighbor (k-NN) graph.

```
sc.tl.pca(adata, random_state=0)
sc.pp.neighbors(adata, random_state=0)
```

Visualize this data, using the t-SNE embedding, cluster labels and Palantir pseudotime provided with the original study [Setty *et al.*, 2019].

```
sc.pl.embedding(adata, basis="tsne", color=["clusters", "palantir_pseudotime"])
```

## Working with kernels#

### Set up a kernel#

To construct a transition matrix, CellRank offers a number of kernel classes in `kernels`

. To demonstrate the concept, we’ll use the `PseudotimeKernel`

which biases k-NN graph edges to point into the direction of increasing pseudotime, inspired by Palantir [Setty *et al.*, 2019].

For a full list of kernels, check out our `API`

. To initialize a kernel object, simply run the following:

```
from cellrank.kernels import PseudotimeKernel
pk = PseudotimeKernel(adata, time_key="palantir_pseudotime")
```

Note that kernels need an `anndata.AnnData`

object to read data from it - CellRank is part of the Scanpy/AnnData ecosystem. The only exception to this is the `cellrank.kernels.PrecomputedKernel`

which directly accepts a transition matrix, thus making it possible to interface to CellRank from outside the scanpy/AnnData world.

To learn more about our kernel object, we can print it.

```
pk
```

```
PseudotimeKernel[n=5780]
```

There isn’t very much here yet! Let’s use this kernel to compute a cell-cell `transition_matrix`

.

```
pk.compute_transition_matrix()
```

```
Computing transition matrix based on pseudotime`
```

```
Finish (0:00:01)
```

```
PseudotimeKernel[n=5780, dnorm=False, scheme='hard', frac_to_keep=0.3]
```

If we print the kernel again, we can inspect the parameters that were used to compute this transition matrix.

```
pk
```

```
PseudotimeKernel[n=5780, dnorm=False, scheme='hard', frac_to_keep=0.3]
```

To get a first impression of the cellular dynamics in this dataset, we can simulate `random walks`

on the Markov chain implied by the transition matrix, starting from hematopoietic stem cells (HSCs), and visualize these in the t-SNE embedding.

```
pk.plot_random_walks(
seed=0,
n_sims=100,
start_ixs={"clusters": "HSC_1"},
basis="tsne",
legend_loc="right",
dpi=150,
)
```

```
Simulating `100` random walks of maximum length `1445`
```

```
Finish (0:00:12)
Plotting random walks
```

Black and yellow dots indicate random walk start and terminal cells, respectively. Overall, this reflects the known differentiation hierachy in human hematopoiesis.

See also

To learn more about the `PseudotimeKernel`

, see CellRank Meets Pseudotime.

### Kernel overview#

There exist CellRank kernels for many different data modalities. Each modality and biological system comes with its own strenghts and limitations, so it’s important to choose the kernel carefully. We provide some guidance in the figure below. However, please check the `kernel API`

for a complete and up-to-date list, as new kernels will come.

If you already have a cell-cell transition matrix for your data, you can pass that directly using the `cellrank.kernels.PrecomputedKernel`

, CellRank’s interface with external methods.

### Combining different kernels#

Any two kernels can be globally combined via a weighted mean. To demonstrate this, let’s set up an additional `cellrank.kernels.ConnectivityKernel`

, based on gene expression similarity.

```
from cellrank.kernels import ConnectivityKernel
ck = ConnectivityKernel(adata).compute_transition_matrix()
```

```
Computing transition matrix based on `adata.obsp['connectivities']`
Finish (0:00:00)
```

Print the kernel to see some properties

```
ck
```

```
ConnectivityKernel[n=5780, dnorm=True, key='connectivities']
```

Combine with the `PseudotimeKernel`

PseudotimeKernel from above.

```
combined_kernel = 0.8 * pk + 0.2 * ck
combined_kernel
```

```
(0.8 * PseudotimeKernel[n=5780, dnorm=False, scheme='hard', frac_to_keep=0.3] + 0.2 * ConnectivityKernel[n=5780, dnorm=True, key='connectivities'])
```

This works for any combination of kernels, and generalizes to more than two kernels. Kernel combinations allow you to use distinct sources of information to describe cellular dynamics.

### Writing and reading a kernel#

Sometimes, we might want to write the kernel to disk, to either continue with the analysis later on, or to pass it to someone else. This can be easily done; simply write the kernel to adata, and write adata to file:

```
pk.write_to_adata()
adata.write("hematopoiesis.h5ad")
```

To continue with the analysis, we read the `anndata.AnnData`

object from disk, and initialize a new kernel from the AnnData object.

```
adata = sc.read("hematopoiesis.h5ad")
pk_new = cr.kernels.PseudotimeKernel.from_adata(adata, key="T_fwd")
pk_new
```

```
PseudotimeKernel[n=5780, dnorm=False, frac_to_keep=0.3, scheme='hard']
```

## Working with estimators#

### Set up an estimator#

Estimators take a `kernel`

object and offer methods to analyze it. The main objective is to decompose the state space into a set of macrostates that represent the slow-time scale dynamics of the process. A subset of these macrostates will be the initial or terminal states of the process, the remaining states will be intermediate transient states. CellRank currently offers two estimator classes in `cellrank.estimators`

:

`CFLARE`

:**C**lustering and**F**iltering**L**eft**A**nd**R**ight**E**igenvectors. Heuristic method based on the spectrum of the transition matrix.`GPCCA`

:**G**eneralized**P**erron**C**luster**C**luster**A**nalysis: project the Markov chain onto a small set of macrostates using a Galerkin projection which maximizes the self-transition probability for the macrostates, [Reuter*et al.*, 2019, Reuter*et al.*, 2018].

We recommend the `CFLARE`

estimator; let’s start by initializing it.

```
from cellrank.estimators import GPCCA
g = GPCCA(pk_new)
print(g)
```

```
GPCCA[kernel=PseudotimeKernel[n=5780], initial_states=None, terminal_states=None]
```

### Identify initial & terminal states#

Fit the estimator to compute macrostates of cellular dynamics; these may include `initial`

, `terminal`

and intermediate states.

```
g.fit(n_states=10, cluster_key="clusters")
g.plot_macrostates(which="all")
```

```
Computing Schur decomposition
Adding `adata.uns['eigendecomposition_fwd']`
`.schur_vectors`
`.schur_matrix`
`.eigendecomposition`
Finish (0:00:00)
Computing `10` macrostates
WARNING: Color sequence contains non-unique elements
Adding `.macrostates`
`.macrostates_memberships`
`.coarse_T`
`.coarse_initial_distribution
`.coarse_stationary_distribution`
`.schur_vectors`
`.schur_matrix`
`.eigendecomposition`
Finish (0:00:08)
```

For each macrostate, the algorithm computes an associated stability value. Let’s use the most stable macrostates as `terminal`

states [Lange *et al.*, 2022, Reuter *et al.*, 2019, Reuter *et al.*, 2018].

```
g.predict_terminal_states(method="top_n", n_states=6)
g.plot_macrostates(which="terminal")
```

```
Adding `adata.obs['term_states_fwd']`
`adata.obs['term_states_fwd_probs']`
`.terminal_states`
`.terminal_states_probabilities`
`.terminal_states_memberships
Finish`
```

This correctly identified all major `terminal states`

in this dataset. CellRank can also identify `initial states`

- in this dataset, that does not make too much sense though, as the initial state was manually passed to Palantir to root the pseudotime computation.

See also

To learn more about `initial`

and `terminal`

state identification, see Computing Initial and Terminal States.

### Compute fate probabilities and driver genes#

We can now compute how likely each cell is to reach each terminal state using `compute_fate_probabilities()`

.

```
g.compute_fate_probabilities()
g.plot_fate_probabilities(legend_loc="right")
```

```
Computing fate probabilities
```

```
Adding `adata.obsm['lineages_fwd']`
`.fate_probabilities`
Finish (0:00:00)
```

The plot above combines fate probabilities towards all terminal states, each cell is colored according to its most likely fate; color intensity reflects the degree of lineage priming. We could equally plot fate probabilities separately for each terminal state, or we can visualize them jointly in a `circular projection`

[Lange *et al.*, 2022, Velten *et al.*, 2017].

```
cr.pl.circular_projection(adata, keys="clusters", legend_loc="right")
```

```
Solving TSP for `6` states
```

Each dot represents a cell, colored by cluster labels. Cells are arranged inside the circle according to their fate probabilities; fate biased cells are placed next to their corresponding corner while naive cells are placed in the middle.

To infer putative driver genes for any of these trajectories, we correlate expression values with fate probabilities.

```
mono_drivers = g.compute_lineage_drivers(lineages="Mono_1_1")
mono_drivers.head(10)
```

```
Adding `adata.varm['terminal_lineage_drivers']`
`.lineage_drivers`
Finish (0:00:00)
```

Mono_1_1_corr | Mono_1_1_pval | Mono_1_1_qval | Mono_1_1_ci_low | Mono_1_1_ci_high | |
---|---|---|---|---|---|

index | |||||

AZU1 | 0.774242 | 0.0 | 0.0 | 0.763706 | 0.784367 |

MPO | 0.741949 | 0.0 | 0.0 | 0.730134 | 0.753321 |

ELANE | 0.736333 | 0.0 | 0.0 | 0.724302 | 0.747916 |

CTSG | 0.722017 | 0.0 | 0.0 | 0.709442 | 0.734133 |

PRTN3 | 0.721471 | 0.0 | 0.0 | 0.708875 | 0.733606 |

CFD | 0.614439 | 0.0 | 0.0 | 0.598133 | 0.630237 |

RNASE2 | 0.595055 | 0.0 | 0.0 | 0.578143 | 0.611456 |

MS4A3 | 0.581471 | 0.0 | 0.0 | 0.564147 | 0.598283 |

SRGN | 0.570484 | 0.0 | 0.0 | 0.552833 | 0.587622 |

CSTA | 0.559455 | 0.0 | 0.0 | 0.541484 | 0.576915 |

See also

To learn more about fate probabilities and driver genes, see Estimating Fate Probabilities and Driver Genes.

### Visualize expression trends#

Given fate probabilities and a pseudotime, we can plot trajectory-specific gene expression trends. Specifically, we fit Generalized Additive Models (GAMs), weighthing each cells contribution to each trajectory according to its vector of fate probabilities. We start by initializing a `model`

.

```
model = cr.models.GAMR(adata)
```

Note

The `GAMR`

uses the R package mgcv for parameter fitting; this requires you to have rpy2 installed. To avoid this and remain entirely in python, you can use `cellrank.models.GAM`

, which uses pyGAM for parameter fitting.

With the model initialized, we can visualize gene dynamics along specific trajectories.

```
cr.pl.gene_trends(
adata,
model=model,
data_key="MAGIC_imputed_data",
genes=["GATA1", "CD34", "IRF8", "MPO"],
same_plot=True,
ncols=2,
time_key="palantir_pseudotime",
hide_cells=True,
)
```

```
Computing trends using `1` core(s)
```

```
Finish (0:00:00)
Plotting trends
```

Note

We use MAGIC-imputed data for gene-trend visualization [van Dijk *et al.*, 2018]. We don’t use imputed data for any other task.

Above, we grouped expression trends by gene, and visualized several trajectories per panel. We can also do it the other way round - group expression trends by trajectory, visualize several genes per panel. While this is possible using the line plots from above (set `transpose=True`

), we’ll demonstrate it using heatmaps.

```
cr.pl.heatmap(
adata,
model=model,
data_key="MAGIC_imputed_data",
genes=["GATA1", "CD34", "IRF8", "MPO"],
lineages=["Mono_1_1", "Mega", "CLP"],
time_key="palantir_pseudotime",
cbar=False,
show_all_genes=True,
)
```

```
Computing trends using `1` core(s)
```

```
Finish (0:00:00)
```

See also

To learn more about gene trend visualization and clustering, see Visualizing and Clustering Gene Expression Trends.

## Closing matters#

### Creating a new kernel - contributing to CellRank#

If you have you own method in mind for computing a cell-cell transition matrix, potentially based on some new data modality, consider including it as a kernel in CellRank. That way, you benefit from CellRank’s established interfaces with packages like `scanpy`

or `AnnData`

, can take advantage of existing downstream analysis capabilities, and immediately reach a large audience of existing CellRank users.

See also

To learn more about creating your own kernel, see TODO.

### What’s next?#

In this tutorial, you learned the basics of CellRank. For the next steps, we recommend to:

explore the various CellRank kernels (REF) and corresponding tutorials.

take a look at the API to learn about parameter values you can use to adapt these computations to your data.

dive deeper into initial and terminal states, fate probabilities, and gene expression trends.

### Package versions#

```
cr.logging.print_versions()
```

```
cellrank==1.5.2.dev206+ga2748bea scanpy==1.9.3 anndata==0.8.0 numpy==1.24.2 numba==0.57.0rc1 scipy==1.10.1 pandas==1.5.3 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.4 scvelo==0.3.0 pygam==0.8.0 matplotlib==3.7.0 seaborn==0.12.2
```