# CellRank meets CytoTRACE#

In this tutorial, you will learn how to…

compute a developmental potential using our adapted CytoTRACE implementation [Gulati

*et al.*, 2020].set up CellRank’s CytoTRACEKernel to compute a transition matrix based on the CytoTRACE score.

visualize the transition matrix in a low-dimensional embedding.

Here’s a conceptual overview of how we compute the CytoTRACE score:

**Fig** | Following the original CytoTRACE suggestion, we use the number of genes expressed per cell as a raw signal of differentiation status and apply several post-processing steps to increase robustness. The basic assumption behind this method is that naive cells on average express more genes compared to mature cells because they regulate their chromatin less tightly; we’ve observed this to work well for many early developmental datasets.

Important

A bottleneck in the original implementation was scalabilty of the post-processing steps; we therefore adapted the original CytoTRACE implementation to be much faster and memory efficient:

**Fig** | While the original implementation breaks down at 100k cells on our servers, CellRank easily scales to 1.3M cells in less than 100 sec.

On a few datasets encompassing different single-cell technologies and biological systems, we validated that the two implementations give the same results:

**Fig** | CellRank faithfully implements the CytoTRACE score.

**a.** Low-dimensional representation of 22,370 embryonic *C.elegans* muscle and mesoderm cells [Packer *et al.*, 2019] colored according to the CytoTRACE pseudotime (left: CellRank implementation, middle: original CytoTRACE implementation [Gulati *et al.*, 2020]) and ground truth embryo time from the original publication.

**b.** Scatter plot comparing both implementations across 6 datasets; x-axis (y-axis) represents Spearman’s rank correlation between CellRank (original) CytoTRACE pseudotime and embryo time, averaged per embryo time-point. In the legend, the number behind each dataset denotes Person correlation between the two implementations. CellRank’s accelerated implementation correlates highly with the original implemetation and agrees with grount-truth embryo times as well as the original implementation.

**Method**: To get from the CytoTRACE pseudotime to a directed transition matrix, we use the general mechanism described in the pseudotime tutorial. To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of zebrafish embryogenesis assayed using drop-seq, restricted to the axial mesoderm lineage [Farrell *et al.*, 2018].

Important

If you want to run this on your own data, you will need…

a scRNA-seq dataset which satisfies the CytoTRACE assumptions, i.e., naive cell states should on avarage express more genes. This is often the case for early develpmental stages.

This tutorial notebook can be downloaded using the following link.

## Import packages & data#

```
[1]:
```

```
import sys
if "google.colab" in sys.modules:
!pip install -q git+https://github.com/theislab/cellrank@dev
```

```
[2]:
```

```
import scvelo as scv
import scanpy as sc
import cellrank as cr
scv.settings.verbosity = 3
scv.settings.set_figure_params("scvelo")
cr.settings.verbosity = 2
```

```
[3]:
```

```
adata = cr.datasets.zebrafish()
adata
```

```
[3]:
```

```
AnnData object with n_obs × n_vars = 2434 × 23974
obs: 'Stage', 'gt_terminal_states', 'lineages'
uns: 'Stage_colors', 'gt_terminal_states_colors', 'lineages_colors'
obsm: 'X_force_directed'
```

This zebrafish dataset contains 12 time-points spanning 3.3 - 12 hours past fertilization. We can use the original force-directed layout to plot cells, colored by developmental stage:

```
[4]:
```

```
scv.pl.scatter(adata, basis="force_directed", c="Stage", legend_loc="right")
```

We can further look into the lineage assignments, computed in the original study using URD [Farrell *et al.*, 2018]:

```
[5]:
```

```
scv.pl.scatter(adata, basis="force_directed", c="lineages", legend_loc="right")
```

## Pre-process the data#

Before we can start applying the CytoTRACE kernel, we’ll have to do some basic pre-processing of the data.

```
[6]:
```

```
# filter, normalize total counts and log-transform
sc.pp.filter_genes(adata, min_cells=10)
scv.pp.normalize_per_cell(adata)
sc.pp.log1p(adata)
# hvg annotation
sc.pp.highly_variable_genes(adata)
print(f"This detected {adata.var['highly_variable'].sum()} highly variable genes. ")
# use scVelo's `moments` function for imputation - note that hack we're using here:
# we're copying our `.X` matrix into the layers because that's where `scv.tl.moments`
# expects to find counts for imputation
adata.layers["spliced"] = adata.X
adata.layers["unspliced"] = adata.X
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
```

```
Normalized count data: X.
This detected 2392 highly variable genes.
computing neighbors
finished (0:00:02) --> added
'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
finished (0:00:01) --> added
'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
```

## Initialize the CytoTRACE kernel#

Import the kernel, initialize it using the pre-processed AnnData object and compute the CytoTRACE score:

```
[7]:
```

```
from cellrank.kernels import CytoTRACEKernel
ctk = CytoTRACEKernel(adata).compute_cytotrace()
```

```
Computing CytoTRACE score with `13690` genes
Adding `adata.obs['ct_score']`
`adata.obs['ct_pseudotime']`
`adata.obs['ct_num_exp_genes']`
`adata.var['ct_gene_corr']`
`adata.var['ct_correlates']`
`adata.uns['ct_params']`
Finish (0:00:00)
```

We can visually check that our computed CytoTRACE pseudotime correlates well with ground-truth real-time annotations:

```
[8]:
```

```
scv.pl.scatter(
adata,
c=["ct_pseudotime", "Stage"],
basis="force_directed",
legend_loc="right",
color_map="gnuplot2",
)
```

To make this a bit more quantitative, we can look at the distribution of our pseudotime over real-time-points and show this via violin plots:

```
[9]:
```

```
sc.pl.violin(adata, keys=["ct_pseudotime"], groupby="Stage", rotation=90)
```

This looks good! The CytoTRACE pseudotime can thus be used to bias graph edges into the direction of increasing differentiation status, similar to how we did this for other pseudotimes in the PseudotimeKernel tutorial.

## Compute & visualize a transition matrix#

Compute the transition matrix:

```
[10]:
```

```
ctk.compute_transition_matrix(threshold_scheme="soft", nu=0.5)
```

```
Computing transition matrix based on pseudotime`
```

```
```

```
Finish (0:00:01)
```

```
[10]:
```

```
CytoTRACEKernel
```

To get some intuition for this transition matrix, we can project it into an embedding and draw the same sort of arrows that we all got used to from RNA velocity - the following plot will look **exactly** like the plots you’re used to from scVelo for visualizing RNA velocity [Bergen *et al.*, 2020, La Manno *et al.*, 2018].

Important

There’s **no RNA velocity** here, we’re visualizing the directed transition matrix we computed using the kNN graph as well as the CytoTRACE pseudotime.

```
[11]:
```

```
ctk.plot_projection(basis="force_directed", color="Stage", legend_loc='right')
```

```
Projecting transition matrix onto `force_directed`
Adding `adata.obsm['T_fwd_force_directed']`
Finish (0:00:00)
```

The process of differentiation seems to be captured well by this transition matrix. We can visually confirm this by looking at terminal-state annotations from the original study:

```
[12]:
```

```
scv.pl.scatter(
adata, basis="force_directed", c="gt_terminal_states", legend_loc="right"
)
```

Another way to gain intuition for the transition matrix is by drawing some cells from the early stage and to use these as starting cells to simulate random walks on the transition matrix:

```
[13]:
```

```
ctk.plot_random_walks(
n_sims=15,
start_ixs={"Stage": "03.3-HIGH"},
basis="force_directed",
color="Stage",
legend_loc="right",
seed=1,
)
```

```
Simulating `15` random walks of maximum length `609`
```

```
Finish (0:00:01)
Plotting random walks
```

```
```

Black dots denote sampled starting cells for random walks, yellow dots denote end cells. We terminated each random walk after a predefined number of steps. Random walks are colored according to how long they’ve been running for - i.e. the initial segments are more black/blue whereas the late segments are more orange/yellow. We can see that most random walks terminate in one of the two terminal states, as expected.

Important

The visualization techniques demonstrated here work for **every kernel**, no matter whether it’s a VelocityKernel, PseudotimeKernel, CytoTRACEKernel or some other kernel.

## What’s next?#

In this tutorial, you learned how to use CellRank to compute a transition matrix based on the CytoTRACE pseudotime and how it can be visualized in low dimensions. The real power of CellRank comes in when you use estimators to analyze the transition matrix directly, rather than projecting it. For the next steps, we recommend…

going through the initial & terminal states tutorial to learn how to use the transition matrix to automatically identify initial and terminal states.

taking a look at the full API to learn about parameter values you can use to adapt these computations to your data.

reading the original CytoTRACE publication and considering limitations of this appraoch [Gulati

*et al.*, 2020].

Important

If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse.

## Package versions#

```
[14]:
```

```
cr.logging.print_versions()
```

```
cellrank==1.5.1+g4bab618c.d20220621 scanpy==1.7.2 anndata==0.8.0 numpy==1.21.4 numba==0.51.2 scipy==1.5.3 pandas==1.3.3 pygpcca==1.0.2 scikit-learn==0.24.0 statsmodels==0.12.1 python-igraph==0.8.3 scvelo==0.2.4 pygam==0.8.0 matplotlib==3.3.3 seaborn==0.11.0
```