CellRank Meets CytoTRACE#

Preliminaries#

In this tutorial, you will learn how to:

  • compute a developmental potential using our adapted CytoTRACE implementation [Gulati et al., 2020].

  • set up CellRank’s CytoTRACEKernel to compute a transition matrix based on the CytoTRACE score.

  • visualize the transition matrix in a low-dimensional embedding.

This tutorial notebook can be downloaded using the following link.

CytoTRACE uses the number of genes expressed per cell to estimate a developmental potential.

CytoTRACE estimates differentiation potential: Following the original CytoTRACE suggestion, we use the number of genes expressed per cell as a raw signal of differentiation status and apply several post-processing steps to increase robustness. The basic assumption behind this method is that naive cells on average express more genes compared to mature cells because they regulate their chromatin less tightly; we’ve observed this to work well for many early developmental datasets.#

Note

A bottleneck in the original implementation was scalabilty of the post-processing steps [Gulati et al., 2020]; we therefore adapted the original CytoTRACE implementation to be much faster and memory efficient. We confirmed across several examples that our reimplementation yields biologically equivalent results (CITE). To get from the CytoTRACE pseudotime to a directed transition matrix, we use the general mechanism described in the pseudotime tutorial.

Note

If you want to run this on your own data, you will need:

  • a scRNA-seq dataset which satisfies the CytoTRACE assumptions, i.e., naive cell states should on avarage express more genes. This is often the case for early develpmental stages.

Note

If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse. You can also contact us using info@cellrank.org.

Import packages & data#

import sys

if "google.colab" in sys.modules:
    !pip install -q git+https://github.com/theislab/cellrank
import cellrank as cr
import scanpy as sc
import scvelo as scv

cr.settings.verbosity = 2
sc.settings.set_figure_params(frameon=False, dpi=100)
Global seed set to 0
import warnings

warnings.simplefilter("ignore", category=UserWarning)

To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of zebrafish embryogenesis assayed using drop-seq, restricted to the axial mesoderm lineage [Farrell et al., 2018]. This dataset can be acessed conveniently using zebrafish.

adata = cr.datasets.zebrafish()
adata
AnnData object with n_obs × n_vars = 2434 × 23974
    obs: 'Stage', 'gt_terminal_states', 'lineages'
    uns: 'Stage_colors', 'gt_terminal_states_colors', 'lineages_colors'
    obsm: 'X_force_directed'

Explore the data#

This zebrafish dataset contains 12 time-points spanning 3.3 - 12 hours past fertilization. We can use the original force-directed layout to plot cells, colored by developmental stage:

sc.pl.embedding(adata, basis="force_directed", color="Stage")
../../../_images/26ae19c5ccd24b2f2284ceb7f932cc0f2c63251bdcd6a34f9cbdf8ef9394c194.png

We can further look into the lineage assignments, computed in the original study using URD [Farrell et al., 2018]:

sc.pl.embedding(adata, basis="force_directed", color="lineages")
../../../_images/8e576a3caf94c4b90e3a07d0fda17e8aeae57ec61b366acf2d1419af0956d379.png

Preprocess#

Before we can start applying the CytoTRACE kernel, we’ll have to do some basic preprocessing of the data.

# filter, normalize total counts and log-transform
sc.pp.filter_genes(adata, min_cells=10)
scv.pp.normalize_per_cell(adata)
sc.pp.log1p(adata)

# hvg annotation
sc.pp.highly_variable_genes(adata)
print(f"This detected {adata.var['highly_variable'].sum()} highly variable genes. ")
Normalized count data: X.
This detected 2392 highly variable genes. 

CytoTRACE uses imputed gene expression to calculate the score of differentiation potential. Here, we use use scvelo.pp.moments() for imputation. We’re copying our X matrix into the layers because that’s where moments() expects to find counts for imputation.

adata.layers["spliced"] = adata.X
adata.layers["unspliced"] = adata.X
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
computing neighbors
    finished (0:00:02) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:01) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)

Use developmental potential to direct differentiation#

Compute the CytoTRACE score#

Import the CytoTRACEKernel, initialize it using the preprocessed AnnData object and compute the CytoTRACE score:

from cellrank.kernels import CytoTRACEKernel

ctk = CytoTRACEKernel(adata).compute_cytotrace()
Computing CytoTRACE score with `13690` genes
Adding `adata.obs['ct_score']`
       `adata.obs['ct_pseudotime']`
       `adata.obs['ct_num_exp_genes']`
       `adata.var['ct_gene_corr']`
       `adata.var['ct_correlates']`
       `adata.uns['ct_params']`
    Finish (0:00:00)

We can visually check that our computed CytoTRACE pseudotime correlates well with ground-truth real-time annotations:

sc.pl.embedding(
    adata,
    color=["ct_pseudotime", "Stage"],
    basis="force_directed",
    color_map="gnuplot2",
)
/Users/marius/miniforge3/envs/py39_arm_cr/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = copy(get_cmap(cmap))
../../../_images/78c8e89d2c80ee622d43815ace8240b632ea7968253747251ddbb20cb322b61b.png

To make this a bit more quantitative, we can look at the distribution of our CytoTRACE pseudotime over real time points and show this via violin plots:

sc.pl.violin(adata, keys=["ct_pseudotime"], groupby="Stage", rotation=90)
../../../_images/8a3b1f9196fa35226655340d0c8379caa37f43bc1f752a68c8a98a79ad384188.png

As expected, the CytoTRACE pseudotime increases for later time points. Thus, we can use this score to bias graph edges into the direction of increasing differentiation status, similar to how we do this for other pseudotimes with the PseudotimeKernel.

Compute & visualize a transition matrix#

Compute the transition matrix using compute_transition_matrix().

ctk.compute_transition_matrix(threshold_scheme="soft", nu=0.5)
Computing transition matrix based on pseudotime`
    Finish (0:00:00)
CytoTRACEKernel[n=2434, dnorm=False, scheme='soft', b=10.0, nu=0.5]

To get some intuition for this transition matrix, we can project it into an embedding and draw streamlines - the following plot will look like the plots you’re used to from scVelo and velocyto for visualizing RNA velocity [Bergen et al., 2020, La Manno et al., 2018].

Note

There’s no RNA velocity here, we’re visualizing the directed transition matrix we computed using the kNN graph as well as the CytoTRACE pseudotime.

ctk.plot_projection(basis="force_directed", color="Stage", legend_loc="right")
Projecting transition matrix onto `force_directed`
Adding `adata.obsm['T_fwd_force_directed']`
    Finish (0:00:00)
../../../_images/c4ead582eb274d8b8885ef3ca5cea3799986934ccc79ee396aa286e232e5f82a.png

Visually, in the low-dimensional space, the process of differentiation seems to be captured well by this transition matrix. We can visually “confirm” this by looking at terminal-state annotations from the original study:

sc.pl.embedding(adata, basis="force_directed", color="gt_terminal_states")
/Users/marius/miniforge3/envs/py39_arm_cr/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = copy(get_cmap(cmap))
../../../_images/e9879fea70da809dd575a2c82c4d150447c1dfc0379c2ccf8ea363727fefbf8d.png

Another way to gain intuition for the transition matrix is by drawing some cells from the early stage and to use these as starting cells to simulate random walks on the transition matrix. This visualization technique is less biased by the low-dimensional embedding as the random walks are computed directly based on the transition matrix.

ctk.plot_random_walks(
    n_sims=100,
    start_ixs={"Stage": "03.3-HIGH"},
    basis="force_directed",
    color="Stage",
    legend_loc="right",
    seed=1,
)
Simulating `100` random walks of maximum length `609`
    Finish (0:00:03)
Plotting random walks
../../../_images/c2c1d358815285d46620f2e7dcb614bf325695ef17dc2a3600fae1a54247c5ab.png

Black dots denote sampled starting cells for random walks, yellow dots denote end cells. We terminated each random walk after a predefined number of steps. Random walks are colored according to how long they’ve been running for - i.e. the initial segments are more black/blue whereas the late segments are more orange/yellow. We can see that many random walks terminate in one of the two terminal states, as expected. Quite a few of them seem to get stuck early on, this could be further investigated.

Note

The visualization techniques demonstrated here work for every kernel, no matter whether it’s a VelocityKernel, PseudotimeKernel, CytoTRACEKernel or some other CellRank kernel.

Closing matters#

What’s next?#

In this tutorial, you learned how to use CellRank to compute a transition matrix based on the CytoTRACE pseudotime and how it can be visualized in low dimensions. The real power of CellRank comes in when you use estimators to analyze the transition matrix directly, rather than projecting it. For the next steps, we recommend to:

  • go through the initial and terminal states tutorial to learn how to use the transition matrix to automatically identify initial and terminal states.

  • take a look at the API to learn about parameter values you can use to adapt these computations to your data.

  • read the original CytoTRACE publication and considering limitations of this appraoch [Gulati et al., 2020].

Package versions#

cr.logging.print_versions()
cellrank==1.5.1+gedbc651e scanpy==1.9.3 anndata==0.8.0 numpy==1.24.2 numba==0.57.0rc1 scipy==1.10.1 pandas==1.5.3 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.4 scvelo==0.3.0 pygam==0.8.0 matplotlib==3.7.0 seaborn==0.12.2