Interactive version Binder badge

CellRank basics

This tutorial introduces you to CellRank’s high level API for computing initial & terminal states and fate probabilities. Once we have the fate probabilities, this tutorial shows you how to use them to plot a directed [Wolf et al., 2019], to compute putative lineage drivers and to visualize smooth gene expression trends. If you want a bit more control over how initial & terminal states and fate probabilities are computed, then you should check out CellRank’s low level API, composed of kernels and estimators. This really isn’t any more complicated than using scikit-learn, so please do check out the Kernels and estimators tutorial.

In this tutorial, we will use RNA velocity and transcriptomic similarity to estimate cell-cell transition probabilities. Using kernels and estimators, you can apply CellRank even without RNA velocity information, check out our CellRank beyond RNA velocity tutorial. CellRank generalizes beyond RNA velocity and is a widely applicable framework to model single-cell data based on the powerful concept of Markov chains.

The first part of this tutorial is very similar to scVelo’s tutorial on pancreatic endocrinogenesis. The data we use here comes from [Bastidas-Ponce et al., 2019]. For more info on scVelo, see the documentation or take a look at [Bergen et al., 2020].

This tutorial notebook can be downloaded using the following link.

Import packages & data

Easiest way to start is to download Miniconda3 along with the environment file found here. To create the environment, run conda create -f environment.yml.

[1]:
import scvelo as scv
import scanpy as sc
import cellrank as cr
import numpy as np

scv.settings.verbosity = 3
scv.settings.set_figure_params("scvelo")
cr.settings.verbosity = 2
[2]:
import warnings

warnings.simplefilter("ignore", category=UserWarning)
warnings.simplefilter("ignore", category=FutureWarning)
warnings.simplefilter("ignore", category=DeprecationWarning)

First, we need to get the data. The following commands will download the adata object and save it under datasets/endocrinogenesis_day15.5.h5ad. We’ll also show the fraction of spliced/unspliced reads, which we need to estimate RNA velocity.

[3]:
adata = cr.datasets.pancreas()
scv.pl.proportions(adata)
adata
100%|██████████| 33.5M/33.5M [00:01<00:00, 19.8MB/s]
_images/cellrank_basics_7_1.png
[3]:
AnnData object with n_obs × n_vars = 2531 × 27998
    obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime'
    var: 'highly_variable_genes'
    uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'connectivities', 'distances'

Pre-process the data

Filter out genes which don’t have enough spliced/unspliced counts, normalize and log transform the data and restrict to the top highly variable genes. Further, compute principal components and moments for velocity estimation. These are standard scanpy/scvelo functions, for more information about them, see the scVelo API.

[4]:
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
Filtered out 22024 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.
Logarithmized X.
computing moments based on connectivities
    finished (0:00:00) --> added
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)

Run scVelo

We will use the dynamical model from scVelo to estimate the velocities. Please make sure to have at least version 0.2.3 of scVelo installed to make use parallelisation in scv.tl.recover_dynamics. On my laptop, using 8 cores, the below cell takes about 1:30 min to execute.

[5]:
scv.tl.recover_dynamics(adata, n_jobs=8)
recovering dynamics (using 2/2 cores)
WARNING: Unable to create progress bar. Consider installing `tqdm` as `pip install tqdm` and `ipywidgets` as `pip install ipywidgets`,
or disable the progress bar using `show_progress_bar=False`.
    finished (0:03:06) --> added
    'fit_pars', fitted parameters for splicing dynamics (adata.var)

Once we have the parameters, we can use these to compute the velocities and the velocity graph. The velocity graph is a weighted graph that specifies how likely two cells are to transition into another, given their velocity vectors and relative positions.

[6]:
scv.tl.velocity(adata, mode="dynamical")
scv.tl.velocity_graph(adata)
computing velocities
    finished (0:00:02) --> added
    'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph
    finished (0:00:03) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
[7]:
scv.pl.velocity_embedding_stream(
    adata, basis="umap", legend_fontsize=12, title="", smooth=0.8, min_mass=4
)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/cellrank_basics_16_1.png

Run CellRank

CellRank offers various ways to infuse directionality into single-cell data. Here, the directional information comes from RNA velocity, and we use this information to compute initial & terminal states as well as fate probabilities for the dynamical process of pancreatic development.

Identify terminal states

Terminal states can be computed by running the following command:

[8]:
cr.tl.terminal_states(adata, cluster_key="clusters", weight_connectivities=0.2)
Computing transition matrix based on logits using `'deterministic'` mode
Estimating `softmax_scale` using `'deterministic'` mode
Setting `softmax_scale=3.7951`
    Finish (0:00:10)
Using a connectivity kernel with weight `0.2`
Computing transition matrix based on `adata.obsp['connectivities']`
    Finish (0:00:00)
Computing eigendecomposition of the transition matrix
Adding `adata.uns['eig_fwd']`
       `.eigendecomposition`
    Finish (0:00:00)
WARNING: Unable to import `petsc4py` or `slepc4py`. Using `method='brandts'`
WARNING: For `method='brandts'`, dense matrix is required. Densifying
Computing Schur decomposition
Adding `adata.uns['eig_fwd']`
       `.eigendecomposition`
       `.schur`
       `.schur_matrix`
    Finish (0:00:09)
Computing `3` macrostates
Adding `.macrostates_memberships`
       `.macrostates`
       `.schur`
       `.coarse_T`
       `.coarse_stationary_distribution`
    Finish (0:00:00)
Adding `adata.obs['terminal_states_probs']`
       `adata.obs['terminal_states']`
       `.terminal_states_probabilities`
       `.terminal_states`
    Finish
100%|██████████| 2531/2531 [00:08<00:00, 302.93cell/s]
100%|██████████| 2531/2531 [00:01<00:00, 1536.59cell/s]

The most important parameters in the above function are:

  • estimator: this determines what’s going to behind the scenes to compute the terminal states. Options are cr.tl.estimators.CFLARE (“Clustering and Filtering of Left and Right Eigenvectors”) or cr.tl.estimators.GPCCA (“Generalized Perron Cluster Cluster Analysis, [Reuter et al., 2018] and [Reuter et al., 2019], see also our pyGPCCA implementation). The latter is the default, it computes terminal states by coarse graining the velocity-derived Markov chain into a set of macrostates that represent the slow-time scale dynamics of the process, i.e. it finds the states that you are unlikely to leave again, once you have entered them.

  • cluster_key: takes a key from adata.obs to retrieve pre-computed cluster labels, i.e. ‘clusters’ or ‘louvain’. These labels are then mapped onto the set of terminal states, to associate a name and a color with each state.

  • n_states: number of expected terminal states. This parameter is optional - if it’s not provided, this number is estimated from the so-called ‘eigengap heuristic’ of the spectrum of the transition matrix.

  • method: This is only relevant for the estimator GPCCA. It determines the way in which we compute and sort the real Schur decomposition. The default, krylov, is an iterative procedure that works with sparse matrices which allows the method to scale to very large cell numbers. It relies on the libraries SLEPc and PETSc, which you will have to install separately, see our installation instructions. If your dataset is small (<5k cells), and you don’t want to install these at the moment, use method='brandts' [Brandts, 2002]. The results will be the same, the difference is that brandts works with dense matrices and won’t scale to very large cells numbers.

  • weight_connectivities: weight given to cell-cell similarities to account for noise in velocity vectors.

When running the above command, CellRank adds a key terminal_states to adata.obs and the result can be plotted as:

[9]:
cr.pl.terminal_states(adata)
_images/cellrank_basics_23_0.png

Identify initial states

The same sort of analysis can now be repeated for the initial states, only that we use the function cr.tl.initial_states this time:

[10]:
cr.tl.initial_states(adata, cluster_key="clusters")
cr.pl.initial_states(adata, discrete=True)
Computing transition matrix based on logits using `'deterministic'` mode
Estimating `softmax_scale` using `'deterministic'` mode
Setting `softmax_scale=3.7951`
    Finish (0:00:08)
Using a connectivity kernel with weight `0.2`
Computing transition matrix based on `adata.obsp['connectivities']`
    Finish (0:00:00)
Computing eigendecomposition of the transition matrix
Adding `adata.uns['eig_bwd']`
       `.eigendecomposition`
    Finish (0:00:00)
WARNING: For 1 macrostate, stationary distribution is computed
Adding `.macrostates_memberships`
        `.macrostates`
    Finish (0:00:00)
Adding `adata.obs['initial_states_probs']`
       `adata.obs['initial_states']`
       `.terminal_states_probabilities`
       `.terminal_states`
    Finish
100%|██████████| 2531/2531 [00:05<00:00, 476.52cell/s]
100%|██████████| 2531/2531 [00:02<00:00, 886.87cell/s]
_images/cellrank_basics_26_2.png

We found one initial state, located in the Ngn3 low EP cluster.

Compute fate maps

Once we know the terminal states, we can compute associated fate maps - for each cell, we ask how likely is the cell to develop towards each of the identified terminal states.

[11]:
cr.tl.lineages(adata)
cr.pl.lineages(adata, same_plot=False)
Computing lineage probabilities towards terminal states
Computing absorption probabilities
WARNING: Unable to import petsc4py. For installation, please refer to: https://petsc4py.readthedocs.io/en/stable/install.html.
Defaulting to `'gmres'` solver.
Adding `adata.obsm['to_terminal_states']`
       `.absorption_probabilities`
    Finish (0:00:00)
Adding lineages to `adata.obsm['to_terminal_states']`
    Finish (0:00:00)
100%|██████████| 3/3 [00:00<00:00, 35.97/s]
_images/cellrank_basics_30_2.png

We can aggregate the above into a single, global fate map where we associate each terminal state with color and use the intensity of that color to show the fate of each individual cell:

[12]:
cr.pl.lineages(adata, same_plot=True)
_images/cellrank_basics_32_0.png

This shows that the dominant terminal state at E15.5 is Beta, consistent with known biology, see e.g. [Bastidas-Ponce et al., 2019].

Directed PAGA

We can further aggregate the individual fate maps into a cluster-level fate map using an adapted version of [Wolf et al., 2019] with directed edges. We first compute scVelo’s latent time with CellRank identified root_key and end_key, which are the probabilities of being an initial or a terminal state, respectively.

[13]:
scv.tl.recover_latent_time(
    adata, root_key="initial_states_probs", end_key="terminal_states_probs"
)
computing latent time using initial_states_probs, terminal_states_probs as prior
    finished (0:00:00) --> added
    'latent_time', shared time (adata.obs)

Next, we can use the inferred pseudotime along with the initial and terminal states probabilities to compute the directed PAGA.

[14]:
scv.tl.paga(
    adata,
    groups="clusters",
    root_key="initial_states_probs",
    end_key="terminal_states_probs",
    use_time_prior="velocity_pseudotime",
)
running PAGA using priors: ['velocity_pseudotime', 'initial_states_probs', 'terminal_states_probs']
    finished (0:00:00) --> added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns)
    'paga/transitions_confidence', velocity transitions (adata.uns)
[15]:
cr.pl.cluster_fates(
    adata,
    mode="paga_pie",
    cluster_key="clusters",
    basis="umap",
    legend_kwargs={"loc": "top right out"},
    legend_loc="top left out",
    node_size_scale=5,
    edge_width_scale=1,
    max_edge_width=4,
    title="directed PAGA",
)
_images/cellrank_basics_39_0.png

We use pie charts to show cell fates averaged per cluster. Edges between clusters are given by transcriptomic similarity between the clusters, just as in normal PAGA.

Compute lineage drivers

We can compute the driver genes for all or just a subset of lineages. We can also restrict this to some subset of clusters by specifying clusters=... (not shown below). In the resulting dataframe, we also see the p-value, the corrected p-value (q-value) and the 95% confidence interval for the correlation statistic.

[16]:
cr.tl.lineage_drivers(adata)
Computing correlations for lineages `['Epsilon' 'Alpha' 'Beta']` restricted to clusters `None` in layer `X` with `use_raw=False`
Adding `.lineage_drivers`
       `adata.var['to Epsilon corr']`
       `adata.var['to Alpha corr']`
       `adata.var['to Beta corr']`
    Finish (0:00:00)
[16]:
Epsilon corr Epsilon pval Epsilon qval Epsilon ci low Epsilon ci high Alpha corr Alpha pval Alpha qval Alpha ci low Alpha ci high Beta corr Beta pval Beta qval Beta ci low Beta ci high
index
Ghrl 0.802445 0.000000e+00 0.000000e+00 0.788123 0.815898 -0.102534 2.297270e-07 1.708007e-06 -0.140933 -0.063827 -0.409512 4.724530e-106 6.749328e-104 -0.441431 -0.376559
Anpep 0.456913 7.362822e-136 7.362822e-133 0.425527 0.487202 -0.063882 1.298457e-03 4.477438e-03 -0.102589 -0.024982 -0.228948 1.018133e-31 2.867979e-30 -0.265541 -0.191696
Gm11837 0.449617 6.363249e-131 4.242166e-128 0.417977 0.480167 -0.045986 2.068029e-02 4.854527e-02 -0.084796 -0.007037 -0.238269 2.593342e-34 7.979514e-33 -0.274681 -0.201175
Irx2 0.399584 1.900065e-100 9.500323e-98 0.366326 0.431823 0.517187 3.359799e-182 3.359799e-179 0.488060 0.545163 -0.640866 0.000000e+00 0.000000e+00 -0.663266 -0.617318
Ccnd2 0.384302 3.215022e-92 1.286009e-89 0.350590 0.417020 0.152544 1.074186e-14 1.732558e-13 0.114262 0.190375 -0.351177 6.078971e-76 4.676132e-74 -0.384873 -0.316546
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dlk1 -0.323105 1.065377e-63 1.253385e-61 -0.357566 -0.287766 -0.168021 1.477933e-17 2.869772e-16 -0.205637 -0.129910 0.325835 7.866644e-65 4.495225e-63 0.290562 0.360224
Gng12 -0.342605 4.652743e-72 7.158066e-70 -0.376540 -0.307751 -0.330585 7.893870e-67 9.867338e-65 -0.364847 -0.295428 0.462703 7.138958e-140 2.379653e-137 0.431521 0.492781
Nkx6-1 -0.349302 4.412839e-75 7.354732e-73 -0.383050 -0.314622 -0.318417 8.753764e-62 8.753764e-60 -0.352999 -0.282965 0.457422 3.293328e-136 9.409509e-134 0.426054 0.487693
Nnat -0.357367 7.904070e-79 1.580814e-76 -0.390886 -0.322901 -0.324287 3.462256e-64 4.073242e-62 -0.358716 -0.288976 0.466844 8.508497e-143 3.403399e-140 0.435809 0.496770
Pdx1 -0.370467 3.608847e-85 8.019659e-83 -0.403603 -0.336360 -0.332613 1.076877e-67 1.435836e-65 -0.366821 -0.297507 0.481220 2.652282e-153 1.768188e-150 0.450709 0.510608

2000 rows × 15 columns

Afterwards, we can plot the top 5 driver genes (based on the correlation), e.g. for the Alpha lineage:

[17]:
cr.pl.lineage_drivers(adata, lineage="Alpha", n_genes=5)
_images/cellrank_basics_45_0.png

What’s next?

Congratulations! You have successfully gone through some first computations with CellRank. If you want to learn more, you can check out:

  • our low level API, unlocking the full potential of CellRank trough kernels and estimators, flexible classes that compute cell-cell transition probabilities (kernels) and aggregate these to formulate hypothesis about the underlying dynamics (estimators). See the Kernels and estimators tutorial.

  • how CellRank can be used without RNA velocity. See the CellRank beyond RNA velocity tutorial.

  • CellRank external, our interface to third-party libraries, giving you even more possibilities to model single-cell data based on Markov chains, conveniently though the CellRank interface.