Computing Initial and Terminal States¶

Preliminaries¶

In this tutorial, you will learn how to use the GPCCA to do the following:

compute macrostates of cellular dynamics.
visualize a coarse-grained transition matrix among macrostates.
classify macrostates as initial, terminal or intermediate.

The GPCCA may be combined with any CellRank kernel used to compute a cell-cell transition matrix, including the VelocityKernel, the PseudotimeKernel and the CytoTRACEKernel. We explain the kernel/estimator paradigm in more detail in our Getting Started with CellRank tutorial.

This tutorial notebook can be downloaded using the following link.

The GPCCA estimator computes initial and terminal states based on Generalized Perron Cluster Cluster Analysis. — **CellRank decomposes cellular dynamics into macrostates:** Based on Generalized Perron Cluster Cluster analysis (GPCCA), CellRank identifies `macrostates` of cellular dynamics, which include initial, intermediate, and terminal states [Lange *et al.*, 2022, Reuter *et al.*, 2019, Reuter *et al.*, 2018].¶

Note

If you want to run this on your own data, you will need:

a scRNA-seq dataset for which you computed a cell-cell transition matrix using any CellRank kernel. For each kernel, there exist individual tutorials that teach you how to use them. For example, we have tutorials about using RNA velocity, experimental time points, and pseudotime, among others, to derive cell-cell transition probabilities.

Note

If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse. You can also contact us using info@cellrank.org.

Import packages & data¶

import sys

if "google.colab" in sys.modules:
    !pip install -q git+https://github.com/theislab/cellrank

import numpy as np
import scipy.stats as st

import cellrank as cr
import scanpy as sc

cr.settings.verbosity = 2
sc.settings.set_figure_params(
    frameon=False,
    dpi=100,
)

Global seed set to 0

import warnings

warnings.simplefilter("ignore", category=UserWarning)

To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of pancreas development at embyonic day E15.5 [Bastidas-Ponce et al., 2019]. We used this dataset in our CellRank meets RNA velocity tutorial to compute a cell-cell transition matrix using the VelocityKernel; this tutorial builds on the transition matrix computed there. The dataset including the transition matrix can be conveniently acessed through cellrank.datasets.pancreas [Bastidas-Ponce et al., 2019, Bergen et al., 2020].

adata = cr.datasets.pancreas(kind="preprocessed-kernel")

vk = cr.kernels.VelocityKernel.from_adata(adata, key="T_fwd")
vk

VelocityKernel[n=2531, model='deterministic', similarity='correlation', softmax_scale=4.013]

Note

The method from_adata() allows you to reconstruct a CellRank kernel from an AnnData object. When you use a kernel to compute a transition matrix, this matrix, as well as a few informations about the computation, are written to the underlying anndata.AnnData object upon calling write_to_adata(). This makes it easy to write and read kernels from disk via the associated AnnData object.

As a reminder, below is a UMAP with cluster labels of this dataset [Bastidas-Ponce et al., 2019, Becht et al., 2019, McInnes et al., 2018]. In this tutorial, we will automatically identify the initial and terminal states.

sc.pl.embedding(adata, basis="umap", color="clusters")

../../../_images/c178de4442769b9c68c2ebfee7a4ac9b647fabe55e1d5208652501a3049d1c24.png

Identify initial and terminal states¶

Initialize an estimator¶

The estimators allow you to analyze cell-state dynamics in CellRank. We initialize the GPCCA estimator here by passing the VelocityKernel [Reuter et al., 2019, Reuter et al., 2018]. However, the GPCCA estimator works with every CellRank kernel.

g = cr.estimators.GPCCA(vk)
print(g)

GPCCA[kernel=VelocityKernel[n=2531], initial_states=None, terminal_states=None]

We use the GPCCA estimator below to compute initial & terminal states; it may further be used to compute fate probabilities as we explain in the fate probabilities & driver genes tutorial.

Note

You can interact with the GPCCA estimator in two ways:

basic usage: via the fit() and predict() methods; this is very convenient but offers less control.
advanced usage: by computing the Schur decomposition, the macrostates, etc. While there are a few more method calls involved here, it offers you more control.

We demonstrate both modes in this tutorial.

Basic usage¶

We fit the estimator; this computes a Schur decompsition and uses the GPCCA algorithm as implemented in pyGPCCA to compute macrostates by maximizing for metastability [Reuter et al., 2019, Reuter et al., 2022, Reuter et al., 2018]. macrostates may contain initial, terminal and intermediate states of cellular dynamics.

g.fit(cluster_key="clusters", n_states=[4, 12])
g.plot_macrostates(which="all", discrete=True, legend_loc="right", s=100)

Computing Schur decomposition
When computing macrostates, choose a number of states NOT in `[5, 10]`
Adding `adata.uns['eigendecomposition_fwd']`
       `.schur_vectors`
       `.schur_matrix`
       `.eigendecomposition`
    Finish (0:00:00)
Calculating minChi criterion in interval `[4, 12]`
Computing `4` macrostates
Adding `.macrostates`
       `.macrostates_memberships`
       `.coarse_T`
       `.coarse_initial_distribution
       `.coarse_stationary_distribution`
       `.schur_vectors`
       `.schur_matrix`
       `.eigendecomposition`
    Finish (0:00:00)

../../../_images/670d4b6e25daecca5baaefe64418793e64d21974422024dda9242471ef31cc26.png

Using n_states=[a, b], we told the algorithm to scan this interval for the optimal number of macrostates. It identified 4 macrostates which we show above in the UMAP. For each macrostate, we display the n_cells cells most strongly associated with it. Each macrostate was automatically assigned a label by associating it with the cluster labels stored in adata.obs['clusters']; this is controled using cluster_key.

You can check out the pyGPCCA documentation for more information on the macrostate computation.

Note

Make sure to have the scientific computing libraries PETSc and SLEPc installed for maximum performance, using e.g. conda install -c conda-forge petsc4py slepc4py. On most machines, this is strongly recommended if your datasets consists of over 10k cells. See our installation instructions.

In the next step, we use the predict_terminal_states() to identify terminal macrostates.

g.predict_terminal_states()
g.plot_macrostates(which="terminal", legend_loc="right", s=100)

Adding `adata.obs['term_states_fwd']`
       `adata.obs['term_states_fwd_probs']`
       `.terminal_states`
       `.terminal_states_probabilities`
       `.terminal_states_memberships
    Finish`

../../../_images/56e89ad3e60e4a2d8fe174e404edd30a7ff351489f24732691c841d4a31b19a9.png

We automatically identified the Alpha, Beta and Epsilon states as terminal. While the plot above just shows the cells most confidently assigned to each terminal state, we can also plot the underlying continuous distribution for each macrostate.

g.plot_macrostates(which="terminal", discrete=False)

../../../_images/bcdc5f1dae65e4987db1fbd2d9a4f3aa7b29e0c160a7bec56783ea99d55b63ce.png

Each cell is colored according to the terminal state it most likely belongs to; higher color intensity reflects greater confidence in the assignment.

Note

These values should not be confused with fate probabilities, they reflect our confidence in assigning a cell to a terminal state, but they don’t tell us how likely a cell is to reach this terminal state. We demonstrate the computation of fate probabilities in the fate probabilities & driver genes tutorial.

Analogously, we identify initial states using the predict_initial_states() method.

g.predict_initial_states()
g.plot_macrostates(which="initial", legend_loc="right", s=100)

Adding `adata.obs['init_states_fwd']`
       `adata.obs['init_states_fwd_probs']`
       `.initial_states`
       `.initial_states_probabilities`
       `.initial_states_memberships
    Finish`

../../../_images/06e696fda1a3d4db36d4090bce8da04e95a2c05ef0c562d3b15038b60b96ceae.png

This automatically identified the Ngn3 low EP state as initial, which is correct. Macrostates, and their classification as initial or terminal, are written to the GPCCA estimator and can be inspected conveniently via initial_states and terminal_states. To get an overview, we can print the estimator.

GPCCA[kernel=VelocityKernel[n=2531], initial_states=['Ngn3 low EP'], terminal_states=['Alpha', 'Beta', 'Epsilon']]

We can visually confirm our terminal state identification by comparing with well-known mature cell-type markers Ins1 and Ins2 for Beta cells, Gcg for Alpha cells and Ghrl for epsilon cells. To confirm our initial state identification, we visualize Ductal cell markers Sox9, Anxa2 and Bicc1.

sc.pl.embedding(
    adata,
    basis="umap",
    color=["Ins1", "Ins2", "Gcg", "Ghrl", "Sox9", "Anxa2", "Bicc1"],
    size=50,
)

../../../_images/f4bb65fe95f39e46bc13201a94cbcc150340fa5106966482878e193744a25b14.png

Visually, these marker genes agree well with automatically identified initial and terminal states. To make this a bit more quantitative, let’s check whether our terminal Beta cells express Ins1 at higher levels compared to other (non-terminal) Beta cells.

# subset to just Beta cells
bdata = adata[adata.obs["clusters"] == "Beta"].copy()

# create an annotation for terminal vs. not-terminal
bdata.obs["maturation_state"] = np.where(
    bdata.obs["term_states_fwd"] == "Beta", "terminal", "not terminal"
)

# show distribution in violin plot
sc.pl.violin(bdata, keys=["Ins1"], groupby="maturation_state")

# use a simple t-test to quantify how different the two distributions are
a = bdata[bdata.obs["maturation_state"] == "terminal", "Ins1"].X.data
b = bdata[bdata.obs["maturation_state"] == "not terminal", "Ins1"].X.data
st.ttest_ind(a, b, equal_var=False)

../../../_images/8832da36a7d1a31d34777be39b93a555cc7f113d4136edd5bc63e948ad7ff8f6.png

Ttest_indResult(statistic=12.758279519129177, pvalue=1.7212735169400435e-19)

This looks promising, the cells we automatically identified as terminal Beta cells express the maturation marker Ins1 at significantly higher levels compared to other Beta cells. This can be a helpful technique to confirm automatic initial and terminal state identification in settings where prior knowledge is available.

The classification of macrostates as initial or terminal is based on the coarse grained transition matrix, as shown below.

g.plot_coarse_T()

../../../_images/432a81e699aae61929d783fdfa862951a934e3629cf09e25881fe7fee84c03d3.png

This transition matrix aggregates the single-cell Markov chain to a macrostate-level Markov chain. Entries in this matrix tell us how likely macrostates are to transition into one-another. We identify initial and terminal states based on the following criteria:

terminal_states are very stable (large value on the diagonal). They can have incoming probability mass, but almost no outgoing probability mass.
initial_states are less stable (smaller values on the diagonal and in the coarse-grained stationary distribution). They can have outgoing, but almost no incoming probability mass.
intermediate states are just all remaining macrostates which we classified neither as terminal nor as initial.

Note that the automatic identification does not always work perfectly. For example, we did not automatically identify the terminal Delta cell population in our analysis above. For this reason, we’ll next explore the low-level mode to interact with the GPCCA estimator which offers a bit more control and supervision to bring in any prior knowledge that you might have about the biological system.

Advanced usage¶

Let’s start with a new estimator object.

g2 = cr.estimators.GPCCA(vk)
print(g2)

GPCCA[kernel=VelocityKernel[n=2531], initial_states=None, terminal_states=None]

The computation of macrostates is based on the real Schur decomposition - essentially a generalization of the eigendecomposition. For non-symmetric transition matrices, as in our case, the eigendecomposition in general yields complex eigenvalues and eigenvectors, which are difficult to interpret. Thus, we revert to the real Schur decomposition [Reuter et al., 2019, Reuter et al., 2018].

Below, we also plot the real part of the top eigenvalues, sorted by their real part. Blue and orange denote real and complex eigenvalues, respectively. For real matrices, complex eigenvalues always appear in pairs of complex conjugates.

g2.compute_schur()
g2.plot_spectrum(real_only=True)

Computing Schur decomposition
When computing macrostates, choose a number of states NOT in `[5, 10, 15, 18]`
Adding `adata.uns['eigendecomposition_fwd']`
       `.schur_vectors`
       `.schur_matrix`
       `.eigendecomposition`
    Finish (0:00:00)

../../../_images/8e94b8ac35279fd340eeb47f5ea98f66d9de3d970af4c9aa6b578a1b97c3eaa8.png

Using the real Schur decomposition, we compute macrostates. In the plot above, the estimator automatically suggested a number of states to use (dashed vertical line); however, we will ignore that and compute a few more states to have a chance at identifying the Delta cell population.

g2.compute_macrostates(n_states=11, cluster_key="clusters")
g2.plot_macrostates(which="all", legend_loc="right", s=100)

Computing `11` macrostates
Adding `.macrostates`
       `.macrostates_memberships`
       `.coarse_T`
       `.coarse_initial_distribution
       `.coarse_stationary_distribution`
       `.schur_vectors`
       `.schur_matrix`
       `.eigendecomposition`
    Finish (0:00:39)

../../../_images/a8eb3860ccabf999fe36b58b0652cff25189cf5226c7527fc1a9569326a9bce0.png

We now have macrostates for Alpha, Beta, Epsilon and Delta populations, besides a few progenitor populations. We assign one label per macrostate based on the underlying 'clusters' annotation. However, that does not imply that all cells within one macrostate are from the same underlying cluster as we use majority voting. We can visualize the relationship between clusters and macrostates. We show below the distribution over cluster membership for each of the cells most confidently assigned to a certain macrostate.

g2.plot_macrostate_composition(key="clusters", figsize=(7, 4))

../../../_images/ccbd1d416b502bb4ca41d28a9ebb68fe877c8f24670bf33839851314a6ce58a8.png

With some exceptions, most macrostates recruit their top-cells from a single underlying cluster. This plotting function works for any observation-level covariate and can be useful when we have experimental time labels as we expect more mature states to come from later time points. See the RealTimeKernel and the corresponding tutorial to learn how CellRank makes use of experimental time points.

To get an idea of how these macrostates are related, we plot the coarse-grained transition matrix.

g2.plot_coarse_T(annotate=False)

../../../_images/05ec6470a0c72c896f596d9763dd0366b678dd92949712a34efa9b9c5b7bd206.png

By default, macrostates are ordered according to their diagonal value, increasing from left to right. The diagonal value is a proxy for a states’ metastability, i.e. the probability of leaving the state once entered. While the Epsilon, Alpha and Beta states have high diagonal values, Delta cells have a low value because they are predicted to transition into Beta cells (check the corresponding matrix element).

Let’s try automatic terminal state identification.

g2.predict_terminal_states()
g2.plot_macrostates(which="terminal", legend_loc="right", s=100)

Adding `adata.obs['term_states_fwd']`
       `adata.obs['term_states_fwd_probs']`
       `.terminal_states`
       `.terminal_states_probabilities`
       `.terminal_states_memberships
    Finish`

../../../_images/048b24f686ae0be6c1716820e2251259104b354edc04b0b6f8294aa871700df0.png

Besides many more macrostates being present now, this consistently identified the same three terminal states as before. However, we still did not catch the Delta cells! If we want to compute fate probabilities towards them later on, we need to annotate them as a terminal state. Luckily, this can be done semi-automatically as well:

g2.set_terminal_states(states=["Alpha", "Beta", "Epsilon", "Delta"])
g2.plot_macrostates(which="terminal", legend_loc="right", s=100)

Adding `adata.obs['term_states_fwd']`
       `adata.obs['term_states_fwd_probs']`
       `.terminal_states`
       `.terminal_states_probabilities`
       `.terminal_states_memberships
    Finish`

../../../_images/04cd50389ba65b3049ddcf3e6c6c96198a2349584e78fb4a10088321f585cc0e.png

We call this semi-automatic terminal state identification as we manually pass the macrostates we would like to select, however, the actual cells belonging to each macrostate are assigned automatically. For initial states, everything works exactly the same, you can identify them fully automatically, or semi-automatically. Let’s compute the initial states fully automatically here:

g2.predict_initial_states()
g2.plot_macrostates(which="initial", s=100)

Adding `adata.obs['init_states_fwd']`
       `adata.obs['init_states_fwd_probs']`
       `.initial_states`
       `.initial_states_probabilities`
       `.initial_states_memberships
    Finish`

../../../_images/ffdd1e14ccffed2404222271b3b0220f90a71138359949982e8130ec6f4f2fb9.png

How can we check whether this is really the correct initial state? In this case, we have prior knowledge that we can use: we know that the initial state should be high for a number of Ductal cell markers. So let’s use these markers to compute a score that we can visualize across macrostates to confirm that we have the correct one.

# compute a score in scanpy by aggregating across a few ductal markers
sc.tl.score_genes(
    adata, gene_list=["Bicc1", "Sox9", "Anxa2"], score_name="initial_score"
)

# write macrostates to AnnData
adata.obs["macrostates"] = g2.macrostates
adata.uns["macrostates_colors"] = g2.macrostates_memberships.colors

# visualize via heatmaps
sc.pl.violin(adata, keys="initial_score", groupby="macrostates", rotation=90)

../../../_images/fcec101d2b45b9cc6f1385d03a05e8d4c648a9fa4eda54f8c0cbc0f5d964d3ac.png

In fact, the 'Ngn3 low EP_1' state scores highest, confirming the identification of this state as as initial state. The same strategy can be used to confirm terminal state identification, or to guide semi-automatic identification in the first place.

Closing matters¶

What’s next?¶

In this tutorial, you learned how to use CellRank’s GPCCA estimator to compute initial and terminal states of cellular dynamics. For the next steps, we recommend to:

go through the fate probabilities & driver genes tutorial to learn how to quantify cellular fate commitment.
take a look at the API to learn about parameter values you can use to adapt these computations to your data.
read the original GPCCA publications to get familiar with the maths and ideas behind these computations [Reuter et al., 2019, Reuter et al., 2018].

Package versions¶

cr.logging.print_versions()

cellrank==1.5.2.dev206+ga2748bea scanpy==1.9.3 anndata==0.8.0 numpy==1.24.2 numba==0.57.0rc1 scipy==1.10.1 pandas==1.5.3 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.4 scvelo==0.3.0 pygam==0.8.0 matplotlib==3.7.0 seaborn==0.12.2