Computing Initial and Terminal States¶
Preliminaries¶
In this tutorial, you will learn how to use the GPCCA
to do the following:
compute macrostates of cellular dynamics.
visualize a coarse-grained transition matrix among macrostates.
classify macrostates as initial, terminal or intermediate.
The GPCCA
may be combined with any CellRank kernel
used to compute a cell-cell transition matrix, including the VelocityKernel
, the PseudotimeKernel
and the CytoTRACEKernel
. We explain the kernel/estimator paradigm in more detail in our Getting Started with CellRank tutorial.
This tutorial notebook can be downloaded using the following link.
Note
If you want to run this on your own data, you will need:
a scRNA-seq dataset for which you computed a cell-cell transition matrix using any CellRank
kernel
. For each kernel, there exist individual tutorials that teach you how to use them. For example, we have tutorials about using RNA velocity, experimental time points, and pseudotime, among others, to derive cell-cell transition probabilities.
Note
If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse. You can also contact us using info@cellrank.org.
Import packages & data¶
import sys
if "google.colab" in sys.modules:
!pip install -q git+https://github.com/theislab/cellrank
import numpy as np
import scipy.stats as st
import cellrank as cr
import scanpy as sc
cr.settings.verbosity = 2
sc.settings.set_figure_params(
frameon=False,
dpi=100,
)
Global seed set to 0
import warnings
warnings.simplefilter("ignore", category=UserWarning)
To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of pancreas development at embyonic day E15.5 [Bastidas-Ponce et al., 2019]. We used this dataset in our CellRank meets RNA velocity tutorial to compute a cell-cell transition matrix using the VelocityKernel
; this tutorial builds on the transition matrix computed there. The dataset including the transition matrix can be conveniently acessed through cellrank.datasets.pancreas
[Bastidas-Ponce et al., 2019, Bergen et al., 2020].
adata = cr.datasets.pancreas(kind="preprocessed-kernel")
vk = cr.kernels.VelocityKernel.from_adata(adata, key="T_fwd")
vk
VelocityKernel[n=2531, model='deterministic', similarity='correlation', softmax_scale=4.013]
Note
The method from_adata()
allows you to reconstruct a CellRank kernel from an AnnData
object. When you use a kernel to compute a transition matrix, this matrix, as well as a few informations about the computation, are written to the underlying anndata.AnnData
object upon calling write_to_adata()
. This makes it easy to write and read kernels from disk via the associated AnnData
object.
As a reminder, below is a UMAP with cluster labels of this dataset [Bastidas-Ponce et al., 2019, Becht et al., 2019, McInnes et al., 2018]. In this tutorial, we will automatically identify the initial and terminal states.
Identify initial and terminal states¶
Initialize an estimator¶
The estimators
allow you to analyze cell-state dynamics in CellRank. We initialize the GPCCA estimator
here by passing the VelocityKernel
[Reuter et al., 2019, Reuter et al., 2018]. However, the GPCCA estimator
works with every CellRank kernel
.
g = cr.estimators.GPCCA(vk)
print(g)
GPCCA[kernel=VelocityKernel[n=2531], initial_states=None, terminal_states=None]
We use the GPCCA estimator
below to compute initial & terminal states; it may further be used to compute fate probabilities as we explain in the fate probabilities & driver genes tutorial.
Note
You can interact with the GPCCA estimator
in two ways:
basic usage: via the
fit()
andpredict()
methods; this is very convenient but offers less control.advanced usage: by computing the
Schur decomposition
, themacrostates
, etc. While there are a few more method calls involved here, it offers you more control.
We demonstrate both modes in this tutorial.
Basic usage¶
We fit the estimator; this computes a Schur decompsition and uses the GPCCA algorithm as implemented in pyGPCCA to compute macrostates
by maximizing for metastability [Reuter et al., 2019, Reuter et al., 2022, Reuter et al., 2018]. macrostates
may contain initial
, terminal
and intermediate states of cellular dynamics.
g.fit(cluster_key="clusters", n_states=[4, 12])
g.plot_macrostates(which="all", discrete=True, legend_loc="right", s=100)
Computing Schur decomposition
When computing macrostates, choose a number of states NOT in `[5, 10]`
Adding `adata.uns['eigendecomposition_fwd']`
`.schur_vectors`
`.schur_matrix`
`.eigendecomposition`
Finish (0:00:00)
Calculating minChi criterion in interval `[4, 12]`
Computing `4` macrostates
Adding `.macrostates`
`.macrostates_memberships`
`.coarse_T`
`.coarse_initial_distribution
`.coarse_stationary_distribution`
`.schur_vectors`
`.schur_matrix`
`.eigendecomposition`
Finish (0:00:00)
Using n_states=[a, b]
, we told the algorithm to scan this interval for the optimal number of macrostates. It identified 4 macrostates which we show above in the UMAP. For each macrostate, we display the n_cells
cells most strongly associated with it. Each macrostate was automatically assigned a label by associating it with the cluster labels stored in adata.obs['clusters']
; this is controled using cluster_key
.
You can check out the pyGPCCA documentation for more information on the macrostate computation.
Note
Make sure to have the scientific computing libraries PETSc and SLEPc installed for maximum performance, using e.g. conda install -c conda-forge petsc4py slepc4py
. On most machines, this is strongly recommended if your datasets consists of over 10k cells. See our installation instructions.
In the next step, we use the predict_terminal_states()
to identify terminal macrostates.
g.predict_terminal_states()
g.plot_macrostates(which="terminal", legend_loc="right", s=100)
We automatically identified the Alpha, Beta and Epsilon states as terminal. While the plot above just shows the cells most confidently assigned to each terminal state, we can also plot the underlying continuous distribution for each macrostate.
Each cell is colored according to the terminal state it most likely belongs to; higher color intensity reflects greater confidence in the assignment.
Note
These values should not be confused with fate probabilities
, they reflect our confidence in assigning a cell to a terminal state, but they don’t tell us how likely a cell is to reach this terminal state. We demonstrate the computation of fate probabilities in the fate probabilities & driver genes tutorial.
Analogously, we identify initial states using the predict_initial_states()
method.
g.predict_initial_states()
g.plot_macrostates(which="initial", legend_loc="right", s=100)
This automatically identified the Ngn3 low EP state as initial, which is correct. Macrostates, and their classification as initial or terminal, are written to the GPCCA estimator
and can be inspected conveniently via initial_states
and terminal_states
. To get an overview, we can print the estimator.
g
GPCCA[kernel=VelocityKernel[n=2531], initial_states=['Ngn3 low EP'], terminal_states=['Alpha', 'Beta', 'Epsilon']]
We can visually confirm our terminal state identification by comparing with well-known mature cell-type markers Ins1 and Ins2 for Beta cells, Gcg for Alpha cells and Ghrl for epsilon cells. To confirm our initial state identification, we visualize Ductal cell markers Sox9, Anxa2 and Bicc1.
sc.pl.embedding(
adata,
basis="umap",
color=["Ins1", "Ins2", "Gcg", "Ghrl", "Sox9", "Anxa2", "Bicc1"],
size=50,
)
Visually, these marker genes agree well with automatically identified initial and terminal states. To make this a bit more quantitative, let’s check whether our terminal Beta cells express Ins1 at higher levels compared to other (non-terminal) Beta cells.
# subset to just Beta cells
bdata = adata[adata.obs["clusters"] == "Beta"].copy()
# create an annotation for terminal vs. not-terminal
bdata.obs["maturation_state"] = np.where(
bdata.obs["term_states_fwd"] == "Beta", "terminal", "not terminal"
)
# show distribution in violin plot
sc.pl.violin(bdata, keys=["Ins1"], groupby="maturation_state")
# use a simple t-test to quantify how different the two distributions are
a = bdata[bdata.obs["maturation_state"] == "terminal", "Ins1"].X.data
b = bdata[bdata.obs["maturation_state"] == "not terminal", "Ins1"].X.data
st.ttest_ind(a, b, equal_var=False)
This looks promising, the cells we automatically identified as terminal Beta cells express the maturation marker Ins1 at significantly higher levels compared to other Beta cells. This can be a helpful technique to confirm automatic initial and terminal state identification in settings where prior knowledge is available.
The classification of macrostates
as initial or terminal is based on the coarse grained transition matrix
, as shown below.
This transition matrix aggregates the single-cell Markov chain to a macrostate-level Markov chain. Entries in this matrix tell us how likely macrostates are to transition into one-another. We identify initial and terminal states based on the following criteria:
terminal_states
are very stable (large value on the diagonal). They can have incoming probability mass, but almost no outgoing probability mass.initial_states
are less stable (smaller values on the diagonal and in the coarse-grained stationary distribution). They can have outgoing, but almost no incoming probability mass.intermediate states are just all remaining
macrostates
which we classified neither as terminal nor as initial.
Note that the automatic identification does not always work perfectly. For example, we did not automatically identify the terminal Delta cell population in our analysis above. For this reason, we’ll next explore the low-level mode to interact with the GPCCA estimator
which offers a bit more control and supervision to bring in any prior knowledge that you might have about the biological system.
Advanced usage¶
Let’s start with a new estimator object
.
g2 = cr.estimators.GPCCA(vk)
print(g2)
GPCCA[kernel=VelocityKernel[n=2531], initial_states=None, terminal_states=None]
The computation of macrostates is based on the real Schur decomposition - essentially a generalization of the eigendecomposition. For non-symmetric transition matrices, as in our case, the eigendecomposition in general yields complex eigenvalues and eigenvectors, which are difficult to interpret. Thus, we revert to the real Schur decomposition [Reuter et al., 2019, Reuter et al., 2018].
Below, we also plot the real part of the top eigenvalues, sorted by their real part. Blue and orange denote real and complex eigenvalues, respectively. For real matrices, complex eigenvalues always appear in pairs of complex conjugates.
g2.compute_schur()
g2.plot_spectrum(real_only=True)
Using the real Schur decomposition
, we compute macrostates. In the plot above, the estimator automatically suggested a number of states to use (dashed vertical line); however, we will ignore that and compute a few more states to have a chance at identifying the Delta cell population.
g2.compute_macrostates(n_states=11, cluster_key="clusters")
g2.plot_macrostates(which="all", legend_loc="right", s=100)
We now have macrostates for Alpha, Beta, Epsilon and Delta populations, besides a few progenitor populations. We assign one label per macrostate based on the underlying 'clusters'
annotation. However, that does not imply that all cells within one macrostate are from the same underlying cluster as we use majority voting. We can visualize the relationship between clusters and macrostates. We show below the distribution over cluster membership for each of the cells most confidently assigned to a certain macrostate.
With some exceptions, most macrostates recruit their top-cells from a single underlying cluster. This plotting function works for any observation-level covariate and can be useful when we have experimental time labels as we expect more mature states to come from later time points. See the RealTimeKernel
and the corresponding tutorial to learn how CellRank makes use of experimental time points.
To get an idea of how these macrostates are related, we plot the coarse-grained transition matrix
.
By default, macrostates are ordered according to their diagonal value, increasing from left to right. The diagonal value is a proxy for a states’ metastability, i.e. the probability of leaving the state once entered. While the Epsilon, Alpha and Beta states have high diagonal values, Delta cells have a low value because they are predicted to transition into Beta cells (check the corresponding matrix element).
Let’s try automatic terminal state identification.
g2.predict_terminal_states()
g2.plot_macrostates(which="terminal", legend_loc="right", s=100)
Besides many more macrostates being present now, this consistently identified the same three terminal states as before. However, we still did not catch the Delta cells! If we want to compute fate probabilities
towards them later on, we need to annotate them as a terminal state. Luckily, this can be done semi-automatically as well:
g2.set_terminal_states(states=["Alpha", "Beta", "Epsilon", "Delta"])
g2.plot_macrostates(which="terminal", legend_loc="right", s=100)
We call this semi-automatic terminal state identification as we manually pass the macrostates we would like to select, however, the actual cells belonging to each macrostate are assigned automatically. For initial states, everything works exactly the same, you can identify them fully automatically, or semi-automatically. Let’s compute the initial states fully automatically here:
g2.predict_initial_states()
g2.plot_macrostates(which="initial", s=100)
How can we check whether this is really the correct initial state? In this case, we have prior knowledge that we can use: we know that the initial state should be high for a number of Ductal cell markers. So let’s use these markers to compute a score that we can visualize across macrostates to confirm that we have the correct one.
# compute a score in scanpy by aggregating across a few ductal markers
sc.tl.score_genes(
adata, gene_list=["Bicc1", "Sox9", "Anxa2"], score_name="initial_score"
)
# write macrostates to AnnData
adata.obs["macrostates"] = g2.macrostates
adata.uns["macrostates_colors"] = g2.macrostates_memberships.colors
# visualize via heatmaps
sc.pl.violin(adata, keys="initial_score", groupby="macrostates", rotation=90)
In fact, the 'Ngn3 low EP_1'
state scores highest, confirming the identification of this state as as initial state. The same strategy can be used to confirm terminal state identification, or to guide semi-automatic identification in the first place.
Closing matters¶
What’s next?¶
In this tutorial, you learned how to use CellRank’s GPCCA estimator
to compute initial and terminal states of cellular dynamics. For the next steps, we recommend to:
go through the fate probabilities & driver genes tutorial to learn how to quantify cellular fate commitment.
take a look at the
API
to learn about parameter values you can use to adapt these computations to your data.read the original GPCCA publications to get familiar with the maths and ideas behind these computations [Reuter et al., 2019, Reuter et al., 2018].
Package versions¶
cr.logging.print_versions()
cellrank==1.5.2.dev206+ga2748bea scanpy==1.9.3 anndata==0.8.0 numpy==1.24.2 numba==0.57.0rc1 scipy==1.10.1 pandas==1.5.3 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.4 scvelo==0.3.0 pygam==0.8.0 matplotlib==3.7.0 seaborn==0.12.2