Estimating Fate Probabilities and Driver Genes#

Preliminaries#

In this tutorial, you will learn how to:

compute and visualize fate probabilities towards terminal states.
aggregate fate probabilities over groups of cells to uncover fate-biased populations.
correlate fate probabilities with gene expression to detect putative driver genes of fate decisions.

This tutorial notebook can be downloaded using the following link.

CellRank computes fate probabilities via absorption probabilities on the Markov chain. — **Quantifying cellular fate bias with absorption probabilities**: For each cell that is not assigned to a terminal state, we estimate its fate probabilities of reaching any terminal state. To compute fate probabilities, we consider random walks, starting from the query cell, and count how often these end up in each terminal state. We repeat this many times and use the arrival frequencies in terminal states as fate probabilities. Luckily, we do not actually have to simulate random walks: we use absorption probabilities on the Markov chain, which quantify these arrival frequencies for an infinite number of random walks [Lange *et al.*, 2022].#

Note

If you want to run this on your own data, you will need:

a scRNA-seq dataset for which you computed a cell-cell transition matrix using any CellRank kernel. For each kernel, there exist individual tutorials that teach you how to use them. For example, we have tutorials about using RNA velocity, experimental time points, and pseudotime, among others, to derive cell-cell transition probabilities.

Note

If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse. You can also contact us using info@cellrank.org.

Import packages & data#

import sys

if "google.colab" in sys.modules:
    !pip install -q git+https://github.com/theislab/cellrank

import cellrank as cr
import scanpy as sc

cr.settings.verbosity = 2
sc.settings.set_figure_params(frameon=False, dpi=100)

Global seed set to 0

import warnings

warnings.simplefilter("ignore", category=UserWarning)

To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of pancreas development at embyonic day E15.5 [Bastidas-Ponce et al., 2019]. We used this dataset in our CellRank meets RNA velocity tutorial to compute a cell-cell transition matrix using the VelocityKernel; this tutorial builds on the transition matrix computed there. The dataset including the transition matrix can be conveniently acessed through cellrank.datasets.pancreas [Bastidas-Ponce et al., 2019, Bergen et al., 2020].

adata = cr.datasets.pancreas(kind="preprocessed-kernel")

vk = cr.kernels.VelocityKernel.from_adata(adata, key="T_fwd")
vk

VelocityKernel[n=2531, model='deterministic', similarity='correlation', softmax_scale=4.013]

Note

The method from_adata() allows you to reconstruct a CellRank kernel from an AnnData object. When you use a kernel to compute a transition matrix, this matrix, as well as a few informations about the computation, are written to the underlying anndata.AnnData object upon calling write_to_adata(). This makes it easy to write and read kernels from disk via the associated AnnData object.

Compute terminal states#

We need to infer or assign terminal states before we can compute fate probabilities towards them.

Estimate cellular fate bias#

Compute fate probabilities#

We compute fate_probabilities by aggregating over all random walks that start in a given cell and end in some terminal population. There exists an analytical solution to this problem called absorption probabilities, their computation is 30x faster in CellRank 2 compared to version 1 and scales to millions of cells.

g.compute_fate_probabilities()
g.plot_fate_probabilities(same_plot=False)

Computing fate probabilities

Adding `adata.obsm['lineages_fwd']`
       `.fate_probabilities`
    Finish (0:00:00)

../../../_images/7da67ab458a08fe77b94283af0b52806b59f2c705a7b127745840a31854f6b37.png

Most cells appear to be fate-biased towards Beta cells, in agreement with known biology for mouse pancreas development at E15.5 [Bastidas-Ponce et al., 2019]. We can visualize these probabilities in compact form by coloring cells according to their most-likely fate, with color intensity reflecting the degree of lineage-bias by using plot_fate_probabilities().

g.plot_fate_probabilities(same_plot=True)

../../../_images/0276bb7347ca476b1797e2ee39d2298f18ec8227116d39e59708542504e42c5a.png

Another compact way to simultanesosly visualize fate probabilities towards several terminal states is via cirular projections [Velten et al., 2017]. In such a plot, we arrange all terminal states around a unit circle and place cells inside this circle, with a position determined by their fate probabilities.

cr.pl.circular_projection(adata, keys=["clusters", "clusters_fine"], legend_loc="right")

Solving TSP for `4` states

../../../_images/72c1eae9cc9ee0b1ea9b4ab0c2101a98fc588e544961964567fa22c8cafe2d18.png

Left plot: As expected, the algorithm placed more commited cells (Alpha, Beta, etc. ) closer to the edges and less commited cells (Ngn3 low EP, Fev+, etc. ) closer to the middle of the circle.

Right plot: Here, we zoom into the Fev+ progenitor population using a subclustering from the original publication [Bastidas-Ponce et al., 2019]. As expected, each Fev+ subpopulation is placed closer to its corresponding terminal state.

Aggregate fate probabilities#

Often, we are interested in aggregated fate probabilities over a group of cells, like a cell- type or state. As an example, let’s focus on the Fev+ progenitor populations, and their fate commitment towards Delta cells. Let’s start with an overview of these subpopulations.

fev_states = ["Fev+ Beta", "Fev+ Alpha", "Fev+ Pyy", "Fev+ Delta", "Fev+ Epsilon"]
sc.pl.embedding(
    adata, basis="umap", color="clusters_fine", groups=fev_states, legend_loc="right"
)

/Users/marius/miniforge3/envs/py39_arm_cr/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = copy(get_cmap(cmap))

../../../_images/eaf6b6bd4177bb72e143d3f2758e756193253ec9cadad0658cd22541da00ea7b.png

Using violin plots, we visualize how fate-commited each of these populations is towards Delta cells.

cr.pl.aggregate_fate_probabilities(
    adata,
    mode="violin",
    lineages=["Delta"],
    cluster_key="clusters_fine",
    clusters=fev_states,
)

../../../_images/89544ef88922e1357658df7dd82a224b99d6e89325443f0b3c1f8a0847f7a71d.png

As expected, Fev+ Delta cells are most commited towards Delta cells. CellRank supports many more ways to visualize aggregated fate probabilities, for example via heatmaps, bar plots, or paga-pie charts, see the API.

Uncover driver genes#

Correlate fate probabilities with gene expression#

We uncover putative driver genes by correlating fate probabilities with gene expression using the compute_lineage_drivers() method. In other words, if a gene is systematically higher or lower expressed in cells that are more or less likely to differentiate towards a given terminal states, respectively, then we call this gene a putative driver gene.

As an example, let’s focus on Delta cell generation again, and let’s restrict the correlation-computation to the relevant clusters.

driver_clusters = ["Delta", "Fev+", "Ngn3 high EP", "Ngn3 low EP"]

delta_df = g.compute_lineage_drivers(
    lineages=["Delta"], cluster_key="clusters", clusters=driver_clusters
)
delta_df.head(10)

Adding `adata.varm['terminal_lineage_drivers']`
       `.lineage_drivers`
    Finish (0:00:00)

	Delta_corr	Delta_pval	Delta_qval	Delta_ci_low	Delta_ci_high
index
Sst	0.837618	0.000000e+00	0.000000e+00	0.820782	0.852999
Iapp	0.734253	1.101269e-254	3.285085e-251	0.708401	0.758137
Arg1	0.585548	2.595327e-131	5.161241e-128	0.548993	0.619867
Frzb	0.548492	4.214793e-111	6.286364e-108	0.509679	0.585077
Pyy	0.530892	1.761453e-102	2.101766e-99	0.491063	0.568506
Rbp4	0.509664	7.962654e-93	7.917532e-90	0.468659	0.548478
Hhex	0.498949	3.079775e-88	2.624849e-85	0.457370	0.538352
1500009L16Rik	0.490320	1.125324e-84	8.392101e-82	0.448288	0.530189
Fam159b	0.482760	1.202195e-81	7.969214e-79	0.440338	0.523030
Ptprz1	0.459699	6.441979e-73	3.843284e-70	0.416131	0.501161

This computed correlation values, as well as p-values, multiple-testing corrected q-values, and confidence intervals.

Visualize putative driver genes#

We can visually explore a few of these genes identified above.

adata.obs["fate_probabilities_delta"] = g.fate_probabilities["Delta"].X.flatten()

sc.pl.embedding(
    adata,
    basis="umap",
    color=["fate_probabilities_delta"] + list(delta_df.index[:8]),
    color_map="viridis",
    s=50,
    ncols=3,
    vmax="p96",
)

/Users/marius/miniforge3/envs/py39_arm_cr/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:163: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = copy(get_cmap(cmap))

../../../_images/208ff4bb1945706840dd706e9ce4028ff38e9aa710ed0e8d3ab3839a2320bbba.png

We show fate probabilities towards the Delta population and some of the top-correlating genes. Among them are Somatostatin (Sst; the hormone produced by Delta cells) and the homeodomain transcription factor Hhex (hematopoietically expressed homeobox), a known driver gene for Delta cell generation [Zhang et al., 2014].

Beyond Delta cells, let’s compute driver genes for all trajectories, and let’s visualize some of the top-correlating genes for the Alpha and Beta lineages using plot_lineage_drivers_correlation().

# compute driver genes
driver_df = g.compute_lineage_drivers()

# define set of genes to annotate
alpha_genes = ["Arx", "Gcg", "Irx2", "Etv1", "Pou6f2", "Meis2"]
beta_genes = ["Ins2", "Gng12", "Ins1", "Nkx6-1", "Pdx1", "Sytl4"]

genes_oi = {
    "Alpha": alpha_genes,
    "Beta": beta_genes,
}

# make sure all of these exist in AnnData
assert [
    gene in adata.var_names for genes in genes_oi.values() for gene in genes
], "Did not find all genes"

# compute mean gene expression across all cells
adata.var["mean expression"] = adata.X.A.mean(axis=0)

# visualize in a scatter plot
g.plot_lineage_drivers_correlation(
    lineage_x="Beta",
    lineage_y="Alpha",
    adjust_text=True,
    gene_sets=genes_oi,
    color="mean expression",
    legend_loc="none",
    figsize=(5, 5),
    dpi=150,
    fontsize=9,
    size=50,
)

Adding `adata.varm['terminal_lineage_drivers']`
       `.lineage_drivers`
    Finish (0:00:00)

Adjusting text position
    Finish (0:00:05)

../../../_images/f45444941913501d60cb3b3fab6b5e07d95c42ddea9b79dd21403719e9759b5e.png

In the plot above, each dot represents a gene, colored by mean expression across all cells, with Beta- and Alpha correlations on the x- and y-axis, respectively. As expected, these two are anti-correlated; genes that correlate positively with the Alpha fate correlate negatively with the Beta fate and vice versa. We annotate a few transciption factors (TFs; according to https://www.ncbi.nlm.nih.gov/) with known functions in Beta-or Alpha trajectories, all of which are among the top-30 highest correlated genes:

Gene Name	Reference	Trajectory
Arx	PMID: 32149367, PMID: 23785486, PMID: 24236044	Alpha
Irx2	PMID: 26977395	Alpha
Etv1	PMID: 32439909	Alpha
Gng12	PMID: 30254276	Beta
Nkx6-1	PMID: 33121533, PMID: 34349281	Beta
Pdx1	PMID: 31401713, PMID: 33526589, PMID: 28580283	Beta
Sytl4	PMID: 27032672, PMID: 12058058	Beta

Further, we annotate Gcg, the gene encoding the Alpha cell hormone Glucagon, as well as Ins1 and Ins2, the genes encoding the Beta cell hormone Insulin. We also annotate Pou6f2 and Meis2, two TFs which are highly correlated with the Alpha fate probabilities, with no known function in Alpha cell formation or maintenance, to the best of our knowledge. These serve as examples of how fate probabilities can be used to predict putative new driver genes.

Closing matters#

What’s next?#

In this tutorial, you learned how to compute, aggregate and visualize fate probabilities and predict driver genes. For the next steps, we recommend to:

go through the gene trend plotting tutorial to learn how to visulize gene dynamics in pseudotime.
take a look at the API to learn about parameters you can use to adapt these computations to your data.
try this out on your own data.

Package versions#

cr.logging.print_versions()

cellrank==1.5.2.dev206+ga2748bea scanpy==1.9.3 anndata==0.8.0 numpy==1.24.2 numba==0.57.0rc1 scipy==1.10.1 pandas==1.5.3 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.4 scvelo==0.3.0 pygam==0.8.0 matplotlib==3.7.0 seaborn==0.12.2