Visualizing and Clustering Gene Expression Trends¶

Preliminaries¶

In this tutorial, you will learn how to:

compute and visualize gene expression trends along specific differentiation trajectories.
cluster expression trends to detect genes with similar dynamical patterns.

This tutorial notebook can be downloaded using the following link.

../../../_images/800_expression_trends.jpg — CellRank fits Generalized Additive Models to recover continuous gene expression dynamics along specific trajectories.¶

Note

If you want to run this on your own data, you will need:

a scRNA-seq dataset for which you computed a cell-cell transition matrix using any CellRank kernel. For each kernel, there exist individual tutorials that teach you how to use them. For example, we have tutorials about using RNA velocity, experimental time points, and pseudotime, among others, to derive cell-cell transition probabilities.
any pseudotime computed on your data. We’ll use the Palantir pseudotime in our example [Setty et al., 2019].
imputed gene expression data. We’ll use MAGIC here, but you can use any imputation strategy that works for your data [van Dijk et al., 2018].

Note

If you encounter any bugs in the code, our if you have suggestions for new features, please open an issue. If you have a general question or something you would like to discuss with us, please post on the scverse discourse. You can also contact us using info@cellrank.org.

Import packages & data¶

import sys

if "google.colab" in sys.modules:
    !pip install -q git+https://github.com/theislab/cellrank

import numpy as np
import pandas as pd

import cellrank as cr
import scanpy as sc
import scvelo as scv

scv.settings.verbosity = 3
scv.settings.set_figure_params("scvelo")
cr.settings.verbosity = 2
sc.settings.set_figure_params(frameon=False, dpi=100)

Global seed set to 0

import warnings

warnings.simplefilter("ignore", category=UserWarning)

To demonstrate the appproach in this tutorial, we will use a scRNA-seq dataset of pancreas development at embyonic day E15.5 [Bastidas-Ponce et al., 2019]. We used this dataset in our CellRank meets RNA velocity tutorial to compute a cell-cell transition matrix using the VelocityKernel; this tutorial builds on the transition matrix computed there. The dataset including the transition matrix can be conveniently acessed through cellrank.datasets.pancreas [Bastidas-Ponce et al., 2019, Bergen et al., 2020].

adata = cr.datasets.pancreas(kind="preprocessed-kernel")

Additional MAGIC imputed expression for this data example is available on Figshare. See the MAGIC documentation or CellRank’s reproducibility repo for more info.

magic_imputed_data = pd.read_csv(
    "../datasets/pancreas_magic_imputed_data.csv", index_col=0
)

# filter to those genes for which we have imputed expression values
mask = np.in1d(adata.var_names, magic_imputed_data.columns)
adata = adata[:, mask].copy()

# add imputed data to an extra layer in AnnData
adata.layers["magic_imputed_data"] = magic_imputed_data[adata.var_names].loc[
    adata.obs_names
]

Reconstruct a VelocityKernel from the AnnData object.

vk = cr.kernels.VelocityKernel.from_adata(adata, key="T_fwd")
vk

VelocityKernel[n=2531, model='deterministic', similarity='correlation', softmax_scale=4.013]

Note

The method from_adata() allows you to reconstruct a CellRank kernel from an AnnData object. When you use a kernel to compute a transition matrix, this matrix, as well as a few informations about the computation, are written to the underlying anndata.AnnData object upon calling write_to_adata(). This makes it easy to write and read kernels from disk via the associated AnnData object.

Compute terminal states and fate probabilities¶

We need to infer or assign terminal states and compute fate probabilities before we can visualize gene expression trends.

Visualize gene expression trends¶

Prepare for trend plotting¶

To visualize gene expression trends, we fit Generalized Additive Models (GAMs) to MAGIC-imputed gene expression data, ordering cells in pseudotime, and weighting each cell’s contribution to each trajectory according to fate probabilities. While we will be using the Palantir pseudotime here, you can use whatever pseudotime works best for your data, including Diffusion Pseudotime (DPT). Let’s start by visulizing the Palantir pseudotime.

sc.pl.embedding(
    adata,
    basis="umap",
    color=["clusters", "palantir_pseudotime"],
    color_map="gnuplot",
    legend_loc="on data",
)

../../../_images/a001497eeed337f3a95cd10f210fadba5ef08d0566eab35200db86a061ea3b4d.png

This looks promising, the pseudotime increases continiously from early states (Ngn3 low EP) until late states (Alpha, Beta, Delta, Epsilon). We can make this more quantitative, without relying on the 2D UMAP, by showing the distribution of pseudotime values per cluster.

sc.pl.violin(adata, keys=["palantir_pseudotime"], groupby="clusters", rotation=90)

../../../_images/900204a260b46385ed07b9d7b7cc0ffcdad235fd35b18eaa507e4e5670273b51.png

This confirms what we saw in the UMAP.

Visualize expression trends via line plots¶

The first step in gene trend plotting is to initialite a model for GAM fitting.

model = cr.models.GAMR(adata, n_knots=6, smoothing_penalty=10.0)

Note

CellRank supports two packages for GAM-fitting: pyGAM (in Python) and mgcv (in R). For us, mgcv usually yields better results, but is a bit harder to get to include in a Python-based workflow as you have to use rpy2.

Next, we specify the pseudotime via the time_key, the data layer via the data_key, the gene set via genes, and a few other parameters for plotting. See the API for a full description.

cr.pl.gene_trends(
    adata,
    model=model,
    data_key="magic_imputed_data",
    genes=["Arx", "Pdx1", "Hhex", "Enpp1"],
    same_plot=True,
    ncols=2,
    time_key="palantir_pseudotime",
    hide_cells=True,
    weight_threshold=(1e-3, 1e-3),
)

Computing trends using `1` core(s)

    Finish (0:00:01)
Plotting trends

../../../_images/386eddcaee9916fab1f434f6d2efd0b265194ee6f0bd823fe549b25f82412082.png

Above, we highlight the trajectory-specific dynamics of a few TFs, some of which have known functions in pancreatic fate specification. For example, Hhex and Arx are involved in Delta and Alpha cell generation, respectively [Wilcox et al., 2013, Zhang et al., 2014].

Visualize expression cascades via heatmaps¶

In this section, we focus on just one trajectory and visualize the temporal activation of several genes along this trajectory. As an example, let’s take the Beta trajectory. We computed putative driver genes for this trajectory in the Estimating Fate Probabilities and Driver Genes tutorial, here, we explore their temporal activation cascade. To do this, as before, we fit the expression of putative Beta-driver genes along the pseudotime using GAMs, weighting each cell’s contribution according to Beta fate_probabilities.

# compute putative drivers for the Beta trajectory
beta_drivers = g.compute_lineage_drivers(lineages="Beta")

# plot heatmap
cr.pl.heatmap(
    adata,
    model=model,  # use the model from before
    lineages="Beta",
    cluster_key="clusters",
    show_fate_probabilities=True,
    data_key="magic_imputed_data",
    genes=beta_drivers.head(40).index,
    time_key="palantir_pseudotime",
    figsize=(12, 10),
    show_all_genes=True,
    weight_threshold=(1e-3, 1e-3),
)

Adding `adata.varm['terminal_lineage_drivers']`
       `.lineage_drivers`
    Finish (0:00:00)
Computing trends using `1` core(s)

    Finish (0:00:02)

../../../_images/0813153820cc33775593dd8e578757138b96650229d5af49abed0c7154541299.png

Each row in this heatmap corresponds to one gene, sorted according to their expression peak in pseudotime. We show cluster annotations, and the fate probabilities used to determine each cells’s contribution to the Beta trajectory, at the top.

Our analysis suggests the following temporal activation of TFs with known function in Beta-cell fate choice: TPT1 (also called TCTP) [Diraison et al., 2011, Tsai et al., 2014] peaks early, followed by Pax4 [Napolitano et al., 2015, Sosa-Pineda et al., 1997] and Mnx1 [Dalgin and Prince, 2012, Pan et al., 2015]. As expected, Ins1 and Ins2 peak late, and so do Gng12 [Byrnes et al., 2018] and Pdx1 [Bastidas-Ponce et al., 2017, Moin and Butler, 2019, Zhu et al., 2021].

Beyond genes with a previously documented role in Beta cell fate choice, this analysis also pinpoints putative new drivers, including the early-peaking TFs Ptma and Srsf3.

Cluster gene expression trends¶

Compute clustering¶

To identify genes with similar expression patterns, CellRank can cluster expression trends. Importantly, we do this again in a trajectory-specific fashion, i.e. we only consider GAM-fitted expression trends towards Beta cells and cluster these. We only consider highly-variable genes here to speed up the computation.

cr.pl.cluster_trends(
    adata,
    model=model,  # use the model from before
    lineage="Beta",
    data_key="magic_imputed_data",
    genes=adata[:, adata.var["highly_variable"]].var_names,
    time_key="palantir_pseudotime",
    weight_threshold=(1e-3, 1e-3),
    n_jobs=8,
    random_state=0,
    clustering_kwargs={"resolution": 0.2, "random_state": 0},
    neighbors_kwargs={"random_state": 0},
)

Computing trends using `8` core(s)

Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0
Global seed set to 0

    Finish (0:02:32)
Saving data to `adata.uns['lineage_Beta_trend']`

../../../_images/5a264fe84588c3d2d5c38840a83050957fb192f3ee91671b09d19e3d03c642c7.png

The results are written to uns, conveniently as an AnnData object, where genes now take the role of the observations.

gdata = adata.uns["lineage_Beta_trend"].copy()
gdata

AnnData object with n_obs × n_vars = 1996 × 200
    obs: 'clusters'
    var: 'x_test'
    uns: 'pca', 'neighbors', 'leiden'
    obsm: 'X_pca'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Let’s merge some gene annotations from the original AnnData object.

cols = ["means", "dispersions"]
gdata.obs = gdata.obs.merge(
    right=adata.var[cols], how="left", left_index=True, right_index=True
)
gdata

AnnData object with n_obs × n_vars = 1996 × 200
    obs: 'clusters', 'means', 'dispersions'
    var: 'x_test'
    uns: 'pca', 'neighbors', 'leiden'
    obsm: 'X_pca'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Analzye gene clusters¶

We have a gene-level PCA and k-NN graph - let’s use that to compute and plot a gene-level UMAP embedding.

sc.tl.umap(gdata, random_state=0)

sc.pl.embedding(gdata, basis="umap", color=["clusters", "means"], vmax="p95")

../../../_images/e360376bc3d577296bf59dfffb03c6b0d091fe6295c90324ca9dba0c776250a1.png

Each dot represent a gene, colored by expression trend clusters along the Beta trajectory (left) or the mean expression across cells (right). Let’s highlight a few driver genes we discussed above.

genes = ["Gng12", "Ins1", "Ins2", "Pdx1", "Mnx1", "Ptma", "Srsf3", "Pax4"]

# make sure the genes were included in the cluster analysis
mask = np.array([gene in gdata.obs_names for gene in genes])

if not mask.all():
    print(f"Could not find {np.array(genes)[~mask]}")

# subset to the genes that are present
genes = list(np.array(genes)[mask])

# retrieve their indices for plotting
indices = [np.where(np.in1d(gdata.obs_names, gene))[0][0] for gene in genes]
title = ["clusters"] + genes

scv.pl.scatter(
    gdata, c=["clusters"] + indices, title=title, ncols=3, legend_loc="right"
)

../../../_images/173e740dcd21123d98c2753c34011072ea63277b4b259b320953c6caf318d7f8.png

Many of the late-peaking TFs we discussed when plotting the heatmap, including Gng12, Ins1, Ins2 and and Pdx1, are assigned to cluster 0, which is indeed a late-peaking cluster. We can confirm this by checking in the gdata object.

gdata[genes].obs

	clusters	means	dispersions
Gng12	0	0.878893	1.414954
Ins1	0	2.744454	5.726741
Ins2	0	2.016665	4.514492
Pdx1	0	0.990023	1.193838
Mnx1	1	0.295992	0.935342
Ptma	4	2.462189	1.153156
Srsf3	4	1.438808	0.470357
Pax4	5	0.794812	1.326806

Pax4 is assigned to the intermediate-peak cluster 5, we can now query for genes that have similar expression dynamics along the Beta trajectory by looking at genes from the same cluster.

# obtain some genes from the same cluster, sort by mean expression
mid_peak_genes = (
    gdata[gdata.obs["clusters"] == "5"]
    .obs.sort_values("means", ascending=False)
    .head(8)
    .index
)

# plot
cr.pl.gene_trends(
    adata,
    model=model,
    lineages="Beta",
    cell_color="clusters",
    data_key="magic_imputed_data",
    genes=["Pax4"] + list(mid_peak_genes),
    same_plot=True,
    ncols=3,
    time_key="palantir_pseudotime",
    hide_cells=False,
    weight_threshold=(1e-3, 1e-3),
)

Computing trends using `1` core(s)

    Finish (0:00:00)
Plotting trends

../../../_images/d599d110eae09941dfbe5739066caab322d09a63a98d52690d8723b3f7712d28.png

To showcase the flexibility of this plotting function, we’re only drawing the trends towards Beta, and we visualize cells, colored by their cluster assignment. This serves as an example of how gene trend clustering can be used to identify genes that show similar expression dynamics to a known driver gene of a given lineage.

Closing matters¶

What’s next?¶

In this tutorial, you learned how to visualize trajectory-specific gene expression trends via line plots or heatmaps, and how to cluster them. For the next steps, we recommend to:

explore this with cell-transitions derived from another kernel. The advantage of CellRank is that you can run the exact same computations from above, no matter what data modality you used to obtain your cell-cell transition matrix.
take a look at the API to learn about parameters you can use to adapt these computations to your data.
try this out on your own data.

Package versions¶

cr.logging.print_versions()

cellrank==1.5.2.dev206+ga2748bea scanpy==1.9.3 anndata==0.8.0 numpy==1.24.2 numba==0.57.0rc1 scipy==1.10.1 pandas==1.5.3 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.4 scvelo==0.3.0 pygam==0.8.0 matplotlib==3.7.0 seaborn==0.12.2