# Pancreas Basics¶

This tutorial shows how to apply CellRank in order to infer initial or terminal states of a developmental process and how to compute probabilistic fate mappings. The first part of this tutorial is very similar to scVelo’s tutorial on pancreatic endocrinogenesis. The data we use here comes from Bastidas-Ponce et al. (2018). For more info on scVelo, see the documentation or read the article.

This is the **high level mode** to interact with CellRank, which is quick and simple but does not offer as many options as the lower level mode, in which you interact directly with our kernels and estimators. If you would like to get to know this more advanced way of interacting with CellRank, see the pancreas
advanced tutorial.

This tutorial notebook can be downloaded using the following link.

## Import packages & data¶

Easiest way to start is to download Miniconda3 along with the environment file found here. To create the environment, run `conda create -f environment.yml`

.

```
[1]:
```

```
import scvelo as scv
import scanpy as sc
import cellrank as cr
import numpy as np
scv.settings.verbosity = 3
scv.settings.set_figure_params('scvelo')
cr.settings.verbosity = 2
```

First, we need to get the data. The following commands will download the `adata`

object and save it under `datasets/endocrinogenesis_day15.5.h5ad`

.

```
[2]:
```

```
adata = cr.datasets.pancreas()
scv.utils.show_proportions(adata)
adata
```

```
Abundance of ['spliced', 'unspliced']: [0.81 0.19]
```

```
[2]:
```

```
AnnData object with n_obs × n_vars = 2531 × 27998
obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime'
var: 'highly_variable_genes'
uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
```

## Pre-process the data¶

Filter out genes which don’t have enough spliced/unspliced counts, normalize and log transform the data and restrict to the top highly variable genes. Further, compute principal components and moments for velocity estimation. These are standard scanpy/scvelo functions, for more information about them, see the scVelo API.

```
[3]:
```

```
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
```

```
Filtered out 22024 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Exctracted 2000 highly variable genes.
Logarithmized X.
computing moments based on connectivities
finished (0:00:00) --> added
'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
```

## Run scVelo¶

We will use the dynamical model from scVelo to estimate the velocities. The first step, estimating the parameters of the dynamical model, may take a while (~10min). To make sure we only have to run this once, we developed a caching extension called scachepy. scachepy does not only work for `recover_dynamics`

, but it can cache the output of almost any scanpy or scvelo function. To install it,
simply run

```
pip install git+https://github.com/theislab/scachepy
```

If you don’t want to install scachepy now, don’t worry, the below cell will run without it as well and this is the only place in this tutorial where we’re using it.

```
[4]:
```

```
try:
import scachepy
c = scachepy.Cache('../../cached_files/basic_tutorial/')
c.tl.recover_dynamics(adata, force=False)
except ModuleNotFoundError:
print("You don't seem to have scachepy installed, but that's fine, you just have to be a bit patient (~10min). ")
scv.tl.recover_dynamics(adata)
```

```
You don't seem to have scachepy installed, but that's fine, you just have to be a bit patient (~10min).
recovering dynamics
finished (0:11:50) --> added
'fit_pars', fitted parameters for splicing dynamics (adata.var)
```

Once we have the parameters, we can use these to compute the velocities and the velocity graph. The velocity graph is a weighted graph that specifies how likely two cells are to transition into another, given their velocity vectors and relative positions.

```
[5]:
```

```
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
```

```
computing velocities
finished (0:00:02) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph
finished (0:00:05) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
```

```
[6]:
```

```
scv.pl.velocity_embedding_stream(adata, basis='umap', legend_fontsize=12, title='', smooth=.8, min_mass=4)
```

```
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
```

## Run CellRank¶

CellRank builds on the velocities computed by scVelo to compute initial and terminal states of the dynamical process. It further uses the velocities to calculate how likely each cell is to develop from each initial state or towards each terminal state.

## Identify terminal states¶

Terminal states can be computed by running the following command:

```
[7]:
```

```
cr.tl.terminal_states(adata, cluster_key='clusters', weight_connectivities=0.2)
```

```
Computing transition matrix based on logits using `'deterministic'` mode
Estimating `softmax_scale` using `'deterministic'` mode
```

```
Setting `softmax_scale=3.9741`
```

```
Finish (0:00:14)
Using a connectivity kernel with weight `0.2`
Computing transition matrix based on connectivities
Finish (0:00:00)
Computing eigendecomposition of the transition matrix
Adding `.eigendecomposition`
`adata.uns['eig_fwd']`
Finish (0:00:00)
Computing Schur decomposition
Adding `.eigendecomposition`
`adata.uns['eig_fwd']`
`.schur`
`.schur_matrix`
Finish (0:00:00)
Computing `3` macrostates
INFO: Using pre-computed schur decomposition
Adding `.macrostates_memberships`
`.macrostates`
`.schur`
`.coarse_T`
`.coarse_stationary_distribution`
Finish (0:00:00)
Adding `adata.obs['terminal_states_probs']`
`adata.obs['terminal_states']`
`adata.obsm['macrostates_fwd']`
`.terminal_states_probabilities`
`.terminal_states`
```

The most important parameters in the above function are:

`estimator`

: this determines what’s going to behind the scenes to compute the terminal states. Options are`cr.tl.estimators.CFLARE`

(“Clustering and Filtering of Left and Right Eigenvectors”) or`cr.tl.estimators.GPCCA`

(“Generalized Perron Cluster Cluster Analysis”). The latter is the default, it computes terminal states by coarse graining the velocity-derived Markov chain into a set of macrostates that represent the slow-time scale dynamics of the process, i.e. it finds the states that you are unlikely to leave again, once you have entered them.`cluster_key`

: takes a key from`adata.obs`

to retrieve pre-computed cluster labels, i.e. ‘clusters’ or ‘louvain’. These labels are then mapped onto the set of terminal states, to associate a name and a color with each state.`n_states`

: number of expected terminal states. This parameter is optional - if it’s not provided, this number is estimated from the so-called ‘eigengap heuristic’ of the spectrum of the transition matrix.`method`

: This is only relevant for the estimator`GPCCA`

. It determines the way in which we compute and sort the real Schur decomposition. The default,`krylov`

, is an iterative procedure that works with sparse matrices which allows the method to scale to very large cell numbers. It relies on the libraries SLEPSc and PETSc, which you will have to install separately, see our installation instructions. If your dataset is small (<5k cells), and you don’t want to install these at the moment, use`method='brandts'`

. The results will be the same, the difference is that`brandts`

works with dense matrices and won’t scale to very large cells numbers.`weight_connectivities`

: additionally to the velocity-based transition probabilities, we use a transition matrix computed on the basis of transcriptomic similarities to make the algorithm more robust. Essentially, we are taking a weighted mean of these two sources of inforamtion, where the weight for transcriptomic similarities is defined by`weight_connectivities`

.

When running the above command, CellRank adds a key `terminal_states`

to adata.obs and the result can be plotted as:

```
[8]:
```

```
cr.pl.terminal_states(adata)
```

## Identify initial states¶

The same sort of analysis can now be repeated for the initial states, only that we use the function `cr.tl.initial_states`

this time:

```
[9]:
```

```
cr.tl.initial_states(adata, cluster_key='clusters')
cr.pl.initial_states(adata, discrete=True)
```

```
Computing transition matrix based on logits using `'deterministic'` mode
Estimating `softmax_scale` using `'deterministic'` mode
```

```
Setting `softmax_scale=3.9741`
```

```
Finish (0:00:11)
Using a connectivity kernel with weight `0.2`
Computing transition matrix based on connectivities
Finish (0:00:00)
Computing eigendecomposition of the transition matrix
Adding `.eigendecomposition`
`adata.uns['eig_bwd']`
Finish (0:00:00)
WARNING: For 1 macrostate, stationary distribution is computed
Adding `.macrostates_memberships`
`.macrostates`
Finish (0:00:00)
Adding `adata.obs['initial_states_probs']`
`adata.obs['initial_states']`
`adata.obsm['macrostates_bwd']`
`.terminal_states_probabilities`
`.terminal_states`
```

We found one initial state, located in the Ngn3 low EP cluster.

## Compute fate maps¶

Once we know the terminal states, we can compute associated fate maps - for each cell, we ask how likely is the cell to develop towards each of the identified terminal states.

```
[10]:
```

```
cr.tl.lineages(adata)
cr.pl.lineages(adata, same_plot=False)
```

```
Computing lineage probabilities towards terminal states
Computing absorption probabilities
Adding `adata.obsm['to_terminal_states']`
`adata.obs['to_terminal_states_dp']`
`.absorption_probabilities`
`.diff_potential`
Finish (0:00:01)
Adding lineages to `adata.obsm['to_terminal_states']`
Finish (0:00:01)
```

We can aggregate the above into a single, global fate map where we associate each terminal state with color and use the intensity of that color to show the fate of each individual cell:

```
[11]:
```

```
cr.pl.lineages(adata, same_plot=True)
```

This shows that the dominant terminal state at E15.5 is `Beta`

, consistent with known biology, see e.g. Bastidas-Ponce et al. (2018).

## Directed PAGA¶

We can further aggragate the individual fate maps into a cluster-level fate map using an adapted version of PAGA with directed edges. We first compute scVelo’s latent time with CellRank identified `root_key`

and `end_key`

, which are the probabilities of being an initial or a terminal state, respectively.

```
[12]:
```

```
scv.tl.recover_latent_time(adata, root_key='initial_states_probs', end_key='terminal_states_probs')
```

```
computing latent time using initial_states_probs, terminal_states_probs as prior
finished (0:00:01) --> added
'latent_time', shared time (adata.obs)
```

Next, we can use the inferred pseudotime along with the initial and terminal states probabilities to compute the directed PAGA.

```
[13]:
```

```
scv.tl.paga(adata, groups='clusters', root_key='initial_states_probs', end_key='terminal_states_probs',
use_time_prior='velocity_pseudotime')
```

```
running PAGA using priors: ['velocity_pseudotime', 'initial_states_probs', 'terminal_states_probs']
finished (0:00:01) --> added
'paga/connectivities', connectivities adjacency (adata.uns)
'paga/connectivities_tree', connectivities subtree (adata.uns)
'paga/transitions_confidence', velocity transitions (adata.uns)
```

```
[14]:
```

```
cr.pl.cluster_fates(adata, mode="paga_pie", cluster_key="clusters", basis='umap',
legend_kwargs={'loc': 'top right out'}, legend_loc='top left out',
node_size_scale=5, edge_width_scale=1, max_edge_width=4, title='directed PAGA')
```

We use pie charts to show cell fates averaged per cluster. Edges between clusters are given by transcriptomic similarity between the clusters, just as in normal PAGA.

Given the fate maps/probabilistic trajectories, we can ask interesting questions like:

How does expression of a given gene vary along a specified lineage/trajectory (cellrank.pl.gene_trends and cellrank.pl.heatmap?

How plastic are different cellular subpopulations (cellrank.pl.cluster_fates(adata, …, mode=’violin’)?

To find out more, check out the CellRank API.

## Compute lineage drivers¶

We can compute the driver genes for all or just the subset of lineages. We can also restric this to some subset of clusters by specifying `clusters=...`

(not shown below).

In the resulting dataframe, we also see the p-value, the corrected p-value (q-value) and the 95% confidence interval for the correlation statistic.

```
[15]:
```

```
cr.tl.lineage_drivers(adata)
```

```
Computing correlations for lineages `['Epsilon' 'Alpha' 'Beta']` restricted to clusters `None` in layer `X` with `use_raw=False`
Adding `.lineage_drivers`
`adata.var['to Epsilon corr']`
`adata.var['to Alpha corr']`
`adata.var['to Beta corr']`
Finish (0:00:00)
```

```
[15]:
```

Epsilon corr | Epsilon pval | Epsilon qval | Epsilon ci low | Epsilon ci high | Alpha corr | Alpha pval | Alpha qval | Alpha ci low | Alpha ci high | Beta corr | Beta pval | Beta qval | Beta ci low | Beta ci high | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

index | |||||||||||||||

Ghrl | 0.804862 | 0.000000e+00 | 0.000000e+00 | 0.790695 | 0.818167 | -0.108862 | 3.900840e-08 | 3.133205e-07 | -0.147200 | -0.070198 | -0.402898 | 2.699204e-102 | 3.856006e-100 | -0.435031 | -0.369740 |

Anpep | 0.471119 | 7.319804e-146 | 7.319804e-143 | 0.440238 | 0.500886 | -0.076036 | 1.279278e-04 | 5.443738e-04 | -0.114658 | -0.037184 | -0.225912 | 6.723208e-31 | 1.792856e-29 | -0.262562 | -0.188610 |

Gm11837 | 0.465009 | 1.702133e-141 | 1.134755e-138 | 0.433909 | 0.495003 | -0.052619 | 8.095108e-03 | 2.187867e-02 | -0.091393 | -0.013685 | -0.241054 | 4.128495e-35 | 1.270306e-33 | -0.277410 | -0.204008 |

Maged2 | 0.377645 | 8.827210e-89 | 4.413605e-86 | 0.343741 | 0.410566 | -0.036364 | 6.737225e-02 | 1.273577e-01 | -0.075220 | 0.002601 | -0.200897 | 1.309633e-24 | 2.619267e-23 | -0.237996 | -0.163212 |

Ccnd2 | 0.377266 | 1.377857e-88 | 5.511429e-86 | 0.343351 | 0.410198 | 0.151691 | 1.515620e-14 | 2.424993e-13 | 0.113399 | 0.189532 | -0.352176 | 2.103505e-76 | 1.682804e-74 | -0.385843 | -0.317571 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

Dlk1 | -0.321305 | 5.844865e-63 | 7.793154e-61 | -0.355813 | -0.285923 | -0.163880 | 9.195224e-17 | 1.718734e-15 | -0.201555 | -0.125721 | 0.327886 | 1.090629e-65 | 6.415466e-64 | 0.292663 | 0.362220 |

Nkx6-1 | -0.331912 | 2.148291e-67 | 3.305064e-65 | -0.366139 | -0.296788 | -0.313250 | 1.023518e-59 | 9.304705e-58 | -0.347965 | -0.277677 | 0.454694 | 2.411776e-134 | 6.890787e-132 | 0.423230 | 0.485063 |

Gng12 | -0.333200 | 6.033913e-68 | 1.005652e-65 | -0.367392 | -0.298108 | -0.324200 | 3.760203e-64 | 4.423769e-62 | -0.358632 | -0.288887 | 0.464301 | 5.382152e-141 | 1.794051e-138 | 0.433175 | 0.494320 |

Nnat | -0.346478 | 8.493802e-74 | 1.698760e-71 | -0.380306 | -0.311724 | -0.318837 | 5.915315e-62 | 6.226648e-60 | -0.353409 | -0.283396 | 0.468074 | 1.130457e-143 | 4.521828e-141 | 0.437083 | 0.497954 |

Pdx1 | -0.354978 | 1.046614e-77 | 2.325809e-75 | -0.388566 | -0.320448 | -0.330513 | 8.468224e-67 | 1.058528e-64 | -0.364778 | -0.295355 | 0.482662 | 2.177751e-154 | 1.451834e-151 | 0.452205 | 0.511996 |

2000 rows × 15 columns

Afterwards, we can plot the top 5 driver genes (based on the correlation), e.g. for the `Alpha`

lineage:

```
[16]:
```

```
cr.pl.lineage_drivers(adata, lineage="Alpha", n_genes=5)
```

## Plot smooth gene expression trends along trajectories¶

The functions demonstrated above are the main functions of CellRank: computing initial and terminal states and probabilistic fate maps. We can use the computed probabilites now to e.g. smooth gene expression trends along lineages.

Let’s start with a temporal ordering for the cells. To get this, we can compute scVelo’s latent time, as computed previously, or alternatively, we can just use CellRank’s initial states to compute a Diffusion Pseudotime.

```
[17]:
```

```
# compue dpt, starting from CellRank defined root cell
root_idx = np.where(adata.obs['initial_states'] == 'Ngn3 low EP')[0][0]
adata.uns['iroot'] = root_idx
sc.tl.dpt(adata)
scv.pl.scatter(adata, color=['clusters', root_idx, 'latent_time', 'dpt_pseudotime'], fontsize=16,
cmap='viridis', perc=[2, 98], colorbar=True, rescale_color=[0, 1],
title=['clusters', 'root cell', 'latent time', 'dpt pseudotime'])
```

```
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.
```

We can plot dynamics of genes in pseudotime along individual trajectories, defined via the fate maps we computed above.

```
[18]:
```

```
model = cr.ul.models.GAM(adata)
cr.pl.gene_trends(adata, model=model, data_key='X',
genes=['Pak3','Neurog3', 'Ghrl'], ncols=3,
time_key='latent_time', same_plot=True, hide_cells=True,
figsize=(15, 4), n_test_points=200)
```

```
Computing trends using `1` core(s)
```

```
Finish (0:00:00)
Plotting trends
```

We can also visualize the lineage driver computed above in a heatmap:

```
[19]:
```

```
cr.pl.heatmap(adata, model, genes=adata.var['to Alpha corr'].sort_values(ascending=False).index[:100],
show_absorption_probabilities=True,
lineages="Alpha", n_jobs=1, backend='loky')
```

```
Computing trends using `1` core(s)
```

```
did not converge
did not converge
Finish (0:00:11)
```