Note
Click here to download the full example code
Fit models and plot gene trends¶
This example shows how to prepare, fit and plot various models to gene expression trends.
We will focus mostly on Generalized Additive Models (GAMs)
and show how to do this for sklearn
estimators towards the end. GAMs are flexible models that are well suited to
model non-linear gene trends as they often appear in single-cell data. Further, they have the advantage that it is
relatively straightforward to derive a confidence interval around the main trend.
from sklearn.svm import SVR
import cellrank as cr
adata = cr.datasets.pancreas_preprocessed("../example.h5ad")
adata
Out:
AnnData object with n_obs × n_vars = 2531 × 2000
obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'dpt_pseudotime'
var: 'highly_variable_genes', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'fit_r2', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_variance', 'fit_alignment_scaling', 'velocity_genes'
uns: 'clusters_colors', 'clusters_fine_colors', 'diffmap_evals', 'iroot', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca', 'recover_dynamics', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
obsm: 'X_diffmap', 'X_pca', 'X_umap', 'velocity_umap'
varm: 'PCs', 'loss'
layers: 'Ms', 'Mu', 'fit_t', 'fit_tau', 'fit_tau_', 'spliced', 'unspliced', 'velocity', 'velocity_u'
obsp: 'connectivities', 'distances'
First, we need to compute a set of terminal states and we have to estimate fate probabilities towards these. The fate probabilities will be used as weights when fitting the model - they determine how much each cell contributes to each lineage.
cr.tl.terminal_states(
adata,
cluster_key="clusters",
weight_connectivities=0.2,
softmax_scale=4,
show_progress_bar=False,
)
cr.tl.lineages(adata)
Out:
INFO: Using pre-computed schur decomposition
Models in cellrank.ul.models
follow similar patterns as sklearn
models.
We begin by initializing and preparing the model for fitting. cellrank.ul.models.BaseModel.prepare()
requires
only the gene name and the lineage name and must be called before cellrank.ul.models.BaseModel.fit()
. It also
includes various useful parameters, such as time_range
or weight_threshold
, which determine the start and
end pseudotime and the minimum required threshold for lineage probabilities, respectively.
model = cr.ul.models.GAM(adata)
model.prepare(
gene="Pak3",
lineage="Alpha",
time_key="dpt_pseudotime",
data_key="Ms",
n_test_points=100,
)
Out:
<GAM[gene='Pak3', lineage='Alpha', model=GammaGAM(callbacks=['deviance', 'diffs'], fit_intercept=True, max_iter=2000, scale=None, terms=s(0), tol=0.0001, verbose=False)]>
CellRank allows any pseudotime from adata.obs
to be passed via the time_key
, e.g. Diffusion Pseudotime (DPT)
[Haghverdi16], scvelo
’s latent time [Bergen20], Palantir’s pseudotime [Setty19], etc.
Further, CellRank accepts imputed gene expression values stored in adata.layers
via the data_key
, i.e. you
can pass MAGIC [MAGIC18] imputed data, scvelo.pp.moments()
[Bergen20] (used below) or any other form of
imputation.
Once the model has been prepared, it is ready for fitting and prediction.
y_hat = model.fit().predict()
y_hat
Out:
array([0.04188798, 0.04126436, 0.04068079, 0.0401372 , 0.03963354,
0.03916981, 0.03874606, 0.03836239, 0.038019 , 0.03771614,
0.03745415, 0.03723348, 0.03705466, 0.03691835, 0.03682534,
0.03677654, 0.03677303, 0.03681602, 0.03690694, 0.0370474 ,
0.03723921, 0.03748443, 0.03778539, 0.03814469, 0.03856524,
0.03905032, 0.03960357, 0.04022907, 0.04093137, 0.04171554,
0.04258723, 0.04355275, 0.04461914, 0.04579423, 0.04708616,
0.04850137, 0.05004637, 0.05172833, 0.05355516, 0.05553548,
0.05767874, 0.05999519, 0.06249599, 0.06519324, 0.06809998,
0.07123035, 0.07459959, 0.07822409, 0.0821215 , 0.08631077,
0.09081224, 0.09564777, 0.10084067, 0.1064159 , 0.11240011,
0.11882179, 0.12571123, 0.13310066, 0.14102435, 0.1495185 ,
0.15862171, 0.16837463, 0.17882016, 0.19000345, 0.2019722 ,
0.21477619, 0.2284676 , 0.24310021, 0.25872743, 0.27540293,
0.29318136, 0.31211773, 0.3322675 , 0.3536854 , 0.3764255 ,
0.40054053, 0.42608052, 0.4530935 , 0.48162335, 0.5117095 ,
0.54338586, 0.5766802 , 0.61161226, 0.6481933 , 0.6864243 ,
0.7262959 , 0.76778543, 0.810857 , 0.8554596 , 0.9015258 ,
0.9489717 , 0.9976943 , 1.0475708 , 1.0984584 , 1.1501942 ,
1.2025931 , 1.2554485 , 1.308532 , 1.3615954 , 1.4143687 ],
dtype=float32)
Optionally, we can also get the confidence interval. Models which don’t have a method to compute it, such as
cellrank.ul.models.SKLearnModel
wrapper for some sklearn
estimators, can use the default the
confidence interval cellrank.ul.models.BaseModel.default_confidence_interval()
.
conf_int = model.confidence_interval()
conf_int[:5]
Out:
array([[0.03875637, 0.04527263],
[0.03821021, 0.04456262],
[0.03770048, 0.0438967 ],
[0.03722708, 0.04327481],
[0.03678997, 0.0426969 ]], dtype=float32)
After the prediction and optionally the confidence interval calculation, we can plot the smoothed gene expression.
Cells in this plot have been colored by their fate probability of reaching the Alpha terminal state. We include these probabilities as weights in the loss function when fitting the model. This allows us to weight each cell by its relative contribution to the lineage, without needing to subset cells for each lineage.
model.plot(conf_int=True)

Lastly, wrapping sklearn
estimators is fairly simple, we just pass the instance
to cellrank.ul.models.SKLearnModel
.
svr = SVR()
model = cr.ul.models.SKLearnModel(adata, model=svr)
model
Out:
<SKLearnModel[gene=None, lineage=None, model=SVR()]>
Total running time of the script: ( 0 minutes 39.544 seconds)
Estimated memory usage: 536 MB