Batch effect correction¶
Runtime: about 3 min after loading the dataset.
The batch effect correction is illustrated on the POISED dataset, which contains 7 batches.
Note
Scyan performs batch effect correction as the same time as training for cell-type annotations. Since these two tasks are linked, we advise the user to have already great annotations before to try batch effect correction. More precise annotations makes a better batch-effect correction.
import scyan
Global seed set to 0
adata, table = scyan.data.load("poised") # Automatic data loading
Amplify batch effect (skip this step on your data)¶
To illustrate batch effect correction on a more complex dataset, we amplify the batch effect. Skip this step, since you don't want to add extra noise.
import numpy as np
noise = np.random.exponential(
scale=1, size=(7, adata.n_vars, adata.n_vars)
)
adata.X = scyan.preprocess.unscale(adata)
adata.X = scyan.preprocess.inverse_transform(adata)
for i, batch in enumerate(adata.obs.batch.cat.categories):
spillover = np.eye(adata.n_vars) + 0.01 * noise[i]
adata.X[adata.obs.batch == batch] = (
adata.X[adata.obs.batch == batch] @ spillover
)
scyan.preprocess.asinh_transform(adata)
scyan.preprocess.scale(adata)
[INFO] (scyan.preprocess) Performing inverse asinh transform [INFO] (scyan.preprocess) Data will be standardised, and translated so that 0 goes to -1. This is advised only when using CyTOF data (if this is not your case, consider running 'auto_logicle_transform' instead of 'asinh_transform').
scyan.tools.umap(adata, markers=table.columns)
[INFO] (scyan.tools.representation) Fitting UMAP...
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
UMAP(min_dist=0.5, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
UMAP(min_dist=0.5, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
scyan.plot.umap(adata, color="batch", title="UMAP before batch correction")
Correct batch effect¶
To correct the batch effect, you must have a column in adata.obs
dedicated to the batch. Then, provide a batch_key
to scyan.Scyan
, i.e. the name of the column dedicated to the batch. You can also choose a batch_ref
, i.e. one of all the batches you want to use, or let Scyan choose it for you (by default, the batch with the highest number of cells).
The batch knowledge is automatically added as one of the model categorical_covariates
. You can add more categorical or continuous covariates if needed (see scyan.Scyan API).
If you use your own dataset, use
Scyan
default parameters first.
# Only 'batch_key' is required for batch correction
model = scyan.Scyan(adata, table, batch_key="batch", prior_std=0.35)
# For better batch effect correction, you could increase the default 'patience' and decrease 'min_delta' arguments of model.fit()
# NB: if you have less than 1 million cells, you can also increase 'max_epochs'
model.fit(patience=10, min_delta=0.1)
[INFO] (scyan.model) Initialized Scyan model with N=4178320 cells, P=26 populations and M=19 markers. ├── Covariates: batch ├── No continuum-marker provided └── Batch correction mode: True [INFO] (scyan.model) Training scyan with the following hyperparameters: "batch_key": batch "batch_size": 8192 "hidden_size": 16 "lr": 0.0005 "max_samples": 200000 "modulo_temp": 3 "n_hidden_layers": 6 "n_layers": 7 "prior_std": 0.35 "temperature": 0.5 GPU available: True (mps), used: False TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /Users/quentinblampey/Library/Caches/pypoetry/virtualenvs/scyan-5lsXrWE1-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/setup.py:201: UserWarning: MPS available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='mps', devices=1)`. rank_zero_warn( | Name | Type | Params --------------------------------------- 0 | module | ScyanModule | 33.4 K --------------------------------------- 33.4 K Trainable params 0 Non-trainable params 33.4 K Total params 0.134 Total estimated model params size (MB)
Training: 0it [00:00, ?it/s]
`Trainer.fit` stopped: `max_epochs=100` reached.
[INFO] (scyan.model) Successfully ended traning.
Scyan model with N=4178320 cells, P=26 populations and M=19 markers. ├── Covariates: batch ├── No continuum-marker provided └── Batch correction mode: True
After model training, we can align the distribution on a reference batch (here, B10
).
Note that corrected expressions are still preprocess and scaled. You can inserve these transformations with
scyan.preprocess.unscale
andscyan.preprocess.inverse_transform
.
# Get corrected marker expressions (warning: the values are still preprocessed and standardised)
adata.obsm["scyan_corrected"] = model.batch_effect_correction().numpy(force=True)
Refine fit / handling NA markers¶
Sometimes, you will have a panel that includes some markers for which you have no prior knowledge at all. To handle batch effect correction for such markers, you can run model.refine_fit()
(after having run model.fit()
).
Note
Such markers have to be added in the table (if not done yet). Just add a column of NAs for each of these markers.
model.predict()
model.refine_fit()
Then run everything again
adata.obsm["scyan_corrected"] = model.batch_effect_correction(batch_ref="B10").numpy(force=True)
scyan.tools.umap(adata, obsm="scyan_corrected");
scyan.plot.umap(adata, color="batch", title="UMAP after refined batch correction")
DataLoader: 0%| | 0/511 [00:00<?, ?it/s]
DataLoader: 0%| | 0/511 [00:00<?, ?it/s]
[INFO] (scyan.tools.representation) Fitting UMAP...
Use batch corrected expressions for further analysis¶
The batch effect corrected expressions are stored in adata.obsm["scyan_corrected"]
, and the corresponding marker names are in model.var_names
. You can use them for further analysis.
print(adata.obsm["scyan_corrected"])
print(f"Marker names: {', '.join(model.var_names)}")
[[-0.9539349 -0.93171173 -0.92845386 ... -0.9675956 -0.951983 -0.81466496] [ 3.1832964 1.8580393 -0.49163747 ... -0.48776484 -0.5667361 0.6440381 ] [ 3.1670866 1.3268812 -0.4432921 ... -0.5298085 -0.65013283 0.57265604] ... [ 3.005826 3.4372778 1.9195399 ... 1.6127968 5.003887 5.5465055 ] [ 1.5146042 2.7854807 0.25790054 ... 0.6421641 1.5063114 2.5468485 ] [ 1.5101717 2.7549195 0.4220283 ... 0.5319461 1.8654416 2.3529358 ]] Marker names: CD19, CD20, CD3, CD4, CD8, TCRgd, CD16, CD56, CD25, CD127, CD45RA, CCR7, HLA.DR, CD14, CD11c, CD123, CD27, CD69, CD40L
You can also update adata.X
and work on the unscaled data. Note that it only updates the markers that are in the table, so consider using them all if needed (see above).
adata[:, model.var_names].X = adata.obsm["scyan_corrected"]
adata.X = scyan.preprocess.unscale(adata)