Batch effect correction¶

Runtime: about 3 min after loading the dataset.

The batch effect correction is illustrated on the POISED dataset, which contains 7 batches.

Note

Scyan performs batch effect correction as the same time as training for cell-type annotations. Since these two tasks are linked, we advise the user to have already great annotations before to try batch effect correction. More precise annotations makes a better batch-effect correction.

In [1]:

Copied!

import scyan
import scyan

Global seed set to 0

In [2]:

Copied!

adata, table = scyan.data.load("poised") # Automatic data loading
adata, table = scyan.data.load("poised") # Automatic data loading

Amplify batch effect (skip this step on your data)¶

To illustrate batch effect correction on a more complex dataset, we amplify the batch effect. Skip this step, since you don't want to add extra noise.

In [3]:

Copied!





import numpy as np

noise = np.random.exponential(
    scale=1, size=(7, adata.n_vars, adata.n_vars)
)

adata.X = scyan.preprocess.unscale(adata)
adata.X = scyan.preprocess.inverse_transform(adata)

for i, batch in enumerate(adata.obs.batch.cat.categories):
    spillover = np.eye(adata.n_vars) + 0.01 * noise[i]
    adata.X[adata.obs.batch == batch] = (
        adata.X[adata.obs.batch == batch] @ spillover
    )

scyan.preprocess.asinh_transform(adata)
scyan.preprocess.scale(adata)
import numpy as np

noise = np.random.exponential(
    scale=1, size=(7, adata.n_vars, adata.n_vars)
)

adata.X = scyan.preprocess.unscale(adata)
adata.X = scyan.preprocess.inverse_transform(adata)

for i, batch in enumerate(adata.obs.batch.cat.categories):
    spillover = np.eye(adata.n_vars) + 0.01 * noise[i]
    adata.X[adata.obs.batch == batch] = (
        adata.X[adata.obs.batch == batch] @ spillover
    )

scyan.preprocess.asinh_transform(adata)
scyan.preprocess.scale(adata)

[INFO] (scyan.preprocess) Performing inverse asinh transform
[INFO] (scyan.preprocess) Data will be standardised, and translated so that 0 goes to -1. This is advised only when using CyTOF data (if this is not your case, consider running 'auto_logicle_transform' instead of 'asinh_transform').

In [4]:

Copied!

scyan.tools.umap(adata, markers=table.columns)
scyan.tools.umap(adata, markers=table.columns)

[INFO] (scyan.tools.representation) Fitting UMAP...
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Out[4]:

UMAP(min_dist=0.5, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [5]:

Copied!

scyan.plot.umap(adata, color="batch", title="UMAP before batch correction")
scyan.plot.umap(adata, color="batch", title="UMAP before batch correction")

No description has been provided for this image

Correct batch effect¶

To correct the batch effect, you must have a column in adata.obs dedicated to the batch. Then, provide a batch_key to scyan.Scyan, i.e. the name of the column dedicated to the batch. You can also choose a batch_ref, i.e. one of all the batches you want to use, or let Scyan choose it for you (by default, the batch with the highest number of cells).

The batch knowledge is automatically added as one of the model categorical_covariates. You can add more categorical or continuous covariates if needed (see scyan.Scyan API).

If you use your own dataset, use Scyan default parameters first.

In [6]:

Copied!





# Only 'batch_key' is required for batch correction
model = scyan.Scyan(adata, table, batch_key="batch", prior_std=0.35)

# For better batch effect correction, you could increase the default 'patience' and decrease 'min_delta' arguments of model.fit()
# NB: if you have less than 1 million cells, you can also increase 'max_epochs'
model.fit(patience=10, min_delta=0.1)
# Only 'batch_key' is required for batch correction
model = scyan.Scyan(adata, table, batch_key="batch", prior_std=0.35)

# For better batch effect correction, you could increase the default 'patience' and decrease 'min_delta' arguments of model.fit()
# NB: if you have less than 1 million cells, you can also increase 'max_epochs'
model.fit(patience=10, min_delta=0.1)

[INFO] (scyan.model) Initialized Scyan model with N=4178320 cells, P=26 populations and M=19 markers.
   ├── Covariates: batch
   ├── No continuum-marker provided
   └── Batch correction mode: True
[INFO] (scyan.model) Training scyan with the following hyperparameters:
"batch_key":       batch
"batch_size":      8192
"hidden_size":     16
"lr":              0.0005
"max_samples":     200000
"modulo_temp":     3
"n_hidden_layers": 6
"n_layers":        7
"prior_std":       0.35
"temperature":     0.5

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/quentinblampey/Library/Caches/pypoetry/virtualenvs/scyan-5lsXrWE1-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/setup.py:201: UserWarning: MPS available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='mps', devices=1)`.
  rank_zero_warn(

  | Name   | Type        | Params
---------------------------------------
0 | module | ScyanModule | 33.4 K
---------------------------------------
33.4 K    Trainable params
0         Non-trainable params
33.4 K    Total params
0.134     Total estimated model params size (MB)

Training: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=100` reached.
[INFO] (scyan.model) Successfully ended traning.

Out[6]:

Scyan model with N=4178320 cells, P=26 populations and M=19 markers.
   ├── Covariates: batch
   ├── No continuum-marker provided
   └── Batch correction mode: True

After model training, we can align the distribution on a reference batch (here, B10).

Note that corrected expressions are still preprocess and scaled. You can inserve these transformations with scyan.preprocess.unscale and scyan.preprocess.inverse_transform.

In [ ]:

Copied!

# Get corrected marker expressions (warning: the values are still preprocessed and standardised)
adata.obsm["scyan_corrected"] = model.batch_effect_correction().numpy(force=True)
# Get corrected marker expressions (warning: the values are still preprocessed and standardised)
adata.obsm["scyan_corrected"] = model.batch_effect_correction().numpy(force=True)

Refine fit / handling NA markers¶

Sometimes, you will have a panel that includes some markers for which you have no prior knowledge at all. To handle batch effect correction for such markers, you can run model.refine_fit() (after having run model.fit()).

Note

Such markers have to be added in the table (if not done yet). Just add a column of NAs for each of these markers.

In [ ]:

Copied!

model.predict()
model.refine_fit()
model.predict()
model.refine_fit()

Then run everything again

In [11]:

Copied!

adata.obsm["scyan_corrected"] = model.batch_effect_correction(batch_ref="B10").numpy(force=True)
scyan.tools.umap(adata, obsm="scyan_corrected");
scyan.plot.umap(adata, color="batch", title="UMAP after refined batch correction")
adata.obsm["scyan_corrected"] = model.batch_effect_correction(batch_ref="B10").numpy(force=True)
scyan.tools.umap(adata, obsm="scyan_corrected");
scyan.plot.umap(adata, color="batch", title="UMAP after refined batch correction")

DataLoader:   0%|          | 0/511 [00:00<?, ?it/s]

DataLoader:   0%|          | 0/511 [00:00<?, ?it/s]

[INFO] (scyan.tools.representation) Fitting UMAP...

Use batch corrected expressions for further analysis¶

The batch effect corrected expressions are stored in adata.obsm["scyan_corrected"], and the corresponding marker names are in model.var_names. You can use them for further analysis.

In [12]:

Copied!

print(adata.obsm["scyan_corrected"])
print(f"Marker names: {', '.join(model.var_names)}")
print(adata.obsm["scyan_corrected"])
print(f"Marker names: {', '.join(model.var_names)}")

[[-0.9539349  -0.93171173 -0.92845386 ... -0.9675956  -0.951983
  -0.81466496]
 [ 3.1832964   1.8580393  -0.49163747 ... -0.48776484 -0.5667361
   0.6440381 ]
 [ 3.1670866   1.3268812  -0.4432921  ... -0.5298085  -0.65013283
   0.57265604]
 ...
 [ 3.005826    3.4372778   1.9195399  ...  1.6127968   5.003887
   5.5465055 ]
 [ 1.5146042   2.7854807   0.25790054 ...  0.6421641   1.5063114
   2.5468485 ]
 [ 1.5101717   2.7549195   0.4220283  ...  0.5319461   1.8654416
   2.3529358 ]]
Marker names: CD19, CD20, CD3, CD4, CD8, TCRgd, CD16, CD56, CD25, CD127, CD45RA, CCR7, HLA.DR, CD14, CD11c, CD123, CD27, CD69, CD40L

You can also update adata.X and work on the unscaled data. Note that it only updates the markers that are in the table, so consider using them all if needed (see above).

In [ ]:

Copied!

adata[:, model.var_names].X = adata.obsm["scyan_corrected"]
adata.X = scyan.preprocess.unscale(adata)
adata[:, model.var_names].X = adata.obsm["scyan_corrected"]
adata.X = scyan.preprocess.unscale(adata)