Preprocessing¶

This tutorial helps you preprocessing your raw data so that you can run Scyan afterwards.

You'll learn how to:

Create an adata object based on a FCS (or CSV) file and preprocess it.
Create the knowledge table required for the annotation.
(Optional) Compute a UMAP and save your dataset for later use.

Before continuing, make sure you have already installed scyan.

In [1]:

Copied!

import scyan
import scyan

Global seed set to 0

1. Creation of an `AnnData` object for your cytometry data¶

Consider reading the anndata documentation if you have never heard about anndata before (it's a nice library for handling single-cell data).

Note

Make sure you only take the population of interest. E.g., if you are interested into immune cells, consider providing only the live cells that are CD45+. If not possible, continue the tutorial, but consider running Scyan for filtering these cells before annotating the populations.

a) Loading a `FCS` or `CSV` file¶

You probably have .fcs or .csv files that you want to load. For this, you can use scyan.read_fcs or scyan.read_csv.

In [7]:

Copied!

# If you have a FCS file
adata = scyan.read_fcs("<path-to-fcs>.fcs")

# If you have a CSV file
adata = scyan.read_csv("<path-to-csv>.csv")

print(f"Created anndata object with {adata.n_obs} cells and {adata.n_vars} markers.\n\n-> The markers names are: {', '.join(adata.var_names)}\n-> The non-marker names are: {', '.join(adata.obs.columns)}")
# If you have a FCS file
adata = scyan.read_fcs(".fcs")

# If you have a CSV file
adata = scyan.read_csv(".csv")

print(f"Created anndata object with {adata.n_obs} cells and {adata.n_vars} markers.\n\n-> The markers names are: {', '.join(adata.var_names)}\n-> The non-marker names are: {', '.join(adata.obs.columns)}")

Created anndata object with 216331 cells and 42 markers.

-> The markers names are: epcam, CD4, CD38, CD1a, CD24, CD123, CD47, CD39, CD31, CD169, CCR7, CD44, CD141, CD1c, CD9, HLADQ, CD11b, CD103, CD3/16/9/20, CD366, PD1, CD21, CD127, GP38, CD14, CD45, CD206, CTLA4, CD207, CD223, PDL1, CD69, CD25, Siglec10, HLADR, FOLR2, CADM1, CD45RA, CD5, Via dye, CD88, CD8
-> The non-marker names are: Time, SSC-H, SSC-A, FSC-H, FSC-A, SSC-B-H, SSC-B-A, AF-A

b) Sanity check¶

Make sure that the listed markers (i.e., adata.var_names) contains only protein markers, and that every other variable is inside adata.obs. If this is not the case, consider reading scyan.read_fcs or scyan.read_csv for more advanced usage (e.g., you can update marker_regex="^cd|^hla|epcam|^ccr" to target all your markers).

c) Concatenate your data (optional)¶

If you have multiple FCS, consider concatenating your data. We advise to add a observation column such as "batch" or "patient_id" to keep the information about the batch / patient ID.

Click to show an example

This short script will concatenate all the FCS inside a specific folder, and save each file name into adata.obs["file"] so that we don't loose information. You can add additional information, e.g. in adata.obs["batch"] if you have different batches.

import anndata
from pathlib import Path

folder_path = Path(".") # Replace "." by the path to your folder containing FCS files
fcs_paths = [path for path in folder_path.iterdir() if path.suffix == ".fcs"]

def read_one(path):
    adata = scyan.read_fcs(path)
    adata.obs["file"] = path.stem
    adata.obs["batch"] = "NA" # If you have batches, add here the batch of the corresponding path
    return adata

adata = anndata.concat([read_one(p) for p in fcs_paths], index_unique="-")

d) Preprocessing¶

Choose either the asinh or logicle transformation below, and scale your data.

In [4]:

Copied!





is_cytof = True

if is_cytof: # we recommend asinh for CyTOF data
    scyan.preprocess.asinh_transform(adata)
else: # we recommend auto_logicle for flow or spectral flow
    scyan.preprocess.auto_logicle_transform(adata)

scyan.preprocess.scale(adata)
is_cytof = True

if is_cytof: # we recommend asinh for CyTOF data
    scyan.preprocess.asinh_transform(adata)
else: # we recommend auto_logicle for flow or spectral flow
    scyan.preprocess.auto_logicle_transform(adata)

scyan.preprocess.scale(adata)

2. Creation of the knowledge table¶

Note

Some existing tables can be found here. It could help you making your table.

The knowledge table contains well-known marker expressions per population. For instance, if you want Scyan to annotate CD4 T cells, you have to tell which markers CD4 T cells are supposed to express or not. Depending on your panel, it may be CD4+, CD8-, CD45+, CD3+, etc. Values inside the table can be:

-1 for negative expressions.
1 for positive expressions.
NA when you don't know or if it is not applicable (if you use a CSV, you can also let the field empty, it will be read as NaN by pandas).
Some float values such as 0 or -0.5 for mid and low expressions respectively (use it only when necessary).

Each row corresponds to one population, and each column corresponds to one marker (i.e., one of adata.var_names).

You can either directly create a csv, or use Excel and export the table as csv. Then, you can then import the csv to make a pandas DataFrame.

Example¶

In [5]:

Copied!

import pandas as pd
import pandas as pd

In [6]:

Copied!

table = pd.read_csv("<path-to-csv>.csv", index_col=0)
table = pd.read_csv(".csv", index_col=0)

In [7]:

Copied!

table.head() # Display the first 5 rows of the table
table.head() # Display the first 5 rows of the table

Out[7]:

	CD19	CD4	CD8	CD34	CD20	CD45	CD123	CD11c	CD7	CD16	CD38	CD3	HLA-DR	CD64
Populations
Basophils	-1	NaN	-1.0	-1	-1.0	NaN	1	-1	-1.0	-1.0	NaN	-1	-1.0	-1.0
CD4 T cells	-1	1.0	-1.0	-1	-1.0	NaN	-1	-1	NaN	-1.0	NaN	1	-1.0	-1.0
CD8 T cells	-1	-1.0	1.0	-1	-1.0	NaN	-1	-1	1.0	-1.0	NaN	1	-1.0	-1.0
CD16- NK cells	-1	NaN	NaN	-1	-1.0	NaN	-1	-1	1.0	-1.0	NaN	-1	-1.0	-1.0
CD16+ NK cells	-1	NaN	NaN	-1	NaN	NaN	-1	-1	1.0	1.0	NaN	-1	-1.0	-1.0

You can see our advices when creating this table.

Sanity check¶

Make sure table.index contains population names, and that table.columns contains existing marker names (i.e., included in adata.var_names).

NB: the table index can be a MultiIndex to list hierarchical populations, and the first level should correspond to the most precise populations (see how to work with hierarchical populations).

3. (Optional) Compute a UMAP¶

You can compute the UMAP coordinates using scyan.tools.umap. The API will guide you for the usage of this tool: especially, you can choose to compute the UMAP on a specific set of markers, or choose a subset of cells on which computing the UMAP (for acceleration).

Note that it only computes the coordinates, then you'll have to use scyan.plot.umap to display it.

In [ ]:

Copied!

# Option 1: Use all markers to compute the UMAP
scyan.tools.umap(adata)

# Option 2: Use only the cell-type markers (recommended), or your choose your own list of markers
scyan.tools.umap(adata, markers=table.columns)
# Option 1: Use all markers to compute the UMAP
scyan.tools.umap(adata)

# Option 2: Use only the cell-type markers (recommended), or your choose your own list of markers
scyan.tools.umap(adata, markers=table.columns)

4. (Optional) Save your data for later use¶

You can use scyan.data.add to save your data.

In [9]:

Copied!

scyan.data.add("your-project-name", adata, table)
scyan.data.add("your-project-name", adata, table)

INFO:scyan.data.datasets:Creating new dataset folder at /.../your_project_name
INFO:scyan.data.datasets:Created file /.../your_project_name/default.h5ad
INFO:scyan.data.datasets:Created file /.../your_project_name/default.csv

From now on, you can now simply load your processed data with scyan.data.load:

In [10]:

Copied!

adata, table = scyan.data.load("your-project-name")
adata, table = scyan.data.load("your-project-name")

Next steps¶

Congratulations! You can now follow our tutorial on model training and visualization.