Preprocessing¶
This tutorial helps you preprocessing your raw data so that you can run Scyan
afterwards.
You'll learn how to:
- Create an
adata
object based on a FCS (or CSV) file and preprocess it. - Create the knowledge table required for the annotation.
- (Optional) Compute a UMAP and save your dataset for later use.
Before continuing, make sure you have already installed scyan.
import scyan
Global seed set to 0
1. Creation of an AnnData
object for your cytometry data¶
Consider reading the anndata documentation if you have never heard about anndata
before (it's a nice library for handling single-cell data).
Note
Make sure you only take the population of interest. E.g., if you are interested into immune cells, consider providing only the live cells that are CD45+. If not possible, continue the tutorial, but consider running Scyan for filtering these cells before annotating the populations.
a) Loading a FCS
or CSV
file¶
You probably have .fcs
or .csv
files that you want to load. For this, you can use scyan.read_fcs
or scyan.read_csv
.
# If you have a FCS file
adata = scyan.read_fcs("<path-to-fcs>.fcs")
# If you have a CSV file
adata = scyan.read_csv("<path-to-csv>.csv")
print(f"Created anndata object with {adata.n_obs} cells and {adata.n_vars} markers.\n\n-> The markers names are: {', '.join(adata.var_names)}\n-> The non-marker names are: {', '.join(adata.obs.columns)}")
Created anndata object with 216331 cells and 42 markers. -> The markers names are: epcam, CD4, CD38, CD1a, CD24, CD123, CD47, CD39, CD31, CD169, CCR7, CD44, CD141, CD1c, CD9, HLADQ, CD11b, CD103, CD3/16/9/20, CD366, PD1, CD21, CD127, GP38, CD14, CD45, CD206, CTLA4, CD207, CD223, PDL1, CD69, CD25, Siglec10, HLADR, FOLR2, CADM1, CD45RA, CD5, Via dye, CD88, CD8 -> The non-marker names are: Time, SSC-H, SSC-A, FSC-H, FSC-A, SSC-B-H, SSC-B-A, AF-A
b) Sanity check¶
Make sure that the listed markers (i.e., adata.var_names
) contains only protein markers, and that every other variable is inside adata.obs
. If this is not the case, consider reading scyan.read_fcs
or scyan.read_csv
for more advanced usage (e.g., you can update marker_regex="^cd|^hla|epcam|^ccr"
to target all your markers).
c) Concatenate your data (optional)¶
If you have multiple FCS
, consider concatenating your data. We advise to add a observation column such as "batch" or "patient_id" to keep the information about the batch / patient ID.
Click to show an example
This short script will concatenate all the FCS inside a specific folder, and save each file name into adata.obs["file"]
so that we don't loose information. You can add additional information, e.g. in adata.obs["batch"]
if you have different batches.
import anndata
from pathlib import Path
folder_path = Path(".") # Replace "." by the path to your folder containing FCS files
fcs_paths = [path for path in folder_path.iterdir() if path.suffix == ".fcs"]
def read_one(path):
adata = scyan.read_fcs(path)
adata.obs["file"] = path.stem
adata.obs["batch"] = "NA" # If you have batches, add here the batch of the corresponding path
return adata
adata = anndata.concat([read_one(p) for p in fcs_paths], index_unique="-")
d) Preprocessing¶
Choose either the asinh
or logicle
transformation below, and scale your data.
is_cytof = True
if is_cytof: # we recommend asinh for CyTOF data
scyan.preprocess.asinh_transform(adata)
else: # we recommend auto_logicle for flow or spectral flow
scyan.preprocess.auto_logicle_transform(adata)
scyan.preprocess.scale(adata)
2. Creation of the knowledge table¶
Note
Some existing tables can be found here. It could help you making your table.
The knowledge table contains well-known marker expressions per population. For instance, if you want Scyan
to annotate CD4 T cells, you have to tell which markers CD4 T cells are supposed to express or not. Depending on your panel, it may be CD4+, CD8-, CD45+, CD3+, etc. Values inside the table can be:
-1
for negative expressions.1
for positive expressions.NA
when you don't know or if it is not applicable (if you use a CSV, you can also let the field empty, it will be read asNaN
bypandas
).- Some float values such as
0
or-0.5
for mid and low expressions respectively (use it only when necessary).
Each row corresponds to one population, and each column corresponds to one marker (i.e., one of adata.var_names
).
You can either directly create a csv
, or use Excel and export the table as csv
. Then, you can then import the csv
to make a pandas DataFrame
.
Example¶
import pandas as pd
table = pd.read_csv("<path-to-csv>.csv", index_col=0)
table.head() # Display the first 5 rows of the table
CD19 | CD4 | CD8 | CD34 | CD20 | CD45 | CD123 | CD11c | CD7 | CD16 | CD38 | CD3 | HLA-DR | CD64 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Populations | ||||||||||||||
Basophils | -1 | NaN | -1.0 | -1 | -1.0 | NaN | 1 | -1 | -1.0 | -1.0 | NaN | -1 | -1.0 | -1.0 |
CD4 T cells | -1 | 1.0 | -1.0 | -1 | -1.0 | NaN | -1 | -1 | NaN | -1.0 | NaN | 1 | -1.0 | -1.0 |
CD8 T cells | -1 | -1.0 | 1.0 | -1 | -1.0 | NaN | -1 | -1 | 1.0 | -1.0 | NaN | 1 | -1.0 | -1.0 |
CD16- NK cells | -1 | NaN | NaN | -1 | -1.0 | NaN | -1 | -1 | 1.0 | -1.0 | NaN | -1 | -1.0 | -1.0 |
CD16+ NK cells | -1 | NaN | NaN | -1 | NaN | NaN | -1 | -1 | 1.0 | 1.0 | NaN | -1 | -1.0 | -1.0 |
You can see our advices when creating this table.
Sanity check¶
Make sure table.index
contains population names, and that table.columns
contains existing marker names (i.e., included in adata.var_names
).
NB: the table index can be a
MultiIndex
to list hierarchical populations, and the first level should correspond to the most precise populations (see how to work with hierarchical populations).
3. (Optional) Compute a UMAP¶
You can compute the UMAP coordinates using scyan.tools.umap
. The API will guide you for the usage of this tool: especially, you can choose to compute the UMAP on a specific set of markers, or choose a subset of cells on which computing the UMAP (for acceleration).
Note that it only computes the coordinates, then you'll have to use
scyan.plot.umap
to display it.
# Option 1: Use all markers to compute the UMAP
scyan.tools.umap(adata)
# Option 2: Use only the cell-type markers (recommended), or your choose your own list of markers
scyan.tools.umap(adata, markers=table.columns)
4. (Optional) Save your data for later use¶
You can use scyan.data.add to save your data.
scyan.data.add("your-project-name", adata, table)
INFO:scyan.data.datasets:Creating new dataset folder at /.../your_project_name INFO:scyan.data.datasets:Created file /.../your_project_name/default.h5ad INFO:scyan.data.datasets:Created file /.../your_project_name/default.csv
From now on, you can now simply load your processed data with scyan.data.load:
adata, table = scyan.data.load("your-project-name")
Next steps¶
Congratulations! You can now follow our tutorial on model training and visualization.