Main usage of Novae¶
This tutorial shows how to load a Novae model, apply it to your spatial data, and plot results. We show Noave's usage for spatial data with single-cell resolution (e.g., Xenium, MERSCOPE), but we will also soon make a tutorial for spot resolution data (e.g., Visium).
Make sure you have installed Novae, e.g. using pip install novae
, as detailed in the installation guide.
import novae
Create and process your AnnData object(s)¶
Novae's input is one or multiple AnnData
object(s). Having multiple AnnData
objects can be useful when you have multiple gene panels, or when you don't want to concatenate your slides. See here for more details on all four possible input modes for Novae.
For this tutorial, we first show an example using one AnnData
object representing a colon slide. We load it with the load_dataset
function.
adata = novae.utils.load_dataset(tissue="colon", species="human", pattern=".*P2.*")[0]
adata
[INFO] (novae.utils._data) Found 1 h5ad file(s) matching the filters.
AnnData object with n_obs × n_vars = 340837 × 422 obs: 'cell_id', 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'unassigned_codeword_counts', 'deprecated_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'region', 'slide_id', 'technology' var: 'gene_ids', 'feature_types', 'genome', 'n_cells' uns: 'log1p', 'neighbors', 'spatial_neighbors', 'spatialdata_attrs', 'umap' obsm: 'X_pca', 'X_umap', 'spatial' layers: 'counts' obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
Then, preprocessing is not mandatory, as Novae can handle it automatically. Here is some extra information:
- You can have either "log1p data" or "raw counts" in
adata.X
. In the latter case, Novae will preprocess it and save the counts inadata.obsm
. - We automatically select the genes that Novae should use. We use the highly variable genes (or eventually other genes, if not enough HVG).
NB: You can disable the automatic preprocessing with
novae.settings.auto_preprocessing = False
One-slide inference¶
Compute the cells neighbors¶
Novae runs on graphs of cells connected by physical proximity. To do so, we use novae.utils.spatial_neighbors
, which will connect the cells based on their locations (in adata.obsm["spatial"]
).
Here, we use radius=80
to drop edges longer than 80 microns (optional).
# if you have multiple samples in the same adata object, specify `slide_key`
novae.utils.spatial_neighbors(adata, radius=80)
[INFO] (novae.utils._build) Computing graph on 340,837 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None)
We can also show the graph of connectivities, which is a good quality control. Nodes in red are cells that are connected to very few other cells (ngh_threshold=2
, by default). In particular:
- You should have a relatively low amount of "red" cells. If so, decrease the
radius
parameter. - Regions that are far apart should not be connected. If so, increase the
radius
parameter.
If the graph is not looking right, check the docs of novae.utils.spatial_neighbors
to adapt it.
novae.plot.connectivities(adata)
Use a pretrained model¶
Now, we need to load a pretrained Novae model.
Since we have a human colon slide, we load the "MICS-Lab/novae-human-0"
model, which can be used on any human tissue for spatial data with single-cell resolution. Other model names can be found here (spot-resolution models, e.g., for Visium data, will come soon).
model = novae.Novae.from_pretrained("MICS-Lab/novae-human-0")
model
Novae model ├── Known genes: 60697 ├── Parameters: 32.0M └── Model name: MICS-Lab/novae-human-0
Now, we can compute the spatial representations for each cell. In the first option below, we pass the argument zero_shot=True
to run only inference (i.e., the model is not re-trained).
Instead of zero-shot, if you want to fine-tune the model, you can use the fine_tune
method, and then call compute_representations
without the zero_shot
argument (see "option 2").
# Option 1: zero-shot
model.compute_representations(adata, zero_shot=True)
# Option 2: fine-tuning
model.fine_tune(adata)
model.compute_representations(adata)
To assign domains, you can use the assign_domains
method, as below. By default, it creates 7 domains, but you can choose the number of domains you want with the level
argument.
The function will save the domains in adata.obs
, and return the name of the column in which it was saved (in this case, adata.obs["novae_domains_7"]
)
model.assign_domains(adata)
'novae_domains_7'
Then, to show the results, you can use novae.plot.domains
.
You can also use scanpy.pl.spatial. Actually,
novae.plot.domains
uses Scanpy internally but adds extra functionalities related to Novae.
If you run model.assign_domains
multiple times, you can also decide the resolution you want to show, by passing the obs_key
argument to novae.plot.domains
.
novae.plot.domains(adata)
[INFO] (novae.utils._validate) Using obs_key='novae_domains_7' as default.
Novae has a hierarchical organization of the spatial domains. That is, if you run multiple times assign_domains
with different level
parameters, the domains at different resolutions will be nested inside each other.
To plot the hierarchy of the domains, you can use the plot_domains_hierarchy
method of Novae, as below:
model.plot_domains_hierarchy()
Multi-slide or multi-panel¶
You can use Novae on multiple slides or multiple panels. In that case, the input slightly changes: you can, for instance, have one AnnData
object with a column that indicates the slide ID (if they share the same gene panel) or have a list of AnnData
objects (one for each slide, or for each panel). For more details, refer to this tutorial.
Here, we load 6 mouse brain slides. adatas
is therefore a list of 6 AnnData
objects.
adatas = novae.utils.load_dataset(tissue="brain", species="mouse")
[INFO] (novae.utils._data) Found 6 h5ad file(s) matching the filters.
# adata object of the first slide
adatas[0]
AnnData object with n_obs × n_vars = 53913 × 346 obs: 'cell_id', 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'unassigned_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'region', 'slide_id', 'technology' var: 'gene_ids', 'feature_types', 'genome', 'n_cells' uns: 'log1p', 'neighbors', 'novae_tissue', 'spatial_neighbors', 'spatialdata_attrs', 'umap' obsm: 'X_pca', 'X_umap', 'spatial' layers: 'counts' obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
Multi-slide neighbors¶
When having multiple slides, computing the cells neighbors with novae.utils.spatial_neighbors
is slightly different. The multiple options are listed below and explained in more detail in this tutorial.
For this specific example, we use option 1, i.e., a list of AnnData objects, each corresponding to one slide.
# Option 1: Multiple AnnData objects (one per slide)
novae.utils.spatial_neighbors(adatas, radius=80)
# Option 2: One AnnData object with multiple slides
# In that case, you need to precise the name of the column in adata.obs
# that contains the slide identifiers using `slide_key`.
novae.utils.spatial_neighbors(adata, radius=80, slide_key="my-slide-column")
# Option 3: Multiple AnnData objects, each containing multiple slides
# Similarly, you can pass a list of AnnData objects, and precise the slide_key
novae.utils.spatial_neighbors(adatas, radius=80, slide_key="my-slide-column")
[INFO] (novae.utils._build) Computing graph on 53,913 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None) [INFO] (novae.utils._build) Computing graph on 58,682 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None) [INFO] (novae.utils._build) Computing graph on 62,268 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None) [INFO] (novae.utils._build) Computing graph on 58,685 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None) [INFO] (novae.utils._build) Computing graph on 58,231 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None) [INFO] (novae.utils._build) Computing graph on 59,935 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None)
Again, we can show the cells connectivites. See the above explanations to understand how to perform quality controls based on this plot.
novae.plot.connectivities(adatas)
Multi-slide inference¶
Since we are now working on mouse brain data, we'll load the brain model, as shown below.
Reminder: for human tissues, you can use the "MICS-Lab/novae-human-0"
model. Again, other model names can be found here.
model = novae.Novae.from_pretrained("MICS-Lab/novae-brain-0")
model
Novae model ├── Known genes: 60697 ├── Parameters: 32.0M └── Model name: MICS-Lab/novae-brain-0
Then, the usage of Novae is similar to above, and you can again use zero-shot or fine-tuning (see the first section of this tutorial for more details).
# Option 1: zero-shot
model.compute_representations(adatas, zero_shot=True)
# Option 2: fine-tuning
model.fine_tune(adatas)
model.compute_representations(adatas)
Computing representations: 0%| | 0/106 [00:00<?, ?it/s]
Computing representations: 0%| | 0/115 [00:00<?, ?it/s]
Computing representations: 0%| | 0/122 [00:00<?, ?it/s]
Computing representations: 0%| | 0/115 [00:00<?, ?it/s]
Computing representations: 0%| | 0/114 [00:00<?, ?it/s]
Computing representations: 0%| | 0/117 [00:00<?, ?it/s]
Again, the command line to assign domains is the same. Here, we assigned 15 domains.
model.assign_domains(adatas, level=15)
'novae_domains_15'
And, again, we can show the domains:
Here, we added the
slide_name_key
argument, which is optional and used to display the name of each slide (slide_name_key
should be a column name ofadata.obs
).
novae.plot.domains(adatas, slide_name_key="slide_id", cell_size=20)
[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.
Batch-effect correction of the spatial representations¶
The spatial representations of each cell are stored in adata.obsm["novae_latent"]
, and they are not batch-effect corrected by default. Yet, the (categorical) spatial domains are corrected. Therefore, we can use the categorical spatial domains to correct the representations, using the batch_effect_correction
method, as below:
model.batch_effect_correction(adatas)
[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.
This saved the corrected representations inside adata.obsm["novae_latent_corrected"]
. You can use these corrected representations for further analysis.
Note that this representation is a spatial domain representation of each cell, not a cell expression representation. Indeed, it contains information on the local neighborhoods of the cells.
Downstream analysis¶
Novae can also perform downstream analysis. We illustrated it below.
Domains proportion per slide¶
A first simple thing to do is to look at the proportion of each domain for each slide. For instance, you may be interested in finding domains that are more or less present under certain conditions or diseases.
Again, the
slide_name_key
is optional and is only used to show the slide's names.
Here, on our mouse brain slides, we see homogeneous domain proportions (which is expected in this very specific case).
novae.plot.domains_proportions(adatas, obs_key="novae_domains_15")
Slide architecture¶
We run trajectory inference (PAGA) on the spatial domains to extract a graph representing the "architecture" of a slide, or the "spatial domains organization".
Currently, this function only supports one slide per call.
novae.plot.paga(adatas[0])
[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.
Spatially Variable Genes (SVG)¶
To extract SVG, we run DEGs on the categorical spatial domains. The function below shows the 3 most variable genes.
novae.plot.spatially_variable_genes(adatas[0], top_k=3, vmax="p95", cell_size=20)
[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.
Spatial pathway analysis¶
Scores per domain¶
We can score pathways for each domain using scanpy.tl.score_genes
.
The pathways input should be one of the following:
- a JSON file downloaded from the GSEA website.
- a
dict
whose keys are pathway names, and values are lists of genes (case insensitive).
Currently, this function only supports one slide per call.
novae.plot.pathway_scores(adatas[0], pathways="mouse_hallmarks.json", figsize=(10, 7))
[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default. [INFO] (novae.plot._heatmap) Loaded 50 pathway(s) [INFO] (novae.plot._heatmap) Plot mode: 24 pathways scores per domain
Scores per slide per domain¶
We can also show the score of one pathway, for each slide and each domain, as in the article.
Here, we need to concatenate our AnnData object into a single one. If you already have one AnnData
object with multiple slides, you can skip this step (just ensure you provided slide_key
in novae.utils.spatial_neioghbors
).
import anndata
adata_conc = anndata.concat(adatas)
Since the JSON file below contains only one pathway, it will detect that it must plot this pathway score per domain and slide. You can also force this behavior by providing the pathway_name
argument.
novae.plot.pathway_scores(
adata_conc, pathways="LEE_AGING_CEREBELLUM_UP.json", figsize=(4, 8), slide_name_key="slide_id"
)
[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default. [INFO] (novae.plot._heatmap) Loaded 1 pathway(s) [INFO] (novae.plot._heatmap) Plot mode: LEE_AGING_CEREBELLUM_UP score per domain per slide