Main usage of Novae¶

This tutorial shows how to load a Novae model, apply it to your spatial data, and plot results. We show Noave's usage for spatial data with single-cell resolution (e.g., Xenium, MERSCOPE), but we will also soon make a tutorial for spot resolution data (e.g., Visium).

Make sure you have installed Novae, e.g. using pip install novae, as detailed in the installation guide.

In [65]:

Copied!

import novae
import novae

Create and process your AnnData object(s)¶

Novae's input is one or multiple AnnData object(s). Having multiple AnnData objects can be useful when you have multiple gene panels, or when you don't want to concatenate your slides. See here for more details on all four possible input modes for Novae.

For this tutorial, we first show an example using one AnnData object representing a colon slide. We load it with the load_dataset function.

In [2]:

Copied!

adata = novae.utils.load_dataset(tissue="colon", species="human", pattern=".*P2.*")[0]
adata
adata = novae.utils.load_dataset(tissue="colon", species="human", pattern=".*P2.*")[0]
adata

[INFO] (novae.utils._data) Found 1 h5ad file(s) matching the filters.

Out[2]:

AnnData object with n_obs × n_vars = 340837 × 422
    obs: 'cell_id', 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'unassigned_codeword_counts', 'deprecated_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'region', 'slide_id', 'technology'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells'
    uns: 'log1p', 'neighbors', 'spatial_neighbors', 'spatialdata_attrs', 'umap'
    obsm: 'X_pca', 'X_umap', 'spatial'
    layers: 'counts'
    obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'

Then, preprocessing is not mandatory, as Novae can handle it automatically. Here is some extra information:

You can have either "log1p data" or "raw counts" in adata.X. In the latter case, Novae will preprocess it and save the counts in adata.obsm.
We automatically select the genes that Novae should use. We use the highly variable genes (or eventually other genes, if not enough HVG).

NB: You can disable the automatic preprocessing with novae.settings.auto_preprocessing = False

One-slide inference¶

Compute the cells neighbors¶

Novae runs on graphs of cells connected by physical proximity. To do so, we use novae.spatial_neighbors, which will connect the cells based on their locations (in adata.obsm["spatial"]).

Here, we use radius=80 to drop edges longer than 80 microns (optional).

In [5]:

Copied!

# if you have multiple samples in the same adata object, specify `slide_key`
novae.spatial_neighbors(adata, radius=80)
# if you have multiple samples in the same adata object, specify `slide_key`
novae.spatial_neighbors(adata, radius=80)

[INFO] (novae.utils._build) Computing graph on 340,837 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None)

We can also show the graph of connectivities, which is a good quality control. Nodes in red are cells that are connected to very few other cells (ngh_threshold=2, by default). In particular:

You should have a relatively low amount of "red" cells. If so, decrease the radius parameter.
Regions that are far apart should not be connected. If so, increase the radius parameter.

If the graph is not looking right, check the docs of novae.spatial_neighbors to adapt it.

In [6]:

Copied!

novae.plot.connectivities(adata)
novae.plot.connectivities(adata)

No description has been provided for this image

Use a pretrained model¶

Now, we need to load a pretrained Novae model.

Since we have a human colon slide, we load the "MICS-Lab/novae-human-0" model, which can be used on any human tissue for spatial data with single-cell resolution. Other model names can be found here (spot-resolution models, e.g., for Visium data, will come soon).

In [7]:

Copied!

model = novae.Novae.from_pretrained("MICS-Lab/novae-human-0")
model
model = novae.Novae.from_pretrained("MICS-Lab/novae-human-0")
model

Out[7]:

Novae model
   ├── Known genes: 60697
   ├── Parameters: 32.0M
   └── Model name: MICS-Lab/novae-human-0

Now, we can compute the spatial representations for each cell. In the first option below, we pass the argument zero_shot=True to run only inference (i.e., the model is not re-trained).

Instead of zero-shot, if you want to fine-tune the model, you can use the fine_tune method, and then call compute_representations without the zero_shot argument (see "option 2").

In [ ]:

Copied!





# Option 1: zero-shot
model.compute_representations(adata, zero_shot=True)

# Option 2: fine-tuning
model.fine_tune(adata)
model.compute_representations(adata)
# Option 1: zero-shot
model.compute_representations(adata, zero_shot=True)

# Option 2: fine-tuning
model.fine_tune(adata)
model.compute_representations(adata)

To assign domains, you can use the assign_domains method, as below. By default, it creates 7 domains, but you can choose the number of domains you want with the level argument.

The function will save the domains in adata.obs, and return the name of the column in which it was saved (in this case, adata.obs["novae_domains_7"])

In [9]:

Copied!

model.assign_domains(adata)
model.assign_domains(adata)

Out[9]:

'novae_domains_7'

Then, to show the results, you can use novae.plot.domains.

You can also use scanpy.pl.spatial. Actually, novae.plot.domains uses Scanpy internally but adds extra functionalities related to Novae.

If you run model.assign_domains multiple times, you can also decide the resolution you want to show, by passing the obs_key argument to novae.plot.domains.

In [10]:

Copied!

novae.plot.domains(adata)
novae.plot.domains(adata)

[INFO] (novae.utils._validate) Using obs_key='novae_domains_7' as default.

Novae has a hierarchical organization of the spatial domains. That is, if you run multiple times assign_domains with different level parameters, the domains at different resolutions will be nested inside each other.

To plot the hierarchy of the domains, you can use the plot_domains_hierarchy method of Novae, as below:

In [11]:

Copied!

model.plot_domains_hierarchy()
model.plot_domains_hierarchy()

Multi-slide or multi-panel¶

You can use Novae on multiple slides or multiple panels. In that case, the input slightly changes: you can, for instance, have one AnnData object with a column that indicates the slide ID (if they share the same gene panel) or have a list of AnnData objects (one for each slide, or for each panel). For more details, refer to this tutorial.

Here, we load 6 mouse brain slides. adatas is therefore a list of 6 AnnData objects.

In [66]:

Copied!

adatas = novae.utils.load_dataset(tissue="brain", species="mouse")
adatas = novae.utils.load_dataset(tissue="brain", species="mouse")

[INFO] (novae.utils._data) Found 6 h5ad file(s) matching the filters.

In [67]:

Copied!

# adata object of the first slide 
adatas[0]
# adata object of the first slide 
adatas[0]

Out[67]:

AnnData object with n_obs × n_vars = 53913 × 346
    obs: 'cell_id', 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'unassigned_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'region', 'slide_id', 'technology'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells'
    uns: 'log1p', 'neighbors', 'novae_tissue', 'spatial_neighbors', 'spatialdata_attrs', 'umap'
    obsm: 'X_pca', 'X_umap', 'spatial'
    layers: 'counts'
    obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'

Multi-slide neighbors¶

When having multiple slides, computing the cells neighbors with novae.spatial_neighbors is slightly different. The multiple options are listed below and explained in more detail in this tutorial.

For this specific example, we use option 1, i.e., a list of AnnData objects, each corresponding to one slide.

In [69]:

Copied!





# Option 1: Multiple AnnData objects (one per slide)
novae.spatial_neighbors(adatas, radius=80)

# Option 2: One AnnData object with multiple slides
# In that case, you need to precise the name of the column in adata.obs
# that contains the slide identifiers using `slide_key`.
novae.spatial_neighbors(adata, radius=80, slide_key="my-slide-column")

# Option 3: Multiple AnnData objects, each containing multiple slides
# Similarly, you can pass a list of AnnData objects, and precise the slide_key
novae.spatial_neighbors(adatas, radius=80, slide_key="my-slide-column")
# Option 1: Multiple AnnData objects (one per slide)
novae.spatial_neighbors(adatas, radius=80)

# Option 2: One AnnData object with multiple slides
# In that case, you need to precise the name of the column in adata.obs
# that contains the slide identifiers using `slide_key`.
novae.spatial_neighbors(adata, radius=80, slide_key="my-slide-column")

# Option 3: Multiple AnnData objects, each containing multiple slides
# Similarly, you can pass a list of AnnData objects, and precise the slide_key
novae.spatial_neighbors(adatas, radius=80, slide_key="my-slide-column")

[INFO] (novae.utils._build) Computing graph on 53,913 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None)
[INFO] (novae.utils._build) Computing graph on 58,682 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None)
[INFO] (novae.utils._build) Computing graph on 62,268 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None)
[INFO] (novae.utils._build) Computing graph on 58,685 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None)
[INFO] (novae.utils._build) Computing graph on 58,231 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None)
[INFO] (novae.utils._build) Computing graph on 59,935 cells (coord_type=generic, delaunay=True, radius=[0.0, 80.0], n_neighs=None)

Again, we can show the cells connectivites. See the above explanations to understand how to perform quality controls based on this plot.

In [70]:

Copied!

novae.plot.connectivities(adatas)
novae.plot.connectivities(adatas)

Multi-slide inference¶

Since we are now working on mouse brain data, we'll load the brain model, as shown below.

Reminder: for human tissues, you can use the "MICS-Lab/novae-human-0" model. Again, other model names can be found here.

In [71]:

Copied!

model = novae.Novae.from_pretrained("MICS-Lab/novae-brain-0")
model
model = novae.Novae.from_pretrained("MICS-Lab/novae-brain-0")
model

Out[71]:

Novae model
   ├── Known genes: 60697
   ├── Parameters: 32.0M
   └── Model name: MICS-Lab/novae-brain-0

Then, the usage of Novae is similar to above, and you can again use zero-shot or fine-tuning (see the first section of this tutorial for more details).

In [72]:

Copied!





# Option 1: zero-shot
model.compute_representations(adatas, zero_shot=True)

# Option 2: fine-tuning
model.fine_tune(adatas)
model.compute_representations(adatas)
# Option 1: zero-shot
model.compute_representations(adatas, zero_shot=True)

# Option 2: fine-tuning
model.fine_tune(adatas)
model.compute_representations(adatas)

Computing representations:   0%|          | 0/106 [00:00<?, ?it/s]

Computing representations:   0%|          | 0/115 [00:00<?, ?it/s]

Computing representations:   0%|          | 0/122 [00:00<?, ?it/s]

Computing representations:   0%|          | 0/115 [00:00<?, ?it/s]

Computing representations:   0%|          | 0/114 [00:00<?, ?it/s]

Computing representations:   0%|          | 0/117 [00:00<?, ?it/s]

Again, the command line to assign domains is the same. Here, we assigned 15 domains.

In [73]:

Copied!

model.assign_domains(adatas, level=15)
model.assign_domains(adatas, level=15)

Out[73]:

'novae_domains_15'

And, again, we can show the domains:

Here, we added the slide_name_key argument, which is optional and used to display the name of each slide ( slide_name_key should be a column name of adata.obs).

In [75]:

Copied!

novae.plot.domains(adatas, slide_name_key="slide_id", cell_size=20)
novae.plot.domains(adatas, slide_name_key="slide_id", cell_size=20)

[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.

Batch-effect correction of the spatial representations¶

The spatial representations of each cell are stored in adata.obsm["novae_latent"], and they are not batch-effect corrected by default. Yet, the (categorical) spatial domains are corrected. Therefore, we can use the categorical spatial domains to correct the representations, using the batch_effect_correction method, as below:

In [76]:

Copied!

model.batch_effect_correction(adatas)
model.batch_effect_correction(adatas)

[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.

This saved the corrected representations inside adata.obsm["novae_latent_corrected"]. You can use these corrected representations for further analysis.

Note that this representation is a spatial domain representation of each cell, not a cell expression representation. Indeed, it contains information on the local neighborhoods of the cells.

Downstream analysis¶

Novae can also perform downstream analysis. We illustrated it below.

Domains proportion per slide¶

A first simple thing to do is to look at the proportion of each domain for each slide. For instance, you may be interested in finding domains that are more or less present under certain conditions or diseases.

Again, the slide_name_key is optional and is only used to show the slide's names.

Here, on our mouse brain slides, we see homogeneous domain proportions (which is expected in this very specific case).

In [77]:

Copied!

novae.plot.domains_proportions(adatas, obs_key="novae_domains_15")
novae.plot.domains_proportions(adatas, obs_key="novae_domains_15")

Slide architecture¶

We run trajectory inference (PAGA) on the spatial domains to extract a graph representing the "architecture" of a slide, or the "spatial domains organization".

Currently, this function only supports one slide per call.

In [78]:

Copied!

novae.plot.paga(adatas[0])
novae.plot.paga(adatas[0])

[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.

Spatially Variable Genes (SVG)¶

To extract SVG, we run DEGs on the categorical spatial domains. The function below shows the 3 most variable genes.

In [79]:

Copied!

novae.plot.spatially_variable_genes(adatas[0], top_k=3, vmax="p95", cell_size=20)
novae.plot.spatially_variable_genes(adatas[0], top_k=3, vmax="p95", cell_size=20)

[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.

Spatial pathway analysis¶

Scores per domain¶

We can score pathways for each domain using scanpy.tl.score_genes.

The pathways input should be one of the following:

a JSON file downloaded from the GSEA website.
a dict whose keys are pathway names, and values are lists of genes (case insensitive).

Currently, this function only supports one slide per call.

In [80]:

Copied!

novae.plot.pathway_scores(adatas[0], pathways="mouse_hallmarks.json", figsize=(10, 7))
novae.plot.pathway_scores(adatas[0], pathways="mouse_hallmarks.json", figsize=(10, 7))

[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.
[INFO] (novae.plot._heatmap) Loaded 50 pathway(s)
[INFO] (novae.plot._heatmap) Plot mode: 24 pathways scores per domain

Scores per slide per domain¶

We can also show the score of one pathway, for each slide and each domain, as in the article.

Here, we need to concatenate our AnnData object into a single one. If you already have one AnnData object with multiple slides, you can skip this step (just ensure you provided slide_key in novae.utils.spatial_neioghbors).

In [81]:

Copied!

import anndata

adata_conc = anndata.concat(adatas)
import anndata

adata_conc = anndata.concat(adatas)

Since the JSON file below contains only one pathway, it will detect that it must plot this pathway score per domain and slide. You can also force this behavior by providing the pathway_name argument.

In [87]:

Copied!

novae.plot.pathway_scores(
    adata_conc, pathways="LEE_AGING_CEREBELLUM_UP.json", figsize=(4, 8), slide_name_key="slide_id"
)
novae.plot.pathway_scores(
    adata_conc, pathways="LEE_AGING_CEREBELLUM_UP.json", figsize=(4, 8), slide_name_key="slide_id"
)

[INFO] (novae.utils._validate) Using obs_key='novae_domains_15' as default.
[INFO] (novae.plot._heatmap) Loaded 1 pathway(s)
[INFO] (novae.plot._heatmap) Plot mode: LEE_AGING_CEREBELLUM_UP score per domain per slide