Benchmarking on a Custom Dataset

You can run thunder tasks on a custom dataset. All you will need is to prepare a .yaml file specifying the config of the dataset and a .json file defining the data splits.

Note

A few examples of .yaml files describing custom datasets can be found in the examples folder of the repository. For examples of .json data split files, please check the ones automatically generated by thunder on supported datasets.

Note

Supported formats of custom datasets: * Format 1: One file per patch (e.g. .png files). Your data split .json file will then contain paths to these files along with labels associated to them. This is the most common format in our already supported datasets. * Format 2: Following the format of patch_camelyon, for each split (train, val, test), one .h5 file with a x key mapping to an array of input images and another .h5 file with a y key mapping to an array of associated labels. Your data split .json file will then contain paths to these .h5 files (images and labels).

For example, we can create a custom version of the bracs dataset with 3 classes instead of 7. The .yaml file would be as follows:

dataset_name: bracs_3classes
nb_classes: 3
compatible_tasks: ["adversarial_attack", "alignment_scoring", "image_retrieval", "knn", "linear_probing", "pre_computing_embeddings", "simple_shot", "zero_shot_vlm"]
nb_train_samples: 3657
nb_val_samples: 312
nb_test_samples: 570
mpp: 0.25
cancer_type: breast
div_patches: False
data_splits: /path/to/data/splits/file.json
classes: ["benign", "atypical", "malignant"]
class_to_id:
  benign: 0
  atypical: 1
  malignant: 2
id_to_class:
  0: benign
  1: atypical
  2: malignant
id_to_classname:
  0: normal breast tissue
  1: breast atypical lesions
  2: breast malingnant tumor

This was an example of a custom dataset where patches are saved as images (Format 1). If you want instead to use a custom dataset with the same format as patch_camelyon (Format 2), you should add the h5_format: True entry to your .yaml file.

The .json file will have the following structure if data follows Format 1:

{
    "train": {
        "images": [/path/to/im0, ..., /path/to/imN], 
        "labels": [label_im0, ..., label_imN],
    }, 
    "val": {
        "images": [...], 
        "labels": [...],
    }, 
    "test": {
        "images": [...], 
        "labels": [...],
    }, 
    "train_few_shot": {
        ...,
    },
}

If data follows Format 2, it will be:

{
    "train": {
        "images": /path/to/train/images/file.h5, 
        "labels": /path/to/train/labels/file.h5,
    },
    "val": {
       "images": /path/to/val/images/file.h5, 
        "labels": /path/to/val/labels/file.h5,
    },
    "test": {
        "images": /path/to/test/images/file.h5, 
        "labels": /path/to/test/labels/file.h5,
    }, 
    "train_few_shot": {
        ...,
    },
}

In both cases, for a classification dataset, labels will be a list of integers (classes) and for a segmentation dataset, it will be a list of tuples (/path/to/the/gt/semnatic/mask/, patch_i_min_coord, patch_i_max_coord, patch_j_min_coord, patch_j_max_coord).

The train_few_shot entry is mapped to a dictionnary where keys are sizes of support sets ("1", "2", "4", "8", "16") and values are dictionnaries with entries images and labels, respectively mapping to the indices of images in the support set and their respective labels. For more details, please check the function where we automatically create such support sets for supported datasets.

Note

Importantly, you have two ways to specify paths to images (and labels in case of h5 format): * You do not provide a base_data_folder entry in your .yaml file and paths specified in the .json file will be absolute paths. * You provide a base_data_folder entry in your .yaml file. This should be the path to a folder containing a sub-folder named dataset_name (e.g. here named bracs_3classes). The paths in your .json will be specified relative to this base_data_folder/dataset_name folder (as done for supported datasets).

With your .yaml file ready (e.g. bracs_3classes.yaml) along with the corresponding .json file, you can run any benchmark task on any model using the following command:

thunder benchmark model_name custom:/path/to/bracs_3classes.yaml task_name

or through the API:

from thunder import benchmark

if __name__ == "__main__":
    benchmark("uni2h", dataset="custom:/path/to/bracs_3classes.yaml", task="linear_probing")