← Back to Previous

Open in Colab

The TACO specification

The file format ahead Tortilla.

The terms “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document follow the definitions from RFC 2119.

The term “data user” refers to any individual or system accessing the data, “data provider” refers to the entity responsible for offering the data.

Overview

A TACO is a specification that extends the common structure of Tortilla files. TACO is inspired by the STAC Collection and includes additional fields to enable the automatic generation of Croissant, DataCite metadata, and Datacards.

TACO is designed around the FAIR principles:

  • Findability: EO data is not structured for easy discovery. To improve dataset accessibility on platforms like Google Dataset Search, DataCite Search, and OpenAIRE Explore, data providers are typically required to understand and implement multiple specifications, such as STAC Collection, Croissant, and DataCite. The TACO Toolbox simplifies this process by providing offering a well-defined Pydantic schema, enabling seamless automatic conversion into these specifications, or all three simultaneously.

Figure 01: Data provides enter data & metadata in a Pydantic DataModel. The taco-toolbox then generates a TACO-compliant dataset. The data users by using the “taco-readers” can subsequently extract the metadata and convert it into Markdown, JSON, or JSON-LD files, ensuring compatibility with various metadata specifications.

  • Accessibility: TACO organizes datasets as collections of samples with similar data and metadata stored in Tortilla files. The dataset structure has two levels: (1) Collection, and (2) Sample. A Collection describes metadata at the dataset level and helps search engines find datasets on the web and for data users get a general idea of what the dataset is about. Data users can obtain the Dataset Collection metadata in one or multiple specifications, as discussed in the Findability section. The Sample represents the fundamental unit of a dataset, encapsulating a single data instance.

Figure 02: TACO toolbox adopts a structure similar to STAC, organizing the data into collections, and samples. Users can use different fields to split the dataset and create “Catalogs” on-the-fly.

  • Interoperability: TACO uses Tortilla in the backend, which provide fast partial and parallel reading capabilities. See Tortilla specification for more details.

  • Reusability: Since each Tortilla is represented as a DataFrame at the top level, integration between TACO-compliant dataset can be achieved through a simple concatenation operation.

Figure 03: This example demonstrates the integration of all methane emission datasets.TACO facilitate the development of universal data loader. The blue line illustrates the connection between data users and providers through client libraries (taco-reader).

Goals

The primary goals of the TACO specification are:

  1. Standardization: Establish a common structure and metadata format for EO datasets.

  2. FAIR Compliance: Ensure all datasets align with the FAIR principles to improve data usability and facilitate broader data sharing.

  3. Support for Responsible AI: Incorporate fields that align with Responsible AI principles, ensuring that datasets are not only scientifically sound but also ethically compliant.

  4. User-Friendly Structure: Organize datasets into a clear hierarchical structure (Collection and Sample) to facilitate easy navigation and retrieval of data.

  5. Efficient Data Handling: Utilize the capabilities of Tortilla files for efficient partial reading and parallel processing to improve data access and analysis.

Format

This is version 0.5.0 of the TACO specification. Future versions MUST remain backward compatible with this one. The mandatory fields are:

Figure 04: This diagram visually represents the internal structure of a TACO file, highlighting how data and metadata are organized within the file format.

TACO extents the Tortilla specification by adding three fields (see bellow). With these changes, the available free space is reduced to 158 bytes, compared to 174 bytes in Tortilla.

The Magic Number (MB)

A 2-byte identifier at the start of each file (#y in hex) ensures that the file is recognized as a valid TACO format.

int.from_bytes(b'WX', byteorder='little')

The Collection offset (CO)

An 8-byte field indicating the starting position of the collection metadata within the file.

The Collection length (CL)

An 8-byte field specifying the size of the collection metadata in bytes.

The COLLECTION

A TACO Collection is simply an JSON object in UTF-8 encoding. The TACO Collection specification defines a set of common fields to describe a dataset. These fields contain information about data structure, the tasks, the providers, etc. Some fields can be automatically generated using TACO-toolbox, while others must be manually configured by the data provider. TACO Collection can be converted to a STAC Collection, Croissant, or DataCite metadata using the TACO-readers. The mandatory and optional fields that need to be filled by the data provider are:

FieldTypeRequiredDetails
idStringyesA unique identifier for the dataset.
taco_versionStringyesThe version of the TACO specification.
dataset_versionStringyesVersion of the dataset.
descriptionStringyesDescription of the dataset.
licensesList of stringsyesLicense(s) of the dataset, it is recommended to use SPDX License identifier. Various licenses can be used to describe the dataset.
extentExtent ObjectyesSpatial and temporal extents.
providersList of Person objectsyesA list of persons, which participated in the creation of the dataset.
titleStringnoTitle of the dataset. Maximum length is 250 characters.
keywordsList of stringsnoList of keywords describing the dataset.
curatorsList of Person objectsnoA list of persons responsible for converting the dataset to TACO compliance.
taskTask ObjectnoThe TACO specification defines a set of explicit tasks. This field refers to the task most relevant to the dataset.
split_strategySplit Strategy ObjectnoThe split strategy is a string chosen from an explicit list of method names.
discuss_linkHyperLink ObjectnoRECOMMENDED. A link to a discussion forum or other resource where users can discuss the dataset.
raw_linkHyperLink ObjectnoProvides a link to the raw dataset if it was not created in the TACO-native format.

Figure 05: TACO Collection Schema, highlighting core metadata fields such as licensing, spatial and temporal extents, dataset provider, and versioning information for both the dataset and TACO. Other fields shown are RECOMMENDED but not mandatory. The Responsible AI (RAI*) fields are highly RECOMMENDED for documenting potential biases and ethical considerations related to the dataset.

TACO Sample

In TACO, a sample consists of a single data point. The binary data is located within the DATA Pile section, while the metadata is found in the FOOTER section. Refer to the Tortilla specification for more details.

TACO Extensions

Extent Object

Describes the spatial and temporal coverage of the entire dataset. Both spatial and temporal extents are required.

FieldTypeRequiredDescription
spatialList of List of numbersyesRepresents the spatial extent of the dataset. It contains a list of lists, where each inner list defines a bounding box as [xmin, ymin, xmax, ymax] in the EPSG:4326 coordinate system.
temporalList of List of integeryesRepresents the temporal extent of the dataset as a nested list. Each inner list contains two integers: the start and end dates, expressed in milliseconds since the Unix Epoch (January 1, 1970, 00:00:00 UTC). Example: [[1672567200000, 1704067199999]] for a dataset spanning from January 1, 2023, 10:00:00.000 AM UTC to December 31, 2023, 11:59:59.999 PM UTC. Multiple inner lists would indicate multiple distinct temporal extents within the dataset.

Contact Extent Object

The Contact Object is based on the STAC Extension proposed by Matthias Mohr. It identifies and provides communication details for a person or organization responsible for a resource.

FieldTypeRequiredDescription
nameStringIf organization is missingName of the responsible person.
organizationStringIf name is missingOrganization or affiliation of the contact.
identifierStringNoUnique identifier for the contact.
positionStringNoRole or job title within the organization.
emailsList of Info ObjectsNoEmail addresses for contacting the responsible party.
contactInstructionsStringNoAdditional instructions for contacting the responsible party.
rolesList of stringsNoDuties, functions, or permissions associated with this contact.
Info Object

The Info Object provides contact information and its associated roles. It is used for phone numbers and email addresses.

FieldTypeRequiredDescription
valueStringYesThe actual contact information (e.g., phone number or email address).
rolesList of stringsNoThe type of contact information, such as home, work, or fax.

The Hyperlink Object defines a URL and its associated description. The URL must be validated according to RFC 3986, ensuring it conforms to standard URI syntax.

FieldTypeRequiredDescription
hrefStringYesThe URL of the linked resource. Must be a valid URI according to RFC 3986.
descriptionStringNoA brief explanation or context for the hyperlink.

Task Extent Object

The Task Extent Object defines the machine learning task associated with a dataset. The task field is a string selected from a predefined list, where each task specifies distinct input and target tensor characteristics. This task classification is intended to be consistent with mlm:tasks.

FieldTypeRequiredDescription
taskStringYesThe type of machine learning task. Must be one from the predefined tasks list.
Supported Tasks:
  • Regression: Generic regression that estimates a numeric and continuous value.

  • Classification: Generic classification task that assigns class labels to an output.

  • Scene Classification: Specific classification task where the model assigns a single class label to an entire scene/area.

  • Detection: Generic detection of the “presence” of objects or entities, with or without positions.

  • Object Detection: Task corresponding to the identification of positions as bounding boxes of object detected in the scene.

  • Segmentation: Generic tasks that regroups all types of segmentations tasks consisting of applying labels to pixels.

  • Semantic Segmentation: Specific segmentation task where all pixels are attributed labels, without consideration for segments as unique objects.

  • Instance Segmentation: Specific segmentation task that assigns distinct labels for groups of pixels corresponding to object instances.

  • Panoptic Segmentation: Specific segmentation task that combines instance segmentation of objects and semantic labels for non-objects.

  • Similarity Search: Generic task to identify whether a query input corresponds to another reference within a corpus.

  • Generative: Generic task that encompasses all synthetic data generation techniques.

  • Image Captioning: Specific task of describing the content of an image in words.

  • Super Resolution: Specific task that increases the quality and resolution of an image by increasing its high-frequency details.

  • Denoising: Task that removes noise from an image or signal.

  • Inpainting: Task that reconstructs missing or damaged portions of an image.

  • Colorization: Task that adds color to a grayscale image.

  • Style Transfer: Task that applies the visual style of one image to another image while preserving its content.

  • Deblurring: Task that sharpens an image by removing blur.

  • Dehazing: Task that removes haze or fog from an image to improve visibility.

  • General: A broad category encompassing various tasks not specifically listed above.

Split strategy Extent

The Split Strategy Extent Object defines how the dataset is divided into training, validation, and testing subsets. The split_strategy field is a string representing the chosen splitting approach.

FieldTypeRequiredDescription
split_strategyStringYesThe dataset splitting strategy. Must be one from the predefined tasks list.
Supported Split Strategies:
  • random: Randomly splits the dataset into training, validation, and testing subsets.
  • stratified: Splits the dataset based on a specific property, such as temporal or spatial characteristics (e.g., season or location).
  • other: Splits the dataset using a custom-defined pattern.
  • none: The dataset does not have a split strategy.
  • Unknown: The method used to split the dataset is not known or specified.

Optical Data Extent

The Optical Data Extent provides key information about the sensor and spectral bands of optical remote sensing data. You can specify the sensor (e.g., landsat8oli, sentinel2msi) and optionally subset the bands (e.g., landsat8oli[B01,B02]). If a recognized sensor is provided, the TACO Toolbox will automatically populate the corresponding bands.

FieldTypeRequiredDescription
sensorStringNoThe sensor that acquired the data. A subset of bands can be specified in brackets after the sensor name: landsat8oli[B01,B02].
Supported sensors:
- landsat1mss
- landsat2mss
- landsat3mss
- landsat4mss
- landsat5mss
- landsat4tm
- landsat5tm
- landsat7etm
- landsat8oli
- landsat9oli
- sentinel2msi
- eo1ali
- aster
- modis
bandsList of SpectralBandNoA list of spectral bands. If not provided directly, it will be automatically populated based on the sensor field if the sensor is recognized.
Spectral Band Object

The spectral band object describes the characteristics of individual spectral bands for a given sensor.

FieldTypeRequiredDescription
nameStringYesThe name of the band (e.g., “B02”, “red”).
indexIntegerNoThe index of the band.
common_nameStringNoThe common name for the band (e.g., “blue”, “green”).
descriptionStringNoA description of the band.
unitStringNoThe unit of measurement for the band’s values.
center_wavelengthFloatNoThe central wavelength of the band.
full_width_half_maxFloatNoThe full width at half maximum (FWHM) of the band, a measure of its spectral resolution.

The Label extension

The Label Extension offers a standardized framework for defining and describing the labels used in a dataset. It commits this through the Labels object, which contains a collection of LabelClass objects. Each LabelClass represents a distinct category or class within the dataset.

Labels Object
FieldTypeRequiredDescription
label_classesList of LabelClass objectsYesA list of LabelClass objects defining the label classes.
label_descriptionStringNoAn optional overall description of the labels.
LabelClass Object
FieldTypeRequiredDescription
nameStringYesThe unique name of the label class (e.g., “car”, “building”).
categoryString or integerYesThe category the label class belongs to.
descriptionStringNoAn optional description providing more information about the label class.
bboxBoolNoThe bbox field consists of four numerical values that define the spatial boundaries of an object within an image. The first value represents the x-coordinate of the bounding box center, normalized by the width of the image (ranging from 0 to 1). The second value indicates the y-coordinate of the bounding box center, also normalized by the height of the image (ranging from 0 to 1). The third value specifies the width of the bounding box, normalized by the image width, while the fourth value indicates the height of the bounding box, normalized by the image height.

The Scientific Extension

This extension standardizes how datasets link to related scientific publications, providing metadata for proper citation and referencing. The Scientific Extension is based on the STAC Extension proposed by Matthias Mohr.

Scientific Object
FieldTypeRequiredDescription
doiStringNoThe Digital Object Identifier (DOI) of the dataset.
citationStringNoThe full citation of the dataset in BibTeX format.
summaryStringYesA brief summary of the dataset.
publicationsList of PublicationYesA list of Publication objects representing related scientific works.
Publication Object
FieldTypeRequiredDescription
doiStringYesThe Digital Object Identifier (DOI) of the publication.
citationStringYesThe full citation of the publication in BibTeX format.
summaryStringYesA brief summary or abstract of the publication.

The RAI Extension

This extension, based on the Croissant RAI specification (http://mlcommons.org/croissant/RAI/1.0), provides essential metadata fields for documenting Responsible AI (RAI) considerations associated with a dataset. It focuses on data lifecycle, potential biases, limitations, and intended use cases.

RAI Object
FieldTypeRequiredDescription
rai:dataCollectionStringNoDescribes the data collection process.
rai:dataCollectionTypeStringNoSpecifies the method of data collection (e.g., “Surveys”, “Web Scraping”, “Focus groups”, “Self-reporting”).
rai:dataLimitationsStringNoDescribes known limitations of the dataset, including potential issues related to generalization, data quality, or data sources.
rai:dataBiasesStringNoDescribes any known or potential biases in the dataset.
rai:useCasesStringNoDescribes the intended use cases for the dataset (e.g., “Training”, “Validation”, “Testing”, “Fine Tuning”).
rai:personalSensitiveInformationStringNoIndicates the presence of personal or sensitive information within the dataset and its type (e.g., “Gender”, “Age”, “Socio-economic status”, “Location”).
rai:dataPreprocessingProtocolStringNoDescribes the steps taken to preprocess the data, such as cleaning, normalization, or feature engineering.
rai:dataAnnotationProtocolStringNoDescribes the process used to annotate the data, if applicable (e.g., platform used, instructions provided to annotators, number of annotators per item).
rai:annotationPlatformStringNoPlatform, tool, or library used to collect annotations by human annotators (e.g., “Amazon Mechanical Turk”, “Scale AI”, “Labelbox”).
rai:dataReleaseMaintenancePlanStringNoDescribes the plans for maintaining and updating the dataset, including versioning and deprecation policies.
rai:dataCollectionTimeframeStringNoSpecifies the time period during which the data was collected (e.g., “2022-01-01 to 2022-12-31”, “January 2023”).