Skip to content

Latest commit

 

History

History
160 lines (124 loc) · 10.5 KB

v3_0.md

File metadata and controls

160 lines (124 loc) · 10.5 KB

Overview

An inputs group should be present at the root of the HDF5 file, containing all information relating to the data inputs. This includes the definition of the input files as well as summary statistics like the number of cells and features. The group itself contains the parameters and results subgroups.

In this section, a single "input dataset" should contain counts for a set of cells across a consistent set of features. That is, each cell in an input dataset has a non-missing count for the same set of features, possibly across multiple modalities. An input dataset will usually contain per-feature annotations, and may also contain per-cell annotations. It may be represented by a single file (e.g., HDF5-based formats) or multiple files (e.g., MatrixMarket). Note that, if multiple input datasets are supplied, they need not have the same set of features.

The "loaded dataset" refers to the in-memory representation of the input dataset(s).

  • The identities of the rows (features) of the loaded dataset may be a permutation or subset of the rows in the input dataset(s). Indeed for multiple inputs, the loaded dataset only contains the intersection of common features. However, a permutation or subset may also be applied to a single input.
  • The identities of the columns (cells) in the loaded dataset is determined from the column order of the input dataset(s). For a single input, the column order in the loaded dataset is the same as that of the input. For multiple inputs, the column order in the loaded dataset is defined as a concatenation of the input datasets in the order supplied.
  • If multiple inputs are supplied, each input dataset is assumed to contain cells in a single "block" (e.g., batch or sample). If only single input dataset is present, multiple blocks can be defined by a blocking factor in the per-cell annotation.

Definitions:

  • embedded: whether the files for the input dataset(s) were embedded in the *.kana file. If false, the input files are assumed to be linked instead.

Parameters

parameters should contain:

  • datasets: a group containing information about the input datasets. Each child of this group is itself a group that corresponds to a single input dataset; each subgroup is named after an index from 0 to N - 1, where N is the total number of datasets. (N should be a positive integer.) The datasets/<dataset_index> subgroup should itself contain:

    • name: scalar string containing the name of the input dataset. This is primarily used for display purposes, and should be unique across all children of datasets.

    • format: scalar string specifying the file format for a single matrix. This is usually one of (but not limited to) "MatrixMarket", "10X" or "H5AD". Other input dataset types may specify other values for format according to the application.

    • files: a group containing the file information for this input dataset. Each child of this group is itself a group that corresponds to a single file; each subgrouop is named after an index from 0 to F - 1, where F is the total number of files for this input dataset. Each files/<file_index> subgroup should contain:

      • type: a scalar string specifying the type of the file. The requirements here depend on the input dataset's format - the most common examples are listed below:

        • If format = "MatrixMarket", there should be exactly one type = "matrix" corresponding to the (possibly Gzipped) *.mtx file. There may be zero or one type = "features", containing a (possibly Gzipped) TSV file with the Ensembl and gene symbols for each row. There may be zero or one type = "cells", containing a (possibly Gzipped) TSV file with the annotations for each column.
        • If format = "10X" or "H5AD", there should be exactly one type = "h5".
        • If format = "SummarizedExperiment", there should be exactly one type = "rds".
      • name: a scalar string specifying the file name. This can be any arbitrary string and is typically used to assist disambiguation.

      If embedded = true, we additionally expect each files/<file_index> subgroup to contain:

      • offset: a scalar integer specifying where the file starts as an offset from the start of the remaining bytes section.
      • size: a non-negative scalar integer specifying the number of bytes in the file. Each file is represented by the byte range from [offset, offset + size) in the remaining bytes section.

      Byte ranges of files should be contiguous in the remaining bytes section, following the order in which they encountered (in datasets first, and then within files per input dataset).

      Alternatively, if embedded = false, we instead expect each files/<file_index> subgroup to contain:

      • id: a scalar string containing some unique identifier for this file. The interpretation of id is application-specific but usually refers to some cache or database.

    Optionally, the datasets/<dataset_index> subgroup might contain:

    • options: a scalar string in JSON format. This encodes a JSON object containing arbitrary number of additional options for loading this input dataset.

Optionally, parameters may contain a subset group. This may in turn contain a cells group, specifying the subset of cells to be used in downstream steps. The subset/cells group should contain one of:

  • indices, a 1-dimensional integer dataset specifying the indices of the cells to retain. Indices refer to columns of the loaded dataset, and should be unique and sorted. The length of this dataset should be equal to the number of cells in results/num_cells.
  • field, a string scalar specifying the field of the column annotations to use for filtering. This field can either be a continuous or categorical variable.
    • If categorical, the group should also contain values, a 1-dimensional string dataset specifying the allowable values for this field. The subset is defined as all cells with a field value in values.
    • If continuous, the group should also contain ranges, a 2-dimensional float dataset specifying the ranges of allowable values. Each row defines a range [a, b] where a is the value in the first column and b is the value in the second column. Ranges should be non-overlapping (excepting the boundaries) and sorted in order of increasing a. The subset is defined as all cells that fall in any of the specified ranges.

For single input datasets, parameters may also contain:

  • block_factor: a string scalar specifying the field in the per-cell annotation that contains the sample blocking factor. If present, it is assumed that the matrix contains data for multiple samples.

Results

results should contain:

  • num_cells: an integer scalar specifying the number of cells in the loaded dataset. This should count all cells if multiple input datasets are provided. Value should be positive. If subset/indices is present, its length should be equal to num_cells.

  • num_blocks: an integer scalar specifying the number of blocks in the loaded dataset.

    • If multiple input datasets are present, the number of input datasets should be equal to num_blocks.
    • If a single input dataset is present and block_factor is not provided, num_blocks should be equal to 1.
    • If a single input dataset is present and block_factor is provided, num_blocks may be any positive value.
  • feature_identities: a group containing 1-dimensional integer datasets, each named after a modality:

    • "RNA" for RNA data, if available.
    • "ADT" for ADT data, if available.
    • "CRISPR" for CRISPR data, if available.

    Each dataset is of length equal to the number of features for its corresponding modality, and contains the identities of the features of that modality in the loaded dataset.

    • For a single input dataset, each feature identity is defined as a row index of the input dataset, for some definition of "row" that depends on the format of input dataset. The interpretations of the row indices for the most common format types are listed here:

      • For format = "MatrixMarket", "H5AD" and "10X", each row is that of the stored count matrix. Thus, feature identities can be directly used to index into the feature annotation of these formats.
      • For format = "SummarizedExperiment", each row index refers to the index of the main/alternative experiment corresponding to that modality. Thus, feature identities can only be used to index rowData() of the appropriate main/alternative experiment. The mapping between modalities and experiments should be specified in the options of the relevant parameters/datasets.
    • For multiple input datasets, each feature identity is defined as a row index of the first input dataset. Note that not all rows of the first input may be present in the loaded dataset after taking the intersection.

    The use of row indices here ensures that features can be disambiguated in the absence of a unique key in the feature annotations. Feature identities are parallel to the per-feature results in subsequent analysis steps, e.g., feature_selection, marker_detection.

Optionally, results may contain:

  • feature_names: a group containing 1-dimensional string datasets. Each dataset is named after a modality as in identities, is of length equal to the corresponding entry of identities, and contains human-readable names for the features of its modality in the loaded dataset. Order of feature identifiers should be consistent with that in identities. Note that this field is only provided for ease of manual interpretation, and not all modalities may be represented. As human-readable names may be ambiguous or unavailable, applications should prefer to use identities instead.

History

Updated in version 3.0, with the following changes from the previous version:

  • Standardized information for single/multiple inputs into a consistent format. This is done by turning each input dataset into its own group, and nesting the file information inside each input's group.
  • Renamed sample_factor to block_factor, to avoid confusion between different concepts of samples. Similarly renamed num_samples to num_blocks.
  • Renamed identities to feature_identities for clarity.
  • Removed num_features as this is made redundant by the lengths of feature_identities.
  • Added feature_names to make the results easier to interpret during manual inspection, so that users do not have to map the feature_identities for each input format.