Skip to content

Latest commit

 

History

History
59 lines (45 loc) · 3.37 KB

v1_0.md

File metadata and controls

59 lines (45 loc) · 3.37 KB

Overview

An inputs group should be present at the root of the HDF5 file, containing all information relating to the data inputs. This includes the definition of the input files as well as summary statistics like the number of cells and features. The group itself contains the parameters and results subgroups.

In this section, a "matrix" refers to one or more files describing a single (count) matrix. This should be exactly one file for HDF5-based formats, or multiple files for MatrixMarket formats, e.g., to include feature information - see below for details.

Definitions:

  • embedded: whether the input files were embedded in the *.kana file. If false, the input files are assumed to be linked instead.

Parameters

parameters should contain:

  • format: a scalar string specifying the file format for a single matrix. This is usually either "MatrixMarket", for a MatrixMarket file with possible feature/barcode annotations; "10X", for the 10X Genomics HDF5 matrix format; or "H5AD", for the H5AD format. Other values are allowed but their interpretation is implementation-defined (e.g., for custom resources). For multiple matrices, format should instead be a 1-dimensional string dataset of length equal to the number of uploads. Each element of the dataset is usually one of "MatrixMarket", "10X" or "H5AD"; different values can be present for mixed input formats.}

  • files: a group of groups representing an array of input file information. Each inner group is named by their positional index in the array and contains information about a file in an upload. Each inner group should contain:

    • type: a scalar string specifying the type of the file.
      • If format = "MatrixMarket", there should be exactly one type = "mtx" corresponding to the (possibly Gzipped) *.mtx file. There may be zero or one type = "genes", containing a (possibly Gzipped) TSV file with the Ensembl and gene symbols for each row. There may be zero or one type = "annotations", containing a (possibly Gzipped) TSV file with the annotations for each column.
      • If format = "10X" or "H5AD", there should be exactly one type = "h5".
      • For other formats, any type can be used, typically for custom resources.
    • name: a scalar string specifying the file name as it was provided to kana.

    If embedded = true, we additionally expect:

    • offset: a scalar integer specifying where the file starts as an offset from the start of the remaining bytes section. The offset for the first file should be zero, and entries in files should be ordered by increasing offset.
    • size: a non-negative scalar integer specifying the number of bytes in the file. The offset of each file should be equal to the sum of size and offset for the preceding file.

    If embedded = false, we expect:

    • id: a scalar string containing some unique identifier for this file. The interpretation of id is application-specific but usually refers to some cache or database.

Results

results should contain:

  • dimensions: an integer dataset of length 2, containing the number of features and the number of cells in the loaded dataset.
  • permutation: an integer dataset of length equal to the number of cells, describing the permutation to be applied to the per-gene results to recover the original row order.

History

Added in version 1.0.