Skip to content
This repository has been archived by the owner on Dec 6, 2022. It is now read-only.

Spec Proposal #2

Closed
wants to merge 3 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions draft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Draft Ipld Unixfs Spec

## Basic Structure

- Some sort of header that indicates that this a directory and included a version number. The header could also have fields to give additional information on the meaning of the extended attributes.

- CBOR Map
- Key: CBOR Byte or Text String: File Name
- Value: CBOR Array of:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really concerned having attributes being represented as an array:

  • it hampers extensibility - we are certainly forgetting something today, which means almost certainly yet another optional map in the future, and then which one is which?
  • It makes parsing difficult when represented as JSON

Why not just a map? With all keys being pre-agreed upon multicodec-style: i.e. it must exist in one of the centralized spec tables in order to be recognized by anyone

We can still declare some of the keys as mandatory, and it is at the discretion of gateways/nodes/etc to decide what to do with "obviously malformed" blocks. We already have this with protobuf/unixfs: if one uploads a link-block with only "type 2" fields, and no "data" - everything rejects it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CBOR defines sorting logic for canonicalizing maps, but not for arrays, and a canonical representation for unixfs directory should be a must.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just a map?

I covered the reasons below in the notes section.

@ehmry I don't see how your comment applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not tied to this idea. The amount of space saving is something that can be calculated once we determine what the keys will be if we use a map.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevina I assumed you meant an array of [key, value] tuples. CBOR is supposed to be schema-less and ordered values seem like schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we go with a map for directory entries certain keys will be required in order for the directory entry to be well defined so that is also a schema in a way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, assigning integer keys to the spec attributes would order them in the same way and just be one byte of overhead for each attribute in a map.

- Type: CBOR Unsigned Int
- Link or Data: CBOR Type varies
- Optional file size: CBOR Unsigned Int
- Optional Standard Attributes: CBOR Map
- Optional Extended Attributes: CBOR Map

The file size is only defined for regular files and is the size of the file contents.

All maps should be ordered based on the binary values of the key,
duplicates are not allowed.

### Notes

* An array makes sense to be as this is more compact and the value of
the fields are unambiguous, it also allows for a separation of
standard and extended attributes

* The key type can either be a byte or text string as POSIX makes no
requirements that file names be utf-8 and it is important that any
file name can be faithfully represented, if the string is utf-8
then the type will be Text.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes an implicit decision regarding the question I posed at the end of ipfs/kubo#4292 (comment): we shift the onus of "check that the name is safe to use/dipslay" to the consumers. Are we ready to do that?

Copy link
Contributor Author

@kevina kevina Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to represent any valid file in a POSIX system. I am against restricting the range in the spec. Yes I think it is the consumer job to make sure that the filename is safe to display.


## Types

The type field should be limited to a set of well defined values so it
makes sense that this is an integer rather than a text string. The
value is the ascii value of a letter. When converting to JSON the
integer can be represented as a single character string.

Possible values are as follows:

* 0, '', `file`: regular file
* `e`, `exe`: executable file
* `d`, `dir`: directory entry
* `s`, `special`: special file type (fifo, device, etc). The second field is a CBOR Map with at least one field to describe the type.
* `l`, `symlink`: symbolic link. The second field is the contents of the link
* `o`, `other`: link to other ipld object, links followed for GC and related operations
* `u`, `unknown`: link to unknown objects, links not followed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an integer enumeration was used rather than ascii characters, the canonical CBOR representation would be packed to one byte rather than two. Given that the CID will be in raw representation, I don't think clarity would suffer by an enumeration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not agents this.

### Notes

* Rather than have a special attribute for an executable bit it is more compact if we just make this a different type
Copy link

@ehmry ehmry Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If file types are enumerated then the high bit in a one-byte packed CBOR integer (0b10000) could be an informative bit that would make regular files (type 0) into executable files (type 16).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work.

* It is very useful to be able to determine if a link is a directory or an ordinary file so I made it as separate type, also there can be multiple ways to define a file size for a directory so it is best to just leave it out as it is of limited usefulness

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be multiple ways to define a file size for a directory

Actually there are only 2 ways - either you only count the logical bytes ( what the Windows properties UI does ), or you take into account the allocation overhead of the filesystem - the blocks taken by both the directories and the files themselves rounded up ( what the unixish du does ).

Given that in the context of IPFS the DAG is completely decoupled from the storage ( it may be files, it may be badger, etc ), the only sensible way to define a file size for a directory is to count the logical bytes, which I've done in my prototypes.

I would be sad if I can't express these cumulative values as part of every link within an FS tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unix has it's own (basically useless) way of defining the size of a directory.

I am not totally against included the cumulative size of a directory if we can agree on how to define it.


## Standard Attributes:

The standard set of attributes should be limited to a small set of meaningful values.
Stripping this filed SHOULD not change the meaning of the directory entry.
Clients SHOULD be able to understand these attributes when reading a directory entry.

Possible entries:

* `mtime`
* `ro`: Boolean, set if the file or directory should be readonly when copied to the filesystem

## Extended Attributes

The extended attributes set is not well defined and can be used for vendor extensions and posix attributes that don't make sense on non-unix systems.
Stripping this field MUST not change the meaning of the directory entry.
These attributes SHOULD be passed along but do not have to be understood.
The directory header MAY include information on the meaning of the attributes;
for example it could indicate that this is a copy of a unix filesystem and to expect a standard set of corresponding attributes.

Possible entries:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "Extended Attributes" be a good place to optionally store explicit media type for problematic data types, as noted in #11 ?


* `user`: unix user name
* `uid`: unix numeric uid
* `group`: unix group name
* `gid`: unix numeric gid
* `perm`: full unix permissions
* extended posix attributes
* windows specific attributes