Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's fix the fish history file format, and maybe extend it to configuration #6493

Closed
wants to merge 8 commits into from

Conversation

ridiculousfish
Copy link
Member

@ridiculousfish ridiculousfish commented Jan 12, 2020

Let's figure out what the fish history file format should be. There's some discussion in #3341 and also here.

The goal is to settle on a fish history format, and then eventually re-use the format to persist configuration data, like universal variables and abbreviations. The user is never expected to look at or edit these files manually. But they may choose to.

State of the world

fish history is currently a broken psuedo-YAML. Mea culpa: I added this about 8 years ago, without thinking hard enough. It is ad-hoc, underspecified and buggy (#6032). This complicates adding new fields while maintaining backwards compatibility, and it prevents the format from being successfully parsed by any other libraries.

The goal here is to settle on a widely-recognized, fully-specified format.

Desirable Properties for fish

These are my opinions only and are totally fair game for discussion.

  • Textual. Shells are built on text, which rules out binary formats like SQLite or protobuf. It should be obvious to the user how the history file is stored.

  • Self Describing. fish history should be a sequence of key-value pairs, not a positional list, so that new fields may be added in the future. This rules out formats like CSV or zsh's extended history.

  • Not Ad-Hoc. The new history format should be a real and commonly understood format, not something that we just make up.

  • Trivial Appending. The file must not require "closing." fish's normal history-saving is O_APPEND writes with best-effort file locking. This rules out formats like XML (</xml>) or JSON (}) that require explicit closing.

  • Streamable Searching. fish should be able to search the history without decoding the entire history file into memory. A SAX-style parser would satisfy this; so also is easily-identifiable record boundaries.

Candidates

I know of two plausible options: JSON Lines and TOML. I don't have a strong opinion between them - I keep going back and forth, and am open to alternatives. IMO YAML is out because of the complexity of the spec, and to avoid ambiguity with the existing format.

JSON Lines

JSON Lines, also known as JSON Object Streams, is simply newline-delimited JSON objects. Of course JSON is ubiquitous, but newline-delimited JSON objects is itself a fairly common serialization format - it is understood by jq, which is the JSON Swiss Army knife. jq can slice and dice, filter, query, etc.

fish history in JSON lines might look like:

{"cmd":"git branch -D tmp_fchown","when":1457678810}
{"cmd":"git checkout pr/2809","when":1457678814}
{"cmd":"cat ~/file.txt","when":1457678820}
{"cmd":"git diff src/fallback.h","paths":["src/fallback.h"],"when":1460230565}

Advantages

  • Ubiquitous, including support in jq.
  • Good library support for both reading and writing. This PR uses SimpleJSON.
  • Trivial record separators (just newlines) allows efficient streaming reads.

Disadvantages

  • Ugly.
  • Bad for human editing. If we used this to persist configuration data, like universal variables and abbreviations, they will be harder to edit by hand.

TOML

TOML is a configuration file format aimed at human-editability. It supports appending via the "list of tables" syntax. Here we might use its support for real dates:

[[item]]
cmd = "git branch -D tmp_fchown"
when = 2019-05-27T05:32:00

[[item]]
cmd = "git branch -D tmp_fchown"
when = 2017-06-2T11:16:05

[[item]]
cmd = "git diff src/fallback.h"
when = 2020-05-15T1:2:4
paths = ["src/fallback.h"]

Advantages

  • Visually nice
  • Excellent for configuration due to its human editability support.

Disadvantages

  • Existing libraries are not ideal for fish's use case:
    • We'll need some hacks to detect record boundaries.
    • Generating TOML is not a priority for most libraries, we might have to write our own.

What Other Shells Do

AFAICT fish would be the only shell doing this:

  • PowerShell stores commands as naive newline-delimited, with no metadata.

  • ksh stores the commands with two nul bytes as delimiters. There is no metadata.

  • zsh has a fixed list of positional fields.

  • elvish uses a binary database called bolt.

Compatibility

The new format will be incompatible with the old format. For that reason, the filename should be different: it will be (for example) fish_history.json instead of merely fish_history. This avoids the problem where you test the new fish, then revert back, and now your history is mangled.

Ideally this is the last time we wholesale replace the fish history file format. Future changes will be adding new key/value pairs, which will be ignored by previous fish shells.

Encapsulate the process of reading from history, in preparation for
history file format changes.
This adds the single-header json.hpp file from SimpleJSON to fish.
It is not yet built or included.

Commit 8dd3e9b84a0b532c46f22f402bafb2c4cf8621eb
from https://github.com/nbsdx/SimpleJSON
This adds support to fish for reading history files in "JSON Lines" format.
Each entry is a separate JSON object.
Teach SimpleJSON to emit "compact" one-line JSON records.
This will allow compatibility with JSON Lines.
This switches the output format from busted YAML to real JSON Lines.
To avoid breaking prior versions, we now append the "json" suffix.
Ensure that we can successfully import old fish history
@faho
Copy link
Member

faho commented Jan 12, 2020

SimpleJSON is WTFPL (i.e. literally "do what the fuck you want"), so it should be GPL-compatible.

The last commit was in 2016, and it's a single-file lib, so I'm assuming the intention is to vendor it, and that's okay.


I think JSON-lines is probably the correct choice, even though I don't think it's ideal for manual adjustment because

  • The line thing makes for long, unwieldy lines
  • No comments

But I don't expect manual changes. If anything users will use tools to change or query the history, and for that jq exists for JSON, but I know of no equivalent for TOML or YAML.


Being incompatible is fine, it's necessary and not too odious.

But it's a bit of a risk, which is why I'd like to hold off on this for 3.1.

@ridiculousfish
Copy link
Member Author

Yes this would definitely be for 3.2.

@zx8
Copy link

zx8 commented Jan 14, 2020

I prefer JSON Lines to TOML as well. For me, being able to use jq is a huge advantage over TOML, especially in the cases where I need more advanced search capabilities, or to manually manipulate history beyond what the history builtin offers.

Quick note, jsonlines.org suggests .jsonl as the extension rather than .json – not sure if we want to stick to that (suggested) convention or just go with .json?

@floam
Copy link
Member

floam commented Jan 20, 2020

Highly prefer JSON Lines.

It'd be cool to add a number of fields at the same time with shell state for history items. rash for example stores:

  • Current directory ($PWD).
  • Exit code ($status)
  • Exit code of pipes ($pipestatus)
  • The time command is started and terminated

@travankor
Copy link

travankor commented Feb 14, 2020

SimpleJSON looks like it is unmaintained and might have unresolved potential security issues. Some alternatives to consider:

If the new history format is reused for configuration data, it would be nice to have support for comments. There are some libraries that have comment support; otherwise maybe add comment support to the vendored library.

Also, it would be nice to have compression for the history file, especially if a lot more fields get added to the format. gzip should be fairly ubiquitous across installs.


Personally, I find it a bit strange that there is a hard requirement for a textual format, while the proposed solution is, for all intents and purposes, not human-parseable without external tooling. Rather than extolling the virtues of jq, the Fish POV should be that the built-in history tools have excellent UX. Recommending external tools like jq, autojump, or fzf goes against Fish's discoverability philosophy. Furthermore, jq isn't a very UNIXy tool other than you can use it with pipes; to cut to the point, jq even has SQL-style operators. So, I'm not sure why binary formats are 100% disqualified when JSONL is not too different. Some advantages over JSONL would include better reliability (users can't accidentally edit and corrupt their history) and performance in the long run.

One problem with the current history implementation is that when your system clock is reset/wrong fish can't figure out the history. So it would be nice to have good reliability with the new format right out of the box.

@faho
Copy link
Member

faho commented Feb 14, 2020

Also, it would be nice to have compression for the history file, especially if a lot more fields get added to the format. gzip should be fairly ubiquitous across installs.

This seems to be overkill. We deduplicate history and keep a maximum of 256k items, which is enough for everyone. My history has 33k items and is 2.3M, meaning a "full" history would be about 20M.

I don't see us adding enough fields to reach even 100M. So the complexity and additional dependency of gzip just doesn't appear to be worth it.

@floam
Copy link
Member

floam commented Feb 14, 2020

I agree with the binary format thing - I'm not sure it'd really be so bad just to use sqlite3 or protobufs, if the history builtin can work on it well enough.

@floam
Copy link
Member

floam commented Feb 14, 2020

(Something a user can do currently if they don't love the size of their history file is just tell their filesystem to compress it behind the scenes, which macOS can do with HFS or APFS. I don't think ext3 can do this but fancier Linux filesystems support transparent compression.)

@ridiculousfish
Copy link
Member Author

Absolutely fish should have excellent tools for manipulating history and jq is not intended to be a substitute for that.

Text still has major advantages. fish can easily import bash history, which would have been harder if bash had used bolt or sqlite. Users might try crazy stuff like putting their history in a git repo. Or maybe your fish installation is borked: a textual history file is easier to find and recover. The only potential advantage of binary formats is performance, and I don't anticipate fish history outgrowing what is possible with text.

Regarding the implementation, I started with SimpleJSON because of its size, but the OOB read that @travankor found does look like a serious issue and I agree we should not use it. Of the alternatives, both picojson and jsonx seem viable. Good finds.

The time issue is orthogonal but well spotted, I hadn't thought of that. Filed #6606.

@mqudsi
Copy link
Contributor

mqudsi commented Feb 15, 2020

I'm a huge toml fan (rust converted me) but I don't think it's right for the job. I think toml should be used where humans are primarily editing the file by hand but not when you only need human-readability for inspection purposes: it's too verbose and has less tooling available.

When I saw the title of the issue, before even opening it, my thoughts were SQLite (which I honestly consider to be very much both human-readable and human-editable, dependent on the schema) for robustness and maximum flexibility, LMDB (which I previously forked fish to use for other reasons) for size and speed if human readability is not a concern, and NDJSON (specifically) otherwise.

If you're willing to reconsider, SQLite makes a ton of sense for a history file with the requirements we have. It gives you "free" indexing and full-text search, "free" multi-thread/multi-process access, "free" paged loading of history items, etc. etc.

Personally, I prefer SQLite over the NDJSON particularly because it enforces a schema (even though it's not a strongly typed database, but we can't have everything in life).

With regards to JSON libraries, I adopted https://github.com/nlohmann/json for a few projects and haven't had any regrets. It's not the fastest but it's very developer friendly. I would prefer to take advantage of a native C++ JSON library rather than an untyped C library, from past experience.

@krobelus
Copy link
Member

krobelus commented Mar 7, 2020

A failing history search needs to traverse the entire history. With a large history, this can be expensive, albeit rarely noticeable since autosuggestions don't block. I don't know whether SQLite would be more efficient at this kind of full text search. Of course, SQLite is much nicer to use when doing more complex queries.

@mqudsi
Copy link
Contributor

mqudsi commented Mar 8, 2020

You would need to use the fts extension (which ships with the upstream SQLite amalgamation) to be able to accelerate the search and avoid a full table scan. It’s quite a capable module.

The default WHERE foo LIKE '%com%' would normally result in a full table scan without an fts index, but WHERE foo LIKE 'com%' with a regular btree index should be pretty fast with prefix lookup.

@krobelus
Copy link
Member

krobelus commented Mar 10, 2020

I realize there was a case where a failing history search would block the shell; when typing some characters that don't have an autosuggestion, and then pressing up, the shell would block because of all those queued autosuggestions, but this has neatly been fixed by e334bec. It seems that linear searches are fine with the current model.

The fts extension looks nice, it seems to be really damn fast at finding all the occurrences of some word: match foo only finds entries containing the word foo, based on natural-language-like word splitting, so it doesn't match echo foo/bar.
Our default should probably stay simple substring matching; fuzzy matching or regex are also nice sometimes. For those I think a linear search is still the best option (?). The btree prefix scan would be nice for autosuggestions.

Browsers are all using SQLite for history nowadays, but of course the UI is different here. All in all, I think JSONL works fine.

EDIT: messed up the example, of course match foo does find echo foo/bar with the default tokenizer, but match fo doesn't.

@mqudsi
Copy link
Contributor

mqudsi commented Mar 10, 2020

Note that what you are describing is probably with the default set of stop words and tokenization. You can customize how words are tokenized and what constitutes a word even further, likely able to further tweak the results to match our needs.

@travankor
Copy link

Is there any interest for a conversion tool that goes from existing format -> new format? Obviously, this would not delete the old history, but you would end up with two copies of the old history.

@BEFH
Copy link

BEFH commented Mar 11, 2020

This might not be a 3.2 thing, but it would be very helpful for the metadata to include the working directory in which the command was executed. I have a lot of pipelines that use identically named scripts so this feature would be pretty helpful.

@mqudsi
Copy link
Contributor

mqudsi commented Mar 11, 2020

Actually, I would prefer reducing data leaked via al these side channels where possible. Would a hash of the directory suffice (so you can match against it but not fuzzy search against it).

@krobelus
Copy link
Member

Is there any interest for a conversion tool that goes from existing format -> new format? Obviously, this would not delete the old history, but you would end up with two copies of the old history.

fish can always read old formats and will save the history in the new format, would this do the job?

@travankor
Copy link

fish can always read old formats and will save the history in the new format, would this do the job?

Yes, that was what I was thinking of. But fish already does this? It should only do this if it can't find the new history format.

@BEFH
Copy link

BEFH commented Mar 14, 2020

Actually, I would prefer reducing data leaked via al these side channels where possible. Would a hash of the directory suffice (so you can match against it but not fuzzy search against it).

It would be more useful to me to have whole paths, but I do see potential for insecurity in some applications. Would it be possible for this to be configurable? The new system is extensible by design. The default could be a hash.

@ridiculousfish
Copy link
Member Author

Storing the PWD hash is a clever idea, but I'm not sure why the PWD would be more sensitive than the commands which were run.

@faho
Copy link
Member

faho commented Sep 2, 2020

Okay, this has bit-rotted enough that it would have to be polished up again, so there's no real use in keeping it open.

Closing!

@faho faho closed this Sep 2, 2020
@ridiculousfish
Copy link
Member Author

Yeah sorry! I'll pick it up again!

@aca
Copy link
Contributor

aca commented Sep 6, 2020

I have huge fish_history. It makes history lookup extremely slow with fzf.
If fish history is formatted in a correct way. I could do some automatic cleanup for the history easily. Like

  • remove all "ls" commands
  • remove duplicated commands

Hope to see this get merged in 3.2.0

@faho
Copy link
Member

faho commented Sep 6, 2020

@aca

remove duplicated commands

Fish already deduplicates commands and only keeps the last 256K.

remove all "ls" commands

history delete --prefix ls.

@ridiculousfish
Copy link
Member Author

@aca how large is your history?

@brettviren
Copy link

This weekend I am switching to fish. So far so great. My thanks to everyone who has contributed to fish!

To drive learning fish language I have been rewriting a Bash package which stores command history into an SQLite DB file. Searching for various things brought me to this PR.

FWIW, I agree with the arguments against using sqlite for this new fish history file format (JSON lines sounds good to me) so I'm not here suggesting that. Rather, I want to add a link to the schema I'm using in hopes the new fish history file schema might support saving some (or all?) of the same elements.

https://github.com/brettviren/fishql/blob/master/functions/fishql-initdb.fish#L22

(I'm still learning fish so please forgive any (non) "fishy" looking code)

This is the same schema used by "advanced shell history" so a quasi-standard.

The motivation for enlarging the new fish history file schema is to support more rich queries and to support sync/dump methods between the new format and this sqlite DB which may be made less lossy.

@mqudsi
Copy link
Contributor

mqudsi commented Sep 15, 2020

I know the problem @aca is referring to; it's part of the reason for my recent patches that brought a 2-3x improvement to history output (but didn't address the actual problem).

@ridiculousfish the problem is orthogonal to history file length; the real issue is that fish builtins buffer their output when piped to another process and fzf is designed to work with streaming input: i.e. it's a question of latency, not throughput.

@krobelus
Copy link
Member

builtins avoid buffering when outputting to non-redirected stdout/stderr since bcfc54f (Do not buffer builtin output if avoidable)
enabling the same for file/pipe could improve latency

@mqudsi
Copy link
Contributor

mqudsi commented Sep 15, 2020

fish would have to ascertain what is reading out of the pipe (or that something is reading out of the pipe) so as to avoid a deadlock (presumably this is why the buffering exists in the first place). Since the parser is currently still single-threaded, the builtin needs a big enough buffer on the output otherwise the pipe will saturate if the reading end is not running simultaneously but rather queued to be run internally by fish after the builtin finishes execution (i.e. is a block, function, or another builtin), right?

A few years ago I had a branch that side-stepped the issue by keeping the current behavior but also starting a thread would try reading from the buffer and writing to the pipe. The codebase was not as amenable to such an approach at the time and I ran into some fundamental issues, but I think it’s more doable today.

@zanchey zanchey removed this from the fish 3.2.0 milestone Oct 6, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.