Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

research Dolt and DoltHub #39

Open
aspiers opened this issue Mar 26, 2023 · 3 comments
Open

research Dolt and DoltHub #39

aspiers opened this issue Mar 26, 2023 · 3 comments

Comments

@aspiers
Copy link
Owner

aspiers commented Mar 26, 2023

I have discovered https://github.com/dolthub/dolt and DoltHub which look like very promising technologies. Perhaps they could be used as an alternative way of storing the data contained here. Certainly a full-blown SQL database would bring a ton of power and flexibility to this project, and when combined with a GitHub-like collaboration model with pull requests, it could be perfect.

@aspiers aspiers mentioned this issue Mar 26, 2023
6 tasks
@aspiers
Copy link
Owner Author

aspiers commented Mar 26, 2023

See also my "git sourcing" idea: githubocto/flat#64

@strk
Copy link
Contributor

strk commented Jul 14, 2024

I'm not convinced a new format is needed. The biggest value of a project like this is the stability of the format specification, in that it allows multiple projects to depend on it, supporting read and write. Unless I misunderstood what dOLt is about (it isn't clear to me).

Note PostgreSQL supports querying CSV files as if they were tables in the database via Foreign Data Wrappers:
https://www.postgresql.org/docs/current/file-fdw.html
See for example https://gist.github.com/NikolayS/a819f139c37e0d54ad4a4ca70764f225

@aspiers
Copy link
Owner Author

aspiers commented Jul 15, 2024

The stability of the schema is largely orthogonal to whether we use CSV or an RDBMS like Dolt. E.g. we could use CSV but create undesirable instability by regularly reordering/renaming columns etc. Or we could use an RDBMS and keep it very stable by never changing the schema.

The main attractions of Dolt are due to RDBMS being much more flexible than CSV, e.g.

  • stricter data types and hence validation (e.g. forcing page number fields to be numbers)
  • enforcement of mandatory vs. optional fields
  • multiple tables with foreign keys defining relationships between them (which helps with things like Embed book identification information in index header #46)
  • easier to address extend schema #12
  • sophisticated query engine for free
  • easy export to CSV would allow backwards compatibility with this repo
  • bidirectional syncing is feasible too

Also DoltHub gives us a very nice frontend for free which is specifically designed for decentralized collaboration on data sets, unlike GitHub.

But I admit it would be an increase in complexity too. Another option is introducing CI which does validation on the existing data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants