Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign CLI around the new JSON file format #59

Open
emk opened this issue May 4, 2024 · 5 comments
Open

Redesign CLI around the new JSON file format #59

emk opened this issue May 4, 2024 · 5 comments
Assignees

Comments

@emk
Copy link
Owner

emk commented May 4, 2024

We want a new JSON-based file format that can represent subtitles or books, and that contains all versions of work, plus any metadata needed to support card creation.

@emk
Copy link
Owner Author

emk commented May 4, 2024

@aaron-meyers re: #12

Since you are currently turning substudy output into ebooks (which is a fantastic idea), I wanted to show you my current draft of the new file format.

Example book (with tags, notes and a paragraph implemented as a nested alignment)
{
  "creators": [
    "Miguel de Cervantes Saavedra"
  ],
  "title": "El ingenioso hidalgo don Quijote de la Mancha",
  "year": 1605,
  "tracks": {
    "es": {
      "type": "html",
      "origin": "original",
      "lang": "es"
    },
    "en": {
      "type": "html",
      "origin": "ai_generated",
      "generated_by": "gpt-4",
      "derived_from_track_id": "es",
      "lang": "en"
    },
    "notes": {
      "type": "notes"
    }
  },
  "tags": [
    "classic"
  ],
  "base_track_id": "es",
  "alignments": [
    {
      "id": "2acdeaf4-7b0c-4f78-abf2-dc299ab362e9",
      "heading": 1,
      "tracks": {
        "es": {
          "html": "El ingenioso hidalgo don Quijote de la Mancha"
        },
        "en": {
          "html": "The Ingenious Gentleman Don Quijote of La Mancha"
        }
      }
    },
    {
      "id": "f4b3b3b4-4b3b-4b3b-4b3b-4b3b4b3b4b3b",
      "heading": 2,
      "tracks": {
        "es": {
          "html": "Capítulo I. Que trata de la condición y ejercicio del famoso hidalgo don Quijote de la Mancha"
        },
        "en": {
          "html": "Chapter I. Which treats of the condition and exercise of the famous gentleman don Quijote of La Mancha"
        }
      }
    },
    {
      "alignments": [
        {
          "id": "f5fb686f-b0ab-486c-9e7d-40c4abd51bc7",
          "tracks": {
            "es": {
              "html": "En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua, rocín flaco y galgo corredor."
            },
            "en": {
              "html": "In a place of La Mancha, whose name I do not wish to recall, not long ago there lived a gentleman of the type with a lance in the rack, an ancient shield, a skinny steed, and a racing greyhound."
            },
            "notes": {
              "html": "<ul><li>\"acordarme\" is a reflexive verb that means \"to remember\"; the reflexive pronoun \"me\" is used to indicate that the action is being done to oneself.</li></ul>"
            }
          },
          "tags": [
            "star"
          ]
        },
        {
          "id": "f24a1744-45f9-4b00-98b1-c7a4c27a5a12",
          "tracks": {
            "es": {
              "html": "Una olla de algo más vaca que carnero, salpicón las más noches, duelos y quebrantos los sábados, lantejas los viernes, algún palomino de añadidura los domingos, consumían las tres partes de su hacienda."
            },
            "en": {
              "html": "A pot of stew more beef than mutton, minced meat most nights, grievous discomforts on Saturdays, lentils on Fridays, and an occasional pigeon as a treat on Sundays, consumed three parts of his estate."
            }
          }
        }
      ]
    }
  ]
}
Example episode
{
  "series": {
    "series_title": "Les aventures de Jean & Luc",
    "index_in_series": 1
  },
  "title": "Episode 01.01",
  "tracks": {
    "base": {
      "origin": "original",
      "type": "media",
      "lang": "fr",
      "file": "files/episode1.mp4"
    },
    "subs.fr": {
      "origin": "ai_generated",
      "generated_by": "whisper-1",
      "derived_from_track_id": "base",
      "type": "html",
      "lang": "fr"
    },
    "subs.en": {
      "origin": "ai_generated",
      "generated_by": "gpt-3.5-turbo",
      "derived_from_track_id": "subs.fr",
      "type": "html",
      "lang": "en"
    }
  },
  "base_track_id": "base",
  "alignments": [
    {
      "id": "56523fb0-b4c5-40d4-bb08-4c59fb027dbb",
      "timeSpan": [
        10,
        15.5
      ],
      "tracks": {
        "subs.fr": {
          "html": "<i>Jean &amp; Luc:</i> On y va !"
        },
        "subs.en": {
          "html": "<i>Jean &amp; Luc:</i> Let's go!"
        }
      }
    }
  ]
}

There's a commented Rust "schema" of the file format here.

Re: line breaks. Preserving line breaks by using <br> in the "html": fields is certainly one option. I could also consider adding something analogous to the CSS white-space property to control default whitespace handling.

My basic goals here are to:

  1. Provide some way to have a proper ebook reader with subheadings and paragraphs.
  2. Provide some way to have a video with a bilingual subtitle list. Basically like Language Reactor, except we had our own protoypes long ago. :-)
  3. Provide some way to better support extensive watching & reading with only occassional card extraction. This is where notes and tags and UUIDs all come in handy.

If a format like this existed, would you be interested in either:

  • Exporting directly from this JSON format?
  • Or having substudy first export to another format?

@aaron-meyers
Copy link

Yes, this is fantastic - I was thinking about this but didn't explicitly mention it. If you looked at my script you probably saw it's currently parsing the HTML output from substudy export review but a JSON format like this would be much cleaner.

Overall, your JSON format has a lot of similar goals with the YAML-based formats I've been working on in my tandoku project. The core file format in my project, 'content' files, are basically an attempt at producing an aligned-media file format that can be used across a variety of media input and output types. I've specifically considered video+subtitles, ebooks (text, graphic novels, picture books), and video game scripts as sources which could all be imported into a common aligned media content format and then exported to output formats like EPUB or HTML slides.

The script I referenced earlier doesn't actually use my aligned media file format yet - it was a quick-and-dirty direct conversion from your HTML output to an EPUB. I did build some tooling though recently to import Anki decks (with an image and native/reference text) into my file format and then some tools to export that into EPUB or HTML slides. I used it to import a video game script deck from Anki and output it as an offline HTML site that I can use on Steam Deck.

It should be trivial for me to import your JSON format into my format so looking forward to this when you're ready!

@emk
Copy link
Owner Author

emk commented May 6, 2024

The core file format in my project, 'content' files, are basically an attempt at producing an aligned-media file format that can be used across a variety of media input and output types. I've specifically considered video+subtitles, ebooks (text, graphic novels, picture books), and video game scripts as sources which could all be imported into a common aligned media content format and then exported to output formats like EPUB or HTML slides.

Sounds fantastic!

I am definitely interesting in collaborating on formats for aligned media. Many years ago, I made a brief attempt to come up with a shared format for aligned media, but it didn't go anywhere. And in retrospect, that format has no way to represent headings or paragraphs, so it turned out to pretty painful for ebooks.

I am imagining a workflow something like:

$ substudy import media episode_1.mov subtitles.es.srt --out=episode_1.substudy
$ ls episode_1.substudy/
metadata.json
$ substudy list tracks episode_1.substudy
base
subs_es
$ substudy add translation episode_1.substudy subs_es --to-lang=en
$ substudy add images episode_1.substudy
$ substudy list tracks episode_1.substudy
base
subs_es
subs_en
images
$ substudy export anki episode_1.substudy ...

But I want metadata.json to be stable enough that you can use things like jq to grab data and then process it however you want:

cat episode_1.substudy/metadata.json | \
    jq '.alignments[] | .. | select(.tracks?) | [.time_span, .tracks.subs_es.html, .tracks.subs_en.html, .tracks.image.file]'

And yes, other formats are interesting!

  • I absolutely do want to the file format to support graphic novels, although I'd probably need to use something like this tool to prepare them. Interestingly, GPT-4 can actually OCR individual comic panels surprisingly well, though it's not cheap.
  • For full-scale novels, I think it's probably best to use an external tool like Bertalign to do the actual alignment. Trying to get these Python+GPU models to run locally is way beyond the technical abilities of most language learners, so I don't want to include them directly in substudy.

Anyway, I am definitely interested in feedback! Do you think it might be worthwhile setting up a Discord (or something similar) for discussing content file formats?

@aaron-meyers
Copy link

aaron-meyers commented May 16, 2024

Sorry for the delay - work has been really busy the past couple weeks and then I've been on a trip. Setting up a Discord sounds like a good idea!

Your proposed workflow is very similar to some of the things I've implemented - take a look at tandoku/scripts for examples.

I tend to adopt terminology from others when discussing a topic but I should call out that the 'aligned' aspect of the file formats I've been working on is technically optional. The core goal of my content format was to provide a standard way to represent media from a variety of sources so that I could build common tools and workflows rather than a bunch of media-specific ones. I wanted to be able to do things like extract word statistics, keep track of known words and estimate % of known words in some target media, as well as aligning media and building ebooks or even an app for consuming media with built-in dictionary lookup and word tracking. There are some Japanese-specific things as well, like dealing with kanji and adding readings to unknown kanji. Most of these are just ideas; I've only managed to implement a few specific flows over the last several years. I should make some time to write down my goals somewhere in my repo (most of my notes are currently in a private OneNote notebook).

Anyway, happy to chat - seems like you've been doing quite a bit in this repo recently! I'm not sure how much GPT-4 costs to do OCR but I've used Azure and Google cloud OCR and they do a very good job with pretty reasonable pricing ($1.50 for 1000 pages) - although I have $150 of Azure credit from Microsoft each month so I haven't actually paid anything 😉 Recently I've actually been using the Panels app on my iPad which supports Apple's Live Text (on-device OCR) - with a dictionary app as a "slide over" from the side, it works pretty well. I would still like to at least align pages of graphic novels (e.g. interleave Japanese and English pages in a single CBZ file) - technically doesn't require OCR but at least some info on panels / text regions could help with automatic alignment. And using OCR to generate my content file format and be able to run word statistics and so on is something I'd still like to do at some point.

@emk
Copy link
Owner Author

emk commented Jun 5, 2024

I have been a little busy with other stuff, but I will be back around to this project in a while with any luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

2 participants