Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression: Missing HTML content #219

Open
rgaudin opened this issue Mar 5, 2024 · 22 comments
Open

regression: Missing HTML content #219

rgaudin opened this issue Mar 5, 2024 · 22 comments
Assignees
Labels
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Mar 5, 2024

In gutenberg_en_all_2024-02, out of the 10 books listed (all declaring offering an HTML version), only three do have an HTML version.

Screenshot 2024-03-05 at 10 06 08

Either HTML version it is not present in the ZIM or the link is incorrect (it's same link in listing and in preview page). This is not limited to those 10 entries but it makes this 75GB ZIM look like garbage.

Initially reported by Offspot user.

@rgaudin rgaudin added the bug label Mar 5, 2024
@benoit74 benoit74 self-assigned this Mar 5, 2024
@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

Who ...

If you search for Mary Wollstonecraft Shelley in the author search box, you will realize there are 3 versions of the Frankenstein book (book ids 84, 41445, 42324)

Same for Moby Dick, 2 versions. Probably same for other as well.

I looked at the ePub (which are OK) and the content is slightly different, so there is clearly a difference between these books.

I looked at https://aleph.pglaf.org/cache/epub/84/ and we see there is an HTML version of book 84, with illustrations.

I will have to reproduce locally, but as usual it will take some time to rebuild the local database from rsync result.

@Popolechien
Copy link

Popolechien commented Mar 5, 2024

Ok I’ve tried the first two pages and about 2/3 of the books are missing. It gets better as one goes deeper, but it is the first impression that matters.

I’ve tested a half dozen other languages, no problem there but there weren't many (or any) that had several versions of the same book.

I have put the recipe on hold on Zimfarm

@benoit74 can you please delete
https://download.kiwix.org/zim/gutenberg/gutenberg_mul_all_2024-02.zim
https://download.kiwix.org/zim/gutenberg/gutenberg_mul_all_2024-01.zim
https://download.kiwix.org/zim/gutenberg/gutenberg_en_all_2024-01.zim
and https://download.kiwix.org/zim/gutenberg/gutenberg_en_all_2024-02.zim

@rgaudin
Copy link
Member Author

rgaudin commented Mar 5, 2024

@Popolechien please make a removal request on zim-requests with appropriate flag otherwise we'll lost it and you'll open a ticket in 6 months asking where gutenberg files are 😉

@kelson42
Copy link
Contributor

kelson42 commented Mar 5, 2024

I'm concerned about a general problem here which might lead to pausing all recipes... or do we have a chance to know exactly which ZIM are impacted?

@kelson42 kelson42 added this to the 2.2.0 milestone Mar 5, 2024
@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

I'm concerned about a general problem here which might lead to pausing all recipes... or do we have a chance to know exactly which ZIM are impacted?

You mean, all gutenberg recipes, right? (there is only one btw)

@Popolechien tests have shown that it looks like only en (and hence mul) are impacted, as far as we can tell (probably wrong, but occurrences are at least way more visible in other languages).

@Popolechien: you wanna remove both ZIMs we have because you tested both and they both have the issue?

Are we sure we want to do this (not provide the en + mul ZIM anymore) given the fact that ePub / PDF is still available?

Should we run a new (temporary) recipe for en + mul with only PDF + ePub as requested formats, so that at least the ZIM does not contain invalid links? We could name it the nohtml flavor. It is just a matter of configuration normally.

@kelson42
Copy link
Contributor

kelson42 commented Mar 5, 2024

This seems a pretty serious issue IMHO. If I get everything right, the wise thing would be to deactivate the gutenber recipe until this issue is closed and new release done.

@Popolechien
Copy link

Popolechien commented Mar 5, 2024

@rgaudin yeah my bad for some reason I thought the ticket was there already.
@benoit74 Yes both en and mul. If 2/3 of the content seems missing on the first page (and I stress that it seems missing) then that's very sub-par UX. Considering the size of the zim and the time/data costs invested to download it, I'd rather not impose this on users.

I was not aware of the possibility of running the recipe with PDF adn ePubs only, but that seems acceptable, yes.

@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

You tested both 2024_01 and 2024_02 ZIMs, both have the issue?

@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

Recipe for only epub+pdf is here: https://farm.openzim.org/recipes/gutenberg_mul_epub-pdf

Can you confirm we want this and you did not spotted any stupid thing? I've activated the "multiple ZIM" mode, should we discover we have the issue in other languages as well, we will be happy to have ZIMs in all languages. It should take about 1 day to produce if I trust last run duration.

@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

This seems a pretty serious issue IMHO. If I get everything right, the wise thing would be to deactivate the gutenber recipe until this issue is closed and new release done.

This has been done more than one hour ago

@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

OK, so regarding the "real issue", I have debugged the scraper logic for book 84.

Foreword: this scraper logic is a nightmare, I won't dive into details

As you've probably already guessed, there are basically two issues:

  • the scraper does not care that HTML version has not been found when it renders the UI
  • the scraper does not achieves to find the HTML version of book 84 (while it exists in many formats)

scraper does not care that HTML version has not been found when it renders the UI

I suspected that first part could be a regression induced by #163 but I don't think so, at least it seems that situation has been enhanced by this PR but not fully fixed : before this PR, buttons where always displayed when the book was supposed to have a given format available according to RDF ; with the PR (now), the buttons are hidden if a given format is not requested ; we should go further and also hide the button if we do not achieve to download the requested format.

scraper does not achieves to find the HTML version of book

For book 84, the various versions present at https://www.gutenberg.org/files/84/84-h/84-h.htm or at https://www.gutenberg.org/cache/epub/84/pg84-images.html (also redirected here from "magic logic" from @eshellman which gives https://www.gutenberg.org/ebooks/84.html.images for this book HTML) are not among the 10s of potential URLs considered by the scraper (see code block below).

"http://aleph.pglaf.org/8/84/84-h.htm"
"http://aleph.pglaf.org/8/84/84-h.html"
"http://aleph.pglaf.org/8/84/84-h.zip"
"http://aleph.pglaf.org/cache/epub/84/pg84.html.utf8"
"http://aleph.pglaf.org/etext00/84-h.htm"
"http://aleph.pglaf.org/etext01/84-h.htm"
"http://aleph.pglaf.org/etext02/84-h.htm"
"http://aleph.pglaf.org/etext03/84-h.htm"
"http://aleph.pglaf.org/etext04/84-h.htm"
"http://aleph.pglaf.org/etext05/84-h.htm"
"http://aleph.pglaf.org/etext90/84-h.htm"
"http://aleph.pglaf.org/etext91/84-h.htm"
"http://aleph.pglaf.org/etext92/84-h.htm"
"http://aleph.pglaf.org/etext93/84-h.htm"
"http://aleph.pglaf.org/etext94/84-h.htm"
"http://aleph.pglaf.org/etext95/84-h.htm"
"http://aleph.pglaf.org/etext96/84-h.htm"
"http://aleph.pglaf.org/etext97/84-h.htm"
"http://aleph.pglaf.org/etext98/84-h.htm"
"http://aleph.pglaf.org/etext99/84-h.htm"

For book 41445 (which works), the HTML version is found at http://aleph.pglaf.org/4/1/4/4/41445/41445-h.zip

Full list of potential URLs for 41445 below:

"http://aleph.pglaf.org/4/1/4/4/41445/41445-h.htm"
"http://aleph.pglaf.org/4/1/4/4/41445/41445-h.html"
"http://aleph.pglaf.org/4/1/4/4/41445/41445-h.zip" <= found in RSYNC results, present on server
"http://aleph.pglaf.org/cache/epub/41445/pg41445.html.utf8"
"http://aleph.pglaf.org/etext00/41445-h.htm"
"http://aleph.pglaf.org/etext01/41445-h.htm"
"http://aleph.pglaf.org/etext02/41445-h.htm"
"http://aleph.pglaf.org/etext03/41445-h.htm"
"http://aleph.pglaf.org/etext04/41445-h.htm"
"http://aleph.pglaf.org/etext05/41445-h.htm"
"http://aleph.pglaf.org/etext90/41445-h.htm"
"http://aleph.pglaf.org/etext91/41445-h.htm"
"http://aleph.pglaf.org/etext92/41445-h.htm"
"http://aleph.pglaf.org/etext93/41445-h.htm"
"http://aleph.pglaf.org/etext94/41445-h.htm"
"http://aleph.pglaf.org/etext95/41445-h.htm"
"http://aleph.pglaf.org/etext96/41445-h.htm"
"http://aleph.pglaf.org/etext97/41445-h.htm"
"http://aleph.pglaf.org/etext98/41445-h.htm"
"http://aleph.pglaf.org/etext99/41445-h.htm"

what next

We cannot add the https://aleph.pglaf.org/8/84/84-h/84-h.htm pattern to the list generated above because the 8/84/84-h folder is normally reserved for "extracted ZIP" version, i.e. we find in this folder not only the HTML but also all images. And in such a situation we do not want to grab only the HTML since we need all the images as well for proper rendering.

I'm not very inclined to fix only the fact that scraper does not care that HTML version has not been found when it renders the UI, because as far as I've understood, HTML version is very important for our users (see comments on #161). Fixing only this could help as an interim solution to "at least build a relevant ZIM without buttons leading to nowhere", but I do not recommend this approach which is only putting lipstick on a pig.

I think that at this point we need to invest time in seriously simplifying the scraper code to get rid of all "fallback" mechanisms we have and are only biting us now.

In other words, finally implement what has been imagined and more or less prepared in #97 (I just renamed it, we won't move to OPDS catalog according to latest discussions in the issue):

  • get rid of RSYNC which is consuming lots of time / disk IOs
  • get rid of fallback URLs / names which are present everywhere
  • for every situation where something unexpected happens (should not be many), raise an issue to project Gutenberg which is ready to help so this problem is fixed upstream

WDYT?

@Popolechien
Copy link

LGTM, thanks a lot.

Regarding the interim recipe, I've disabled the multiple languages output (we would have duplicates files with almost the same name and for very limited added value, I find this confusing rather than helpful) - let's see this as an English problem and an English fix. I have changed the language settings (and recipe name) accordingly, please double check before launching the recipe.

I have also disabled the bookshelves feature, apparently according to #184 the feature is not maintained by Gutenberg folks.

@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

Interim recipe started.

Be aware that doing it only for English also means we will not provide the mul big ZIM anymore in the interim.

I don't get what the problem is about the mostly similar name, we already have this situation for Wikipedia with its flavor. Mostly same name, same title, same description, only size differ. It is only a UI issue.

@Popolechien
Copy link

Popolechien commented Mar 5, 2024

Be aware that doing it only for English also means we will not provide the mul big ZIM anymore in the interim.

I'm fine with that, it's use case always seemed dubious to me in the first place.

Regarding the Wikipedia example, that's exactly the problem I had in mind (the question comes regularly as to why these three and what the difference is, despite all the FAQ, message, etc.)

@eshellman
Copy link
Collaborator

Over the past 2-3 years, a lot of effort has been put into upgrading all 70,000 books in PG books to validated html5 and epub3. There are two trees in the file system, the "1/2/3/4/5" tree, and the "cache/epub" tree. The generated epub3 and html5 files are in the "cache/epub" tree. Both of these are in the aleph mirror. I don't remember how we were handling epub, but the generated HTML5 was not yet implemented when this was last implemented.

as you might expect, the generated html5 is much more uniform in quality compared to the source files, which come in all sorts of htm and txt flavors!

@benoit74
Copy link
Collaborator

benoit74 commented Mar 7, 2024

https://farm.openzim.org/recipes/gutenberg_en_epub-pdf did not produced the expected outcome, I forgot again that HTML format is mandatory (see #161), we can only request to not put epub or pdf in the ZIM ...

I've disabled the recipe (we can probably delete it, it is only misleading) and the ZIM (still suffering the same HTML issue).

@kelson42 do you consider this is a fast-track issue which needs to be fixed asap (i.e. with more priority than other projects I have)?

@benoit74
Copy link
Collaborator

benoit74 commented Mar 7, 2024

Over the past 2-3 years, a lot of effort has been put into upgrading all 70,000 books in PG books to validated html5 and epub3. There are two trees in the file system, the "1/2/3/4/5" tree, and the "cache/epub" tree. The generated epub3 and html5 files are in the "cache/epub" tree. Both of these are in the aleph mirror. I don't remember how we were handling epub, but the generated HTML5 was not yet implemented when this was last implemented.

as you might expect, the generated html5 is much more uniform in quality compared to the source files, which come in all sorts of htm and txt flavors!

I now really consider it is mandatory to do the necessary changes to fix #97 and have a scraper which is both faster, easier to maintain and producing a ZIM with more uniform quality

@kelson42
Copy link
Contributor

kelson42 commented Mar 7, 2024

@benoit74 How much work do you estimate to be able to bring things back to normal in good and substainable conditions?

@benoit74
Copy link
Collaborator

benoit74 commented Mar 7, 2024

@kelson42 In man days, 5 to 10 days probably (including PoC, reviews, ...). In elapse ...

@eshellman
Copy link
Collaborator

eshellman commented Mar 8, 2024 via email

@eshellman
Copy link
Collaborator

I've added an update to #97 that I hope will help

@benoit74
Copy link
Collaborator

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants