Consider a better support of ZIM files without books in HTML #95

kelson42 · 2019-11-12T13:50:30Z

I think we should maybe consider a better support of ZIM files without HTML. The reasons are:

This would save disk usage (I estimate a reduction of 30% of the whole size)
Make easier the integration of third party ebook sources which don't (easily) provide HTML version of their content

Currently I see two big reasons to keep the HTML versions:
1 - Full text engine applying to HTML only
2 - Ability to directly see the content

These two things might be fixed with:
1 - Support ability to fulltext index EPUBs (relatively easy) see openzim/libzim#289
2 - Providing readers for multiple platforms within the ZIM... even maybe a pure Web Epub reader?

kelson42 · 2019-11-12T13:50:48Z

@eshellman This ticket might be of interest for you

Popolechien · 2019-11-12T13:52:29Z

Sure, sounds good but what would the final output look like compared to what we have now?

kelson42 · 2019-11-12T13:54:19Z

Sure, sounds good but what would the final output look like compared to what we have now?

@Popolechien Same without the book in HTML directly usable from the browser, in place we would have an info page explaining how to read the EPUB file from the Browsers, mobile, computer, etc...

Popolechien · 2019-11-12T14:09:21Z

@kelson42 Well, if the idea is to save some space, how about offering either Gutenberg epubs or Gutenberg HTML and hope that people would know the difference? Much like we have Wikipedia with or without images, in a way.

kelson42 · 2019-11-12T14:14:04Z

@kelson42 Well, if the idea is to save some space, how about offering either Gutenberg epubs or Gutenberg HTML and hope that people would know the difference? Much like we have Wikipedia with or without images, in a way.

@Popolechien This might be done, can be already done, but this is not the point of the ticket which is about providing a better UX without HTML. Buy maybe you just want to say "I don't think we need that: we should provide one with HTML and one with EPUB and people can only have one or the either and live with that."

Popolechien · 2019-11-12T14:53:47Z

we should provide one with HTML and one with EPUB and people can only have one or the either and live with that.

Yes.

kelson42 · 2019-11-14T15:59:59Z

@Popolechien To me this would be a fallback solution. But I believe we might be able to solve the problem properly.

We could be able to solve (2) in an even better manner by using a pure javascript EPUB reader (so for the end user) it would be a similar experience as having the HTML in the ZIM file. We could for example use https://github.com/futurepress/epub.js/

eshellman · 2019-11-14T18:35:46Z

only tricky thing deploying epub.js is overcoming same-origin javascript issues, but you probably are experienced with that

…

On Nov 14, 2019, at 11:00 AM, Kelson ***@***.***> wrote: @Popolechien <https://github.com/Popolechien> To me this would be a fallback solution. But I believe we might be able to solve the problem properly. We could be able to solve (2) in an even better manner by using a pure javascript EPUB reader (so for the end user) it would be a similar experience as having the HTML in the ZIM file. We could for example use https://github.com/futurepress/epub.js/ <https://github.com/futurepress/epub.js/> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#95?email_source=notifications&email_token=AAHCGMKVWBMDWPFBYGF7GCDQTVYYBA5CNFSM4JMDYFJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECKPAY#issuecomment-553953155>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHCGMKP2IMQALSTQ7XIDM3QTVYYBANCNFSM4JMDYFJA>.

stale · 2020-08-22T18:11:00Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2022-12-20T16:47:25Z

Once #136 implemented, we should be able to implement this ticket. The scraper would download the EPUB, parse it to extra the key words for the search engine. Epub.js should be able to make the EPUB directly readable in the ZIM (to best tested).

rgaudin · 2023-01-26T10:55:10Z

The most difficult part here is the one that's not been mentioned: the UI. With our generic UI that

What does entries look like? An html shell that displays epub.js on size 100%? Should it include a link/button to download/open the epub should you have an external epub reader?

I believe the search topic deserves its own ticket. openzim/libzim#289 seems like a wrong solution to the problem. We don't want libzim to index epub. If libzim does it, then search results would point to the .epub entry and not to our epub.js shell… If we want to index the shell, then we need the libzim NOT to index .epub ones, otherwise we'll double index size

We'd need a scraper-level epub parser (and html, and pdf). Actually we could already (when also including HTML) build indexdata on the cover article and disable libzim one on the HTML book so that search points to the cover and not the HTML itself.

Now one issue would be that books are very long and epub (and PDF) are paginated. If you're searching for an expression, is it acceptable to just link to the book cover? In a WP article, it's single page so despite being cumbersome, you can easily ^F and find that text again.

In epub.js there is no search-in-book feature (yet??) so if you were not looking for a book but for an extract, it's gonna be useless… and I believe finding books is not what fulltext index is about (home page search does it probably better)

Jaifroid · 2023-04-20T07:52:56Z

I risk sounding like a broken record, but please remember users with older browsers and OS's, as well as those with restrictive CSPs. HTML is a universal way to access content that is supported everywhere (at least, static HTML). While it's fine if we can include a system in the ZIM to convert EPUB or PDF content to accessible (and searchable) HTML, we would need to be sure that such readers run under old browsers and restrictive CSPs. Otherwise you risk making ZIMs even more inaccessible than they already are. Even a modern Chrome extension can't access the current dynamic UI due to its use of inline JS (#145), and that is only going to get worse with the stricter CSPs in manifest v3 extensions also: kiwix/kiwix-js#755.

So, I agree with the caution expressed by @rgaudin, but for slightly different reasons.

Jaifroid · 2023-04-20T08:08:21Z

I've just checked, and epub.js doesn't work in IE11. Yes, IE11 is now history, but it's still a good proxy for old browser support...

benoit74 · 2023-08-18T12:37:05Z

For those who do not yet knows about it, integrating an epub and a pdf reader has already been done for kolibri scraper.

There is even a download button for those who prefer to use another reader.

Other questions regarding resulting UI and the creation of multiple ZIMs (all, epub_only, html_only, pdf_only) are still relevant

kelson42 added enhancement question labels Nov 12, 2019

kelson42 assigned kelson42, rgaudin, dattaz and Popolechien Nov 12, 2019

kelson42 mentioned this issue Dec 8, 2019

Allow writer to parse EPUB openzim/libzim#289

Open

stale bot added the stale label Aug 22, 2020

kelson42 added this to the 1.2.0 milestone Dec 17, 2022

stale bot removed the stale label Dec 17, 2022

kelson42 removed the question label Dec 20, 2022

kelson42 unassigned rgaudin, kelson42, dattaz and Popolechien Dec 20, 2022

kelson42 changed the title ~~Consider a better support of ZIM files without HTML~~ Consider a better support of ZIM files without books in HTML Jan 7, 2023

This was referenced Jan 19, 2023

UI does not limit the buttons displayed to only requested formats #159

Closed

Do not force the presence of HTML format for all books #161

Open

kelson42 pinned this issue Jan 25, 2023

kelson42 mentioned this issue Jan 25, 2023

Remove « .html » extension #166

Open

kelson42 modified the milestones: 2.0.0, 3.0.0, 2.2.0 Feb 26, 2023

kelson42 mentioned this issue Aug 18, 2023

Remove HTML parsing from our source repository openzim/libzim#377

Open

Jaifroid mentioned this issue Mar 26, 2024

Missing images in EPUBs that are present in HTML books #222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider a better support of ZIM files without books in HTML #95

Consider a better support of ZIM files without books in HTML #95

kelson42 commented Nov 12, 2019 •

edited

Loading

kelson42 commented Nov 12, 2019 •

edited

Loading

Popolechien commented Nov 12, 2019

kelson42 commented Nov 12, 2019

Popolechien commented Nov 12, 2019

kelson42 commented Nov 12, 2019 •

edited

Loading

Popolechien commented Nov 12, 2019

kelson42 commented Nov 14, 2019

eshellman commented Nov 14, 2019 via email

stale bot commented Aug 22, 2020

kelson42 commented Dec 20, 2022

rgaudin commented Jan 26, 2023

Jaifroid commented Apr 20, 2023

Jaifroid commented Apr 20, 2023

benoit74 commented Aug 18, 2023

Consider a better support of ZIM files without books in HTML #95

Consider a better support of ZIM files without books in HTML #95

Comments

kelson42 commented Nov 12, 2019 • edited Loading

kelson42 commented Nov 12, 2019 • edited Loading

Popolechien commented Nov 12, 2019

kelson42 commented Nov 12, 2019

Popolechien commented Nov 12, 2019

kelson42 commented Nov 12, 2019 • edited Loading

Popolechien commented Nov 12, 2019

kelson42 commented Nov 14, 2019

eshellman commented Nov 14, 2019 via email

stale bot commented Aug 22, 2020

kelson42 commented Dec 20, 2022

rgaudin commented Jan 26, 2023

Jaifroid commented Apr 20, 2023

Jaifroid commented Apr 20, 2023

benoit74 commented Aug 18, 2023

kelson42 commented Nov 12, 2019 •

edited

Loading

kelson42 commented Nov 12, 2019 •

edited

Loading

kelson42 commented Nov 12, 2019 •

edited

Loading