Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCE vulnerability in libwebp dependency #1903

Closed
lfcnassif opened this issue Sep 28, 2023 · 20 comments · Fixed by #1914
Closed

RCE vulnerability in libwebp dependency #1903

lfcnassif opened this issue Sep 28, 2023 · 20 comments · Fixed by #1914
Assignees
Labels
bug dependencies Pull requests that update a dependency file

Comments

@lfcnassif
Copy link
Member

libwebp is used by tesseract and imagemagick, we should upgrade libwebp to 1.3.2 version as described here:
https://nvd.nist.gov/vuln/detail/CVE-2023-4863

@tc-wleite, as you already compiled tesseract from source, would it be easy to compile it again with libwebp-1.3.2?

About imagemagick, I already reported the dependency issue to them. If they are not fast, we may think about compiling it from source...

For users, to mitigate the problem within IPED for now, it is enough to disable OCR and set enableExternalConv = false in conf/ImageThumbsConfig.txt

@lfcnassif lfcnassif added bug dependencies Pull requests that update a dependency file labels Sep 28, 2023
@lfcnassif lfcnassif changed the title RCE vulnerability in libwebp depency RCE vulnerability in libwebp dependency Sep 28, 2023
@wladimirleite
Copy link
Member

wladimirleite commented Sep 28, 2023

@tc-wleite, as you already compiled tesseract from source, would it be easy to compile it again with libwebp-1.3.2?

A few months ago I compiled it again (tesseract version 5.3.0) to check if there was any improvement comparing to the version we are using (5.0.0), but it wasn't the case.
I can try build it again, using libwebp-1.3.2.

@lfcnassif
Copy link
Member Author

I can try build it again, using libwebp-1.3.2.

Great, thank you!

@wladimirleite
Copy link
Member

Just built Tesseract 5.3.2-24-g3922 (latest version), but it uses libwebp-1.3.1.
I will try to manually change to 1.3.2.

@lfcnassif
Copy link
Member Author

Just built Tesseract 5.3.2-24-g3922 (latest version), but it uses libwebp-1.3.1.
I will try to manually change to 1.3.2.

There is no urgent need @tc-wleite! I just got an answer from ImageMagick project:

Version 7.1.1-17 of ImageMagick uses libwebp-1.3.2

So we could redirect webp to imagemagick before tesseract, as we do for other non standard formats.

@wladimirleite
Copy link
Member

I finally managed to build tesseract 5.3.2 with libwebp-1.3.2.
It was kind of painful as I am using a procedure that uses "sw", which is not updated to libwebp-1.3.2 (currenly https://software-network.org/org.sw.demo.webmproject.webp shows 1.3.1 as the most recent version).

Tests with a few samples are looking good.
I will run a larger test overnight, to check the performance and recognition results.
From what I remember from 5.3.0 (compared to 5.0.0), I expect only very minor differences.

@lfcnassif
Copy link
Member Author

Thank you very much @tc-wleite! But as I said, don't hurry, we can use imagemagick as a workaround.

@wladimirleite
Copy link
Member

Tesseract 5.3.2. compiled for Windows with libwebp 1.3.2: tesseract.zip

tesseract 5.3.2-24-g3922
 leptonica-1.83.1 (Sep 29 2023, 19:05:06) [MSC v.1929 LIB Release x64]
  libgif 5.2.1 : libjpeg 9e : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.13 : libwebp 1.3.2 : libopenjp2 2.5.0

I processed a large set of images and PDFs (around 20K files in total), with the new version and the one we currently use (5.0.0).
Performance was slightly better (ParsingTask total time was reduced by ~7% ) with the newer version.
Results (extracted text) are similar, but there are small (and in a few cases not so small) differences.
I wrote a quick program to compare OCR results of each item, calculating the Levenshtein distance (simplified to deal with longer strings). Then I visually inspect some of the images/PDFs with the highest distances. In most of them, the recognized text is similar, but how it dealt with the layout (e.g. two columns instead of one) changed. In general, the newer version seems slightly better.

@lfcnassif
Copy link
Member Author

Awesome! Thank you @tc-wleite! I'll update tesseract and imagemagick, cherry pick other important fixes (like #1879) and try to release 4.1.5 early in the next week.

@lfcnassif
Copy link
Member Author

lfcnassif commented Oct 2, 2023

Just started an ImageMagick regression test on 300K samples of non standard image formats collected from 220 different cases. Probably I'll post the results tomorrow.

@lfcnassif
Copy link
Member Author

lfcnassif commented Oct 3, 2023

Images with generated thumbnails by current ImageMagick version:
image

Images with generated thumbnails by ImageMagick version 7.1.1-18:
image

So the upgrade resulted in more EMF, TIFF & XBM rendered images. I'll proceed with the upgrade.

PS: I didn't compare the rendered image quality or correctness, just if a thumbnail was generated or not.

@lfcnassif
Copy link
Member Author

Hi @tc-wleite, I'm thinking to use ImageMagick dynamically instead of statically linked (maybe it runs faster), what do you think?

@lfcnassif
Copy link
Member Author

Hi @tc-wleite, I'm thinking to use ImageMagick dynamically instead of statically linked (maybe it runs faster), what do you think?

I started a performance test. Unless there is an important difference, I'll keep the statically linked version, since all official IM portable versions are statically linked.

@wladimirleite
Copy link
Member

I usually prefer static linked libraries. My intuition is that performance should be very similar in the case of ImageMagick, but it is better to test!

By the way, if you want to compare generated thumbnails from the test you made, between the newer IM version and the one currently used, not sure if you remember, but I wrote a small program that point out the hashes of the N images with "more different" thumbnails. So you can filter in both cases (if you still have the cases) and visually compare just a small subset, not thousands of images.

@lfcnassif
Copy link
Member Author

lfcnassif commented Oct 3, 2023

Yes I remember, that would be great! What's the input, the IPED cases or the thumbs databases?

@wladimirleite
Copy link
Member

Thumbs database.

@wladimirleite
Copy link
Member

wladimirleite commented Oct 3, 2023

It needs SQLite JDBC jar.
Cases path and number of top differences to be printed are hard coded in main().
The comparison between two image is very simple (just difference between RGB values).
And it shows hashes present in the first case but not in the second only, so it is better to use the case with more thumbs as the first, or run the comparison twice inverting the order.

EDIT: Code was too long, it is better to attach it: CompareThumbs.zip

@lfcnassif
Copy link
Member Author

It needs SQLite JDBC jar. Cases path and number of top differences to be printed are hard coded in main(). The comparison between two image is very simple (just difference between RGB values). And it shows hashes present in the first case but not in the second only, so it is better to use the case with more thumbs as the first, or run the comparison twice inverting the order.

EDIT: Code was too long, it is better to attach it: CompareThumbs.zip

Thank you @tc-wleite! Just did the comparison, differences are very minor, just one JP2 was rendered with different colors/brightness, but I think it is fine. And looking into the EMF number difference, it is due to timeouts, old ImageMagick is also able to render them in ImageViewer.

So I think we are fine and I will proceed with both upgrades.

@lfcnassif
Copy link
Member Author

@tc-wleite, just realized mplayer may link to libwebp too... Do you know if it does? At least, we don't process webp using mplayer, just animated heic, heif, gif & png, right?

@wladimirleite
Copy link
Member

wladimirleite commented Oct 5, 2023

@tc-wleite, just realized mplayer may link to libwebp too... Do you know if it does? At least, we don't process webp using mplayer, just animated heic, heif, gif & png, right?

I believe that FFmpeg (used by MPlayer) uses libwebp, but only to encode, not to decode.
Decoding would be useful for us to process animated WEBPs, but currently there is no support (https://trac.ffmpeg.org/ticket/4907).
So we definitely do not use anything related to WEBPs in MPlayer.

From time to time, I check new MPlayer versions for Windows.
Unfortunately, one of the websites that used to publish these Windows builds has not been updated since 2019, and the other was usually updated very often (like once per month or more), but the last update was in December, 2022.
So, I am not sure if it would be easy to find a recent build (that includes the most recent version of libwebp).

@lfcnassif
Copy link
Member Author

Thank you @tc-wleite for your research! Since the issue is triggered by decoding a malicious webp and FFmpeg/Mplayer doesn't support it, I think we are safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants