You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
html parsing is slow. I know that using html5ever's Tokenizer more than doubled hyperlink's runtime, which is why I didn't manage to move hyperlink away from quick-xml for a long time. I see that you're constructing an entire tree in lychee instead of iterating over tokens. Shamelessly plugging my own html parser at this point, but swapping out html5ever with literally anything else, or replacing the document parser with just the Tokenizer will give you some boost. Quickly skimming over what you do with the element tree, you don't really need anything but tokens.
In #330 (comment) @untitaker mentioned that html parsing could be improved:
We could do so by using implementing our own
TokenSink
as shown in this example:https://github.com/servo/html5ever/blob/master/html5ever/examples/tokenize.rs
TokenSink
could be an iterator or even a stream of tokens; we don't have to create an entire DOM for our use-case.extractor.rs
to use a customTokenSink
.The text was updated successfully, but these errors were encountered: