ExtractorChrome: reduce request duplication between browser and frontier #416

ato · 2021-07-23T08:42:22Z

By intercepting the browser's request and fulfilling it using the response previously recorded by FetchHTTP we avoid the browser sending a duplicate request for the main CrawlURI to the web server.
By running the link extractors on subresources captured by the browser we can now also tell the frontier not to bother scheduling them for classic fetching.

Note: The browser will still make duplicate requests for previously downloaded subresources. Solving this will require implementing resource caching or some way to read back previously written WARC records.

By intercepting the browser's request and fulfilling it using the response previously recorded by FetchHTTP we avoid sending duplicate requests for the CrawlURI to the web server. A size limit (maxReplayLength) is applied as a safety measure since the browser's Fetch.fulfillRequest API requires us to load the entire response body into memory. Note: This only applies to the main CrawlURI. The browser can still make duplicate requests when loading sub-resources. Solving this for sub-resources will require implementing the ability to read back previously written WARC records.

This ensures we discover links in subresources even if the browser doesn't happen to load them. For example a CSS file might link to images that the browser won't load as they're gated by media queries.

Since we now run extractors on subresources there's no reason to schedule and refetch them again. Note that duplicate fetches can still occur if the URI was already scheduled or if the browser itself refetches the resource.

ato added 3 commits July 23, 2021 17:38

ExtractorChrome: run extractors on subresources captured by the browser

bbae794

This ensures we discover links in subresources even if the browser doesn't happen to load them. For example a CSS file might link to images that the browser won't load as they're gated by media queries.

ato changed the title ~~ExtractorChrome: replay the recorded CrawlURI response to the browser~~ ExtractorChrome: reduce request duplication between browser and frontier-based crawling Jul 24, 2021

ato changed the title ~~ExtractorChrome: reduce request duplication between browser and frontier-based crawling~~ ExtractorChrome: reduce request duplication between browser and frontier Jul 24, 2021

ato merged commit 466c10b into master Jul 27, 2021

ato deleted the extractor-chrome-replay-responses branch July 27, 2021 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExtractorChrome: reduce request duplication between browser and frontier #416

ExtractorChrome: reduce request duplication between browser and frontier #416

ato commented Jul 23, 2021 •

edited

Loading

ExtractorChrome: reduce request duplication between browser and frontier #416

ExtractorChrome: reduce request duplication between browser and frontier #416

Conversation

ato commented Jul 23, 2021 • edited Loading

ato commented Jul 23, 2021 •

edited

Loading