Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtractorChrome: reduce request duplication between browser and frontier #416

Merged
merged 3 commits into from
Jul 27, 2021

Conversation

ato
Copy link
Collaborator

@ato ato commented Jul 23, 2021

  1. By intercepting the browser's request and fulfilling it using the response previously recorded by FetchHTTP we avoid the browser sending a duplicate request for the main CrawlURI to the web server.

  2. By running the link extractors on subresources captured by the browser we can now also tell the frontier not to bother scheduling them for classic fetching.

Note: The browser will still make duplicate requests for previously downloaded subresources. Solving this will require implementing resource caching or some way to read back previously written WARC records.

ato added 3 commits July 23, 2021 17:38
By intercepting the browser's request and fulfilling it using the
response previously recorded by FetchHTTP we avoid sending duplicate
requests for the CrawlURI to the web server.

A size limit (maxReplayLength) is applied as a safety measure since the
browser's Fetch.fulfillRequest API requires us to load the entire
response body into memory.

Note: This only applies to the main CrawlURI. The browser can still
make duplicate requests when loading sub-resources. Solving this for
sub-resources will require implementing the ability to read back
previously written WARC records.
This ensures we discover links in subresources even if the browser
doesn't happen to load them. For example a CSS file might link to images
that the browser won't load as they're gated by media queries.
Since we now run extractors on subresources there's no reason to
schedule and refetch them again.

Note that duplicate fetches can still occur if the URI was already
scheduled or if the browser itself refetches the resource.
@ato ato changed the title ExtractorChrome: replay the recorded CrawlURI response to the browser ExtractorChrome: reduce request duplication between browser and frontier-based crawling Jul 24, 2021
@ato ato changed the title ExtractorChrome: reduce request duplication between browser and frontier-based crawling ExtractorChrome: reduce request duplication between browser and frontier Jul 24, 2021
@ato ato merged commit 466c10b into master Jul 27, 2021
@ato ato deleted the extractor-chrome-replay-responses branch July 27, 2021 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant