Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ExtractorYoutubeDL #257

Merged
merged 9 commits into from
May 22, 2019
Merged

WIP: ExtractorYoutubeDL #257

merged 9 commits into from
May 22, 2019

Conversation

nlevitt
Copy link
Contributor

@nlevitt nlevitt commented Apr 30, 2019

No description provided.

There is already a clause for this in logging.properties, but it's using
log4j. It was dumping stack traces every time the client was dubious of
heritrix's self-signed certificate.

Why do we have so many identical log4j.xml's? 🤷‍♂️
Without this change (or other measures), we sometimes get nulls in the
ExtractorYoutubeDL log for containing page information. We'll run this
on QA for a while and see if it causes any problems.

nlevitt [1:59 PM]
https://github.com/internetarchive/heritrix3/blob/master/modules/src/main/java/org/archive/modules/CrawlURI.java#L878
drops some stuff from `CrawlURI.data` after processing a uri, even if it needs to be processed again
there is a list of keys that shouldn’t be dropped (`persistentKeys`), but it is final and private
so if you’re writing your own heritrix module and you want to keep some information in CrawlURI.data, it usually works, except when the url is processed more than once (like when it needs a prereq like robots.txt the first time)
in practice it seems that most data is persisted, that is, most commonly used keys are in `persistentKeys`
in a crawl with pretty standard configuration i’m mostly seeing `prerequisite-uri` dropped and occasionally `fetch-completed-time` and `fetch-began-time` being dropped
i’m highly skeptical of the value of dropping keys at all and i’m tempted to get rid of this entirely, make all the keys persistent in other words
soliciting feedback (edited)

anjackson [2:39 PM]
My immediate reaction is HARD AGREE. It looks like Really Old Code though (https://github.com/internetarchive/heritrix3/blame/7d3eff5269142c77fa4b988396153f4c29d16caa/modules/src/main/java/org/archive/modules/CrawlURI.java#L878)
so the reasons for doing so may have been lost in time.
Hm, looking at usage: https://github.com/internetarchive/heritrix3/blob/a60b2ef3875ad47f57b0c6b3c0b19f86c40a12f7/engine/src/main/java/org/archive/crawler/frontier/WorkQueueFrontier.java#L954-L955
engine/src/main/java/org/archive/crawler/frontier/WorkQueueFrontier.java:954-955

                curi.processingCleanup(); // lose state that shouldn't burden
                                          // retry

I guess there's a concern that there may be state in there that is set during a fetch and may cause problems if the same CrawlURI is deferred?
But I'm not aware of anything in the fetch chain that behaves like that.

nlevitt [3:02 PM]
oh, i missed `CrawlURI.addDataPersistentMember(String)` et al. still...
@adam-miller adam-miller merged commit de0c19f into internetarchive:master May 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants