Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

referrer_title is empty, if the web site is not available #88

Closed
gitressa opened this issue Jun 2, 2019 · 7 comments · Fixed by #89
Closed

referrer_title is empty, if the web site is not available #88

gitressa opened this issue Jun 2, 2019 · 7 comments · Fixed by #89
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@gitressa
Copy link
Contributor

gitressa commented Jun 2, 2019

It looks like Fink doesn't include the referrer_title, if the web site is no longer online, even thought it might be quite useful.

Here is an example of a missing page where everything works as expected, since the web site is available. A status: 404 is returned, and the referrer_title included:

[
  {
    "distance": 3,
    "exception": null,
    "referrer": "https://example.org/gang",
    "referrer_title": "Gangs and Security Threat Group Awareness:",
    "referrer_xpath": "/html/body/div/div/div/section/div[2]/section[2]/div/div/div/div[1]/span[2]/div/p[17]/a",
    "request_time": 557563,
    "status": 404,
    "url": "http://www.dc.state.fl.us/pub/gangs/index.html",
    "timestamp": "2019-05-24T01:20:04+02:00"
  }
]

Here are a few examples where the web site is no longer online, and the referrer_title not included in the result:

[
  {
    "distance": 64,
    "exception": "Resolving the specified domain failed: 'www.eucalb.com'",
    "referrer": "https://example.org/pages?page=62",
    "referrer_title": "",
    "referrer_xpath": "",
    "request_time": 0,
    "status": null,
    "url": "http://www.eucalb.com",
    "timestamp": "2019-05-24T01:39:19+02:00"
  },
  {
    "distance": 177,
    "exception": "Connection to 'byerly.org:80' failed",
    "referrer": "https://example.org/pages?page=175",
    "referrer_title": "",
    "referrer_xpath": "",
    "request_time": 0,
    "status": null,
    "url": "http://byerly.org/bt.htm",
    "timestamp": "2019-05-24T01:43:10+02:00"
  },
  {
    "distance": 180,
    "exception": "Resolving the specified domain failed: 'awalls.org'",
    "referrer": "https://example.org/pages?page=178",
    "referrer_title": "",
    "referrer_xpath": "",
    "request_time": 0,
    "status": null,
    "url": "http://awalls.org",
    "timestamp": "2019-05-24T01:45:00+02:00"
  },
  {
    "distance": 50,
    "exception": "Resolving the specified domain failed: 'curia.eu.int'",
    "referrer": "https://example.org/pages?page=48",
    "referrer_title": "",
    "referrer_xpath": "",
    "request_time": 0,
    "status": null,
    "url": "http://curia.eu.int/da/index.htm",
    "timestamp": "2019-05-24T01:28:04+02:00"
  }
]

Other scenarios of missing referrer_title, like mistyped URL or time-out:

[
  {
    "distance": 202,
    "exception": "Request must specify a valid HTTP URI",
    "referrer": "https://example.org.dk/pages?page=200",
    "referrer_title": "",
    "referrer_xpath": "",
    "request_time": 0,
    "status": null,
    "url": "http://www.danah.org.",
    "timestamp": "2019-05-24T01:54:50+02:00"
  },
  {
    "distance": 50,
    "exception": "Allowed transfer timeout exceeded: 15000 ms",
    "referrer": "https://example.org/pages?page=48",
    "referrer_title": "",
    "referrer_xpath": "",
    "request_time": 0,
    "status": null,
    "url": "http://nobelprize.org/nobel_prizes/literature/laureates/1930/index.html",
    "timestamp": "2019-05-24T01:35:24+02:00"
  }
]
@dantleech dantleech added bug Something isn't working good first issue Good for newcomers labels Jun 2, 2019
dantleech added a commit that referenced this issue Jun 2, 2019
Ensure that URL is recorded if client throws exception when requestin…
@dantleech
Copy link
Owner

Should be fixed in #89

@gitressa
Copy link
Contributor Author

gitressa commented Jun 2, 2019

Thanks @dantleech it works perfectly, as can be seen in this freshly crawled example, where referrer_title is now included:

[
  {
    "distance": 48,
    "exception": "Resolving the specified domain failed: 'curia.eu.int'",
    "referrer": "https://example.org/pages?page=47",
    "referrer_title": "CURIA",
    "referrer_xpath": "/html/body/div/div/div/section/div[2]/div/div/div/div[3]/div[2]/div/a[1]",
    "request_time": 0,
    "status": null,
    "url": "http://curia.eu.int/da/index.htm",
    "timestamp": "2019-06-02T20:36:09+02:00"
  }
]

@gitressa
Copy link
Contributor Author

gitressa commented Jun 2, 2019

Would it be possible to somehow include in the output that the web site gave no response? Currently, I can't use status: null as a reliable filter, since many URLs with that status actually do work, but are 301's. For a few examples, see here.

EDIT: I can probably check for Connection to 'www.example.com:80' failed or Resolving the specified domain failed in the exception field to achieve this, that would probably work ...

@gitressa
Copy link
Contributor Author

gitressa commented Jun 2, 2019

With referrer_title available also for status: null, my dead link harvest can be increased by 10-15%, so is it worth considering a fresh release to include this new feature? If it is too much work for too little improvement, I can respect that.

@dantleech
Copy link
Owner

dantleech commented Jun 3, 2019 via email

@gitressa
Copy link
Contributor Author

gitressa commented Jun 3, 2019

That sounds great, thanks. Let me know if you would like me to test any new features.

@gitressa
Copy link
Contributor Author

gitressa commented Jun 9, 2019

Thanks for releasing version 0.9.0, which adds referrer_title for links with status: null.

I can now use status!=200 as a filter, which is much simpler than this 😄

select(([.status] | inside([400, 403, 404, 500, 502, 503])) or contains({exception: "Connection"}) or contains({exception: "Resolving"}))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants