Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIHHealthyEating issues with a page without content #393

Closed
3 of 5 tasks
micahcochran opened this issue Jun 20, 2021 · 4 comments
Closed
3 of 5 tasks

NIHHealthyEating issues with a page without content #393

micahcochran opened this issue Jun 20, 2021 · 4 comments
Labels

Comments

@micahcochran
Copy link
Contributor

Pre-filing checks

  • I have searched for open issues that report the same problem
  • I have checked that the bug affects the latest version of the library

The URL of the recipe(s) that are not being scraped correctly

This is intentionally a bad URL being sent to scrape_me() function:

The version of Python you're using

3.6.9

The operating system of your environment

Ubuntu Linux 18.04

The results you expect to see

I expect a mostly consistent experience among different scrapers. See the Notes section below.

The results (including any Python error messages) that you are seeing

raises AttributeError for all methods except title().

Can you write Python and would you like to help fix the scraper yourself? We'd be glad for your assistance! We can provide you with guidance and code review in return. If so, tick any of the relevant boxes below:

  • I'd like to try fixing this scraper myself
  • I'd like guidance to help me develop a fix
  • I'd prefer if the recipe-scrapers team try to fix this

Notes

I think I have a fix, but I want to get some guidance before submitting it.

I also have code to implement image() method.

For example scraping https://www.allrecipes.com/ itself (presumably index.html), which is for the most part a schema scraper, I get the following:

Setup:

from recipe_scrapers import scrape_me
url = "https://www.allrecipes.com/"
r = scrape_me(url)

Results

>>> r.yields()
'0 serving(s)'

>>> r.ingredients()
[]

>>> r.instructions()
''

# title() raises a TypeError
>>> r.title()

I have written a website crawler, which crawls pages on the same domain. The crawler will send to recipe_scraper and some of those pages that will have not recipe content on them. What behavior should be expected when there isn't a recipe?

I can submit a PR for this if it helps discussion.

@bfcarpio
Copy link
Collaborator

I just did a quick peak and I think you've discovered at least 2 bugs.

  1. No error handling for network requests (now tracking in Add error handling for network requests #394 )
  2. We made this big thing about now throwing exceptions by default in v13, but didn't update the schema scraper entirely facepalm. We need something like this in yields for example so that it doesn't default to returning a value.

I think our general direction is that we'd just have all the individual methods throw errors when they can't parse anything. We don't have any global functionality in the factory that errors out because of no recipe on the page. For your scraper, you could argue that if there are no ingredients, instructions, or title that you should skip the page etc.

I'm quite interested in why title() raises a TypeError; though, I think that's also easy pickings for doing a type check.

Feel free to submit a PR where you feel comfortable and link your crawler in #9!

@micahcochran
Copy link
Contributor Author

Thanks @bfcarpio for your insight.

For your scraper, you could argue that if there are no ingredients, instructions, or title that you should skip the page etc.

Yes, that's exactly what the crawler does. I haven't released next version with the recipe_scraper related code because I ran into this issue. I would prefer that my crawler does not have to have rely upon an empty try-except block (a very bad hack).

So the ElementNotFoundInHtml is currently is only raised by the functions get_yields() and get_minutes().

The ingredients, instructions, and so on will be in every recipe. Should ElementNotFoundInHtml be raised for ingredient, instructions, and so on when the expected content is unavailable? Presumably because there isn't recipe content on that webpage. An image is only present in some of the recipes, so should image() content when not present should it just return an empty string or should there be some other?

@bfcarpio
Copy link
Collaborator

Some sort of error should be raised if the element is unavailable. We have a schema variant and a non-schema variant, I believe. These errors would include things like "ingredients section doesn't exist" as well as "I tried to get the title, but it was an integer".

ref

@micahcochran
Copy link
Contributor Author

These are improved. There are still a few issues with the parse, but those are my own doing. I'll fix it in due time. It would be nice to have a 2-3 test case webpages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants