NIHHealthyEating issues with a page without content #393

micahcochran · 2021-06-20T16:32:03Z

Pre-filing checks

I have searched for open issues that report the same problem
I have checked that the bug affects the latest version of the library

The URL of the recipe(s) that are not being scraped correctly

This is intentionally a bad URL being sent to scrape_me() function:

https://healthyeating.nhlbi.nih.gov/

The version of Python you're using

3.6.9

The operating system of your environment

Ubuntu Linux 18.04

The results you expect to see

I expect a mostly consistent experience among different scrapers. See the Notes section below.

The results (including any Python error messages) that you are seeing

raises AttributeError for all methods except title().

Can you write Python and would you like to help fix the scraper yourself? We'd be glad for your assistance! We can provide you with guidance and code review in return. If so, tick any of the relevant boxes below:

I'd like to try fixing this scraper myself
I'd like guidance to help me develop a fix
I'd prefer if the recipe-scrapers team try to fix this

Notes

I think I have a fix, but I want to get some guidance before submitting it.

I also have code to implement image() method.

For example scraping https://www.allrecipes.com/ itself (presumably index.html), which is for the most part a schema scraper, I get the following:

Setup:

from recipe_scrapers import scrape_me
url = "https://www.allrecipes.com/"
r = scrape_me(url)

Results

>>> r.yields()
'0 serving(s)'

>>> r.ingredients()
[]

>>> r.instructions()
''

# title() raises a TypeError
>>> r.title()

I have written a website crawler, which crawls pages on the same domain. The crawler will send to recipe_scraper and some of those pages that will have not recipe content on them. What behavior should be expected when there isn't a recipe?

I can submit a PR for this if it helps discussion.

The text was updated successfully, but these errors were encountered:

bfcarpio · 2021-06-20T22:58:43Z

I just did a quick peak and I think you've discovered at least 2 bugs.

No error handling for network requests (now tracking in Add error handling for network requests #394 )
We made this big thing about now throwing exceptions by default in v13, but didn't update the schema scraper entirely facepalm. We need something like this in yields for example so that it doesn't default to returning a value.

I think our general direction is that we'd just have all the individual methods throw errors when they can't parse anything. We don't have any global functionality in the factory that errors out because of no recipe on the page. For your scraper, you could argue that if there are no ingredients, instructions, or title that you should skip the page etc.

I'm quite interested in why title() raises a TypeError; though, I think that's also easy pickings for doing a type check.

Feel free to submit a PR where you feel comfortable and link your crawler in #9!

micahcochran · 2021-06-21T15:36:03Z

Thanks @bfcarpio for your insight.

For your scraper, you could argue that if there are no ingredients, instructions, or title that you should skip the page etc.

Yes, that's exactly what the crawler does. I haven't released next version with the recipe_scraper related code because I ran into this issue. I would prefer that my crawler does not have to have rely upon an empty try-except block (a very bad hack).

So the ElementNotFoundInHtml is currently is only raised by the functions get_yields() and get_minutes().

The ingredients, instructions, and so on will be in every recipe. Should ElementNotFoundInHtml be raised for ingredient, instructions, and so on when the expected content is unavailable? Presumably because there isn't recipe content on that webpage. An image is only present in some of the recipes, so should image() content when not present should it just return an empty string or should there be some other?

bfcarpio · 2021-06-22T05:57:05Z

Some sort of error should be raised if the element is unavailable. We have a schema variant and a non-schema variant, I believe. These errors would include things like "ingredients section doesn't exist" as well as "I tried to get the title, but it was an integer".

ref

micahcochran · 2021-08-09T21:17:12Z

These are improved. There are still a few issues with the parse, but those are my own doing. I'll fix it in due time. It would be nice to have a 2-3 test case webpages.

micahcochran added the bug label Jun 20, 2021

micahcochran mentioned this issue Jun 22, 2021

NIHHealthyEating: Improvements #395

Closed

micahcochran closed this as completed Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIHHealthyEating issues with a page without content #393

NIHHealthyEating issues with a page without content #393

micahcochran commented Jun 20, 2021

bfcarpio commented Jun 20, 2021

micahcochran commented Jun 21, 2021

bfcarpio commented Jun 22, 2021

micahcochran commented Aug 9, 2021

NIHHealthyEating issues with a page without content #393

NIHHealthyEating issues with a page without content #393

Comments

micahcochran commented Jun 20, 2021

bfcarpio commented Jun 20, 2021

micahcochran commented Jun 21, 2021

bfcarpio commented Jun 22, 2021

micahcochran commented Aug 9, 2021