-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIHHealthyEating issues with a page without content #393
Comments
I just did a quick peak and I think you've discovered at least 2 bugs.
I think our general direction is that we'd just have all the individual methods throw errors when they can't parse anything. We don't have any global functionality in the factory that errors out because of no recipe on the page. For your scraper, you could argue that if there are no ingredients, instructions, or title that you should skip the page etc. I'm quite interested in why Feel free to submit a PR where you feel comfortable and link your crawler in #9! |
Thanks @bfcarpio for your insight.
Yes, that's exactly what the crawler does. I haven't released next version with the recipe_scraper related code because I ran into this issue. I would prefer that my crawler does not have to have rely upon an empty try-except block (a very bad hack). So the The ingredients, instructions, and so on will be in every recipe. Should |
Some sort of error should be raised if the element is unavailable. We have a schema variant and a non-schema variant, I believe. These errors would include things like "ingredients section doesn't exist" as well as "I tried to get the title, but it was an integer". |
These are improved. There are still a few issues with the parse, but those are my own doing. I'll fix it in due time. It would be nice to have a 2-3 test case webpages. |
Pre-filing checks
The URL of the recipe(s) that are not being scraped correctly
This is intentionally a bad URL being sent to
scrape_me()
function:The version of Python you're using
3.6.9
The operating system of your environment
Ubuntu Linux 18.04
The results you expect to see
I expect a mostly consistent experience among different scrapers. See the Notes section below.
The results (including any Python error messages) that you are seeing
raises
AttributeError
for all methods except title().Can you write Python and would you like to help fix the scraper yourself? We'd be glad for your assistance! We can provide you with guidance and code review in return. If so, tick any of the relevant boxes below:
recipe-scrapers
team try to fix thisNotes
I think I have a fix, but I want to get some guidance before submitting it.
I also have code to implement image() method.
For example scraping https://www.allrecipes.com/ itself (presumably
index.html
), which is for the most part a schema scraper, I get the following:Setup:
Results
I have written a website crawler, which crawls pages on the same domain. The crawler will send to recipe_scraper and some of those pages that will have not recipe content on them. What behavior should be expected when there isn't a recipe?
I can submit a PR for this if it helps discussion.
The text was updated successfully, but these errors were encountered: