Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception with 6.0.7: "This should be implemented". #148

Closed
synergiator opened this issue May 4, 2020 · 6 comments
Closed

Exception with 6.0.7: "This should be implemented". #148

synergiator opened this issue May 4, 2020 · 6 comments

Comments

@synergiator
Copy link

Installed the packege 6.0.7 and tried with Python 3.6 following the README file:


from recipe_scrapers import scrape_me
scraper = scrape_me('http://allrecipes.com/Recipe/Apple-Cake-Iv/Detail.aspx')

scraper.title()
scraper.total_time()
scraper.yields()
scraper.ingredients()
scraper.instructions()
scraper.image()
scraper.links()

Error:


Traceback (most recent call last):
  File "sandbox.py", line 10, in <module>
    scraper.total_time()
  File "/home/user/.local/lib/python3.6/site-packages/recipe_scrapers/_abstract.py", line 44, in schema_org_priority_wrapper
    return value or decorated(self, *args, **kwargs)
  File "/home/user/.local/lib/python3.6/site-packages/recipe_scrapers/_abstract.py", line 85, in total_time
    raise NotImplementedError("This should be implemented.")
NotImplementedError: This should be implemented.

@jayaddison
Copy link
Collaborator

Thanks for the report @synergiator; the allrecipe scraper should be picking up the necessary data from schema.org metadata in the page and returning it to us, but it looks like that isn't happening here.

Do you have any time to dig into why the code falls into this path for the example recipe?

@synergiator
Copy link
Author

Hi there @jayaddison, I can now share some facts what I could find so far.

  • All scrapers base on the abstract class AbstractScraper. So if one of its methods is not overriden, like total_time, it raises an exception.
  • The error I've observed with the allrecipes scraper seems therefore so to say to be "correct" as it indeed does not implement an overriding method - in fact, it implements just host() method.
  • So I thought maybe with another scraper which does implement this method things should work better, for example the Epicurious scraper. But, it produces same error which I can't understand.

The above test was in a fresh Python virtual environment with just the package and its requirements installed. Then, I've tried same test from the path where all other tests are, in a corresponding recipe-scrapers development virtual environment. No error. This behavior gives me an impression there is some wrongly coded dynamic loader / dependency injection or similar (just a guess), but for time being I can't tell more and better pass the word to the authors of this code. :-)

@jayaddison
Copy link
Collaborator

Thanks for the analysis @synergiator - this leads me to believe we have a regression somewhere between 6.0.7 and the 5.x.x branch (latest release 5.18.0).

The largest change between the two is the introduction of schema.org handlers taking priority.

Essentially the code will look for a schema.org (or JSON-LD) metadata property on the target page, and only if it does not find it will it revert to using the def total_time(...) (or other extraction field) method.

It seems that the example Apple Cake recipe doesn't contain such metadata, and so the code tries to invoke the AllRecipes extraction methods -- but then finds that they don't exist either, so it finally raises the NotImplementedError.

I've no sense for how many AllRecipes pages contain vs. do not contain schema.org properties, unfortunately, so I don't know how many scrapes this would affect.

There are a few options here:

  1. If you wish, you could try the 5.18.0 release and you should find that it works as advertised
  2. We may fix the readme link and provide an example URL that works, 'hiding' the problem
  3. We may restore the ability of the AllRecipes scraper (and others?) to handle pages-without-metadata -- this is the most 'complete' fix

It could be a while before we get to item 3. Let me know if this helps you out short-term :)

@synergiator
Copy link
Author

Hi, there, @jayaddison, thanks for the explanation! In the meantime I've found out there is a quite large dataset from MIT with about 1M recipes so I don't need currently to scrape by myself.

Regarding the issue, it's strange enough same site uses different templates to render recipes.

@hhursev
Copy link
Owner

hhursev commented May 10, 2020

@synergiator version 6.1.0 is updated and will handle "gracefully" the exception (defaulting to 0 if there is no information for preparation time found). Thanks for pointing out the issue!

The dataset you've linked is awesome! You can take a look at this issue for another set of recipes w/o needing to scrape a thing ;)

@hhursev hhursev closed this as completed May 10, 2020
@synergiator
Copy link
Author

hi there @hhursev! thank you for the fix, and the link to the other data set!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants