Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import collections #1012

Open
snickers2k opened this issue Mar 1, 2024 · 7 comments
Open

import collections #1012

snickers2k opened this issue Mar 1, 2024 · 7 comments

Comments

@snickers2k
Copy link

Took this request from tandoor, as they're using recipe-scrapers for this.

Is your feature request related to a problem? Please describe.

would be great to have the ability to import whole collections into tandoor recipe-scrapers for supported sites

https://www.chefkoch.de/rezeptsammlung/896538/Auflaeufe.html
for example

Describe the solution you'd like

"one-click" import for collections

Describe alternatives you've considered

importing hundreds of favorite recipes by "hand" (tandoor bookmarklet)

thanks

@jayaddison
Copy link
Collaborator

Hi @snickers2k - thanks for the request: let's see what we can figure out.

Currently, as you may have realized/figured out already, recipe-scrapers is designed to extract a single recipe from a public HTML recipe webpage by URL -- and the codebase and test suite are modeled for that.

We do have a scraper implemented for individual chefkoch.de recipes - so those become available for import in Tandoor -- but not the collections you mention.

@hhursev @jknndy @strangetom @brett do you have any thoughts on whether we could/should support retrieval of collections of recipes? I lean towards 'no', and could explain more about that, but my perspective could be different from others.

Linking to TandoorRecipes/recipes#3017 to keep a relationship between the two issue reports.

@jknndy
Copy link
Collaborator

jknndy commented Mar 1, 2024

I think this could be a cool function to add but it definitely feels outside the scope of the core usage.

A possible implementation would be to specify some specific format or header / something in the html that switches over to returning scraper.links() which could be filtered based on a specified URL patterns. Then prompting the user to select from a couple of different next steps. I have a few ideas that i'll look into and maybe open a PR if it turns into anything

@strangetom
Copy link
Collaborator

I agree that it feels a bit out of scope, but I think it's worth investigating. There might also be an opportunity to combine this with handling to web pages with more than one recipe, which has come up a few times before.

I can imagine something along the lines of

def scrape_collection(url: str, **options: Any) -> List[Scraper]:
    # Check the url is for a site that we support scraping collections for and raise Exception if not.

    # Do some magic to scrape collection

To determine if a scraper supports scraping collections, we could add a class method to AbstractScraper, e.g. .supports_collections(), which would default to False and scrapers would override it as necessary.

@jayaddison
Copy link
Collaborator

I think we should be cautious about this. In fact, @jknndy - I'd suggest pausing until we check a few more details.

In particular I have a memory about copyright law as it relates to recipes; basically that individual recipes are generally not copyrightable, but that the process of selecting and making available a number of recipes (like in cookbooks) could be.

With that in mind, I want to be careful so that we don't provide functionality that could get either us or our users into murky legal territory.

This page from Y2002 is what I'm reading currently, although there may be more recent rulings / precedents: https://www.everything2.com/title/US+Copyright+for+Recipes

@jayaddison
Copy link
Collaborator

I don't think I'm enough of an expert in this to provide any definitive answers. I'll seek some expert/professional advice that I can share here about what the acceptable boundaries are, and from there we can figure out whether it's sensible to apply any restrictions (annoying though that could be). That could get complicated if any of the guidance is jurisdiction-specific, but I'll also try to find common universally-acceptable baselines.

@jayaddison
Copy link
Collaborator

I haven't yet found that advice, but do still intend to; while preparing for that I'll write up some guidelines about how I think about some of the valid, intended and potentially more dubious/controversial use cases that could be possible in future, and how those affect the way I think about feature suggestions and also how I review code. Those will be my opinions but perhaps having a single place to view and adjust those will help build gradual consensus and refer back to for questions.

@jayaddison
Copy link
Collaborator

Here's what I've put together so far. One side-effect of this is: no, I don't think we should support collections of recipes. However, all just my opinion and perspective on the library.

It needs some references/citations to be added if it is to become anything like a documented policy.

Proposed Scraper development guidelines

About me / introduction

  • Software developer; some interest in legal matters but no legal training/education.
  • A maintainer of recipe-scrapers after getting involved ~5 years ago.
  • Creator and business owner of RecipeRadar, a UK-based search engine that uses recipe-scrapers.
  • Resident in the UK, although spent some time in the USA, and those two countries are, I think, the largest influences on my outlook.
  • Not a negotiator; I tend to back down in the face of non-technical interpersonal conflicts, and often admit that to the other party. I believe it's important for people to stand up for their rights, but for me individually, I often find the stress of doing that difficult (especially synchronously/in-person), and also often consider that my time would be better spent elsewhere.

Goals of these guidelines

Primarily: give recipe-scrapers the best chance of continuing to develop and thrive as a project.

To do that, provide reasoning and guidelines for:

  • The content (code and resources) that exists within the git repository.
  • The functionality that the library provides to developers and end-users.
  • The acceptance criteria for modifications to the library.

...in order to give us the best chance of maintaining a healthy relationship with content creators, part of which I believe involves being transparent and justifying how the library works and is developed.

The library has been successful so far and I expect that our contributors each have their own unstated and slightly-varying ideas about these areas already. Writing the guidelines down should allow us to debate them, refer to them during code review / development / discussion, and adjust them as times change or when problems are encountered.

Concerns / risks

I believe that the main risks for the project are that we may infringe copyright unfairly ourselves, or that we may create circumstances that make it unreasonably straightforward for other entities to infringe copyright unfairly.

Those circumstances would negatively affect recipe authors, and so I believe that our guidelines should reassure recipe authors about the way that we handle their recipe pages, while also providing a framework that contributors and maintainers can refer to (for example, when trying to decide whether a recipe website should be supported, or during decisions about how/whether to support specific fields on a scraper).

To a certain extent, food recipes -- lists of ingredients and a description of how to prepare them -- especially those that do not include any unpublished trade secrets -- are generally not protectable using copyright law in the legal systems I'm aware of.

However, some web recipe authors do earn income from their websites, and so they have a reason to want to protect their recipes, and could reasonably be concerned about code that can provide access to them.

Advice can be found on the web about ways to create recipes that make them more likely to be copyrightable - some of this explains, for example, that photography and imagery is copyrightable, and some of it explains that it is possible to add distinct written elements in or around the core of the recipe to improve the chance that legal proceedings would consider it copyrightable.

In addition, there is historical precedent that although a typical individual recipe may not be copyrightable, cookbooks -- that is, multiple recipes that have been collected, curated and published together -- can be.

Geographic considerations

Implementation of copyright law and exceptions to it vary by location (jurisdiction), and we have contributors and downstream library consumers who we can reasonably expect could be almost anywhere in the world (with perhaps a small number of exceptions, based on our source repository and packaging distribution host providers).

Proposed Guidelines

Content

  • We only store a small number -- often only one -- HTML page from each recipe website that we support. We store this as test data, to ensure that the code for the website can accurately represent recipe information from that site, and to follow good software engineering practice (namely: test coverage, to prevent accidental bugs and regressions when the code changes).

  • With rare exceptions, code for each supported recipe website is placed into its own individual code file (module) - this makes it straightforward for our contributors to add, enhance, fix and remove individual recipe websites.

  • Recipe websites are identified (keyed) by their brand's DNS domain name(s) - that is, the part of the web address after the http(s):// and before the first '/' character (so, for example, 'example.org' -- if it were a recipe website -- would be a valid recipe scraper key). We list these entries up-front in the documentation of our code repository and also the published packaged library, to make it straightforward for users, contributors and content authors to check what recipe websites are supported.

  • All of the code in our repository is written using plaintext Python code, and where possible we write that code as clearly and unobfuscatedly as possible. We use automated software tools to help adhere to effective code style practices.

  • Published packages of the library are released with version numbers, and these version numbers correspond to tagged versions of the source code.

Functionality

  • The library's interface accepts one recipe URL as input, and returns an object that provides access to the recipe's metadata as output.

  • Where possible we use industry-standard methods to extract recipe metadata from recipe webpages - in particular, schema.org's definition of Recipe metadata is widely-used on the web and in our code.

  • The library provides minimal, accessible text output that omits many of the surrounding HTML presentation elements. We do our best to ensure that each scraper's implementation produces an accurate representation of the recipe fields as they would appear in an HTML browser; discrepancies can be considered bugs.

  • We place a priority on ensuring that scraped recipes are attributable back to their origin website, canonical URL, and author.

  • Although in a small number of cases we provide scraper code that can extract information from HTML that users have retrieved from non-public pages (for example, recipes that are only visible after the user has logged-in to a website where they have an account), we do not provide mechanisms to download non-public website content in this library.

  • We do not provide mechanisms to publish recipe information; our focus is on extracting information from HTML that a user has available on their system.

  • We recommend that our users follow good netiquette when retrieving recipe webpages.

Acceptance Criteria

  • All of the recipe-related fields that the library can support -- not all of which are available for every recipe website -- are documented, and during consideration of proposed scraper implementations (code review), we are unlikely to accept code that extracts other unrelated information from a web page.

  • We consider some recipe information fields to be mandatory -- title, author, and ingredients, for example -- and without good reason we are very unlikely to accept proposed scraper implementations that lack demonstrated support for those fields.

  • With some rare exceptions (such as HTTP AJAX requests that would be made equivalently by any JavaScript-capable browser when viewing a public recipe webpage), we do not accept any scraper techniques that could be considered as circumvention or avoidance of bot-detection or authentication measures used by websites.

  • In our scraper 'feature request' template, when requesting sample URLs that contributors may consider as test recipes, we ask submitters to check that the recipe website they're requesting support for makes their recipes available to the public without requiring account login.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants