Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify a set of wrongly parsed URLs with a version from 2017 #155

Closed
synergiator opened this issue May 14, 2020 · 1 comment
Closed

Verify a set of wrongly parsed URLs with a version from 2017 #155

synergiator opened this issue May 14, 2020 · 1 comment

Comments

@synergiator
Copy link

synergiator commented May 14, 2020

As it seems, at least as of 2017, one of the scrapers (epicurious) did not throw away URLs for some reasons. This could be either an acceptable weakness by a design decision, or a missing feature in the design.

Actual problem: some of parsed Epicurious recipes do not contain the element "ingredients". It is just not there.

  • An URL formally qualifies to be a recipe by template, by actually is not. (extreme examples, untitled either /recipes/food/views/reserve-this-recipe-id-for-future-use-51234840).
  • Some other reason.

I do understand this is rather a problem with data outliers than with the scrapers, so maybe there is a need to clarify how much scraping intelligence and data model sensitivity is required at this level, and if none, how to best implement/integrate it as it seems be a generally relevant use case in this context. (i.e. one outlier is marginal problem, but across many large datasets this can sum up to a bigger issue in terms of data quality).

The issue needs to be of course validated with an up to date version.

URL of recipes producing recipe data without ingredients:


epicurious-recipes.json  /recipes/food/views/opening-a-fresh-coconut-104353
epicurious-recipes.json  /recipes/food/views/cracking-and-grating-coconut-103091
epicurious-recipes.json  /recipes/food/views/to-carve-a-rib-roast-15825
epicurious-recipes.json  /recipes/food/views/to-quick-roast-and-peel-chilies-or-peppers-14149
epicurious-recipes.json  /recipes/food/views/roast-smoked-loin-of-pork-20033
epicurious-recipes.json  /recipes/food/views/crown-roast-of-pork-with-sauerkraut-20035
epicurious-recipes.json  /recipes/food/views/to-toast-and-skin-hazelnuts-14281
epicurious-recipes.json  /recipes/food/views/brazilian-black-beans-20146
epicurious-recipes.json  /recipes/food/views/basic-method-for-cooking-corn-on-the-cob-40047
epicurious-recipes.json  /recipes/food/views/boiled-carrots-with-prepared-horseradish-51154600
epicurious-recipes.json  /recipes/food/views/tagliatelle-em-flat-egg-noodles-em-51221240
epicurious-recipes.json  /recipes/food/views/how-to-toast-nuts-51220040
epicurious-recipes.json  /recipes/food/views/to-form-water-caltrop-wontons-51147800
epicurious-recipes.json  /recipes/food/views/whale-steaks-51197430
epicurious-recipes.json  /recipes/food/views/strozzapreti-and-pici-51221410
epicurious-recipes.json  /recipes/food/views/herb-basting-brush-51103400
epicurious-recipes.json  /recipes/food/views/chocolate-dipped-wild-spearmint-leaves-51105400
epicurious-recipes.json  /recipes/food/views/vanilla-sugar-357809
epicurious-recipes.json  /recipes/food/views/yaki-onigiri-365589
epicurious-recipes.json  /recipes/food/views/to-wash-greens-and-chopped-sliced-leeks-241804
epicurious-recipes.json  /recipes/food/views/quick-fresh-fruit-sauces-for-yogurt-pancakes-and-waffles-358284
epicurious-recipes.json  /recipes/food/views/risotto-357170
epicurious-recipes.json  /simple-syrup-368889-recipe
epicurious-recipes.json  /recipes/food/views/smoked-salmon-with-egg-salad-and-green-beans-350313
epicurious-recipes.json  /recipes/food/views/fish-stock-350281
epicurious-recipes.json  /recipes/food/views/building-blocks-for-self-recipes-355489
epicurious-recipes.json  /recipes/food/views/grandma-reggies-chopped-liver-350254
epicurious-recipes.json  /recipes/food/views/maple-and-black-pepper-bacon-350874
epicurious-recipes.json  /recipes/food/views/chipotle-mayonnaise-1222196
epicurious-recipes.json  /recipes/food/views/to-roast-and-peel-large-chiles-235803
epicurious-recipes.json  /recipes/food/views/technique-of-preparing-nopales-243027
epicurious-recipes.json  /recipes/food/views/gourmet-magazine-grilling-procedures-242904
epicurious-recipes.json  /recipes/food/views/poaching-lobster-231560
epicurious-recipes.json  /recipes/food/views/sun-dried-tomato-and-anchovy-dip-109565
epicurious-recipes.json  /recipes/food/views/cafe-porto-200947
epicurious-recipes.json  /recipes/food/views/passover-honey-nut-cake-in-soaking-syrup-109412
epicurious-recipes.json  /recipes/food/views/coconut-cream-pie-with-chocolate-cookie-crust-109182
epicurious-recipes.json  /recipes/food/views/red-pepper-cumin-dip-109563
epicurious-recipes.json  /recipes/food/views/sugar-syrup-200774
epicurious-recipes.json  /recipes/food/views/procedure-for-shorter-time-processing-230703
epicurious-recipes.json  /recipes/food/views/victors-parmesan-and-olive-oil-crostino-231338
epicurious-recipes.json  /recipes/food/views/to-tender-roast-bell-peppers-234691
epicurious-recipes.json  /recipes/food/views/grilling-procedure-234697
epicurious-recipes.json  /recipes/food/views/asian-lamb-stir-fry-in-radicchio-wraps-109696
epicurious-recipes.json  /recipes/food/views/aunt-berthas-strawberry-schaumtorte-109665
epicurious-recipes.json  /recipes/food/views/spicy-mexican-salsa-109766
epicurious-recipes.json  /recipes/food/views/rum-stinger-200227
epicurious-recipes.json  /recipes/food/views/untitled-105056
epicurious-recipes.json  /recipes/food/views/boiling-water-bath-for-jams-chutneys-pickles-and-salsas-105936
epicurious-recipes.json  /recipes/food/views/to-zest-citrus-fruits-106139
epicurious-recipes.json  /recipes/food/views/crisp-calamari-on-a-href-cooking-how-to-food-dictionary-entry-id-3529-mizuna-a-with-lime-vinaigrette-105204
epicurious-recipes.json  /recipes/food/views/chicken-vedova-108148
epicurious-recipes.json  /recipes/food/views/spinach-bacon-and-cashew-stuffing-107374
epicurious-recipes.json  /recipes/food/views/to-seal-process-and-store-filled-jars-108424
epicurious-recipes.json  /recipes/food/views/to-sterilize-jars-108425
epicurious-recipes.json  /recipes/food/views/braised-veal-shanks-102964
epicurious-recipes.json  /recipes/food/views/to-peel-plantains-105639
epicurious-recipes.json  /recipes/food/views/barley-and-mushroom-pilaf-107296
epicurious-recipes.json  /recipes/food/views/sterilizing-jars-105256
epicurious-recipes.json  /recipes/food/views/sealing-processing-and-storing-filled-jars-105255
epicurious-recipes.json  /recipes/food/views/to-prepare-a-water-bath-for-baking-105616
epicurious-recipes.json  /recipes/food/views/to-toast-spices-nuts-or-seeds-105622
epicurious-recipes.json  /recipes/food/views/almond-buttercrunch-108697
epicurious-recipes.json  /recipes/food/views/chocolate-mousse-and-raspberry-cream-dacquoise-102614
epicurious-recipes.json  /recipes/food/views/to-finely-grate-parmigiano-reggiano-107726
epicurious-recipes.json  /recipes/food/views/how-to-clean-and-steam-mussels-106390
epicurious-recipes.json  /recipes/food/views/untitled-100548
epicurious-recipes.json  /recipes/food/views/to-sterilize-jars-for-pickling-100574
epicurious-recipes.json  /recipes/food/views/tutti-frutti-102087
epicurious-recipes.json  /recipes/food/views/to-sterilize-jars-and-lids-for-preserving-102234
epicurious-recipes.json  /recipes/food/views/poached-salmon-102305
epicurious-recipes.json  /recipes/food/views/untitled-102429
epicurious-recipes.json  /recipes/food/views/white-chocolate-leaves-101132
epicurious-recipes.json  /recipes/food/views/pastry-dough-102554
epicurious-recipes.json  /recipes/food/views/leftover-lamb-casserole-101431
epicurious-recipes.json  /recipes/food/views/chicken-stock-101136
epicurious-recipes.json  /recipes/food/views/orange-scented-hot-chocolate-102810
epicurious-recipes.json  /recipes/food/views/baked-acorn-squash-101446
epicurious-recipes.json  /recipes/food/views/mango-puree-101134
epicurious-recipes.json  /recipes/food/views/caramel-shards-101135
epicurious-recipes.json  /recipes/food/views/to-sterilize-jars-and-lids-101506
epicurious-recipes.json  /recipes/food/views/cranberry-kir-102815
epicurious-recipes.json  /recipes/food/views/kebabs-101717
epicurious-recipes.json  /recipes/food/views/to-roast-and-peel-bell-peppers-or-poblano-chiles-15157
epicurious-recipes.json  /recipes/food/views/baby-chickens-on-the-spit-101736
epicurious-recipes.json  /recipes/food/views/grilled-italian-sausages-101730
epicurious-recipes.json  /recipes/food/views/spitted-duckling-101745
epicurious-recipes.json  /recipes/food/views/broiled-whole-lobster-101747
epicurious-recipes.json  /recipes/food/views/broiled-whole-fish-101746
epicurious-recipes.json  /recipes/food/views/frankfurters-101729
epicurious-recipes.json  /recipes/food/views/chicken-tarragon-101732
epicurious-recipes.json  /recipes/food/views/broiled-duckling-101743
epicurious-recipes.json  /recipes/food/views/spitted-roast-chicken-101731
epicurious-recipes.json  /recipes/food/views/to-quick-roast-and-peel-peppers-14967
epicurious-recipes.json  /recipes/food/views/to-peel-tomatoes-15503
epicurious-recipes.json  /recipes/food/views/to-warm-tortillas-14142
epicurious-recipes.json  /suzanne-goin-s-corned-beef-and-cabbage-with-parsley-mustard-sauce-56389323-recipe
epicurious-recipes.json  /recipes/seared-scallops-with-tomato-water-lime-and-mint-51242060-recipe
epicurious-recipes.json  /recipes/food/views/reserve-this-recipe-id-for-future-use-51234840
epicurious-recipes.json  /recipes/food/views/chocolate-plum-cake56390135
@hhursev
Copy link
Owner

hhursev commented Jun 3, 2020

The recipes given do not have ingredients listed on the site, indeed. The scrapers are functioning as intended so I'll close the issue. Feel free to reopen if I had missed your point 🙂

This package is intended to be a super simple tool handling the operation of parsing the html. If no data in the html is found - the scrapers won't assume anything. They will return defaulting values and that's it.

Depending on your use case and aim you can:

  • omit saving/analyzing the recipes with missing data
  • try to implement clever mechanism that gathers missed information based on what's at hand

However, that decision is beyond the package responsibilities.

One can ask for advice on how to normalize recipes data, speed up scraping, elude bot protection mechanisms and whatever else comes across when building scraping data related project, but these things are not the recipe-scrapers job.

@hhursev hhursev closed this as completed Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants