[bbc] Fix BBCCoUkIPlayerPlaylistIE #28360

dirkf · 2021-03-05T23:11:02Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

The BBCCoUkIPlayerPlaylistIE extractor was failing, whereas BBCCoUkPlaylistIE still worked for the same pids: eg, the _TESTS.

In the iPlayer series pages, the playlist metadata is now sent in a JSON expression within a <script> element, and this patch extracts the title and description from it, and modifies the _VIDEO_ID_TEMPLATE to match the episode links within the entities subobject of the JSON.

Also, the regex used to get the next page link for BBCCoUkPlaylistBaseIE didn't allow for a newline that is now sent between tags.

For multi-series iPlayer playlists, each series page is also processed.

For multi-page iPlayer playlists, the pagination regex is extended to match the "Next Page" link on these pages.

remitamine · 2021-03-13T09:38:49Z

do mix multiple changes in a single PR.

Allow white space between tags as now sent by BBC.

remitamine · 2021-03-22T17:51:08Z

does not handle programmes/series with multiple seasons.
does not handle seasons/groups with multiple pages.

dirkf · 2021-03-23T21:07:01Z

The current version should deal with your comments, but the download tests aren't yet validated: CI is failing because Certificate did not match expected hostname: files.pythonhosted.org when installing nose with pip install nose in the Jython configurations.

remitamine · 2021-03-24T09:36:43Z

as i was looking at this, i think there is a better way to make an optimized version that would reduce the number of requests and still support all the expected features in addition to extracting more metadata for --flat-playlist option, i will try to implement it.

dirkf · 2021-03-24T12:38:41Z

Feel free to use the test URLs from this PR for that. See also PR #23438 for a previous solution and set of test cases.

Generally, even though the BBC site support is split into different extractors, the BBCIE extractor in particular implements multiple extraction tactics and it's difficult to know which of these are still valid -- perhaps some telemetry would help, as well as refactoring the over-sized _real_extract() method.

It seems that two types of tactics are currently valid:

find some JSON in JS that is part of the page; extract links and metadata from the JSON;
find certain links, normally with a regex search, but mostly trying to replicate a DOM search similar to JS document.querySelectorAll() without the overhead of DOM.

Maybe generic methods can be created to support these tactics?

I have further updates to fix some of the failing test cases but will hold off if you're reimplementing.

The playlist metadata is now sent in a JSON expression within a <script> element. Save the initial playlist_id

remitamine · 2021-03-24T14:31:04Z

Generally, even though the BBC site support is split into different extractors, the BBCIE extractor in particular implements multiple extraction tactics and it's difficult to know which of these are still valid -- perhaps some telemetry would help, as well as refactoring the over-sized _real_extract() method.

It seems that two types of tactics are currently valid:

find some JSON in JS that is part of the page; extract links and metadata from the JSON;

find certain links, normally with a regex search, but mostly trying to replicate a DOM search similar to JS document.querySelectorAll() without the overhead of DOM.

Maybe generic methods can be created to support these tactics?

the main BBC Extractor has to be split in the future, this can be done by slowly extracting code that is related to a specific URL pattern into it's own extractor, it would certainly make the code more manageable.

I have further updates to fix some of the failing test cases but will hold off if you're reimplementing.

it always depend on whether there is a better way of doing thing, in the case of BBC Reel code, even thought i would preferred to introduce the code in a different way, but as the code can be reused in the future, i did accept it with only small changes, but in this case, the method that i added is quite different from the code here, so there aren't much code that can be reused(except for the tests).

dirkf · 2021-03-24T18:04:52Z

The changes I mentioned aren't specific to the iPlayer extractor (fixes to tests, morph playlists, support simorgh playlists).

Regarding the new code, please see #15710 (comment) for a case where what started out as an iPlayer URL redirected to a programmes page in which _call_api() returned None and caused a crash.

* https://github.com/ytdl-org/youtube-dl: [ard] improve clip id extraction(ytdl-org#22724)(closes ytdl-org#28528) release 2021.03.25 [ChangeLog] Actualize [ci skip] [zoom] Add new extractor(closes ytdl-org#16597, closes ytdl-org#27002, closes ytdl-org#28531) [extractor] escape forgotten dot for hostnames in regular expression (ytdl-org#28530) [bbc] fix BBC IPlayer Episodes/Group extraction(closes ytdl-org#28360) [youtube] Fix default value for youtube_include_dash_manifest (closes ytdl-org#28523)

dirkf force-pushed the dirkf-BBCCoUkIPlayerPlaylistIE-patch branch from c2c0fd4 to 4863404 Compare March 16, 2021 12:45

Fix matching next page tag (BBCCoUkPlaylistBaseIE)

7c7d42e

Allow white space between tags as now sent by BBC.

dirkf force-pushed the dirkf-BBCCoUkIPlayerPlaylistIE-patch branch from 4863404 to dc43160 Compare March 16, 2021 13:08

dirkf changed the title ~~Fix BBCCoUkIPlayerPlaylistIE~~ [bbc] Fix BBCCoUkIPlayerPlaylistIE Mar 22, 2021

dirkf force-pushed the dirkf-BBCCoUkIPlayerPlaylistIE-patch branch 2 times, most recently from 6920932 to a5b6df4 Compare March 24, 2021 13:15

dirkf added 4 commits March 24, 2021 13:15

Fix BBCCoUkIPlayerPlaylistIE

ef00fa7

The playlist metadata is now sent in a JSON expression within a <script> element. Save the initial playlist_id

Fix expected Description for Bohemian Icons test

ef7a5e1

Support multi-series(-season) iPlayer playlist

d5fb08a

Support multi-page iPlayer playlists

a5b6df4

remitamine closed this in a400024 Mar 24, 2021

leshasmlesha pushed a commit to leshasmlesha/youtube-dl that referenced this pull request Apr 3, 2021

[bbc] fix BBC IPlayer Episodes/Group extraction(closes ytdl-org#28360)

1d98b85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bbc] Fix BBCCoUkIPlayerPlaylistIE #28360

[bbc] Fix BBCCoUkIPlayerPlaylistIE #28360

dirkf commented Mar 5, 2021 •

edited

Loading

remitamine commented Mar 13, 2021

remitamine commented Mar 22, 2021

dirkf commented Mar 23, 2021

remitamine commented Mar 24, 2021

dirkf commented Mar 24, 2021

remitamine commented Mar 24, 2021

dirkf commented Mar 24, 2021

[bbc] Fix BBCCoUkIPlayerPlaylistIE #28360

[bbc] Fix BBCCoUkIPlayerPlaylistIE #28360

Conversation

dirkf commented Mar 5, 2021 • edited Loading

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

remitamine commented Mar 13, 2021

remitamine commented Mar 22, 2021

dirkf commented Mar 23, 2021

remitamine commented Mar 24, 2021

dirkf commented Mar 24, 2021

remitamine commented Mar 24, 2021

dirkf commented Mar 24, 2021

dirkf commented Mar 5, 2021 •

edited

Loading