Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

la1ere: french overseas (9 regions) public and free #32713

Open
ecodix opened this issue Feb 1, 2024 · 13 comments · May be fixed by #32884
Open

la1ere: french overseas (9 regions) public and free #32713

ecodix opened this issue Feb 1, 2024 · 13 comments · May be fixed by #32884
Labels
site-support-request Add extractor(s) for a new domain

Comments

@ecodix
Copy link

ecodix commented Feb 1, 2024

CONTEXT

French overseas public channels are available without authentication from the portal : https://la1ere.francetvinfo.fr/

Apparently, all formats/manifests are the same as those from france.tv (extractor francetv.py):
indeed, a few videos are even shared by both sites.
For its developers, it should be straightforward to derive a new extractor la1ere.py from francetv.py (only two URL templates/pages embed videos) - i really hope my feeling is right, and would gladly help if i had some knowledge in video streams and python.

The video offer is free, very large and covers documentaries, news, fictions of many cultures. It would be a nice gift for those interested in those, as internet speed may be slow overseas, preventing streaming :)

9 regions are covered (below, refers to one of)

  • guadeloupe
  • guyane
  • martinique
  • mayotte
  • nouvellecaledonie
  • polynesie
  • reunion
  • saintpierremiquelon
  • wallisfutuna

Here is the mosaic of videos available for the chosen region:
https://la1ere.francetvinfo.fr/REGION/programme-video/
(as said before, a few of them are also available at: https://www.france.tv/la1ere/REGION/toutes-les-videos/)

TEMPLATES OF EMBEDDING URLs

In the end, final pages embedding videos may have 2 paths only:

The latter is used for series, news ...
The extractor may be flexible and even not consider/whitelist the region list, letting it be free (prophylaxis/evolutions), as the substring "/diffusion/" already filters templates.

EXAMPLES:

Documentary:
https://la1ere.francetvinfo.fr/martinique/programme-video/diffusion/4774522-origine-kongo.html

News:
https://la1ere.francetvinfo.fr/guadeloupe/programme-video/la1ere_guadeloupe_le-13h-en-guadeloupe/diffusion/5643549-emission-du-lundi-29-janvier-2024.html

@ecodix ecodix added the site-support-request Add extractor(s) for a new domain label Feb 1, 2024
@aiur-adept
Copy link
Contributor

I used a VPN to set my location as France so I could access.

What I saw is that the video is streamed by the player a chunk at a time using URLs like this:

(first chunk)

https://cloudingest.ftven.fr/ZXhwPTE3MjIyODYxNDN+YWNsPSUyZip+aG1hYz1mMjE3ZWE2NzRhNmRhZjhkZmUwMjA0MmFlNWEyMmM0ZDM5NmY3MmJmZjFhNTE2MWNhMWU2MWZkYjdkMmMxMmRk/462436a625346/97790f3a-b23a-4004-bd9c-b89fde6f95bf-1681205990_france-domtom_TA.ism/dash/97790f3a-b23a-4004-bd9c-b89fde6f95bf-1681205990_france-domtom_TA-video=2000000.dash

(second chunk)

https://cloudingest.ftven.fr/ZXhwPTE3MjIyODYxNDN+YWNsPSUyZip+aG1hYz1mMjE3ZWE2NzRhNmRhZjhkZmUwMjA0MmFlNWEyMmM0ZDM5NmY3MmJmZjFhNTE2MWNhMWU2MWZkYjdkMmMxMmRk/462436a625346/97790f3a-b23a-4004-bd9c-b89fde6f95bf-1681205990_france-domtom_TA.ism/dash/97790f3a-b23a-4004-bd9c-b89fde6f95bf-1681205990_france-domtom_TA-audio_fre=96000.dash

I don't know if youtube-dl can download a video without a link to a single resource, especially I'm not sure whether we can reconstruct how to query these .dash files from the source page.

@dirkf
Copy link
Contributor

dirkf commented Jul 29, 2024

yt-dl knows how to deal with fragment manifests, whether HLS (m3u8) or DASH (mpd). If, as with On:ORF, eg, a site may serve a single show as multiple parts, the site extractor has to return a multi_video playlist, which has to be concatenated manually for now.

For this site we have to reverse-engineer the site's behaviour because there are no media links in the non-JS page that yt-dl receives. In the browser tools network tab, look for a request by XHR for a DASH manifest, before the video starts to stream. If that URL can be reconstructed from information available in the page URL or the page itself, then it should be possible to create an extractor, provided that the video is not DRM-encrypted (does it play in a browser with DRM aka EME disabled?).

My initial impression from the page source is that the site geo-blocks outside FR and its remnant empire. The .dash URLs are giving me 403 but may just have expired.

@aiur-adept
Copy link
Contributor

aiur-adept commented Jul 29, 2024

@dirkf ah, ok that's good to know. I'm definitely new to the project so I'm glad you're here with such info.

SO,

There is a request to a .mpd file, but the issue is it has some pretty cooked query params and it's not evident how to derive them from the page. I tried looking at the initiatior of the xhr but it's all minified garbage js.

For the page https://la1ere.francetvinfo.fr/martinique/programme-video/diffusion/4774522-origine-kongo.html, the .mpd request is:

https://cloudingest.ftven.fr/ZXhwPTE3MjIyOTI1MjR+YWNsPSUyZip+aG1hYz1jYzM5Mzg0YzRmNjU5YjNkMDYzM2NhN2Q4YzJjMDQwYjc2NTI4MTNmOWMyNzk5MzE4MDkzNmU4N2I3OWJkNGRm/462436a625346/97790f3a-b23a-4004-bd9c-b89fde6f95bf-1681205990_france-domtom_TA.ism/manifest.mpd?hdnea=exp=1722271524~acl=%2f*~hmac=c294fe8aaae7faaa23735f371a367deb6a6365d6638b6fd8201c9bd83a2cb414

Let's break this up into parts:

  • https://cloudingest.ftven.fr - this is fine

  • ZXhwPTE3MjIyOTI1MjR+YWNsPSUyZip+aG1hYz1jYzM5Mzg0YzRmNjU5YjNkMDYzM2NhN2Q4YzJjMDQwYjc2NTI4MTNmOWMyNzk5MzE4MDkzNmU4N2I3OWJkNGRm - does not appear in the webpage or the script bundle - must be generated

  • 462436a625346 does not appear in the webpage or the script bundle - must be generated

  • 97790f3a-b23a-4004-bd9c-b89fde6f95bf appears in the webpage as the data-id attribute of the #mainContent element.

  • _france-domtom_TA.ism/manifest.mpd - this is fine

`?hdnea=exp=1722271524~acl=%2f*~hmac=c294fe8aaae7faaa23735f371a367deb6a6365d6638b6fd8201c9bd83a2cb414`` - these query params include an HMAC which is a bit scary as we'd have to figure out how this is generated... and it seems to be happening inside the magnetoscope player code.

If you request the .mpd file without these query params you get a 400.


tl;dr: requesting the .mpd file is complicated and involves some crypto that seems unlikely to be cracked.

@aiur-adept
Copy link
Contributor

Hmm, I just found something else.

(reference URL is https://la1ere.francetvinfo.fr/martinique/programme-video/diffusion/4774522-origine-kongo.html)

The entire .mpd URL including the HMAC query params can be retrieved by

https://hdfauth.ftven.fr/esi/TA?format=json&url=https%3A%2F%2Fcloudingest.ftven.fr%2F462436a625346%2F97790f3a-b23a-4004-bd9c-b89fde6f95bf-1681205990_france-domtom_TA.ism%2Fmanifest.mpd

This returns:

{
    "url": "https://cloudingest.ftven.fr/ZXhwPTE3MjIyOTI1MjR+YWNsPSUyZip+aG1hYz1jYzM5Mzg0YzRmNjU5YjNkMDYzM2NhN2Q4YzJjMDQwYjc2NTI4MTNmOWMyNzk5MzE4MDkzNmU4N2I3OWJkNGRm/462436a625346/97790f3a-b23a-4004-bd9c-b89fde6f95bf-1681205990_france-domtom_TA.ism/manifest.mpd?hdnea=exp=1722271524~acl=%2f*~hmac=c294fe8aaae7faaa23735f371a367deb6a6365d6638b6fd8201c9bd83a2cb414"
}

Exactly what we need.

So, if we can chef up this URL we can get the .mpd.

We need therefore to be able to find/generate:

  • 462436a625346 (not in webpage or bundle)
  • 97790f3a-b23a-4004-bd9c-b89fde6f95bf (in webpage)
  • 1681205990 (not in webpage or bundle)

@aiur-adept
Copy link
Contributor

OK, so we need

https://hdfauth.ftven.fr/esi/TA?format=json&url=https%3A%2F%2Fcloudingest.ftven.fr%2F462436a625346%2F97790f3a-b23a-4004-bd9c-b89fde6f95bf-1681205990_france-domtom_TA.ism%2Fmanifest.mpd

To request the URL for the .mpd. Let's call this the key request.

We can generate this URL by parsing the response from

https://k7.ftven.fr/videos/97790f3a-b23a-4004-bd9c-b89fde6f95bf?country_code=FR&w=937&h=527&screen_w=1920&screen_h=1200&player_version=5.116.1&domain=la1ere.francetvinfo.fr&device_type=desktop&browser=chrome&browser_version=126&os=linux&diffusion_mode=tunnel&gmt=-0400&capabilities=drm

Which only requires 97790f3a-b23a-4004-bd9c-b89fde6f95bf, which we can get from the page.

Querying this URL will return a JSON object where .video.url contains:

"https://cloudingest.ftven.fr/462436a625346/97790f3a-b23a-4004-bd9c-b89fde6f95bf-1681205990_france-domtom_TA.ism/manifest.mpd"

From which we can construct the key request.

Now, I've tried querying without all the various params, and it seems to need them. Troubling is the player_version=5.116.1, as if this changes it might break.

@aiur-adept
Copy link
Contributor

aiur-adept commented Jul 29, 2024

@dirkf even if I did write an extractor for this, the tests would always fail without a France VPN. Does that alone kill this exporter? Or could we publish it without tests - but then, how would we know when it breaks...

@dirkf
Copy link
Contributor

dirkf commented Jul 29, 2024

Normally someone in the relevant region, or who has a VPN so as to appear so, is both sufficiently interested and sufficiently skilled to work on the extractor. If any extractor tests can't be run outside the region, the author should post a transcript showing the tests working and then disable them.

OP suggested that this site is like FranceTV, so it may be possible to use the yt-dlp FranceTV extractor (I suspect that ours is broken) as a basis, or just to add a new extractor class in the FranceTV extractor.

The player_version thing may just be instrumentation, but otherwise might be found in one of the JS modules.

The JSON seems to have a good set of metadata. Maybe some specific headers are needed to get the DASH manifest without Akamai "blocage", or maybe just an acceptable IP address.

@aiur-adept
Copy link
Contributor

@dirkf ah, i thought the tests were run on every extractor before every release - i guess they're just run when merging the pull request? I've got a free france VPN set up for this so I can try to work on it

@dirkf
Copy link
Contributor

dirkf commented Jul 29, 2024

There's no point scheduling a test that will always fail so the test-case will typically include 'skip': 'only available in FR', or similar. A maintainer who is working on the extractor would comment those out and then restore all those that are found still to be applicable.

@aiur-adept aiur-adept linked a pull request Jul 29, 2024 that will close this issue
11 tasks
@aiur-adept
Copy link
Contributor

I wrote an extractor for both URL formats in the original request. Tested on my end using a france VPN and both work to download the video.

@dirkf could you confirm that nothing is needed to download an mpd/m3u8 resource other than returning "formats" in the _real_extract object? This seemed suspiciously simple.

@dirkf
Copy link
Contributor

dirkf commented Jul 30, 2024

In yt-dl, use the appropriate _extract_xxx_formats[_and_subtitles]() IE method to turn a xxx manifest into formats; call _sort_formats() before returning the info-dict. It may be necessary to populate the http_headers dict for each format, eg with a Referer header: you have to check what the browser does if an extracted format URL fails.

Check what happens in other extractors, and especially in yt-dlp's FranceTV extractor.

@aiur-adept
Copy link
Contributor

@dirkf could you expand no the http_headers part of that? i don't see any example of that in the francetv extractor, it just extracts and sorts the formats. Thanks in advance, I know i've been blowing u up lately

@dirkf
Copy link
Contributor

dirkf commented Jul 30, 2024

It may not be relevant here. If requests for the URLs in the formats list fail, but the same URLs succeed when playing in the browser, you have to diagnose what the browser is doing, using the devtools Network tab.

Supposing requests need to be passed with a Referer to the original page URL url:

for f in formats:
    f.setdefault('http_headers', {})['Referer'] = url

A content search of *.py for 'http_headers' in the extractor directory can be illuminating, but if plagiarising (recommended) be aware that more recent extractors may have better (safer, using newer helper functions) style.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-support-request Add extractor(s) for a new domain
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants