Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target Bitrate As Reported by Chrome is slow to adjust when ConfigureTWCCSender is Being Used #2830

Closed
aankur opened this issue Jul 24, 2024 · 33 comments · Fixed by #2833
Closed

Comments

@aankur
Copy link

aankur commented Jul 24, 2024

Your environment.

  • Go: 1.22.2
  • Version: Pion Webrtc v3.2.49/v4.0.0-beta.24
  • Browser: Chrome 126.0.6478.126
  • Other Information - Ubuntu 22.04

What did you do?

Used the Broadcast Example, created a new RegisterDefaultInterceptors without webrtc.ConfigureTWCCSender

func RegisterDefaultInterceptors(mediaEngine *webrtc.MediaEngine, interceptorRegistry *interceptor.Registry) error {
	if err := webrtc.ConfigureNack(mediaEngine, interceptorRegistry); err != nil {
		return err
	}
	if err := webrtc.ConfigureRTCPReports(interceptorRegistry); err != nil {
		return err
	}
	if err := webrtc.ConfigureSimulcastExtensionHeaders(mediaEngine); err != nil {
		return err
	}
	return nil
}

replaced

// Use the default set of Interceptors
	if err := webrtc.RegisterDefaultInterceptors(m, i); err != nil {
		panic(err)
	}

What did you expect?

Bandwidth Ramp-up should be fast with webrtc.ConfigureTWCCSender

What happened?

looks like introduced in v3.2.25
when using webrtc.ConfigureTWCCSender the rampup was slow and in a sea-saw pattern and got stuck at 1 mbps
when not using it it was still slow but went upto 2.5 mbps, please see the Target Bitrate as reported on the graph

With ConfigureTWCCSender , Slow, Seesaw and stuck at 1mbps
With ConfigureTWCCSender , Seesaw and stuck at 1mbps

Without ConfigureTWCCSender, slow but goes upto 2.5mbps
Screenshot from 2024-07-24 10-30-41

With ConfigureTWCCSender in v3.2.24
Screenshot from 2024-07-24 15-31-34

@kcaffrey
Copy link

Given that the only commits between v3.2.24 and v3.2.25 are RTX related (v3.2.24...v3.2.25), I'm guessing it has something to do with RTX. This would make sense, as when RTX is negotiated, libwebrtc will attempt to probe with RTX packets instead of padding packets.

I think what might be going on is that if RTX gets negotiated by default (I'm not sure whether this is the case now or not), and somehow RTX packets are not consumed, then RTX packets will stop being fed through the interceptor and libwebrtc won't get TWCC feedback for RTX probes. In 5da7278 repairStreamChannel is an unbuffered channel, meaning that if it is not read from in readRTX then the goroutine in receiveForRTX will be blocked. I'm not that familiar with the RTX code though, so perhaps @adriancable or @cnderrauber could chime in and correct anything I got wrong here.

If the general shape of my theory is correct, that would definitely explain the behavior you are seeing with libwebrtc's bandwidth estimation.

@adriancable
Copy link
Contributor

@kcaffrey - what you write sounds correct, although I'm not sure the issue is RTX related. Note I have also seen similar things to @aankur (slow BWE on the Chrome side) but I saw this a while back before I added the RTX code. From memory, I have seen it sometimes be slow and sometimes not with no obvious cause.

@aankur - just to exclude this, can you try it without RTX, i.e. instead of registering the default codecs adding just H264 for example without adding RTX, then see if the issue still occurs. You may need to run the experiment e.g. 10 times to rule out the coincidence effect.

@kcaffrey
Copy link

One other cause I've seen (which you may have ran into @adriancable) is that if you are using simulcast, the first packet is dropped by Pion (see #2777). If libwebrtc sees the first packet it sends is lost, it enters a mode where the bandwidth estimate tends to not grow fast (for reasons I'm not 100% sure about). In our own internal testing, we saw that much of the time, libwebrtc happened to send an audio packet first (meaning the issue was not hit), but if it happened to send a simulcast video packet first, then the problem would occur.

So to further rule out possible causes, you can try making sure simulcast is not enabled (if it was on the client)

@adriancable
Copy link
Contributor

@kcaffrey - also to add, it isn't really possible 'not to consume' RTX packets. In the implementation that landed, track.Read() returns both RTX and non-RTX packets (since most consumers don't need to know the difference), with the attributes being set to indicate if it's RTX (if the consumer does need to know).

@adriancable
Copy link
Contributor

@aankur - are you able to get a WebRTC packet trace and post it here for us to look at? If you go to chrome://webrtc-internals there's now an option to 'create diagnostic packet recordings'. After turning this on you'll need to restart the stream (so we can see everything from the start) and then locate the right log file and attach it here.

@adriancable
Copy link
Contributor

@kcaffrey / @aankur - there's one other thing to bear in mind if we are looking at differences with RTX and without. A recent change in libwebrtc (which may or may not have landed in Chrome M126) starts sending BWE probes before the tracks start, on SSRC 0, if RTX is enabled. Previously Pion didn't handle these and so it looked to the sender like these packets were lost, which would make the initial BWE hit the floor until it caught up via subsequent probes. This would have a similar effect to what @aankur reported, but it was fixed in #2816. (I did test this change, and it seems to do the right thing, but it's possible there's some edge case it's not handling right.) Pointing this out because if @aankur finds that their issue happens only with RTX enabled, does not necessarily mean the issue is with Pion's RTX handling, even if the issue is on the Pion side.

I'll be able to see whether this is what's happening from the Chrome WebRTC packet trace.

@aankur
Copy link
Author

aankur commented Jul 24, 2024

Hi,
i am not using Simulcast, have run the tests as requested. i also debugged through the code to see if SSRC 0 was being handled, the version chrome i am using does not do that yet.

  1. v4.0.0-beta.24 without RTX
    Bandwidth Ramps up Faster than with-RTX but slower than V3.2.22, which has the same steps but narrower slope
    i.e first jumps up fast to 1.5 takes some time jumps to 2 takes some time jumps to 2.5, does not see-saw

  2. Without RTX SDP Answer
    Without RTX Answer

Without RTX Bandwidth Ramp-UP
Without RTX Bandwidth Ramp-UP

Without RTX Packet dump
event_log_no_rtx_.log20240725_0007_68_4.log

  1. With RTX and TWCC
    With RTX and TWCC

With RTX and TWCC Packet Dump
event_log_with_rtx_twcc_.log20240725_0011_68_5.log

@adriancable
Copy link
Contributor

@aankur - in your original post, for pion v3, you mentioned v3.2.49, then said it happens in v3.2.25 but not v3.2.24, then above you mention v3.2.22. That's a lot of versions! Can you just tell us two, the last version that works and the first version which doesn't?

It does look like the 'with RTX' case is the one that's problematic. In which case the difference between v3.2.24 and v3.2.25 is explained by the fact that v3.2.24 does not support RTX. It isn't clear to me this is a Pion issue vs. a difference in how Chrome does BWE in the RTX vs non-RTX cases (they are quite different, in the non-RTX case you get padding probes on the main track, in the RTX case you don't get padding probes, instead you get periodic retransmissions on the RTX track which also serve as BWE probes). It is quite possible that the different numbers (and nature) of the probes you get from Chrome in the two cases make it take longer or quicker for Chrome to do the BWE.

Can you clarify the topology here? You are sending video from the Chrome side and receiving from the Pion side, correct?

In this case, I would look at what happens between Chrome and Chrome in the RTX and non-RTX cases. So take Pion out of the equation and just see if it's a 'natural' RTX vs non-RTX effect.

@aankur
Copy link
Author

aankur commented Jul 24, 2024

Hi,
Sorry for the confusion, do you want me to run tests on V3.2.25 , i have run all tests on v3.2.49
i then ran tests on all released V3.2 versions prior to that and narrowed down to the release that first displayed the issue V3.2.25, i extensively verified that V3.2.24 did not have this issue, i was most recently testing MediaMTX with uses V3.2.22 and its Target Bitrate were excellent, V3.2.22 and V3.2.24 both are similar, but all the reports that you see generated belong to v3.2.49, including the parent issue.

i am sending data from Chrome to Pion (Video Only/Not Simulcast)

Screenshot from 2024-07-25 00-12-48
This Particular Behaviour exists on all Pion Versions from V3.2.25.
All These tests where done on https://github.com/pion/webrtc/blob/master/examples/broadcast/main.go as a publisher

@adriancable
Copy link
Contributor

@aankur - please try sending video from Chrome to Chrome in the RTX vs non-RTX case, and then look at the targetBitrate graph for the sender Chrome in both cases. My guess is that what you are seeing is the difference in how Chrome (actually libwebrtc) does bandwidth estimation in the RTX vs non-RTX case, and that it isn't a Pion issue.

v3.2.24 and v3.2.25 are different only because the former doesn't support RTX. There are no other changes (as I understand it), so this is consistent with this being a Chrome-side difference in how BWE is done in the RTX vs non-RTX case.

@aankur
Copy link
Author

aankur commented Jul 24, 2024

Sample HTML Demonstrating Chrome -> Chrome (Video Only)

  1. With RTX
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<div id="container">
    <video id="localVideo" playsinline autoplay muted></video>
    <video id="remoteVideo" playsinline autoplay></video>

    <div class="box">
        <button id="startButton">Start</button>
        <button id="callButton">Call</button>
        <button id="hangupButton">Hang Up</button>
    </div>
</div>
<script>
    'use strict';

const startButton = document.getElementById('startButton');
const callButton = document.getElementById('callButton');
const hangupButton = document.getElementById('hangupButton');
callButton.disabled = true;
hangupButton.disabled = true;
startButton.addEventListener('click', start);
callButton.addEventListener('click', call);
hangupButton.addEventListener('click', hangup);

let startTime;
const localVideo = document.getElementById('localVideo');
const remoteVideo = document.getElementById('remoteVideo');

localVideo.addEventListener('loadedmetadata', function() {
  console.log(`Local video videoWidth: ${this.videoWidth}px,  videoHeight: ${this.videoHeight}px`);
});

remoteVideo.addEventListener('loadedmetadata', function() {
  console.log(`Remote video videoWidth: ${this.videoWidth}px,  videoHeight: ${this.videoHeight}px`);
});

remoteVideo.addEventListener('resize', () => {
  console.log(`Remote video size changed to ${remoteVideo.videoWidth}x${remoteVideo.videoHeight} - Time since pageload ${performance.now().toFixed(0)}ms`);
  // We'll use the first onsize callback as an indication that video has started
  // playing out.
  if (startTime) {
    const elapsedTime = window.performance.now() - startTime;
    console.log('Setup time: ' + elapsedTime.toFixed(3) + 'ms');
    startTime = null;
  }
});

let localStream;
let pc1;
let pc2;

function getName(pc) {
  return (pc === pc1) ? 'pc1' : 'pc2';
}

function getOtherPc(pc) {
  return (pc === pc1) ? pc2 : pc1;
}

async function start() {
  console.log('Requesting local stream');
  startButton.disabled = true;
  try {
    const stream = await navigator.mediaDevices.getUserMedia({video: {width: 1280, height: 720}});
    console.log('Received local stream');
    localVideo.srcObject = stream;
    localStream = stream;
    callButton.disabled = false;
  } catch (e) {
    alert(`getUserMedia() error: ${e.name}`);
  }
}

async function call() {
  callButton.disabled = true;
  hangupButton.disabled = false;
  console.log('Starting call');
  startTime = window.performance.now();
  const videoTracks = localStream.getVideoTracks();
  if (videoTracks.length > 0) {
    console.log(`Using video device: ${videoTracks[0].label}`);
  }
  const configuration = {};
  console.log('RTCPeerConnection configuration:', configuration);
  pc1 = new RTCPeerConnection(configuration);
  console.log('Created local peer connection object pc1');
  pc1.addEventListener('icecandidate', e => onIceCandidate(pc1, e));
  pc2 = new RTCPeerConnection(configuration);
  console.log('Created remote peer connection object pc2');
  pc2.addEventListener('icecandidate', e => onIceCandidate(pc2, e));
  pc1.addEventListener('iceconnectionstatechange', e => onIceStateChange(pc1, e));
  pc2.addEventListener('iceconnectionstatechange', e => onIceStateChange(pc2, e));
  pc2.addEventListener('track', gotRemoteStream);


  localStream.getTracks().forEach(track => {
    pc1.addTransceiver(track, {direction: 'sendonly'})
  });
  console.log('Added local stream to pc1');

  try {
    console.log('pc1 createOffer start');
    const offer = await pc1.createOffer();
    await onCreateOfferSuccess(offer);
  } catch (e) {
    onCreateSessionDescriptionError(e);
  }
}

function onCreateSessionDescriptionError(error) {
  console.log(`Failed to create session description: ${error.toString()}`);
}

async function onCreateOfferSuccess(desc) {
  console.log(`Offer from pc1\n${desc.sdp}`);
  console.log('pc1 setLocalDescription start');
  try {
    await pc1.setLocalDescription(desc);
    onSetLocalSuccess(pc1);
  } catch (e) {
    onSetSessionDescriptionError();
  }

  console.log('pc2 setRemoteDescription start');
  try {
    await pc2.setRemoteDescription(desc);
    onSetRemoteSuccess(pc2);
  } catch (e) {
    onSetSessionDescriptionError();
  }

  console.log('pc2 createAnswer start');
  // Since the 'remote' side has no media stream we need
  // to pass in the right constraints in order for it to
  // accept the incoming offer of audio and video.
  try {
    const answer = await pc2.createAnswer();
    await onCreateAnswerSuccess(answer);
  } catch (e) {
    onCreateSessionDescriptionError(e);
  }
}

function onSetLocalSuccess(pc) {
  console.log(`${getName(pc)} setLocalDescription complete`);
}

function onSetRemoteSuccess(pc) {
  console.log(`${getName(pc)} setRemoteDescription complete`);
}

function onSetSessionDescriptionError(error) {
  console.log(`Failed to set session description: ${error.toString()}`);
}

function gotRemoteStream(e) {
   remoteVideo.srcObject = new MediaStream([e.track]);
   console.log('pc2 received remote stream');
}

async function onCreateAnswerSuccess(desc) {
  console.log(`Answer from pc2:\n${desc.sdp}`);
  console.log('pc2 setLocalDescription start');
  try {
    await pc2.setLocalDescription(desc);
    onSetLocalSuccess(pc2);
  } catch (e) {
    onSetSessionDescriptionError(e);
  }
  console.log('pc1 setRemoteDescription start');
  try {
    await pc1.setRemoteDescription(desc);
    onSetRemoteSuccess(pc1);
  } catch (e) {
    onSetSessionDescriptionError(e);
  }
}

async function onIceCandidate(pc, event) {
  try {
    await (getOtherPc(pc).addIceCandidate(event.candidate));
    onAddIceCandidateSuccess(pc);
  } catch (e) {
    onAddIceCandidateError(pc, e);
  }
  console.log(`${getName(pc)} ICE candidate:\n${event.candidate ? event.candidate.candidate : '(null)'}`);
}

function onAddIceCandidateSuccess(pc) {
  console.log(`${getName(pc)} addIceCandidate success`);
}

function onAddIceCandidateError(pc, error) {
  console.log(`${getName(pc)} failed to add ICE Candidate: ${error.toString()}`);
}

function onIceStateChange(pc, event) {
  if (pc) {
    console.log(`${getName(pc)} ICE state: ${pc.iceConnectionState}`);
    console.log('ICE state change event: ', event);
  }
}

function hangup() {
  console.log('Ending call');
  pc1.close();
  pc2.close();
  pc1 = null;
  pc2 = null;
  hangupButton.disabled = true;
  callButton.disabled = false;
}

</Script>

</body>
</html>
  1. Without RTX
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<div id="container">
    <video id="localVideo" playsinline autoplay muted></video>
    <video id="remoteVideo" playsinline autoplay></video>

    <div class="box">
        <button id="startButton">Start</button>
        <button id="callButton">Call</button>
        <button id="hangupButton">Hang Up</button>
    </div>
</div>
<script>
    'use strict';

const startButton = document.getElementById('startButton');
const callButton = document.getElementById('callButton');
const hangupButton = document.getElementById('hangupButton');
callButton.disabled = true;
hangupButton.disabled = true;
startButton.addEventListener('click', start);
callButton.addEventListener('click', call);
hangupButton.addEventListener('click', hangup);

let startTime;
const localVideo = document.getElementById('localVideo');
const remoteVideo = document.getElementById('remoteVideo');

localVideo.addEventListener('loadedmetadata', function() {
  console.log(`Local video videoWidth: ${this.videoWidth}px,  videoHeight: ${this.videoHeight}px`);
});

remoteVideo.addEventListener('loadedmetadata', function() {
  console.log(`Remote video videoWidth: ${this.videoWidth}px,  videoHeight: ${this.videoHeight}px`);
});

remoteVideo.addEventListener('resize', () => {
  console.log(`Remote video size changed to ${remoteVideo.videoWidth}x${remoteVideo.videoHeight} - Time since pageload ${performance.now().toFixed(0)}ms`);
  // We'll use the first onsize callback as an indication that video has started
  // playing out.
  if (startTime) {
    const elapsedTime = window.performance.now() - startTime;
    console.log('Setup time: ' + elapsedTime.toFixed(3) + 'ms');
    startTime = null;
  }
});

let localStream;
let pc1;
let pc2;

function getName(pc) {
  return (pc === pc1) ? 'pc1' : 'pc2';
}

function getOtherPc(pc) {
  return (pc === pc1) ? pc2 : pc1;
}

async function start() {
  console.log('Requesting local stream');
  startButton.disabled = true;
  try {
    const stream = await navigator.mediaDevices.getUserMedia({video: {width: 1280, height: 720}});
    console.log('Received local stream');
    localVideo.srcObject = stream;
    localStream = stream;
    callButton.disabled = false;
  } catch (e) {
    alert(`getUserMedia() error: ${e.name}`);
  }
}

async function call() {
  callButton.disabled = true;
  hangupButton.disabled = false;
  console.log('Starting call');
  startTime = window.performance.now();
  const videoTracks = localStream.getVideoTracks();
  if (videoTracks.length > 0) {
    console.log(`Using video device: ${videoTracks[0].label}`);
  }
  const configuration = {};
  console.log('RTCPeerConnection configuration:', configuration);
  pc1 = new RTCPeerConnection(configuration);
  console.log('Created local peer connection object pc1');
  pc1.addEventListener('icecandidate', e => onIceCandidate(pc1, e));
  pc2 = new RTCPeerConnection(configuration);
  console.log('Created remote peer connection object pc2');
  pc2.addEventListener('icecandidate', e => onIceCandidate(pc2, e));
  pc1.addEventListener('iceconnectionstatechange', e => onIceStateChange(pc1, e));
  pc2.addEventListener('iceconnectionstatechange', e => onIceStateChange(pc2, e));
  pc2.addEventListener('track', gotRemoteStream);


  localStream.getTracks().forEach(track => {
    const t = pc1.addTransceiver(track, { direction: 'sendonly' })
    t.setCodecPreferences([{
      clockRate: 90000,
      mimeType: "video/VP8"
    }]);
  });
  console.log('Added local stream to pc1');

  try {
    console.log('pc1 createOffer start');
    const offer = await pc1.createOffer();
    await onCreateOfferSuccess(offer);
  } catch (e) {
    onCreateSessionDescriptionError(e);
  }
}

function onCreateSessionDescriptionError(error) {
  console.log(`Failed to create session description: ${error.toString()}`);
}

async function onCreateOfferSuccess(desc) {
  console.log(`Offer from pc1\n${desc.sdp}`);
  console.log('pc1 setLocalDescription start');
  try {
    await pc1.setLocalDescription(desc);
    onSetLocalSuccess(pc1);
  } catch (e) {
    onSetSessionDescriptionError();
  }

  console.log('pc2 setRemoteDescription start');
  try {
    await pc2.setRemoteDescription(desc);
    onSetRemoteSuccess(pc2);
  } catch (e) {
    onSetSessionDescriptionError();
  }

  console.log('pc2 createAnswer start');
  // Since the 'remote' side has no media stream we need
  // to pass in the right constraints in order for it to
  // accept the incoming offer of audio and video.
  try {
    const answer = await pc2.createAnswer();
    await onCreateAnswerSuccess(answer);
  } catch (e) {
    onCreateSessionDescriptionError(e);
  }
}

function onSetLocalSuccess(pc) {
  console.log(`${getName(pc)} setLocalDescription complete`);
}

function onSetRemoteSuccess(pc) {
  console.log(`${getName(pc)} setRemoteDescription complete`);
}

function onSetSessionDescriptionError(error) {
  console.log(`Failed to set session description: ${error.toString()}`);
}

function gotRemoteStream(e) {
  remoteVideo.srcObject = new MediaStream([e.track]);
  console.log('pc2 received remote stream');
}

async function onCreateAnswerSuccess(desc) {
  console.log(`Answer from pc2:\n${desc.sdp}`);
  console.log('pc2 setLocalDescription start');
  try {
    await pc2.setLocalDescription(desc);
    onSetLocalSuccess(pc2);
  } catch (e) {
    onSetSessionDescriptionError(e);
  }
  console.log('pc1 setRemoteDescription start');
  try {
    await pc1.setRemoteDescription(desc);
    onSetRemoteSuccess(pc1);
  } catch (e) {
    onSetSessionDescriptionError(e);
  }
}

async function onIceCandidate(pc, event) {
  try {
    await (getOtherPc(pc).addIceCandidate(event.candidate));
    onAddIceCandidateSuccess(pc);
  } catch (e) {
    onAddIceCandidateError(pc, e);
  }
  console.log(`${getName(pc)} ICE candidate:\n${event.candidate ? event.candidate.candidate : '(null)'}`);
}

function onAddIceCandidateSuccess(pc) {
  console.log(`${getName(pc)} addIceCandidate success`);
}

function onAddIceCandidateError(pc, error) {
  console.log(`${getName(pc)} failed to add ICE Candidate: ${error.toString()}`);
}

function onIceStateChange(pc, event) {
  if (pc) {
    console.log(`${getName(pc)} ICE state: ${pc.iceConnectionState}`);
    console.log('ICE state change event: ', event);
  }
}

function hangup() {
  console.log('Ending call');
  pc1.close();
  pc2.close();
  pc1 = null;
  pc2 = null;
  hangupButton.disabled = true;
  callButton.disabled = false;
}

</Script>

</body>
</html>

Both These Have Similiar Bandwidth Ramp-up Similiar to What V3.2.24 Has
Chrome -> Chrome RTX and No-RTX

@adriancable
Copy link
Contributor

adriancable commented Jul 24, 2024

OK, so it looks like: when RTX is enabled, you get a slow ramp on the sender side if the receiver is Pion, but not if the receiver is Chrome. When RTX is disabled, everything works similarly whether the receiver is Pion or Chrome.

(The Pion version is kind of a red herring, because 3.2.24 does not support RTX so whether the sender offers RTX or not does not matter, so you are always operating with RTX disabled, which is why you always get the fast ramp with 3.2.24.)

Sorry if this duplicates what you sent previously a little, but can you send 4 Chrome WebRTC packet logs (all from the sender side, all capturing from the very beginning of the stream):

  1. Chrome sender (RTX disabled), Chrome receiver
  2. Chrome sender (RTX disabled), Pion receiver
  3. Chrome sender (RTX enabled), Chrome receiver
  4. Chrome sender (RTX enabled), Pion receiver

... along with the targetBitrate graph screenshots for each of the 4 cases. I'll then see if I can spot any difference between #1 vs. #2, and #3 vs. #4. Thank you so much!

@adriancable
Copy link
Contributor

@aankur - when you save the Chrome WebRTC logs you will need to save them in V1 format so my tools can digest them properly. To do that you will need to run Chrome with this command line parameter:

--force-fieldtrials=WebRTC-RtcEventLogNewFormat/Disabled/

@Sean-Der
Copy link
Member

Nice sleuthing, @cnderrauber and @boks1971 I bet will know the answer quick!

Tell me if there is anything I can do to help. I hope this isn't because we are handling NACKs across different routines (timing issues).

If it is this is still all fixable. We should update pion/srtp to return both SSRCes as one stream. Ready to jump in and help wherever I can.

@cnderrauber
Copy link
Member

cnderrauber commented Jul 25, 2024

I can reproduce this with v3 master, the targetbitrates/availableOutgingbitrates costs 10sec to reach 1.5Mbps@480p with default codecs (has rtx enabled). If I comment out rtx codec in the codec register then it reach the target nearly immediately.

@cnderrauber
Copy link
Member

It shows no difference in the livekit sfu (rtx vs no rtx), wondering if it relates to how twcc process the packets that the livekit sfu does't use interceptor (but uses the pion twcc's recorder).

@adriancable
Copy link
Contributor

adriancable commented Jul 25, 2024

@cnderrauber - it must be due to some difference between Pion and Chrome in how TWCC packets for the RTX case are generated. I had a quick look previously at the TWCC packets generated for RTX and eyeballing they looked OK, but I haven't compared Chrome with Pion.

Are you able to get two Chrome WebRTC logs from the video sender side (with RTX enabled), one where the receiver is Pion and one where the receiver is Chrome, running Chrome with --force-fieldtrials=WebRTC-RtcEventLogNewFormat/Disabled/, and post them here? I am OOO and can't easily generate the stream logs, but I can take a look at them if you can generate them.

Also, if you can, please post the targetBitrate graph screenshot from Chrome in both cases. Thanks!

@aankur
Copy link
Author

aankur commented Jul 25, 2024

Hi,
i have collected All The Logs.

Single Video, No Simulcast -> Chrome to Chrome/Chrome to Pion
All Pion Tests are done on V3.2.49
All Chrome Only Tests have two peerconnection One SendOnly, Other RecvOnly

  1. Chrome to Chrome RTX Enabled
    Screenshot from 2024-07-25 08-48-57
    Screenshot from 2024-07-25 08-49-03
    event_log_chrome_chrome_rtx_enabled_20240725_0848_35_2.log
    event_log_chrome_chrome_rtx_enabled_20240725_0848_35_1.log

  2. Chrome to Chrome RTX Disabled
    Screenshot from 2024-07-25 08-50-20
    Screenshot from 2024-07-25 08-50-59
    event_log_chrome_chrome_rtx_disabled_20240725_0850_40_2.log
    event_log_chrome_chrome_rtx_disabled_20240725_0850_40_1.log

  3. Chrome to pion RTX Enabled
    Screenshot from 2024-07-25 08-55-38
    Screenshot from 2024-07-25 08-56-38
    event_log_chrome_pion_rtx_enabled_20240725_0854_50_1.log

  4. Chrome to pion RTX Disabled
    Screenshot from 2024-07-25 08-58-42
    Screenshot from 2024-07-25 08-59-09
    event_log_chrome_pion_rtx_disabled_20240725_0857_50_2.log

@adriancable
Copy link
Contributor

adriancable commented Jul 25, 2024

@cnderrauber - I have noticed something odd. Here is a log of some TWCC packets being sent from a Pion receiver. SSRC 2859612492 is the main video track, SSRC 631513261 is the RTX track. So in this log we are getting 2 main track TWCCs, followed by 3 RTX TWCCs, followed by 2 more main track TWCCs.

Jul 24 20:40:07 [TransportLayerCC:
Jul 24 20:40:07 	Header {true 15 TSFB 13}
Jul 24 20:40:07 TransportLayerCC:
Jul 24 20:40:07 	Sender Ssrc 3311308308
Jul 24 20:40:07 	Media Ssrc 2859612492
Jul 24 20:40:07 	Base Sequence Number 4713
Jul 24 20:40:07 	Status Count 37
Jul 24 20:40:07 	Reference Time 606
Jul 24 20:40:07 	Feedback Packet Count 34
Jul 24 20:40:07 	PacketChunks &{PacketStatusChunk:<nil> Type:0 SymbolSize:0 SymbolList:[1 1 0 0 0 1 1 1 1 0 0 0 1 0]} &{PacketStatusChunk:<nil> Type:0 SymbolSize:0 SymbolList:[1 1 0 0 1 1 1 1 1 1 1 1 1 1]} &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:9}
Jul 24 20:40:07 	RecvDeltas &{Type:1 Delta:34500} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:500} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0}
Jul 24 20:40:07 ]
Jul 24 20:40:07 [TransportLayerCC:
Jul 24 20:40:07 	Header {true 15 TSFB 5}
Jul 24 20:40:07 TransportLayerCC:
Jul 24 20:40:07 	Sender Ssrc 3311308308
Jul 24 20:40:07 	Media Ssrc 2859612492
Jul 24 20:40:07 	Base Sequence Number 4750
Jul 24 20:40:07 	Status Count 1
Jul 24 20:40:07 	Reference Time 607
Jul 24 20:40:07 	Feedback Packet Count 35
Jul 24 20:40:07 	PacketChunks &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:1}
Jul 24 20:40:07 	RecvDeltas &{Type:1 Delta:38500}
Jul 24 20:40:07 ]
Jul 24 20:40:07 [TransportLayerCC:
Jul 24 20:40:07 	Header {true 15 TSFB 13}
Jul 24 20:40:07 TransportLayerCC:
Jul 24 20:40:07 	Sender Ssrc 3311308308
Jul 24 20:40:07 	Media Ssrc 631513261
Jul 24 20:40:07 	Base Sequence Number 4751
Jul 24 20:40:07 	Status Count 29
Jul 24 20:40:07 	Reference Time 608
Jul 24 20:40:07 	Feedback Packet Count 36
Jul 24 20:40:07 	PacketChunks &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:14} &{PacketStatusChunk:<nil> Type:0 SymbolSize:1 SymbolList:[0 0 2 1 1 1 1]} &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:8}
Jul 24 20:40:07 	RecvDeltas &{Type:1 Delta:41000} &{Type:1 Delta:0} &{Type:1 Delta:11750} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:2750} &{Type:1 Delta:750} &{Type:1 Delta:1250} &{Type:1 Delta:1250} &{Type:1 Delta:250} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:2 Delta:-2250} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:750} &{Type:1 Delta:250} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0}
Jul 24 20:40:07 ]
Jul 24 20:40:07 [TransportLayerCC:
Jul 24 20:40:07 	Header {true 15 TSFB 13}
Jul 24 20:40:07 TransportLayerCC:
Jul 24 20:40:07 	Sender Ssrc 3311308308
Jul 24 20:40:07 	Media Ssrc 631513261
Jul 24 20:40:07 	Base Sequence Number 4765
Jul 24 20:40:07 	Status Count 32
Jul 24 20:40:07 	Reference Time 609
Jul 24 20:40:07 	Feedback Packet Count 37
Jul 24 20:40:07 	PacketChunks &{PacketStatusChunk:<nil> Type:0 SymbolSize:1 SymbolList:[1 1 2 1 1 1 1]} &{PacketStatusChunk:<nil> Type:0 SymbolSize:0 SymbolList:[1 1 1 1 1 1 1 1 1 1 0 0 0 0]} &{PacketStatusChunk:<nil> Type:0 SymbolSize:1 SymbolList:[0 0 0 2 1 1 1]} &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:4}
Jul 24 20:40:07 	RecvDeltas &{Type:1 Delta:52750} &{Type:1 Delta:250} &{Type:2 Delta:-59500} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:750} &{Type:1 Delta:250} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:57750} &{Type:1 Delta:6250} &{Type:2 Delta:-7000} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:6000}
Jul 24 20:40:07 ]
Jul 24 20:40:07 [TransportLayerCC:
Jul 24 20:40:07 	Header {true 15 TSFB 13}
Jul 24 20:40:07 TransportLayerCC:
Jul 24 20:40:07 	Sender Ssrc 3311308308
Jul 24 20:40:07 	Media Ssrc 631513261
Jul 24 20:40:07 	Base Sequence Number 4782
Jul 24 20:40:07 	Status Count 25
Jul 24 20:40:07 	Reference Time 611
Jul 24 20:40:07 	Feedback Packet Count 38
Jul 24 20:40:07 	PacketChunks &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:7} &{PacketStatusChunk:<nil> Type:0 SymbolSize:1 SymbolList:[2 1 1 1 1 1 1]} &{PacketStatusChunk:<nil> Type:0 SymbolSize:1 SymbolList:[1 2 1 1 1 1 1]} &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:4}
Jul 24 20:40:07 	RecvDeltas &{Type:1 Delta:6500} &{Type:1 Delta:2250} &{Type:1 Delta:4500} &{Type:1 Delta:250} &{Type:1 Delta:1500} &{Type:1 Delta:5750} &{Type:1 Delta:5750} &{Type:2 Delta:-102000} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:6000} &{Type:2 Delta:75000} &{Type:1 Delta:2000} &{Type:1 Delta:5000} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:1250} &{Type:1 Delta:5750} &{Type:1 Delta:5750} &{Type:1 Delta:250}
Jul 24 20:40:07 ]
Jul 24 20:40:07 [TransportLayerCC:
Jul 24 20:40:07 	Header {true 15 TSFB 6}
Jul 24 20:40:07 TransportLayerCC:
Jul 24 20:40:07 	Sender Ssrc 3311308308
Jul 24 20:40:07 	Media Ssrc 2859612492
Jul 24 20:40:07 	Base Sequence Number 4807
Jul 24 20:40:07 	Status Count 4
Jul 24 20:40:07 	Reference Time 612
Jul 24 20:40:07 	Feedback Packet Count 39
Jul 24 20:40:07 	PacketChunks &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:4}
Jul 24 20:40:07 	RecvDeltas &{Type:1 Delta:50500} &{Type:1 Delta:2250} &{Type:1 Delta:3000} &{Type:1 Delta:0}
Jul 24 20:40:07 ]
Jul 24 20:40:07 [TransportLayerCC:
Jul 24 20:40:07 	Header {true 15 TSFB 8}
Jul 24 20:40:07 TransportLayerCC:
Jul 24 20:40:07 	Sender Ssrc 3311308308
Jul 24 20:40:07 	Media Ssrc 2859612492
Jul 24 20:40:07 	Base Sequence Number 4811
Jul 24 20:40:07 	Status Count 12
Jul 24 20:40:07 	Reference Time 613
Jul 24 20:40:07 	Feedback Packet Count 40
Jul 24 20:40:07 	PacketChunks &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:12}
Jul 24 20:40:07 	RecvDeltas &{Type:1 Delta:14750} &{Type:1 Delta:2250} &{Type:1 Delta:0} &{Type:1 Delta:3000} &{Type:1 Delta:24750} &{Type:1 Delta:0} &{Type:1 Delta:2000} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:3250} &{Type:1 Delta:0} &{Type:1 Delta:3250}
Jul 24 20:40:07 ]

I thought the sanity check (assuming the packets are received in order, which they are) for TWCC is that the base sequence number for each TWCC must equal the base sequence number for the previous TWCC + the status count for the previous TWCC. This fails between adjacent TWCC RTX packets.

I am not surprised this messes up BWE. I have no idea why it's happening, though.

@Sean-Der
Copy link
Member

@cnderrauber Would you be ok putting RTX behind a flag until this is fixed? If you have time to fix it that would be even better, but I understand if you don't.

The default experience of Pion is broken for users currently. I should have time soon to address this. Releasing new version of DTLS today

@kcaffrey
Copy link

@adriancable I looked at the TWCC logs you posted, and it all looks rather normal to me. What it indicates is that the TWCC interceptor is "losing" some packets (which I suspect are the RTX packets not making their way through the TWCC interceptor, somehow).

we are getting 2 main track TWCCs, followed by 3 RTX TWCCs, followed by 2 more main track TWCCs

One note is that TWCC is transport-wide, so there is one contiguous sequence of TWCCs across all SSRCs. With that in mind, the sequence you pasted has the following base sequence numbers and status counts:

  • BSN=4713, Count=37
  • BSN=4750, Count=1
  • BSN=4751, Count=29
  • BSN=4765, Count=32 *
  • BSN=4782, Count=25 *
  • BSN=4807, Count=4
  • BSN=4811, Count=12

With the exception of the two entries that I added an asterisk to, each BSN equals the prior plus the status count (4750=4713+37, 4751=4750+1, etc).

The explanation of the two noted are packets that were denoted as lost in the prior feedback which arrived late. For example, in the third TWCC feedback (BSN 4751, count 29), there is a run of 14 "receive small deltas", followed by 2 missing packets, then a received large delta and 4 small deltas. The first missing packet would have been SN=4765. In the following feedback, both 4765 and 4766 are denoted as "receive small delta" (therefore being late arrivals). This behavior of having the base sequence number be smaller than the prior BSN + count is expected and desirable, as seen in libwebrtc's unit tests (also, there is no other way to report that a previously missing packet was received).

Note, however, that not all the missing packets were marked as late arrivals. In the first feedback, for example, 4715, 4716, and 4717 were all marked as missing and never marked as arrived in a later feedback. There are many other such permanently missing packets, which I highly suspect were RTX. Since chrome sees so many missing packets in TWCC feedback, the bandwidth estimate grows slowly due to the loss based estimator being used.

@cnderrauber
Copy link
Member

had some threads this afternoon but was occupied by other things, it should relate to how rtx has been read in the rtpreceiver, will have a pr tomorrow

@cnderrauber
Copy link
Member

@adriancable the rtx read loop could be blocked by RemoteTrack.Read since the repair stream uses a no buffer channel, and the interceptor uses the timestamp when the packet has been read from the buffer that explains the out-of-order feedback @kcaffrey had seen.

@adriancable
Copy link
Contributor

adriancable commented Jul 25, 2024

@kcaffrey - the log would be consistent if there were lots of out of order packets. But that isn't the case. For example:

[TransportLayerCC:
Jul 24 20:40:07 	Header {true 15 TSFB 13}
Jul 24 20:40:07 TransportLayerCC:
Jul 24 20:40:07 	Sender Ssrc 3311308308
Jul 24 20:40:07 	Media Ssrc 631513261
Jul 24 20:40:07 	Base Sequence Number 4765
Jul 24 20:40:07 	Status Count 32
Jul 24 20:40:07 	Reference Time 609
Jul 24 20:40:07 	Feedback Packet Count 37
Jul 24 20:40:07 	PacketChunks &{PacketStatusChunk:<nil> Type:0 SymbolSize:1 SymbolList:[1 1 2 1 1 1 1]} &{PacketStatusChunk:<nil> Type:0 SymbolSize:0 SymbolList:[1 1 1 1 1 1 1 1 1 1 0 0 0 0]} &{PacketStatusChunk:<nil> Type:0 SymbolSize:1 SymbolList:[0 0 0 2 1 1 1]} &{PacketStatusChunk:<nil> Type:0 PacketStatusSymbol:1 RunLength:4}
Jul 24 20:40:07 	RecvDeltas &{Type:1 Delta:52750} &{Type:1 Delta:250} &{Type:2 Delta:-59500} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:750} &{Type:1 Delta:250} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:57750} &{Type:1 Delta:6250} &{Type:2 Delta:-7000} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:0} &{Type:1 Delta:0} &{Type:1 Delta:250} &{Type:1 Delta:6000}
Jul 24 20:40:07 ]

This looks like the 4767 was received (a long time) before 4766. But looking at the tcpdump on the receiver side this isn't the case. 4767 was actually received a few ms after 4766.

@cnderrauber - I think you are on to something. The current logic in trackRemote.Read() returns an RTX packet if one is waiting in the repairStreamChannel, otherwise it will block on reading the non-repair track. So the failure mode here in this logic is that, if there's no RTX packet waiting and we block on the non-repair track, and then some RTX packets come in, we won't process them until the next non-repair packet comes in (which unblocks the loop). This could definitely mess up the timestamps and hence the BWE. I don't think it's an issue that the repairStreamChannel is unbuffered, but in trackRemote.Read() we probably want instead to block the loop until either an RTX or a non-repair packet comes in. If you have a strategy in mind for a PR, that would be amazing.

@adriancable
Copy link
Contributor

@aankur - are you able to test this modification.

In rtpreceiver.go:

func (r *RTPReceiver) readRTP(b []byte, reader *TrackRemote) (n int, a interceptor.Attributes, err error) {
        if t := r.streamsForTrack(reader); t != nil {
                select {
                        case <-r.received:
                                return t.rtpInterceptor.Read(b, a)
                        case rtxPacketReceived := <-t.repairStreamChannel:
                                copy(b, rtxPacketReceived.pkt)
                                return len(rtxPacketReceived.pkt), rtxPacketReceived.attributes, err
                }
        }

        return 0, nil, fmt.Errorf("%w: %d", errRTPReceiverWithSSRCTrackStreamNotFound, reader.SSRC())
}

In track_remote.go:

func (t *TrackRemote) Read(b []byte) (n int, attributes interceptor.Attributes, err error) {
	t.mu.RLock()
	r := t.receiver
	peeked := t.peeked != nil
	t.mu.RUnlock()

	if peeked {
		t.mu.Lock()
		data := t.peeked
		attributes = t.peekedAttributes

		t.peeked = nil
		t.peekedAttributes = nil
		t.mu.Unlock()
		// someone else may have stolen our packet when we
		// released the lock.  Deal with it.
		if data != nil {
			n = copy(b, data)
			err = t.checkAndUpdateTrack(b)
			return
		}
	}

	n, attributes, err = r.readRTP(b, t)
	if err != nil {
		return
	}
	err = t.checkAndUpdateTrack(b)
	return n, attributes, err
}

I'm currently OOO so this is tested by my eyeballs only.

@aankur
Copy link
Author

aankur commented Jul 25, 2024

Hi,
Please find the Test Report
Screenshot from 2024-07-26 00-15-37
Screenshot from 2024-07-26 00-15-45
Screenshot from 2024-07-26 00-18-42
chrome_pion_modification_1_20240726_0016_11_3.log

@adriancable
Copy link
Contributor

adriancable commented Jul 25, 2024

@aankur - that’s disappointing. Maybe there’s something not right with my fix, or maybe the cause of your issue is elsewhere.

@cnderrauber - I’ll wait for your tests/PR since I may not have understood your concern right, and fixed the wrong issue (although I think what I fixed in my code above should still be fixed, probably is a less haphazard way).

@adriancable
Copy link
Contributor

@aankur - one other thing you could try is flipping round the order of the case statements in readRTP, to give priority to the RTX track, i.e.

func (r *RTPReceiver) readRTP(b []byte, reader *TrackRemote) (n int, a interceptor.Attributes, err error) {
        if t := r.streamsForTrack(reader); t != nil {
                select {
                        case rtxPacketReceived := <-t.repairStreamChannel:
                                copy(b, rtxPacketReceived.pkt)
                                return len(rtxPacketReceived.pkt), rtxPacketReceived.attributes, err
                        case <-r.received:
                                return t.rtpInterceptor.Read(b, a)
                }
        }

        return 0, nil, fmt.Errorf("%w: %d", errRTPReceiverWithSSRCTrackStreamNotFound, reader.SSRC())
}

I'd be interested to see if this makes a difference.

cnderrauber added a commit that referenced this issue Jul 26, 2024
Fix #2830. The TrackRemote.Read could block in
readRTP if the buffer is empty then rtx packets
arrival before next media rtp packet will be
readed after the next media rtp packet and cause
out-of-order fb and mess up remote peer's
bandwidth estimation.
@cnderrauber
Copy link
Member

cnderrauber commented Jul 26, 2024

@adriancable The problem is the reading is blocked by empty buffer to wait for next video rtp packet then the rtx packets arrival before it will be read after it then cause out-of-order feedback. #2833 works in my test, @aankur can you confirm it works too?

image

@aankur
Copy link
Author

aankur commented Jul 26, 2024

Hi,
I have tested #2833 and it works as expected,
Screenshot from 2024-07-26 09-12-32
Screenshot from 2024-07-26 09-12-17
chrome_pion_pr_2833_1_20240726_0911_64_1.log

Sean-Der pushed a commit that referenced this issue Jul 26, 2024
Fix #2830. The TrackRemote.Read could block in
readRTP if the buffer is empty then rtx packets
arrival before next media rtp packet will be
readed after the next media rtp packet and cause
out-of-order fb and mess up remote peer's
bandwidth estimation.
@adriancable
Copy link
Contributor

@cnderrauber - this is great - thank you so much!

However, I still think the RTX code needs modifying along the lines of #2830 (comment)

As it currently stands, if track.Read() is called when there is no RTX packet waiting, it will block on reading from the non-repair track. At this point, if RTX packets come in, they will just queue up until the next non-repair RTP packet comes in, which then unblocks.

This may not be terrible in practice because there shouldn't be long gaps between RTP packets on the non-repair track, so the RTX packets would not be blocked for long. But I think we should fix it because the point of RTX is to get missing packets to the receiver as quickly as possible. I think my modification achieves that, but it would be great to have some extra eyeballs before I make the PR.

@cnderrauber
Copy link
Member

cnderrauber commented Jul 26, 2024

@adriancable wait for r.received doesn't mean there is rtp packets, it just means receiver.startReceive has been called before read, you need another signal to unblock the reading or write rtx packet to the primary buffer directly like what we have done in livekit. I think this is not a problem since the rtp packet is continuously sending in practice and if there is no video packet for a long time, get retransmit packet quickly can't help the subscribe experience.

@adriancable
Copy link
Contributor

@cnderrauber - I agree. I think we are good to close this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants