&yet is a highly distributed team, with yetis all over North America and Europe. To keep in touch, we use Talky on a daily basis for impromptu discussions among our teammates. Unfortunately, Talky doesn't quite work for our all-hands meetings because "full-mesh" media sharing only functions well when the number of people in the conference is small. To change that, we are working on an improved version that we've tested up to 20 people, internally known as Talky 2.
Last Friday we had another weekly update meeting and Talky 2 would not work - all the videos remained black. Switching from Chrome to Opera resolved the problem for a while, but at some point every participant in the conference had their browser crash. We had to switch back to Hangouts. Which was embarrassing.
What happened?
Crashing every browser (including Opera) was something I had been expecting. We had seen it before and currently it is being investigated by the WebRTC team. It is quite a serious issue and getting them more data to investigate what is going on helps. Showing black videos, on the other hand, was a behavior I had not seen before, so I went to investigate. I could not reproduce this behavior on my Linux laptop and It seemed to happen only on the Macintosh computer we use in our meeting room.
The first thing I noticed was that it was running Chrome 41. Chrome 41 is not the version I was testing with and is currently in beta stage. The way Google rolls out updates means that a small percentage of users get those updates even on the stable channel, so if there are any major problems they might be noticed at that point and can still be fixed before affecting the majority of users.
So... when i'm debugging a WebRTC problem, the first thing I do is to check the chrome://webrtc-internals page which shows all the API calls and statistics information. If something breaks, there is quite a good chance that you can see what might be wrong on that page.
Was it Talky or Jitsi Videobridge? Or something else?
Indeed the page showed that Chrome had failed to establish a connection to the Jitsi Videobridge (which is the thing allowing user to scale to a much higher number of participants than the current Talky). It was apparent that Chrome only sent a few bytes and then stopped, even though the basic connection seemed to be up. This looked like, despite being able to establish a basic ICE connection, it failed to set-up the DTLS encryption. To find and verify this hunch I dumped the low-level traffic using Wireshark, which showed a DTLS Alert packet it described as Level: Fatal, Description: Illegal Parameter
Sounds scary, eh?
At this point I started to wonder if this was something related to just Talky or would this also affect other applications like Jitsi Meet? Turns out it did affect them, so I pinged the Jitsi team and got a response very quickly – they had started to notice this a few hours earlier as well.
So a Chrome update was about to break both our WebRTC applications... I have been afraid of that happening ever since starting to work on WebRTC about two and a half years ago. I've had my share of WebRTC-related crashes, crashing almost every release since Chrome 29, but those had never affected all users.
The Chrome 41 WebRTC release notes had been published earlier last week. I tend to review them very carefully and they did not contain any hints on changes related to DTLS. George Politis from the Jitsi team filed an issue in the WebRTC bug tracker and I immediately poked Vikas Marwaha, the awesome Google engineer who writes the release notes and triages all reported issues. He was able to reproduce it and assigned it to the person responsible for that area within less than 25 minutes.
That's quite a good response time, especially considering it was 3pm on a Friday. The issue got reassigned to the BoringSSL team the same day still and was fixed on Monday morning. The fix will be included in the release version of Chrome 41.
By a whisker we caught it early in the Chrome Beta phase.
Why did we not notice this earlier?
We do some automated testing that should have caught this. However, since we run the testing tool on Linux servers and Linux was not affected our own testing process did not catch the change when it was introduced. Since it only happens with the Jitsi Videobridge and the BouncyCastle TLS library used by it, the Chrome-Firefox interop testing that Google does didn't show the problem either.
From the Google side, while I do not want to experience such a situation again, the way this was handled makes me confident that they take such issues very seriously. The WebRTC team seems to have been surprised by the change in BoringSSL as well.
Effectively "what if a Chrome upgrade kills your WebRTC applications" is FUD. The Chrome release process makes it possible to detect such changes relatively early, so if as a developer you are not working in the Canary version you should start doing that today. And get automated testing done. On all platforms. Oh, and also across other browsers like Firefox and Opera, just to increase the size of the testing matrix.