A while ago I raved about Skype and made a few uneducated statements about Speex. Jean-Marc Valin, the lead author of Speex and a legitimate expert about audio coding, was nice enough to send me an email explaining why I was wrong, and also a number of other interesting details about audio codecs. In my previous post
I said that open source voice codecs are so much worse than commercial codecs [...]. [Speex's] own codec comparison page seems to indicate that the proprietary algorithms are better
. Looking at this page again, I'm not sure why I said that. The iLBC codec from Global IP Sound, the codec used by Skype, has basically the same coding delay as Speex. As Jean-Marc pointed out, the 30 ms coding delay is irrelevant when there is hundreds of milliseconds of delay due to the network, jitter buffers and sound card buffers. I also have not tested any VoIP software that uses Speex, so I can't make any quality comparisons between Speex and Skype.
As I have written before, I believe that the "open" VoIP projects, such as Speex, Asterisk, OpenH323, and GnomeMeeting, are very important. If VoIP technology gets locked into proprietary protocols nothing changes. If the technology remains open, then innovative services like Free World Dialup and fwdOut are possible. These things really have the potential to change how the industry works.
Jean-Marc and I also discussed some other interesting details about voice over IP in general. With his permission, I've included some interesting parts of our correspondence here. The indented parts are written by Jean-Marc.
However, if you listened to all these codecs, I can bet that AT EQUAL BIT-RATE, you would find that Speex is close, but not quite as good quality as AMR and AMR-WB, but much better quality than iLBC. You can actually listen to samples in narrowband (8 kHz) and wideband (16 kHz).
You are absolutely correct, quality is very subjective, and basically impossible to measure in a quantitative fashion. I will also freely admit to not being very knowledgeable about sound codecs. All that I can say is that Skype has much better sound quality than the previous VoIP software I have tried (Netmeeting, Yahoo Chat, OpenH323, NetFone). Unfortunately, there is no cross-platform VoIP software that uses Speex, so I can't try it to see how well it works.
Actually, it's possible to use Speex with Netmeeting and talk to people using gnomemeeting. On the other hand, Skype isn't available on Linux (which is why I never actually tried it, though I have some idea of how it works).
Hmm... Interesting! I didn't realize that there are downloadable ACMs that work with Netmeeting. I may have to fight with this enough to give it a try. The biggest problem for me is that the people I talk with over the Internet are, for the most part, not technical people (my parents, brothers, girlfriend, etc), so I have to set it up on both ends.
Skype is available on Linux, and has been for a long time. I spend about 1/3rd my time in Linux, and I've used Skype without any problems. You need the QT libraries installed.
As for the delay numbers, let me say that a difference of 5 or 10 ms is completely meaningless for VoIP, where you usually get total delays between 200 ms and 1 second when you count network delay and jitter, audio buffer and all.
Well, if there is much more than 500ms of one way delay, I may just hang up and pick up a regular telephone. Trying to chat over a delayed audio connection is frustrating and annoying. For me, one of the things that I instantly noticed about Skype is that it has less delay than the software I have used before. I even had one Skype conversation while we both held phones to our ears and the delay difference between the two was nearly imperceptible. I assumed that the difference is because of their choice of codec, although to be more specific it may be because iLBC is good at handling dropped or delayed packets?
The latency of Skype has nothing to do with the codec. Also, while it is claimed that iLBC is more robust to packet losses, it has been shown that one can achieve even better robustness by just using a lower bit-rate codec (iLBC has a high bit-rate for the quality it provides) and just adding redundancy.
Interesting! That makes it sound like the research into having audio codecs that "degrade" seamlessly with packet loss has been a failure. Is that accurate? Or is it just that iLBC does not do a great job of it?
That's not what I meant. What happens is that Speex and other similar (*CELP) codecs encode the current frame (packet) using the previous frames (i.e. it has a memory). When a packet is lost, the frame needs to be reconstructed, so we basically make up something plausible. However, when the next frame arrives, the memory of the decoder is out of sync with the one at the encoder because we only "guessed" the last packet. That means that even this frame we just got won't be perfectly reconstructed (how far from the original depends on how far our guess was).
Now, the idea behind iLBC is to encode each frame independently of past frames. This means that they still have to guess the missing packets, but when a new packet arrives, it can be perfectly reconstructed right away. That seems like the best solution, except that by making each frame independent, they also make the codec less efficient for the normal case. For example, iLBC at 15 kbps has the same quality as G.729 at 8 kbps (for Speex, the equivalent would be between 8 and 11 kbps), almost twice the bandwidth for the same quality. With all that bandwidth, what you could do is simply send each frame twice! So in a case where your network has 20% packet loss, you could reduce that to 4% by sending the packets twice. I doubt iLBC performs better with 20% packet loss than G.729 with only 4%. The guys of the speech coding group at my university (the ones that came up with 729 and all) made the experiment and it clearly shows that adding redundancy is more efficient than encoding frames independently. This is similar for Speex.
So in summary, I'm not claiming that Speex is the best codec out there, but its weaknesses have nothing to do with what you're mentioning. I would also say that I'm pretty sure Skype could have had better quality if it used Speex instead of iLBC.
Interesting! So why didn't they choose to use Speex? The Skype developers just didn't do their homework, and immediately ran to GIPS instead of searching on Google? It seems to me that a company would be very happy to use something that is open and patent free, rather than having to licence technology from another company. If the voice quality would also be better, than it would be a clear win. Is there some other advantage of iLBC?
My guess is that iLBC and Speex were probably the only options for them because licensing another codec would have been too expensive.
Right, that's what I figure. However, I noticed in the iLBC licence there is lots of discussion about how the licence applies only to the software, and not to any patents that may be held by GIPS or other parties. I'm assuming that GIPS likely has some sort of patents on the codec algorithm(s) that need to be licenced for a fee, if you really want to use iLBC in a commercial product?
As to why they chose iLBC I can only guess. Either it's because they felt like they wanted to work with a company, or maybe GIPS made them an offer they couldn't refuse (e.g. develop stuff for them for free so they could get some visibility).
At any rate, Skype is not successful because of any technical innovations. It reuses codecs developed by someone else and it uses the firewall tunneling pioneered by Kazaa and many other P2P systems. The reason that it is successful is that the whole package just works. I find it depressing that no "open" systems have been developed that work as well. However, if any are developed, you can bet they will use Speex.
I think the main reason Skype is popular is because it works through NATs (using STUN/P2P and all).
This is definitely one of the reasons. In fact, the reason I started to use it was because I was behind a firewall, and so was the other person I was trying to talk with. I could not get Netmeeting to work, but Skype worked flawlessly.
Also, so many VoIP apps are written really badly when it comes to delay that suddenly Skype sounds so good with respect to that.
This is the reason that Skype is popular: They made no other mistakes. The UI is simple and usable, it works on multiple platforms, apparently it does good echo cancellation as well. Basically, they put a solid, bug free program out that that "just works." That is really, really important for successful technology.
Last thing, I think it can use wideband, which isn't common yet.
It definitely does use wideband. A technical report about the Skype protocol states "Skype uses wideband codecs which allows it to maintain reasonable call quality at an available bandwidth of 32 kb/s."