Speech recognition between American and Chinese companies

I’ve already written here that I started trying out speech input. I’ve tried various ones for both Chinese and English, namely Apple, Sogou, and IFlyTek. Sogou is a relatively well known, at least in China, company that used to have sizable search market share quite a while ago. It’s also famous for its Chinese input method, which is its default. IFlyTek is this little known company in Hefei, Anhui, that can tap smart graduates of the University of Science and Technology of China (中国科技大学) there, arguably the nerdiest school in China. I was rather disappointed, and my impression was that these Chinese companies have a long way to go in AI, compared the top American ones.

Curious to see a more objective comparison, I did a test, where I recorded something on this matter that I thought of impromptu.

Yes, it sounds very hesitant and stumbly, because it was entirely improvised. But it’s good enough. What did, Apple, Sogou, and IFlyTek generate respectively when tested on this audio file.


我想做一下作业,然后试试中国公司和美国公司的语音识别比较一下。我排客之后对这些中国公司感到非常的失望,就不用说,谷歌苹果很可能都比最好的中国公司多苹果。的强项不是我觉得中国的科技公司这两年好多都是媒体可以的。他们斗地主还是继续在美国所找到的人也都是下个二流的同时,很可能在中国的顶级的开发人,还知道那就是说最好的开发员可能在美国还比中国多多得多。(here, many characters were actually omitted as opposed to misrecognized)





It turned out that Sogou and IFlyTek are actually a bit better than Apple for speech recognition, to my surprise, which just goes to show how flawed subjective impressions can be. Of course, all of them made numerous major errors, such that I can see why speech input still isn’t widely used (as far as I know). Even for English, Apple make some errors. I told me friend this, and he said, “strange, it’s usually pretty reliable for me, maybe your voice isn’t clear enough.” Though he was using Google’s on an Android, and we all know that Google is the world leader in AI, almost certainly quite a ways ahead of the other top companies in it. So I tried out Google’s as well, via this, and the result was


It’s comparable in accuracy to IFlyTek, maybe a bit worse.

Of course, I’m sure Google and Apple invested relatively little on Chinese speech recognition. Just like Sogou and IFlyTek invested little on English (or maybe they trained on English spoken with Chinese accents), because their English speech recognition basically felt like complete garbage.

In any case, we can still see that speech recognition and AI in general still has a long way to go. After all, your AI is only as good as the data you feed to train it. It will never handle cases exceptional to the training set and not programmatically hard coded, unless there is a major paradigm shift in how state-of-the-art AI is done (so something even better than neural nets).

Whoever reads this is welcome to do a similar experiment comparing Google Translate with Baidu Translate. I did, but I didn’t record the results so it doesn’t really count as a completed experiment.

Trying out speech input

I wrote my previous blog article lying in bed at night very tired, trying out speech recognition input. I was using the one provided by Sogou. It turned out that even after many manual corrections, there were still several errors made which I didn’t catch. You can check the complexity and level of ambiguity of the writing itself (of course you’ll have to read Chinese). You also don’t know how clearly I spoke. Yes, it can be a problem when you speak quickly without a certain level of enunciation, especially when your combination of words isn’t all that frequent. There are of course also exceptional cases which a smart human would easily recognize that the machine cannot, like when I say Sogou, a human with the contextual knowledge would not see it as “so go.” Of course, this is expected, AI is only as good as your training data.

I tried Google’s speech recognition too, here, and initially it seemed to work much better, until it started to make errors too. Next, I tried IFlyTek, this company in Hefei which supposedly hires a ton of USTC (中科大) grads. Still not much better. It’s much easier to type Chinese and have to select something other than the default very occasionally. Turns out that the statistical NLP techniques for Chinese input work well enough, especially given the corpus that Sogou, whose input method I use, has accumulated over time. I had read that back a while ago, it even criticized Google for using some of their data for their Pinyin input method, and Google actually conceded that it did. It’s expected that the Chinese companies in China would have easier access to such data. Even so, Google Translate still works visibly better than Baidu Translate, even for Chinese.

From an HCI perspective, it’s much easier to input English on phone than to input Chinese. Why? Because spelling (Pinyin in the case of Chinese) correction, necessarily for phone touch-screen keyboard, works much better for English than for Chinese. Sure, Sogou provides a 9 key input method as shown below (as opposed to the traditional 26 key),


where once one is sufficiently practiced, the key press error rate goes down significantly, but the tradeoff is more ambiguity, which means more error in inference to manually correct. In the example below, 需要(xu’yao) and 语言(yu’yan) are equivalent under the key-based equivalence relation (where equivalence classes are ABC, DEF, PQRS, etc). Unfortunately, I meant 语言(yu’yan) but the system detected as 需要(xu’yao).


You can kind of guess that I wanted to say that “Chinese this language is shit.” The monosyllabic-ness of the spoken Chinese language, in contrast to the polysyllabic (?) languages in the Middle East for which the alphabet was first developed, obstructed the creation of an alphabet. Because each distinct syllable in Chinese maps to so many distinct characters with different meanings, there would be much ambiguity without characters. For an extreme example of this, Chinese linguistic genius Yuen Ren Chao (赵元任) composed an actually meaningful passage with 92 characters all pronounced shi by the name of Lion-Eating Poet in the Stone Den.

I remember how in 8th grade history class, an American kid in some discussion said how our (Western) languages are so much better than their (Chinese-based) languages, and the teacher responded with: I wouldn’t say better, I would say different. Honestly, that kid has a point. Don’t get me wrong. I much appreciate the aesthetic beauty of the Chinese language. I’m the complete opposite of all those idiot ABCs who refuse to learn it. But no one can really deny that the lack of an alphabet made progress significantly harder in many ways for Chinese civilization. Not just literacy. Printing was so much harder to develop, though that is now a solved problem, thanks much to this guy. There is also that Sogou’s Chinese OCR, which I just tried, basically doesn’t work. Of course, nobody really worries about this now, unlike in the older days. In the early 20th century, there were prominent Chinese intellectuals like Qian Xuantong (钱玄同) who advocated for the abolishment of Chinese characters. Moreover, early on in the computer era, people were worried that Chinese characters would be a problem for it.

In any case, unless I am presented with something substantially better, I can only conclude that any claim, such as this one, that computers now rival humans at speech is bullshit. I was telling a guy yesterday that AI is only as a good as your training data. It cannot really learn autonomously. There will be edge cases in less restricted contexts (unlike in chess and go, where there are very precisely defined rules) such as a computer vision and speech recognition obvious to a human that would fool the computer, until the appropriate training data is added for said edge case and its associates. Analogously, there has been a near perpetual war between CAPTCHA generators and bots over the past few decades, with more sophisticated arsenals developed by both sides over time. Technical, mathematically literate people, so long as they take a little time to learn the most commonly used AI models and algorithms, all know. Of course, there will always be AI bureaucrats/salesmen conning the public/investors to get more funding and attention to their field.

Don’t get me wrong. I still find the results in AI so far very impressive. Google search uses AI algorithms to find you the most relevant content, and now deep learning is being applied to extract information directly from image context itself, vastly improving image search. I can imagine in a decade we’ll have the same for video working relatively well. To illustrate this, China now has face recognition deployed on a wide scale. This could potentially be used to search for all the videos a specific person appears in by computationally scanning through all videos in the index, and indexing corresponding between specific people and times in specific videos. Of course, much of the progress has been driven by advances in hardware (GPUs in particular) which enable 100x+ speedup in the training time. AI is mostly an engineering problem. The math behind it is not all that hard, and in fact, relatively trivial compared to much of what serious mathematicians do. Backpropagation, the idea and mathematical model behind deep learning that was conceived in the 80s or even 70s in the academic paper setting but far too computationally costly to implement at that time on real world data, is pretty straightforward and nowhere near the difficulty of many models for theoretical physics developed long ago. What’s beautiful about AI is that simple models often work sufficiently well for certain types problems so long as the training data and computational power is there.