Rony Gao - Conference Interpreter, Translator, Trainer & Communications Consultant

View Original

Simultaneous Interpreting: The Race between Human Brain and Artificial Intelligence (Part 2)

Rony Gao is a member of Mensa Canada, a practicing conference interpreter and cross-cultural consultant based in Toronto. As a Chinese-English interpreter, Rony has worked for a wide array of political and business leaders.

(Updated in April 2022) Below is Part 2 of a two-part essay adapted from a talk I gave to members of Mensa Canada in 2018.

Mensa is the largest and oldest high IQ society in the world, of which I have been a member since 2015. Many members of Mensa are interested in the topic of human intelligence, and I was invited to weigh in on this topic with a view to my own profession—first giving a talk to the Toronto chapter of Mensa back in 2018, which was later converted into this two-part essay published in 2019.

Three years and a pandemic later, it seems that many of my predictions are becoming reality at a speed we did not anticipate.


In Part 1 of this article, I briefly described the advent of simultaneous interpretation and the techniques that enable interpreters like myself to perform this task. Next, I would like to answer an obvious question confronting our industry today: Is AI already good enough to replace human interpreters? If not, will AI ever be?

90% * 90% * 100%=81%?

To frame this discussion, let us first seek to understand how artificial intelligence may be able to perform the task of a simultaneous interpreter. There are three steps:

  • Speech Recognition. This is the same technology that enables you to hold a conversation with Apple’s Siri, Amazon Echo, or Google Home. Thanks to machine learning and massive amounts of user-generated data, the accuracy of speech recognition programs have improved by leaps and bounds in just the past few years. In China, for example, internet companies like Baidu, iFlyTech and Sogou claim that their input methods can convert voice into text at accuracy rates above 90%, some as high as 97%.

  • Text-to-Text Translation. Using technology in the translation of text is nothing new. Over the past few decades, Computer Assisted Translation (CAT) has evolved enormously: from translation memory (TM) to Statistical Machine Translation (SMT) to Neural Machine Translation (NMT). The nature of this progress is that when a large amount of source text is fed into the machine’s algorithm, together with high-quality human-produced translation in the target language, the machine recognizes patterns and applies them on any sentence following the same pattern. As the name “machine learning” suggests, the machine literally “learns” how to process the input data without relying on human-produced patterns or grammar rules. While the accuracy of text-to-text translation is hard to quantify, most users agree that the output is usually highly usable, enabling us to understand foreign-language webpages and other documents almost instantaneously and for free.

  • Speech Synthesis, also known as Text-to-Speech technology. Thanks to deep learning (another buzz word that is often used interchangeably with Machine Learning or Artificial Intelligence), computer programs are able to synthesize lifelike voice to read out any text. To get a sense of where the technology stands today, go to this website and click the “listen” button. Paid tools perform even better.

Now, to answer the question of how good AI already is, we can give technology a report card at each step, and multiply the three numbers to get an overall result.

  • Some of the best speech recognition programs are claiming accuracy rates between 90% and 97%. That sounds jaw-dropping, and yet it is very much true. Here, we will use 90% as an estimated score.

  • The assessment quality at the translation stage is the most controversial. Some say that machine-powered translation is still far from being satisfactory, and never will be. Others marvel at the quality of automated translation tools and say that the Tower of Babel is already a thing of the past. For example, if you use Facebook, you would have noticed the “translate” button next to your friends’ posts in other languages. Click it, and you will get a decent idea of what they said most of the time. The “Translate This Page” button on Chrome, likewise, also enables us to navigate around foreign-language websites with ease. If I have to put a percentage, I would give these tools a 90 out of 100. They are far from perfect, but let us agree that they are already highly usable, and are only getting better by the day! My own experience as a translator also suggests that the usefulness of machine-powered translation depends on the type of text. If there are highly repetitive (and yet specialized) patterns and vocabulary in the language, machines usually prove to be very reliable. For example, legal contracts or patent applications, fall into this category. Interestingly, these areas also tend to be the weak links in a human interpreter’s skill mix. This phenomenon lays the ground for a collaborative approach between the human and the machine.

  • The third stage in this process, speech synthesis, is much less relevant to this discussion than the first two. Theoretically speaking, the quality at this stage is always 100% as the machine can easily read out transcribed (and translated) text with perfect accuracy. Plus, many will find this part of the service unnecessary. Think of the last time you watched a foreign language movie. Given the choice between watching a dubbed version and the original version with subtitles, I bet that many of you preferred the original version with subtitles in your language. In a few years, multilingual conferences will likely be streamed on mobile devices with real-time subtitles available in different languages. In other words, in the future technology may be serving the need of viewers rather than listeners.

The three steps and estimated accuracy involved inn the process of AI-powered simultaneous interpretation.

So, if we do the math by multiplying these scores, we will now get somewhere around 81%. But what exactly does this number mean? Is the job of human interpreters 81% doomed, or are we still safe? I would like to discuss two additional considerations.

Machines and human interpreters are good at different, and usually supplementary, parts of the job

First, machines and human interpreters are good at different, and usually supplementary, parts of the job. For instance, human interpreters often feel stressed out by the numbers, figures, and proper nouns that come up in speeches being interpreted. If the following sentence comes up in a speech out of the blue, most human interpreters will have a hard time keeping up simultaneously, especially if the interpreter is unfamiliar with the subject matter.

MAHMOUD MOHIELDIN, Senior Vice-President of the 2030 Development Agenda, United Nations Relations and Partnerships, World Bank Group, describing the messages that emerged from the international financial institution’s spring meetings, said global growth has lost momentum, dropping from 3.3 per cent in the first quarter of 2018 to below 2.7 per cent in the fourth quarter.

Machines, however, are amazingly accurate and fast when it comes to transcribing and translating proper nouns and numbers. If the transcribed and translated message is displayed on a screen in front of the human interpreter in the “booth” (the soundproof space in which interpreters work), it would greatly enhance the confidence and overall quality of interpreter’s output.

Thinking and analyzing is AI’s blind spot

What about the AI’s weak spot? The answer is that AI is, to this day, still very weak at thinking and analyzing. Human interpreters take into account the social context when they interpret. Machines are utterly unable to do so. For instance, when the American investment guru Ray Dalio was invited to give a talk in China last year, the event organizer in Beijing was daring enough to use a service provided by named Sogou, offering a combination of Step 1 (speech recognition) and Step 2 (text-to-text translation) with both the transcription and the translation displayed on a big screen.

When the host of the event, a Chinese professor and friend of Mr. Ray Dalio introduced his guest, he said “Ray是一个做梦的人” (Ray shi yi ge zuo meng de ren). The correct English translation would have been something like “Ray is a dreamer” or “Ray is a man with dreams.” To everyone’s bewilderment, what came out on the big screen was: “瑞士一个,做梦的人。” (Rui shi yi ge zuo meng de ren)”, accompanied by the English translation “One in Switzerland. A dreamer.” Phonetically, that was exactly what the speaker said, but apparently the machine mistakenly understood the syllables Ray and shi (the Chinese character for “is”) to mean Rui-shi (瑞士 / Switzerland).

Picture Credit: Jonathan Rechtman https://www.linkedin.com/pulse/ray-dalio-speaks-china-machine-translation-fails-jonathan-rechtman/

This is a telling case, because the root cause of this blunder is not lack of data or computing power, but the inability to think and analyze. Had a human interpreter heard the same combination of sounds (Ray-shi), (s)he would very likely be able to understand it as “Ray is”, rather than “Switzerland”, especially given that Mr. Ray Dalio is American.

Second, it is also worth pointing out that technical maturity is one thing, but user adoption is quite another. Just because some technology is “pretty much there” doesn’t mean it can take over a market any time soon. One major barrier to be overcome is the perceived risk and lack of trust by stakeholders. After all, the real decision maker about using simultaneous interpretation services is not someone browsing a wikipedia page and casually clicks the “translate” button on Chrome, but big institutional clients (think of the United Nations and Government of Canada) who tend to be conservative and slow in adopting new technology. With the slightest risk that things might go wrong, these large institutions usually hold back and stick to the safe option that they’ve been using for decades.

This dilemma between innovation and risk was best shown in the public’s reaction to the first fatal accident caused by Uber’s self-driving car. In light of this accident – which we knew was going to happen sooner or later –  Uber immediately suspended its tests. In a way, this is unfair for the Ubers of our world: statistically speaking, human drivers kill hundreds of pedestrians every day, but very few of those accidents make headlines.

I suspect that the same story will play out for AI-enabled simultaneous interpretation. As soon as news of the first “major accident” breaks out (well, hopefully, it won’t cost the life of anyone), people will blame it on the imperfect technology, and hold back from it for a while. Perhaps this means that the wide adoption of breakthrough technologies in facilitating international meetings, where the stakes are usually high, won’t be an easy pathway. Will it happen one day? Maybe, but much later than a technologist might expect.


So, where are we headed?

To conclude, I would like to offer three predictions about where this “race” between the human brain and artificial intelligence is headed.

  1. In the next 5 -10 years, technology will continue to re-shape conference interpreting, improve its user experience, and lower its cost.

  2. In 10 – 20 years, most interpretation work will involve AI-powered tools as assistants.

  3. The human interpreter’s job will be re-defined, but will never be replaced.

In short, AI should, and will, have a role to play in shaping the future of simultaneous interpretation, just as AI will do for many other professions. However, AI should be not a replacement, but rather an assistant, to human interpreters. If used properly, it will greatly augment our ability to translate spoken language simultaneously, accurately, and elegantly.


Read the original blog post on the Mensa Canada blog: https://mensa.ca/2019/04/18/simultaneous-interpretation-race-between-human-brain-and-artificial-intelligence-part-2/

See this form in the original post