Baidu: Using Machine Learning for Voice Cloning to Get Closer to Consumers…All In Just 3.7 Seconds!

Baidu is using machine learning to clone human voices and develop the capability of imitating thousands of accents in order to personalize machine-based interactions with consumers.

Baidu is for the Chinese market what Google is for the western world. In recent years, the Chinese search giant is making an aggressive push into innovative technologies such as artificial intelligence and autonomous vehicles. One of these efforts has been on the development of computer speech synthesis software that aim to personalize human-machine interactions in the services they offer. Baidu hopes this technology will allow them to offer more genuine interactions between computer-based applications and the final customer in turn attracting companies that want to use this new marketing tool to come closer to their customers.

Voice base natural language processing (“NLP”) systems are not exclusive to Baidu and have been around for a while; with companies like Apple or Amazon featuring voice-based NLP systems in Siri and Alexa respectively. In simple terms, voice-based NLP listens to speech and uses math models to determine what was said to later translated into text. Then the system breaks down every word in this text to determine what part of speech it corresponds to (noun, verb, adjective). Finally, through a series of algorithms and coded grammar rules it determines the context of what was said (1).

Simplified diagram illustrating key stages of voice-based natural processing systems (2)

The Chinese company has taken speech recognition software a step further. In 2017 the company introduced Deep Voice, a system that using deep learning can convert text to speech and produce short sentences that sound indistinguishable from a real person (3). This system required to process many hours of data and was only able to learn one voice at a time.

Baidu continued to invest in this technology and earlier this year the company released the third and latest version of their marquee software Deep Voice, claiming that their system could clone a human’s voice with only 3.7 seconds of training data (4). Not only can the system replicate a speaker’s voice in record time, but the system can also manipulate a voice and could change from a male to a female or from a British accent to an American one. Essentially allowing a person to hear how they would sound with a British accent. The system is so effective that it was able to fool voice recognition software 95% of the time (5).

At the core of Baidu’s push into AI is the hope to personalize human-machine interactions and bring the consumer closer to the company. This software will certainly help improve its voice-search applications in the short term, but Baidu has greater ambitions with this technology. One major area where Baidu is looking to implement this technology is in the voice marketing arena. A recent poll by consulting firm Capgemini estimated that 1/3 of the respondents had purchased a consumer product using a voice assistant (6). In recent years there has been an increased involvement of digital assistants in marketing campaigns and there seems to be a clear trend of increased AI in marketing (7). Using Baidu’s Deep Voice marketers can minimize the hassles of customer service and feedback management. Assistants could identify where the target speaker is from based on how he speaks and then speak like someone who is from a similar background, this will help build trust with the customer. It will also allow companies to be present in every step of the customer’s purchasing journey helping the brands be closer to consumers and create more authentic interactions.

This speech synthesis and recognition technology does not come without risks. Several critics argue that if this technology is available freely to the public, identity and voice theft could be a potential threat. Other potential risks involved are an increase in phone fraud attacks or fake recording of political / famous figures.

As Tom Harwood, co-founder of a voice security solutions company, points out:

“[This technology] raises serious concerns about voice biometric security systems. Soon, criminals will need just a few seconds of someone’s voice to cheat a voice recognition security system – voice biometric authentication will be rendered useless.”(4)

As Baidu continues its transformation form a desktop website to a mobile based search app, it will need to consider if Deep Voice is the right tool to attract new customers, namely Chinese advertising companies and get them on board the voice-marketing space. There are several risks in the technology that the company will need to find ways to mitigate, mainly security concerns around identity theft and fake recordings. However, the key underlying issue remains, will Baidu be successful in completely replacing human interaction with human-like speaking machines in a seamless manner? How will consumers react once they know they are unknowingly interacting solely with machines?

(774 Words)


[1] Mills, T. (2018). What Is Natural Language Processing And What Is It Used For?. [online] Forbes. Available at: [Accessed 12 Nov. 2018].

[2] Shewan, D. (2018). 10 Companies Using Machine Learning in Cool Ways. [online] Wordstream. Available at: [Accessed 12 Nov. 2018].

[3] Popper, B. (2018). Baidu’s new system can learn to imitate every accent. [online] The Verge. Available at: [Accessed 12 Nov. 2018].

[4] Tom Allen. “Deep Voice can clone a human voice in 3.7 seconds”. March 8, 2018. Via LexisNexis Academic, accessed [11/2018]

[5] “Upgraded Deep Voice can mimic any voice in mere seconds”. Tech Xplore. March 6, 2018 Tuesday. Via LexisNexis Academic, accessed [11/2018]

[6] Bernard, Julie. 2018. “Meet The Voice Marketer: Who Will Tell The Story Of Marketing’s Next Chapter?”. Forbes [Accessed 13 Nov. 2018].

[7] “Baidu Unveils Deep Voice 2: A Multi-Speaker Neural Text-to-Speech Technology”. ICT Monitor Worldwide. May 30, 2017 Tuesday. Via LexisNexis Academic, accessed [11/2018]


The Future of Portfolio Returns and Wall Street Traders


Bricks & Code: Open Innovation at LEGO Group

Student comments on Baidu: Using Machine Learning for Voice Cloning to Get Closer to Consumers…All In Just 3.7 Seconds!

  1. Interesting read. Baidu’s performance is lagging compared to Tencent and Alibaba on the advertising front. Given ads is more important to Baidu than to Alibaba and Tencent, it is not surprising that they want to open up new avenues to increase the revenue stream. And here comes the voice (search). My main question would be, since Baidu is already struggling to gaining consumer trust based on prior healthcare related scandals due to lack of control, the fact that there are more security risks around this voice technology would probably further exacerbate Baidu’s positioning if things go south again. Is it all worth it from a company’s point of view?

  2. Wow! This article poses an extremely interesting (and slightly eerie!) question when it comes to transforming the consumer experience to AI driven voice technologies. The business case for building out these types of technologies are clear — reducing labor, but I wonder whether commercializing these technologies is really as close as Baidu would like. When I think of the chat bots that even the most sophisticated of companies have today I am still extremely skeptical. The classic case is of course Microsoft’s twitter bot ( which, when unleashed on the open web quickly turned into any marketer’s worst nightmare. I wonder what Baidu can do to ensure that as this technology is deployed it truly does build trust with the consumer — one offensive interaction could be fatal.

Leave a comment