Vocaloid: The Music of Future
Brief
Overview
A research paper written for the Fundamentals of Music Technology class about the singing voice synthesizer Vocaloid and how it has changed the music industry.
Main Text
References
Date: May 6th, 2020
Type: Research Paper
Position: Author
Note: Written for the Fundamentals of Music Technology class
Vocaloid: The Music of Future
Have you ever imagined that one day our music will be produced entirely by robots? In the 2019 Japanese anime Carole & Tuesday, the story is set in a future world where people’s lives are mainly dependent on computers, and artificial intelligence writes most of the music (Motamayor, 2019). Although computers replacing human composers seems to be a far-fetched idea, musicians can now use synthesized singing voices for composing by software known as Vocaloid. The invention of Vocaloid is no doubt a breakthrough in music technology since it benefits both composers and producers and shakes the whole music industry significantly.
The idea of producing artificial human voices has existed much earlier than people’s imagination. The earliest artificial human voice production started with speech synthesis, which works with a text-to-speech system converting text to a synthesized human voice, which comes from recorded human voices stored in the database. Before electronic signal processing was invented, scientists such as Christian Kratzenstefound that humans could produce five basic vowels in 1779. Later in 1791, Wolfgang von Kempelen invented the machine known as the “acoustic-mechanical speech machine,” which worked with tongue and lips models to produce consonants and vowels, and this speech machine was refined later by many scientists. Then, Bell Labs developed a machine known as VOCODER in the 1930s, a speech analyzer and synthesizer produced by keyboards, and Homer Dudley refined VOCODER into VODER in 1939. In 1950, Dr. Franklin S. Cooper invented the Pattern playback, which converted pictures of acoustic patterns into sounds (McGill CS, 2007).
Since the 1950s, computer technology has developed, and people can build speech synthesizers through computers instead of machines. Scientists developed the first electronic speech synthesis system in the 1950s, and in 1968, the first text-to-speech system was invented. Around the same time in Bell Labs, John Larry Kelly, Jr produced synthesized speech through an IBM 704 computer to recreate the song “Daisy Bell” with Max Mathews’s accompanist, which was the first time that people used speech synthesis in a musical context. In the movie 2001: A Space Odyssey, the film scorer used a HAL 9000 computer to sing the same song during the movie’s climactic scene. Scientists are still seeking ways to improve speech synthesis to make it sound more natural and clear (McGill CS, 2007).
However, speech synthesis produces human voices artificially, mainly in speech. Although Bell Labs successfully made computers sing with music, scientists did not design these voice synthesizers for making music and singing. As the technology keeps developing, people try to make the computers sing through various techniques such as cross-synthesis, which analyzes a vocal and a non-vocal instrument to produce synthesized singing voice, frequency modulation, format wave functions, and formant synthesizers. Many singing synthesis projects, such as Pabon’s singing synthesizer, worked through formant control in the late twentieth century. In 1984, Yves Potard and Xavier Rodet used the CHANT synthesizer to synthesize the ‘Queen of the Night’ from the opera “The Magical Flute,” the most successful singing synthesizer in the twentieth century. Karaokes also had real-time processing through a vocoder to produce harmony while singing (Cook, 1996). However, the techniques are still not mature enough since they mainly depend on modeling human voice tract to produce artificial voice and still work based on the vocoder principle. Therefore, future works are needed to create a more natural, clear synthesized singing voice.
Then, more and more singing synthesizers merged later, and the most prominent and successful one among them is Vocaloid, which became popular among computer music producers today. If you are familiar with Japanese popular culture, you will often see an anime girl who puts her blue hair in pigtails. Her name is Hatsune Miku, but she is not just a character. She is not an anime or video games character but a digital instrument developed with Yamaha’s voice synthesizer engine. Music producers can tune her voice to create a singing voice that sounds like a human (Boxall, 2016). The invention of Hatsune Miku was a huge success, but she is not Yamaha’s first attempt. Then, more and more singing synthesizers merged later, and the most prominent and successful one among them is Vocaloid, which became popular among computer music producers today.
In 1997, Yamaha contacted Jordi Bonada, a senior researcher at the Music Technology Group from Pompeu Fabra University in Barcelona, about a research project on voice transformation. The project was named “Elvis,” which aimed to help people sing better at karaoke. However, the project only lasted for two years. It never came out as a product since it depended on spectral morphing techniques and professional’s performance for each song, and there were too many songs in Japanese karaoke systems. Although Elvis failed, it gave the developers new ideas and a different perspective of making voice synthesizers. “We realized that it might be a better idea to record not just a song from a particular singer,” said Bonada, “but a set of vocal exercises with a great phonetics range and build a model capable of singing any song” (St. Michel, 2014).
Hideki Kenmochi, later known as the “father of Vocaloid,” initially designed hardware with active noise canceling functions when he first joined Yamaha. In 2000, he found himself interested in Bonada’s singing synthesizer project. While Bonada and his team were researching in Barcelona, Yamaha focused on product design and development. Bonada’s team saw the Elvis project as their starting point, but the main challenge was processing and transforming singers’ performances into natural and fluent singing. Therefore, the team developed a new voice model which can transform vocal timbres naturally while keeping the details (St. Michel, 2014).
As a result, the prototype of Vocaloid came out in 2002, which was named Daisy. The earliest model of the program was very similar to the one widely used today, as it allowed users to write lyrics and adjusted the synthesized singing voice generated by the computer. On the other hand, Yamaha originally wanted to sell the program by itself. However, if doing so, the choice of voice library would be limited as the program would only come with a voice library made by Yamaha. Therefore, Yamaha decided to sell the technology to other companies (St. Michel, 2014).
In 2003, the team displayed Daisy at Musikmesse, a German music trade show, and after the event, the program was renamed “Vocaloid.” One year later, the first version of Vocaloid was released along with two voice libraries, LEON and LOLA, a male and a female voice in English, by the British company Zero-G. Later in the same year, the Japanese company Crypton Future Media released the first Japanese Vocaloid voice library MEIKO ("Timeline," n.d.).
However, Vocaloid 1was not very successful, but compared to Zero-G’s LEON and LOLA, Cryton’s MEIKO sold more. The reason was that Cryton designed a character for MEIKO and put her on the product’s package. The second version of Vocaloid, also known as Vocaloid 2, is easier to use, has a smoother voice, and samples authentic human voices. Cryton designed a new character for the updated program. To sample authentic human voices for the new voice library, the company recruited Saki Fujita, a Japanese voice actress, for recording. Then, the illustrator Kei Garo designed the character for Vocaloid 2 (St. Michel, 2014). This new character is named Hatsune Miku, which means “first sound of future” in Japanese, and it sold very well and became a big success ("The First," 2012).
After the massive success of Hatsune Miku, more and more companies get licenses from Yamaha to set up Vocaloid voice libraries in different languages, and more characters have been created since then. For example, Cryton’s Megurine Luka has both Japanese and English voice libraries, and thus she became the first multilingual Vocaloid. In 2011, the Korean company SBS A&T released SeeU, which made her the first Korean Vocaloid. In 2012, the first Chinese Vocaloid, Luo TianYi, was released by the Chinese Company Shanghai HENIAN. There are also Vocaloids in English, such as PowerFX Systems AB.’s Sweet Ann and OLIVER and Zero-G’s Prima. Besides releasing new Vocaloids, Yamaha also keeps updating the technology to generate more natural singing voices. For example, in later versions of Hatsune Miku, known as Hatsune Miku Append, the software includes new voicebanks called Soft, Sweet, Dark, Vivid, Solid, and Light. These new voicebanks allow users to use different voice textures to express emotions and make the voice fit the music’s genre ("Timeline," n.d.).
Vocaloid’s success is phenomenal as it is the first time that people can create authentic singing voices through computers so that music producers no longer have to find a vocalist for their compositions. The famous Vocaloid producer Hachioji-P recalls that he “had been making my music for a while, club music, but it had all been instrumental” and purchased Hatsune Miku because he “didn’t know anyone” who he “could ask to sing over my music.” Later, Hachioji-P became a famous Vocaloid producer and started going to major Vocaloid events to play his music, and thus he started to connect with other producers who share musical talents and dreams with him (St. Michel, 2014).
However, what is more important is that it provides opportunities for anyone who has musical talents. Coralmines, a Vocaloid producer, claims that he is an “average bedroom musician with average music skills and absolutely no experience nor knowledge of the music industry,” and Vocaloid gives him chances to share his music with everyone in the world (Boxall, 2016). When Yamaha first announced Hatsune Miku, it was around the same time the Japanese video website Nico Nico Douga became popular. Many amateur musicians started to upload their songs using Hatsune Miku’s singing voice to the website and gradually established a community of Vocaloid producers. Many of the earliest Vocaloid producers, such as Supercell and Livetune, later went into the mainstream music industry thanks to their success on Nico Nico Douga with their Vocaloid songs (St. Michel, 2014).
On the other hand, Vocaloid also dramatically impacts the music industry besides individuals such as composers and producers as it gradually grows into a big business. Hachioji-P says that the Vocaloid business was “more like a playground” in the beginning. Later, producers started to care about what the audience would like to hear and react after they achieved enough on Nico Nico. Thus the whole business gradually became more and more commercialized. As a result, many Japanese music retailers now have a section only for Vocaloid songs, and Japanese karaoke systems have thousands of Vocaloid songs in their music libraries (St. Michel, 2014). There are even Vocaloid concerts where these virtual stars perform on stage with holograms. One of the most famous annual tours of Hatsune Miku and other Vocaloids under Crypton is called Magical Mirai. These concerts were initially a domestic tour inside Japan. However, in 2014, they became international because Miku Expo was held in New York and Los Angeles, expanding the market from Japan to the world (Min, 2020). People in the music industry see Vocaloids as a new business and gain huge profits from these virtual singers. Therefore, Vocaloids benefits the music industry in unconventional ways.
Although Vocaloids are already making such big success worldwide, Yamaha is still improving their technology every year. In 2018, the latest version of the Vocaloid engine, also known as Vocaloid 5, was released with many new functions such as attack and release to add more emotions in the singing voice, emotion tool for controlling voice strength while looking at the waveform, and new control parameters for adjusting the voices. With these new features, the synthesized voices now sound closer to authentic human singing voices. It includes a vast phrase library that helps people who have difficulties writing lyrics and melodies and can directly coordinate in other DAWs through VST plugins, making the software more straightforward to use than before ("VOCALOID 5," n.d.). In 2012, Yamaha started developing a Vocaloid keyboard which allows producers to make Vocaloids sing in real-time through playing the keyboard (Byford, 2012). In 2009, Yamaha announced that its new Vocaloid named VOCALOID: AI, successfully reproduced the voice of Hibari Misora, a legendary Japanese singer, through learning from recordings of the singer’s singing and speech voice, marking that Vocaloid is becoming more and more like artificial intelligence (Yamaha Cooperation, 2019).
From the early synthesized speech machines to Vocaloids, humans have improved very much in creating synthesized voices and can now use computers for singing. Vocaloid’s invention and success give people new ideas of composing music and chances of showing their musical talents, opening a new, unconventional market in the music industry. Vocaloids are no doubt the music of the future. As Yamaha keeps developing its technology, it is exciting to see how far Vocaloids will go and how they will keep changing this world.
Boxall, A. (2016, March 13). 'Vocaloids' aren't characters, they're instruments changing the way music is made. Retrieved May 6, 2020, from https://www.digitaltrends.com/music/hatsune-miku-creative-revolution-musicians/
Byford, S. (2012, March 21). Yamaha developing Vocaloid keyboard for real-time live performances. Retrieved May 8, 2020, from https://www.theverge.com/2012/3/21/2889395/yamaha-vocaloid-keyboard-real-time-live-performances
Cook, P. R. (1996). Singing Voice Synthesis: History, Current Work, and Future Directions. Computer Music Journal, 20(3), 38-46. https://doi.org/10.2307/3680822
The first sound of the future: Hatsune Miku. (2012, March 20). Retrieved May 7, 2020, from https://www.digitalmeetsculture.net/article/the-first-sound-of-the-future-hatsune-miku/
McGill CS. (2007). Speech Synthesis. Retrieved May 6, 2020, from https://www.cs.mcgill.ca/~rwest/wikispeedia/wpcd/wp/s/Speech_synthesis.htm
Min, L. (2020, April 21). What Live Music can Learn from the World's Biggest Virtual Pop Star. Retrieved May 8, 2020, from https://www.nylon.com/entertainment/hatsune-miku-future-of-live-music
Motamayor, R. (2019, September 13). 'Carole & Tuesday' is a Science Fiction Anime About the Power of Music From the Creator of 'Cowboy Bebop'. Retrieved May 6, 2020, from https://www.slashfilm.com/carole-and-tuesday/
St. Michel, P. (2014, November 11). The Making of Vocaloid. Retrieved May 6, 2020, from https://daily.redbullmusicacademy.com/2014/11/vocaloid-feature
Timeline. (n.d.). Retrieved May 7, 2020, from https://vocaloid.fandom.com/wiki/Timeline
VOCALOID 5. (n.d.). Retrieved May 8, 2020, from https://www.vocaloid.com/en/
Watanabe, S. (Producer). (2019, April 10). Carole & Tuesday [Television program]. Retrieved from https://www.netflix.com/title/80992137
Yamaha Cooperation. (2019, October 8). Yamaha VOCALOID:AI™ Faithfully Reproduces Singing of Legendary Japanese Vocalist Hibari Misora. Retrieved May 8, 2020, from http://www.vocaloid.com/en/news/vocaloid_ai_news_release