Convert Text to Speech Using Google API

If you go to Google Translate, you can not only translate text from one language to another, but you can also listen to the translation. For example, this English to Chinese translation allows you to listen to the pronunciation of the Chinese text.

Google’s Cloud Text-to-Speech API allows you to programmatically generate mp3s of any text. Below are steps to do it on Windows using PHP.

1. Follow These Instructions

https://cloud.google.com/text-to-speech/docs/quickstart-client-libraries#client-libraries-install-php

2. Set Environment Variable

When you follow the steps above, you will download a JSON file containing your credentials. You need to set an environment variable by opening a command prompt and entering

set GOOGLE_APPLICATION_CREDENTIALS=path-to-json-file

You can then verify it is set by typing “set”.

That environment variable is temporary and will persist for the duration of the terminal session. To set the environment variable permanently, follow these steps.

3. Install PHP Composer

https://getcomposer.org/download/

Composer will need a php.ini file. If one doesn’t exist, it will create one.

4. Update php.ini

To ensure your SSL certificates are up-to-date, download the latest cacert.pem from https://curl.haxx.se/ca/cacert.pem. Then, edit php.ini as follows:

curl.cainfo=”/path/to/downloaded/cacert.pem”

5. Create PHP Script

Copy and paste the example code from the instructions in step 1. This is a PHP script so wrap the code in <?php … ?>. Save it as test-text-to-speech.php somewhere.

6. Run PHP Script

At the command prompt, verify the Google environment variable is set and then run the PHP script. If PHP is in your path, you can run, for example,

This will output an audio file (output.mp3) in the same folder. By default, the text is “Hello, world!” and the language code is en-US. You can change the text to Chinese, for example: 这是一个测试 and change the language code accordingly to cmn-CN. Then, you’ll get the same speech as what you hear in Google Translate.

See list of text-to-speech voices (language codes)

There are two types of voice synthesis models: standard (parametric) and Wavenet. Wavenet is more expensive but sounds more natural.

Google Text-to-Speech pricing is based on the number of characters processed.

For additional customization of text to speech, use Speech Synthesis Markup Language (SSML).