IBM Cloud Docs
How to modify speech synthesis

How to modify speech synthesis

The expressive neural voices that are available with the IBM Watson® Text to Speech service offer some additional features that are not available with other types of voices: using speaking styles, emphasizing interjections, and emphasizing words. These features are available for both the HTTP and WebSocket interfaces.

The features involve the use of elements of the Speech Synthesis Markup Language (SSML). The descriptions of the features provide information about how they interact with related SSML elements and attributes.

Using speaking styles

The expressive neural voices determine the sentiment of the text from the context of its words and phrases. The speech that they produce, in addition to having a very conversational style, reflects the mood of the text. The expressive voices naturally express gratitude, thankfulness, happiness, empathy, confusion, and other sentiments by default, with no explicit additional tagging.

However, you can embellish the voices' natural tendencies by using the <express-as> element with the required style attribute to indicate that all or some of the text is to emphasize specific characteristics. These characteristics are referred to as speaking styles:

  • cheerful - Expresses happiness and good news. The style is upbeat, welcoming, and conveys a positive message.
  • empathetic - Expresses empathy and compassion. The style has sympathetic undertones, but it is not excessively sorrowful.
  • neutral - Expresses objectivity and evenness. The style strives for less emotion, and instead conveys a more even and instructional tone.
  • uncertain - Expresses uncertainty and confusion. The style conveys the feeling of being unsure or in doubt.

In many cases, the effect of the styles is very subtle. In such cases, the differences might not be readily discernible from the default expressive tone. But there are instances where the style can audibly enhance the voices' inherent ability to detect and express emotion. The differences are often detectable only on certain words and phrases.

Expressing speaking styles with SSML

The following HTTP request uses the <express-as> element with a style of cheerful to modify the speaking style of the entire input text. The example uses the en-US_EmmaExpressive voice.

IBM Cloud

curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output full-cheerful-style.wav \
--data "{\"text\":\"<express-as style='cheerful'>Oh, that&apos;s good news! I am very happy for you!</express-as>\"}" \
"{url}/v1/synthesize?voice=en-US_EmmaExpressive"

IBM Cloud Pak for Data

curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output full-cheerful-style.wav \
--data "{\"text\":\"<express-as style='cheerful'>Oh, that&apos;s good news! I am very happy for you!</express-as>\"}" \
"{url}/v1/synthesize?voice=en-US_EmmaExpressive"

The following example HTTP request uses the cheerful style for the first sentence of the request only. The second sentence is spoken with the default voice. The example again uses the en-US_EmmaExpressive voice.

IBM Cloud

curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output partial-cheerful-style.wav \
--data "{\"text\":\"<express-as style='cheerful'>Oh, that&apos;s good news!</express-as> Do you need any further help?\"}" \
"{url}/v1/synthesize?voice=en-US_EmmaExpressive"

IBM Cloud Pak for Data

curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output partial-cheerful-style.wav \
--data "{\"text\":\"<express-as style='cheerful'>Oh, that&apos;s good news!</express-as> Do you need any further help?\"}" \
"{url}/v1/synthesize?voice=en-US_EmmaExpressive"

Emphasizing interjections

When you use expressive neural voices, the service automatically detects a number of common interjections based on context. In the resulting audio, it gives them the natural emphasis that a human would use in normal conversation. Expressive voices support a different set of interjections based on language. Please refer to the language specific pages for more details on which interjections are supported.

Enabling or disabling interjections with SSML

The service always interprets and pronounces most of the interjections as expected. However, there are some interjections that could be used in a different context. In such cases, the default behaviour of the service is to disable such interjections by default. You can use SSML to enable or disable such interjections. The interpret-as attribute of the SSML <say-as> element accepts an additional value, interjection. When you use the interpret-as attribute with the interjection value, you include the additional enabled attribute with a value of true or false to indicate whether the word is to be pronounced as an interjection. Every language has a different subset of interjections that can be enabled/disabled through SSML. Please refer to the language specific pages for more details on which interjections can be enabled/disabled using SSML.

The following example HTTP request directs the service not to pronounce oh and aha as interjections. The example uses the en-US_AllisonExpressive voice.

IBM Cloud

curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output disabled-interjections.wav \
--data "{\"text\":\"<say-as interpret-as='interjection' enabled='false'>Oh</say-as>, in addition, the <say-as interpret-as='interjection' enabled='false'>aha</say-as> wasp is endemic to Australia.\"}" \
"{url}/v1/synthesize?voice=en-US_AllisonExpesssive"

IBM Cloud Pak for Data

curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output disabled-interjections.wav \
--data "{\"text\":\"<say-as interpret-as='interjection' enabled='false'>Oh</say-as>, in addition, the <say-as interpret-as='interjection' enabled='false'>aha</say-as> wasp is endemic to Australia.\"}" \
"{url}/v1/synthesize?voice=en-US_AllisonExpesssive"

Emphasizing words

All the expressive voices use a conversational style that naturally applies the correct intonation from context. But you can use the SSML <emphasis> element to indicate that one or more words are to be given more or less emphasis in the synthesized audio. The change in stress can be indicated by an increase or decrease in pitch, timing, volume, or other acoustic attributes.

The <emphasis> element supports an optional level attribute that accepts one of the following values:

  • none - Prevents the service from emphasizing text that might otherwise be emphasized.
  • moderate - Provides a noticeable amount of emphasis to the text. This value is the default if you omit the level attribute.
  • strong - Provides a more significant amount of emphasis to the text than the moderate level provides.
  • reduced - De-emphasizes the text by tending to reduce its significance in the audio. This level is the opposite of stressing the text.

Table 3 provides examples of each available level for the emphasis element. The samples use the en-US_MichaelExpressive voice. A request that uses the emphasis element fails if a specified level is not one of the supported values.

Table 3. Emphasizing words
Level of emphasis Example sentence Audio sample
No emphasis "I am going to give her the book."
none "I am going to give her the <emphasis level='none'>book</emphasis>."
moderate "I am going to <emphasis level='moderate'>give</emphasis> her the book."
strong "I am going to <emphasis level='strong'>give</emphasis> her the book."
reduced "I <emphasis level='reduced'>am going to give</emphasis> her the book."

Emphasizing a word with SSML

The following example HTTP request uses the <emphasis> element with a level of moderate (the default) to emphasize the word give in the sentence. The example uses the en-US_MichaelExpressive voice.

IBM Cloud

curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output strong-emphasis.wav \
--data "{\"text\":\"I am going to <emphasis level='moderate'>give</emphasis> her the book.\"}" \
"{url}/v1/synthesize?voice=en-US_MichaelExpesssive"

IBM Cloud Pak for Data

curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output strong-emphasis.wav \
--data "{\"text\":\"I am going to <emphasis level='moderate'>give</emphasis> her the book.\"}" \
"{url}/v1/synthesize?voice=en-US_MichaelExpesssive"