Voxygen Cloud
API Documentation

Neural or Standard Text-to-Speech

NTTS (Neural Text-to-Speech) voices are a type of synthetic voice generated using machine learning algorithms, specifically neural networks. These voices are designed to sound more natural and human-like than standard TTS (text-to-speech) voices, which are typically based on concatenating pre-recorded snippets of speech. NTTS voices are created by training a neural network on a large dataset of recorded speech, and are able to generate speech in a more continuous and natural-sounding manner.

API Overview

The Voxygen Cloud API is a "REST-like" API. This means that each request, made from any client, contains all of the information necessary to service the request. The server does not establish sessions with clients. Authentication is however necessary for each request.

Three resources are available:

Requests format #back to top

The service invocation is done with a HTTPS GET or POST request with a set of required list of parameters: [parameter, value].
The MIME type of encoded data is application/x-www-form-urlencoded.
When sent in an HTTPS GET request, data should be included in the query component of the request URI.
When sent in an HTTPS POST, the data should be placed in the body of the message.
In the latter case the Content-Type header is not mandatory, but if it is present, it must be equal to application/x-www-form-urlencoded.
Parameter names and values must be UTF-8 encoded.

Responses codes#back to top

List of possible HTTP status codes:

  • 200 OK - when request succeeds, then it returns the appropriate content together with a Content-Type header application/json (requests /info and /tts2) or audio/xxx (request /tts1);
  • 404 Not Found - when the requested resource does not exist or the user parameter is missing;
  • 401 Unauthorized - when the IP address of the client is not allowed for the user or the HMAC is wrong (or missing);
  • 403 Forbidden - when the account has expired;
  • 413 Request Entity Too Large - when the text parameter is too long;
  • 429 Too Many Requests - when the account has sent too many requests in a given amount of time;
  • 400 Bad Request - when an unknown parameter is present, an incorrect value is assigned to a parameter, the input cannot be correctly parsed by the engine or the header is too long (approx. >4000 characters). The response body then contains more information;
  • 500 Internal Server Error;
  • 503 Service Not Available.

Mark-up of the text #back to top

The mark-up of the text to vocalize allows to influence dynamically the behaviour of the speech synthesis system, changing voice, regulate flow control, change volume... Two mark-up formats are available:

  1. The standardized SSML format. The latter is described in the document VOX31_SSML_reference_manual
  2. The baratinoo in-house format. The latter is described in the document VOX32_Baratinoo_tags_reference_manual

Text-to-Speech API #back to top

These requests convert the value of the text parameter to an audio file. The format of the audio file depends on the two parameters header and coding. The one-stage request directly returns the audio file in the response body, whereas the two-stage request returns a JSON document that may contain the URL of the audio file to be fetched in a second stage.

one-stage TTS (tts1)

tts1 requests must be used for low latency needs because the audio signal is streamed as the text is processed (the response is an audio signal sent in chunks by HTTP/1.1 and above). See usage notes below.

  • Standard voices: https://ws.voxygen.fr/ws/tts1
  • Neural voices: https://ws.voxygen.fr/ntts/tts1

Here are the parameters for one-stage requests.
If optional parameters are omitted, the server assigns a default value that depends on the configuration of the account.

Parameter Required Value
user yes User id. See authentication section
hmac yes hmac computed value. See authentication section
text yes Text to be synthesized, UTF-8 encoded. Mark-up formats ssml and tags are available (max = 2000 characters)
voice no Voice name to use for synthesis
Must be one of the voices available for the account
frequency no An integer value, the sampling frequency in Hertz between 6000Hz and 48000Hz
header no Generated audio file can be WAV, MP3 or OGG file. The format of the audio file depends on these two parameters header and coding:

application/octet-stream:
  • header=headerless
  • coding=lin or A or mu or LIN
audio/x-wav:
  • header=wav-header or header=wav-stream-header (see note)
  • coding=lin or A or mu or LIN
audio/au:
  • header=au-header or header=au-stream-header (see note)
  • coding=lin or A or mu or LIN
audio/mpeg:
  • header=headerless
  • coding=mp3:<bitrate>-<quality> with <bitrate> in {16, 32, 64, 96, 128, 160} and <quality> is an integer between 0 and 9 (0 = best quality)
audio/ogg:
  • header=headerless
  • coding=ogg:<quality> with <quality> a float between 0.0 and 1.0 (1.0 = best quality)
coding no
volume no Set current volume to <volume>. Accepted values for <volume> are:
silent -∞dB
x-soft -12dB relative to default
soft -6dB relative to default
medium +0dB relative to default
default initial volume for current voice
loud +6dB relative to default
x-loud +12dB relative to default
<number> absolute value in interval [0;100]
+/-<number>dB relative change in dB
+/-<number>% relative percentage on linear scale
+/-<number> relative change on linear scale
articulation-rate no Set speech rate to <articulation-rate>. Accepted values for <articulation-rate> are:
x-slow 50% of default
slow 75% of default
medium 100% of default
default initial rate for current voice
fast 125% of default
x-fast 150% of default
+/-<number>% relative percentage
+/-<number> relative change
<number>
<number>%
multiplier (positive, not zero) of default
pause-rate no Set pauses rate to <pause-rate>.
Rate is applied to pauses originated from the TTS engine (\break values are not affected). Accepted values for <pause-rate> are:
x-slow 50% of default
slow 75% of default
medium 100% of default
default initial rate for current voice
fast 125% of default
x-fast 150% of default
+/-<number>% relative percentage
+/-<number> relative change
<number>
<number>%
multiplier (positive, not zero) of default
timbre no Set timbre coefficient to <timbre>. Accepted values for <timbre> are:
+/-<number>% relative percentage
<number> multiplier (positive, not zero) of initial value
pitch-height no Set pitch baseline to <pitch-height>.
More than 99% of pitch values are in the interval [<baseline>;<baseline>+<range>]. <baseline> is the lower bound of the pitch in Hertz (limited to [30;300]), range is the degree of additional pitch in Hertz (limited to [0;300]).
Accepted values for <pitch-height> are:
x-low 50% of default
low 75% of default
medium 100% of default
default initial baseline/range for current voice
high 133% of default
x-high 200% of default
+/-<number>% relative percentage
+/-<number>st relative change in semitones
+/-<number>Hz relative change in Hertz
<number>Hz absolute value in Hertz
pitch-range no Set pitch range to <pitch-range>.
More than 99% of pitch values are in the interval [<baseline>;<baseline>+<range>]. <baseline> is the lower bound of the pitch in Hertz (limited to [30;300]), range is the degree of additional pitch in Hertz (limited to [0;300]).
Accepted values for <pitch-range> are:
x-low 50% of default
low 75% of default
medium 100% of default
default initial baseline/range for current voice
high 133% of default
x-high 200% of default
+/-<number>% relative percentage
+/-<number>st relative change in semitones
+/-<number>Hz relative change in Hertz
<number>Hz absolute value in Hertz
lexicon no The name of an user lexicon to be used. The user lexicons are stored in the user's cloud account.
Important notes for streaming usage:
  • You must select an audio format compatible with streaming: header parameter must be equal to headerless, wav-stream-header or au-stream-header.
    For an audio file generated with wav-header or au-header header, the response is delayed until the full text is processed in order to set the true signal length in the header. To avoid this delay, you can use the header value wav-stream-header or au-stream-header: in this case, the speech signal is sent as the text is processed but the length of the signal in the header does not reflect the true signal length. It is set to the arbitrary value of 0xFFFFFFF .
  • Your HTTP client must be able to return each chunk of signal as soon as it is generated. Otherwise you will lose the benefit of the streaming.

two-stage TTS (tts2)

  • Standard voices: https://ws.voxygen.fr/ws/tts2
  • Neural voices: https://ws.voxygen.fr/ntts/tts2

The result of the request contains only meta-information about the generated audio file (duration, possible synchronization information, location of the file...). When generated, the audio file is accessible at the place indicated in the response for five minutes.

Parameters for the two-stage TTS request is the same as for the one-stage TTS request above.

A additional parameter 'event' is available with the two-stage TTS request which adds synchronization events in the json response.

Parameter Required Value
event no An integer value, from 1 to 3:
Value Events
1 Marker
2 Events of level 1 + Edge of a sentence, Silence, Voice change, Word separator, Punctuation
3 Events of level 2 + Syllable, Viseme

The response of a tts2 request is an application/json object that contains information related to the generated audio file, with the following structure:

                                    {
                                      "url":    "https://ws.voxygen.fr/ws/audio/abcdefg",  // where to fetch the audio file
                                      "signal": "https://ws.voxygen.fr/ws/audio/abcdefg",  // This field is deprecated (same value as 'url'). 
                                      "warnings": [],   // may contains strings (warning messages)
                                      "events": []      // may contains list of events, according to the requested level (see 'event' parameter above)
                                    }  
                                

An event in json response has the following structure:

                                    [
                                      float,      // timestamp from the beginning of the audio signal, in milliseconds
                                      keyword,    // identifies the type of event
                                      value       // depends of the type of event
                                    ]
                                

Here are keywords and values (type for values is always string):

Type of event Keyword Value
Marker MARK The name of the marker
Edge of a sentence PHR None (empty string)
Word WORD Text of a word
Silence SIL None (empty string)
Voice change VOICE Voice name
Word separator SEPR Separator string
Punctuation PUNCT Punctation symbol
Syllable SYL None (empty string)
Viseme VISEME Viseme type

Account information API #back to top

  • Standard voices: https://ws.voxygen.fr/ws/info
  • Neural voices: https://ws.voxygen.fr/ntts/info

This request gives details about the account.

Parameter Required Value
user yes User id. See authentication section
hmac yes hmac computed value. See authentication section

The response is an application/json object with the following structure: (with example values)

                                    {
                                        // List of voices
                                        "voices": [
                                            {   // each voice is described in an object
                                                "name": "Jenny",         // the voice name, which is used as a value for parameter voice
                                                "display_name": "Jenny", // voice name which can be used for display
                                                "language": "en-US",     // language identifier as defined by IETF BCP 47
                                                "gender": "female",      // "male" or "female"
                                                "version": "3.2.0",      // the voice version
                                                "frequency": "24000",    // voice sampling frequency in Hertz
                                            },
                                            { 
                                                "name": "Judith_NTTS",
                                                "display_name": "Judith (NTTS)",
                                                "language": "en-GB",
                                                "gender": "female",
                                                "version": "5.0.0",
                                                "frequency": "24000",
                                            },
                                            {
                                                "name": "Arnaud_enjoue",
                                                "display_name": "Arnaud enjoué",
                                                "language": "fr-FR",
                                                "gender": "female",
                                                "version": "3.2.0",
                                                "frequency": "24000",
                                            }
                                        ],

                                        // Constrained parameters (if any) for TTS requests and restricted values
                                        "parameters": {
                                            // text parameter is never listed here as its content is not restricted
                                            // voice parameter is never listed here. Its value should be one of the names listed in "voices" (see above)
                                            "parsing": ["ssml", "noparsing"],
                                            "frequency": [8000, 16000],
                                            "header": ["wav-header", "headerless"],
                                            "coding": ["lin", "A"],
                                        },

                                        // Default values
                                        "default": {
                                            "voice": "Jenny",
                                            "parsing": "ssml",
                                            "frequency": 24000,
                                            "header": "wav-header",
                                            "coding": "A"
                                        }
                                        
                                        // user lexicons
                                        "lexicons": {
                                            "french.bin",
                                            "english.bin",
                                        }
                                    }
                                

Reporting API #back to top

You can monitor your data consumption at url https://ws.voxygen.fr/report?user=XXX&password=YYY , where XXX is your login, and YYY your monitoring password (different from the password used to compute hmac! ). Two modes are available :

Parameter Required Value
user yes User id. See authentication section
password yes password provided by Voxygen
type no
  • graph: interactive reporting dashboard
  • json: raw data reporting

Authentication #back to top

The authentication is based on a hmac value computed and inserted in each request.
hmac computation involves a hash function and a secret key. The hash function is MD5. The algorithm is termed HMAC-MD5. All major programming languages (at least for JAVA, PHP, Python, C#, Javascript) already have a library that implements this algorithm.
The secret key is the password that comes with the user.
The procedure is as follows:

  1. sort all the request parameters in alphabetic order
  2. init hmac with password
  3. update hmac with "parameter=value"(or "parameter=" if value is empty) for each parameter in request (except hmac!)
  4. append the parameter "hmac=computed_value" to the request

Here is a small example : user=demo, password=demo_password, text=Hello world!
https://ws.voxygen.fr/ws/tts1?user=demo&text=Hello+world%21&hmac=8a38fdf476212b2ce4f8a2dd14bb0d99
User is inactive so the response is always 'Account expired'.

Sources examples #back to top

Here are sources examples which demonstrate how compute hmac value in various languages.

PHP #back to top

                                    <?php
                                    $password="provided_by_voxygen";
                                    $arguments = array(
                                        'user'  => "provided_by_voxygen",
                                        'voice' => "Jenny",
                                        'text'  => 'Hello world!'
                                    );
                                    
                                    $ctx=hash_init('md5',HASH_HMAC,$password);
                                    ksort($arguments);
                                    $url="https://ws.voxygen.fr/ws/tts1?";
                                    foreach ($arguments as $name=>$val)
                                    {
                                        hash_update($ctx, $name."=".$val);
                                        $url.=$name."=".urlencode($val)."&";
                                    }
                                    $hmac=hash_final($ctx);
                                    $url.="hmac=".$hmac;
                                    return $url;
                                    ?>
                                

Python 3 #back to top

                                    #!/usr/bin/env python3
                                    #-*- coding: utf-8 -*-
                                    import urllib.parse
                                    import hmac
                                    import hashlib

                                    def validatehmac(password,arguments):
                                        # 'arguments' is a dict of elements {"user":"provided_by_voxygen","voice":"Jenny","text":"Hello world!",...}
                                        url = "https://ws.voxygen.fr/ws/tts1?"
                                        computedHMAC=hmac.new(password.encode(),digestmod=hashlib.md5) # password provided by voxygen
                                        for parameter,value in sorted(arguments.items()): # the list is sorted by parameter
                                            computedHMAC.update(("%s=%s" % (parameter,value)).encode())
                                        url = url + urllib.parse.urlencode(arguments) + "&hmac="+computedHMAC.hexdigest()
                                        return url
                                

Objective-C #back to top

                                    #import <Foundation/Foundation.h>
                                    #import <CommonCrypto/CommonHMAC.h>
                                    
                                    NSString* getWebServiceURL();
                                    
                                    int main(int argc, const char * argv[])
                                    {
                                        @autoreleasepool
                                        {
                                            NSLog(@"%@", getWebServiceURL());
                                        }
                                        
                                        return 0;
                                    }
                                    
                                    NSString* getWebServiceURL()
                                    {
                                        const char password[] = "provided_by_voxygen"; // Password of your account
                                        
                                        CCHmacContext ctx;
                                        unsigned char hmac[CC_MD5_DIGEST_LENGTH];
                                        char hexHmac[2 * CC_MD5_DIGEST_LENGTH + 1];
                                        
                                        NSMutableString *url = [[NSMutableString alloc] initWithString:@"https://ws.voxygen.fr/ws/tts1?"];
                                        NSMutableDictionary *param = [[NSMutableDictionary alloc] init];
                                        NSArray *sortedKey;
                                        
                                        NSString *tmpParam;
                                        
                                        CCHmacInit(&ctx, kCCHmacAlgMD5, password, strlen(password));
                                        
                                        [param setObject:@"provided_by_voxygen" forKey:@"user"];
                                        [param setObject:@"Hello world" forKey:@"text"];
                                        [param setObject:@"Jenny" forKey:@"voice"];
                                        [param setObject:@"24000" forKey:@"frequency"];
                                        [param setObject:@"headerless" forKey:@"header"];
                                        [param setObject:@"mp3:64-3" forKey:@"coding"];
                                        
                                        
                                        sortedKey = [[param allKeys] sortedArrayUsingSelector:@selector(compare:)]; // Get the dictionnary's key sorted
                                        
                                        for (NSString *key in sortedKey)
                                        {
                                            tmpParam = [NSString stringWithFormat:@"%@=%@", key, [param objectForKey:key]]; // Get "key=value" string
                                            CCHmacUpdate(&ctx, [tmpParam UTF8String], strlen([tmpParam UTF8String])); // Update hmac with the string
                                            
                                            tmpParam = CFBridgingRelease(CFURLCreateStringByAddingPercentEscapes(NULL, 
                                                                            (CFStringRef)tmpParam,
                                                                            NULL, 
                                                                            (CFStringRef)@";?@$+{}<>,éçèàù%",
                                                                            CFStringConvertNSStringEncodingToEncoding(NSUTF8StringEncoding)
                                                                        )); // Encode the url
                                            
                                            [url appendFormat:@"%@&", tmpParam]; // Append the percent-encoded string to the url
                                        }
                                        
                                        CCHmacFinal(&ctx, hmac); // Finalize HMAC
                                        
                                        char *p = hexHmac;
                                        
                                        for (int i = 0; i < CC_MD5_DIGEST_LENGTH; i++, p += 2) // Convert HMAC into string
                                            snprintf(p, 3, "%02x", hmac[i]);
                                        
                                        [url appendFormat:@"hmac=%s", hexHmac]; // Append the HMAC string
                                        
                                        return url;
                                        
                                    }
                                

C# #back to top

                                    using System;
                                    using System.Collections.Generic;
                                    using System.Linq;
                                    using System.Text;
                                    using System.Web;
                                    
                                    namespace ConsoleApplication1
                                    {
                                        class Program
                                        {
                                            public static String ByteToString(byte[] ba)
                                            {
                                                StringBuilder hex = new StringBuilder(ba.Length * 2);
                                                foreach (byte b in ba)
                                                    hex.AppendFormat("{0:x2}", b);
                                                return hex.ToString();
                                            }
                                            
                                            static void Main(string[] args)
                                            {
                                                System.Collections.Generic.SortedList paramsOrdered = new SortedList();
                                                paramsOrdered.Add("user", "provided_by_voxygen");
                                                paramsOrdered.Add("voice", "Jenny");
                                                paramsOrdered.Add("coding", "mp3:160-0");
                                                paramsOrdered.Add("parsing", "tags");
                                                paramsOrdered.Add("frequency", "48000");
                                                paramsOrdered.Add("header", "headerless");
                                                
                                                byte[] bytes = Encoding.UTF8.GetBytes("Hello World!");
                                                Console.Write(BitConverter.ToString(bytes));
                                                paramsOrdered.Add("text", Encoding.UTF8.GetString(bytes));
                                                
                                                //Calcul de la valeur de hashsage
                                                System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
                                                byte[] keyByte = encoding.GetBytes("provided_by_voxygen");
                                                
                                                System.Security.Cryptography.HMACMD5 hmacMD5 = new System.Security.Cryptography.HMACMD5(keyByte);
                                                hmacMD5.Initialize();
                                                
                                                string concatParams = "";
                                                
                                                for (int i = 0; i < paramsOrdered.Count; i++)
                                                    concatParams += paramsOrdered.ElementAt(i).Key + "=" + paramsOrdered.ElementAt(i).Value;
                                                
                                                byte[] hashByte = hmacMD5.ComputeHash(encoding.GetBytes(concatParams));
                                                
                                                // Construction de l'url d'invocation
                                                string webRequestString = @"https://ws.voxygen.fr/ws/tts1?";
                                                for (int i = 0; i < paramsOrdered.Count; i++)
                                                {
                                                    if (i > 0) webRequestString += "&";
                                                        webRequestString += paramsOrdered.ElementAt(i).Key + "=" + paramsOrdered.ElementAt(i).Value;
                                                }
                                                
                                                webRequestString += "&hmac=" + ByteToString(hashByte);
                                                Console.Write(webRequestString);
                                                Console.ReadLine();
                                            }
                                        }
                                    }
                                

Android #back to top

                                    package HmacAndroid;
                                    
                                    import java.io.UnsupportedEncodingException;
                                    import java.math.BigInteger;
                                    import java.net.URLEncoder;
                                    import java.security.InvalidKeyException;
                                    import java.security.NoSuchAlgorithmException;
                                    import java.util.TreeMap;
                                    import java.util.logging.Level;
                                    import java.util.logging.Logger;
                                    import javax.crypto.Mac;
                                    import javax.crypto.spec.SecretKeySpec;
                                    
                                    public class HmacAndroid {
                                        
                                        private static final String user = "provided_by_voxygen";
                                        private static final String voice = "Jenny";
                                        private static final String texte = "Hello world!";
                                        private static final String passwd = "provided_by_voxygen";
                                        
                                        public static void main(String[] args) {
                                            try {
                                                String url_base = "https://ws.voxygen.fr/ws/tts1";
                                                TreeMap am = new TreeMap<>();
                                                am.put("user", user);
                                                am.put("voice", voice);
                                                am.put("text", texte);
                                                am.put("header", "wav-header");
                                                am.put("coding", "lin");
                                                am.put("frequency", "16000");
                                                
                                                // calcul hmac
                                                String hmac = calculhmac(am);
                                                // encode text
                                                String textEncoded = URLEncoder.encode(texte, "utf-8");
                                                // update TreeMap
                                                am.put("text", textEncoded);
                                                // create request, concat hmac at the end
                                                String request = treeMapToUrl(url_base, am);
                                                if (!"".equals(hmac)) {
                                                    request = request + "&hmac=" + hmac;
                                                }
                                                
                                            } catch (UnsupportedEncodingException ex) {
                                                Logger.getLogger(HmacAndroid.class.getName()).log(Level.SEVERE, null, ex);
                                            }
                                        }
                                        
                                        private static String calculhmac(TreeMap am) {
                                            String result = "";
                                            try {
                                                Mac m = Mac.getInstance("HmacMD5");
                                                SecretKeySpec k = new SecretKeySpec((passwd).getBytes("UTF-8"), "HmacMD5");
                                                m.init(k);
                                                
                                                for (String key : am.keySet()) {
                                                    String value = am.get(key);
                                                    m.update((key + "=" + value).getBytes());
                                                }
                                                
                                                byte s[] = m.doFinal();
                                                result = new BigInteger(1, s).toString(16);
                                                /* this is important, toString leaves out initial 0 */
                                                if (result.length() % 2 > 0) {
                                                    result = "0" + result;
                                                }
                                            } catch (NoSuchAlgorithmException | InvalidKeyException | IllegalStateException e) {
                                                e.printStackTrace();
                                            } catch (UnsupportedEncodingException ex) {
                                                Logger.getLogger(HmacAndroid.class.getName()).log(Level.SEVERE, null, ex);
                                            }
                                            return result;
                                        }
                                        
                                        private static String treeMapToUrl(String url, TreeMap am) {
                                            String request = url + "?";
                                            for (String key : am.keySet()) {
                                                String value = am.get(key);
                                                request += key + "=" + value + "&";
                                            }
                                            request = request.substring(0, request.length() - 1);
                                            return request;
                                        }
                                    }