NTTS (Neural Text-to-Speech) voices are a type of synthetic voice generated using machine learning algorithms, specifically neural networks. These voices are designed to sound more natural and human-like than standard TTS (text-to-speech) voices, which are typically based on concatenating pre-recorded snippets of speech. NTTS voices are created by training a neural network on a large dataset of recorded speech, and are able to generate speech in a more continuous and natural-sounding manner.
The Voxygen Cloud API is a "REST-like" API. This means that each request, made from any client, contains all of the information necessary to service the request. The server does not establish sessions with clients. Authentication is however necessary for each request.
Three resources are available:
The service invocation is done with a HTTPS GET or POST request with a set
of required list of parameters: [
The MIME type of encoded data is application/x-www-form-urlencoded.
When sent in an HTTPS GET request, data should be included in the query component of the request URI.
When sent in an HTTPS POST, the data should be placed in the body of the message.
In the latter case the Content-Type header is not mandatory, but if it is present,
it must be equal to application/x-www-form-urlencoded.
Parameter names and values must be UTF-8 encoded.
List of possible HTTP status codes:
The mark-up of the text to vocalize allows to influence dynamically the behaviour of the speech synthesis system, changing voice, regulate flow control, change volume... Two mark-up formats are available:
These requests convert the value of the
tts1 requests must be used for low latency needs because the audio signal is streamed as the text is processed (the response is an audio signal sent in chunks by HTTP/1.1 and above). See usage notes below.
Here are the parameters for one-stage requests.
If optional parameters are omitted, the server assigns a default value that depends on the configuration of the account.
Parameter | Required | Value | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
yes | User id. See authentication section | |||||||||||||||||||||||
yes | hmac computed value. See authentication section | |||||||||||||||||||||||
yes | Text to be synthesized, UTF-8 encoded. Mark-up formats ssml and tags are available (max = 2000 characters) | |||||||||||||||||||||||
no | Voice name to use for synthesis Must be one of the voices available for the account |
|||||||||||||||||||||||
no | An integer value, the sampling frequency in Hertz between 6000Hz and 48000Hz | |||||||||||||||||||||||
no | Generated audio file can be WAV, MP3 or OGG file. The format of the audio file depends on these two parameters application/octet-stream:
|
|||||||||||||||||||||||
no | ||||||||||||||||||||||||
no | Set current volume to <volume>. Accepted values for <volume> are:
|
|||||||||||||||||||||||
no | Set speech rate to <articulation-rate>. Accepted values for <articulation-rate> are:
|
|||||||||||||||||||||||
no | Set pauses rate to <pause-rate>.
Rate is applied to pauses originated from the TTS engine (\break values are not affected). Accepted values for <pause-rate> are:
|
|||||||||||||||||||||||
no | Set timbre coefficient to <timbre>. Accepted values for <timbre> are:
|
|||||||||||||||||||||||
no | Set pitch baseline to <pitch-height>. More than 99% of pitch values are in the interval [<baseline>;<baseline>+<range>]. <baseline> is the lower bound of the pitch in Hertz (limited to [30;300]), range is the degree of additional pitch in Hertz (limited to [0;300]). Accepted values for <pitch-height> are:
|
|||||||||||||||||||||||
no | Set pitch range to <pitch-range>. More than 99% of pitch values are in the interval [<baseline>;<baseline>+<range>]. <baseline> is the lower bound of the pitch in Hertz (limited to [30;300]), range is the degree of additional pitch in Hertz (limited to [0;300]). Accepted values for <pitch-range> are:
|
|||||||||||||||||||||||
no | The name of an user lexicon to be used. The user lexicons are stored in the user's cloud account. |
The result of the request contains only meta-information about the generated audio file (duration, possible synchronization information, location of the file...). When generated, the audio file is accessible at the place indicated in the response for five minutes.
Parameters for the two-stage TTS request is the same as for the one-stage TTS request above.
A additional parameter 'event' is available with the two-stage TTS request which adds synchronization events in the json response.
Parameter | Required | Value | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
no |
An integer value, from 1 to 3:
|
The response of a tts2 request is an application/json object that contains information related to the generated audio file, with the following structure:
{ "url": "https://ws.voxygen.fr/ws/audio/abcdefg", // where to fetch the audio file "signal": "https://ws.voxygen.fr/ws/audio/abcdefg", // This field is deprecated (same value as 'url'). "warnings": [], // may contains strings (warning messages) "events": [] // may contains list of events, according to the requested level (see 'event' parameter above) }
An event in json response has the following structure:
[ float, // timestamp from the beginning of the audio signal, in milliseconds keyword, // identifies the type of event value // depends of the type of event ]
Here are keywords and values (type for values is always string):
Type of event | Keyword | Value |
---|---|---|
Marker | MARK | The name of the marker |
Edge of a sentence | PHR | None (empty string) |
Word | WORD | Text of a word |
Silence | SIL | None (empty string) |
Voice change | VOICE | Voice name |
Word separator | SEPR | Separator string |
Punctuation | PUNCT | Punctation symbol |
Syllable | SYL | None (empty string) |
Viseme | VISEME | Viseme type |
This request gives details about the account.
Parameter | Required | Value |
---|---|---|
yes | User id. See authentication section | |
yes | hmac computed value. See authentication section |
The response is an application/json object with the following structure: (with example values)
{ // List of voices "voices": [ { // each voice is described in an object "name": "Jenny", // the voice name, which is used as a value for parameter voice "display_name": "Jenny", // voice name which can be used for display "language": "en-US", // language identifier as defined by IETF BCP 47 "gender": "female", // "male" or "female" "version": "3.2.0", // the voice version "frequency": "24000", // voice sampling frequency in Hertz }, { "name": "Judith_NTTS", "display_name": "Judith (NTTS)", "language": "en-GB", "gender": "female", "version": "5.0.0", "frequency": "24000", }, { "name": "Arnaud_enjoue", "display_name": "Arnaud enjoué", "language": "fr-FR", "gender": "female", "version": "3.2.0", "frequency": "24000", } ], // Constrained parameters (if any) for TTS requests and restricted values "parameters": { // text parameter is never listed here as its content is not restricted // voice parameter is never listed here. Its value should be one of the names listed in "voices" (see above) "parsing": ["ssml", "noparsing"], "frequency": [8000, 16000], "header": ["wav-header", "headerless"], "coding": ["lin", "A"], }, // Default values "default": { "voice": "Jenny", "parsing": "ssml", "frequency": 24000, "header": "wav-header", "coding": "A" } // user lexicons "lexicons": { "french.bin", "english.bin", } }
You can monitor your data consumption at url
Parameter | Required | Value |
---|---|---|
yes | User id. See authentication section | |
yes | password provided by Voxygen | |
type | no |
|
The authentication is based on a hmac value computed and inserted in each request.
hmac computation involves a hash function and a secret key.
The hash function is MD5. The algorithm is termed HMAC-MD5. All major programming languages (at least for JAVA, PHP, Python, C#,
Javascript) already have a library that implements this algorithm.
The secret key is the password that comes with the
The procedure is as follows:
Here is a small example : user=demo, password=demo_password, text=Hello world!
User is inactive so the response is always 'Account expired'.
Here are sources examples which demonstrate how compute hmac value in various languages.
<?php $password="provided_by_voxygen"; $arguments = array( 'user' => "provided_by_voxygen", 'voice' => "Jenny", 'text' => 'Hello world!' ); $ctx=hash_init('md5',HASH_HMAC,$password); ksort($arguments); $url="https://ws.voxygen.fr/ws/tts1?"; foreach ($arguments as $name=>$val) { hash_update($ctx, $name."=".$val); $url.=$name."=".urlencode($val)."&"; } $hmac=hash_final($ctx); $url.="hmac=".$hmac; return $url; ?>
#!/usr/bin/env python3 #-*- coding: utf-8 -*- import urllib.parse import hmac import hashlib def validatehmac(password,arguments): # 'arguments' is a dict of elements {"user":"provided_by_voxygen","voice":"Jenny","text":"Hello world!",...} url = "https://ws.voxygen.fr/ws/tts1?" computedHMAC=hmac.new(password.encode(),digestmod=hashlib.md5) # password provided by voxygen for parameter,value in sorted(arguments.items()): # the list is sorted by parameter computedHMAC.update(("%s=%s" % (parameter,value)).encode()) url = url + urllib.parse.urlencode(arguments) + "&hmac="+computedHMAC.hexdigest() return url
#import <Foundation/Foundation.h> #import <CommonCrypto/CommonHMAC.h> NSString* getWebServiceURL(); int main(int argc, const char * argv[]) { @autoreleasepool { NSLog(@"%@", getWebServiceURL()); } return 0; } NSString* getWebServiceURL() { const char password[] = "provided_by_voxygen"; // Password of your account CCHmacContext ctx; unsigned char hmac[CC_MD5_DIGEST_LENGTH]; char hexHmac[2 * CC_MD5_DIGEST_LENGTH + 1]; NSMutableString *url = [[NSMutableString alloc] initWithString:@"https://ws.voxygen.fr/ws/tts1?"]; NSMutableDictionary *param = [[NSMutableDictionary alloc] init]; NSArray *sortedKey; NSString *tmpParam; CCHmacInit(&ctx, kCCHmacAlgMD5, password, strlen(password)); [param setObject:@"provided_by_voxygen" forKey:@"user"]; [param setObject:@"Hello world" forKey:@"text"]; [param setObject:@"Jenny" forKey:@"voice"]; [param setObject:@"24000" forKey:@"frequency"]; [param setObject:@"headerless" forKey:@"header"]; [param setObject:@"mp3:64-3" forKey:@"coding"]; sortedKey = [[param allKeys] sortedArrayUsingSelector:@selector(compare:)]; // Get the dictionnary's key sorted for (NSString *key in sortedKey) { tmpParam = [NSString stringWithFormat:@"%@=%@", key, [param objectForKey:key]]; // Get "key=value" string CCHmacUpdate(&ctx, [tmpParam UTF8String], strlen([tmpParam UTF8String])); // Update hmac with the string tmpParam = CFBridgingRelease(CFURLCreateStringByAddingPercentEscapes(NULL, (CFStringRef)tmpParam, NULL, (CFStringRef)@";?@$+{}<>,éçèàù%", CFStringConvertNSStringEncodingToEncoding(NSUTF8StringEncoding) )); // Encode the url [url appendFormat:@"%@&", tmpParam]; // Append the percent-encoded string to the url } CCHmacFinal(&ctx, hmac); // Finalize HMAC char *p = hexHmac; for (int i = 0; i < CC_MD5_DIGEST_LENGTH; i++, p += 2) // Convert HMAC into string snprintf(p, 3, "%02x", hmac[i]); [url appendFormat:@"hmac=%s", hexHmac]; // Append the HMAC string return url; }
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Web; namespace ConsoleApplication1 { class Program { public static String ByteToString(byte[] ba) { StringBuilder hex = new StringBuilder(ba.Length * 2); foreach (byte b in ba) hex.AppendFormat("{0:x2}", b); return hex.ToString(); } static void Main(string[] args) { System.Collections.Generic.SortedList paramsOrdered = new SortedList(); paramsOrdered.Add("user", "provided_by_voxygen"); paramsOrdered.Add("voice", "Jenny"); paramsOrdered.Add("coding", "mp3:160-0"); paramsOrdered.Add("parsing", "tags"); paramsOrdered.Add("frequency", "48000"); paramsOrdered.Add("header", "headerless"); byte[] bytes = Encoding.UTF8.GetBytes("Hello World!"); Console.Write(BitConverter.ToString(bytes)); paramsOrdered.Add("text", Encoding.UTF8.GetString(bytes)); //Calcul de la valeur de hashsage System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding(); byte[] keyByte = encoding.GetBytes("provided_by_voxygen"); System.Security.Cryptography.HMACMD5 hmacMD5 = new System.Security.Cryptography.HMACMD5(keyByte); hmacMD5.Initialize(); string concatParams = ""; for (int i = 0; i < paramsOrdered.Count; i++) concatParams += paramsOrdered.ElementAt(i).Key + "=" + paramsOrdered.ElementAt(i).Value; byte[] hashByte = hmacMD5.ComputeHash(encoding.GetBytes(concatParams)); // Construction de l'url d'invocation string webRequestString = @"https://ws.voxygen.fr/ws/tts1?"; for (int i = 0; i < paramsOrdered.Count; i++) { if (i > 0) webRequestString += "&"; webRequestString += paramsOrdered.ElementAt(i).Key + "=" + paramsOrdered.ElementAt(i).Value; } webRequestString += "&hmac=" + ByteToString(hashByte); Console.Write(webRequestString); Console.ReadLine(); } } }
package HmacAndroid; import java.io.UnsupportedEncodingException; import java.math.BigInteger; import java.net.URLEncoder; import java.security.InvalidKeyException; import java.security.NoSuchAlgorithmException; import java.util.TreeMap; import java.util.logging.Level; import java.util.logging.Logger; import javax.crypto.Mac; import javax.crypto.spec.SecretKeySpec; public class HmacAndroid { private static final String user = "provided_by_voxygen"; private static final String voice = "Jenny"; private static final String texte = "Hello world!"; private static final String passwd = "provided_by_voxygen"; public static void main(String[] args) { try { String url_base = "https://ws.voxygen.fr/ws/tts1"; TreeMap am = new TreeMap<>(); am.put("user", user); am.put("voice", voice); am.put("text", texte); am.put("header", "wav-header"); am.put("coding", "lin"); am.put("frequency", "16000"); // calcul hmac String hmac = calculhmac(am); // encode text String textEncoded = URLEncoder.encode(texte, "utf-8"); // update TreeMap am.put("text", textEncoded); // create request, concat hmac at the end String request = treeMapToUrl(url_base, am); if (!"".equals(hmac)) { request = request + "&hmac=" + hmac; } } catch (UnsupportedEncodingException ex) { Logger.getLogger(HmacAndroid.class.getName()).log(Level.SEVERE, null, ex); } } private static String calculhmac(TreeMap am) { String result = ""; try { Mac m = Mac.getInstance("HmacMD5"); SecretKeySpec k = new SecretKeySpec((passwd).getBytes("UTF-8"), "HmacMD5"); m.init(k); for (String key : am.keySet()) { String value = am.get(key); m.update((key + "=" + value).getBytes()); } byte s[] = m.doFinal(); result = new BigInteger(1, s).toString(16); /* this is important, toString leaves out initial 0 */ if (result.length() % 2 > 0) { result = "0" + result; } } catch (NoSuchAlgorithmException | InvalidKeyException | IllegalStateException e) { e.printStackTrace(); } catch (UnsupportedEncodingException ex) { Logger.getLogger(HmacAndroid.class.getName()).log(Level.SEVERE, null, ex); } return result; } private static String treeMapToUrl(String url, TreeMap am) { String request = url + "?"; for (String key : am.keySet()) { String value = am.get(key); request += key + "=" + value + "&"; } request = request.substring(0, request.length() - 1); return request; } }