Speech Models - Parameters Reference
This document provides a comprehensive reference for the parameters available across various audio generation models in the Scenario API. Each model has a unique modelId and a set of specific parameters that can be used to control the speech generation process. Understanding these parameters is crucial for effectively utilizing the API to achieve desired audio outputs.
Below, you will find detailed information for each audio model, including its modelId, the types of parameters it accepts, allowed values, default settings, and a clear description of each parameter's function.
ElevenLabs
ElevenLabs V3
Model ID: model_elevenlabs-tts-v3
| Input | Label | Type | Default | Min | Max | Allowed Values | Notes |
|---|---|---|---|---|---|---|---|
text | Text | string | – | – | – | – | Required. Up to 40k characters |
voice | Voice | select | Aria | – | – | "Aria", "Roger", "Sarah", "Laura", "Charlie", "George", "Callum", "River", "Liam", "Charlotte", "Alice", "Matilda", "Will", "Jessica", "Eric", "Chris", "Brian", "Daniel", "Lily", "Bill" | |
stability | Stability | number | 0.5 | 0 | 1 | – | |
similarityBoost | Similarity Boost | number | 0.5 | 0 | 1 | – | |
style | Style Exaggeration | number | 0 | 0 | 1 | – | |
speed | Speed | number | 1 | 0.7 | 1.2 | – | <1 slows; >1 speeds up |
previousText | Previous Text | string | – | – | – | – | optional context |
nextText | Next Text | string | – | – | – | – | optional context |
languageCode | Language Code | select | "" | – | – | ISO 639‑1 codes |
ElevenLabs Turbo v2.5
Model ID: elevenlabs-turbo-v2-5
| Input | Label | Type | Default | Min | Max | Allowed Values | Notes |
|---|---|---|---|---|---|---|---|
text | Text | string | – | – | – | – | Required. Text to convert to speech (max 40000 chars) |
voice | Voice | string | Aria | – | – | Aria, Roger, Sarah, Laura, Charlie, George, Callum, River, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, Bill | Voice preset |
stability | Stability | number | 0.5 | 0 | 1 | – | Controls voice stability |
similarityBoost | Similarity Boost | number | 0.5 | 0 | 1 | – | Closeness to selected voice |
styleExaggeration | Style Exaggeration | number | 0 | 0 | 1 | – | Boosts emotional expression |
speed | Speed | number | 1 | 0.7 | 1.2 | – | <1 slows, >1 speeds up |
previousText | Previous Text | string | – | – | – | – | Optional. Helps continuity across multi-part generation (max 10000 chars) |
nextText | Next Text | string | – | – | – | – | Optional. Helps continuity (max 10000 chars) |
languageCode | Language Code | string | "" | – | – | "" (auto), en, ca, es, fr, de, it, ja, ko, zh, ru, ar, hi, bn, pa, ta, te, mr, ur, fa, tr, nl, sv, da, no, fi, el, ro, hu, cs, sk, sl, pt, id, th, vi, ms, tl, yo, ig, ha, am, az, be, bg, hr | Forces language for synthesis |
ElevenLabs Multilingual v2
Model ID: model_elevenlabs-multilingual-v2
| Input | Label | Type | Default | Min | Max | Allowed Values | Notes |
|---|---|---|---|---|---|---|---|
text | Text | string | – | – | – | – | Required. Up to 40k characters |
voice | Voice | select | Aria | – | – | "Aria", "Roger", "Sarah", "Laura", "Charlie", "George", "Callum", "River", "Liam", "Charlotte", "Alice", "Matilda", "Will", "Jessica", "Eric", "Chris", "Brian", "Daniel", "Lily", "Bill" | |
stability | Stability | number | 0.5 | 0 | 1 | – | |
similarityBoost | Similarity Boost | number | 0.5 | 0 | 1 | – | |
style | Style Exaggeration | number | 0 | 0 | 1 | – | |
speed | Speed | number | 1 | 0.7 | 1.2 | – | <1 slows; >1 speeds up |
previousText | Previous Text | string | – | – | – | – | optional context |
nextText | Next Text | string | – | – | – | – | optional context |
languageCode | Language Code | select | "" | – | – | ISO 639‑1 codes |
Minimax
Minimax Speech 2.6 HD
Model ID: model_minimax-speech-2-6-hd
| Input | Label | Type | Default | Min | Max | Allowed Values |
|---|---|---|---|---|---|---|
text | Text | string | – | – | – | – |
voiceId | Voice Id | select | Wise_Woman | – | – | Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl |
speed | Speed | number | 1 | 0.5 | 2 | – |
volume | Volume | number | 1 | 0 | 10 | – |
pitch | Pitch | number | 0 | -12 | 12 | – |
emotion | Emotion | select | auto | – | – | auto, neutral, happy, sad, angry, fearful, disgusted, surprised |
englishNormalization | English Normalization | boolean | false | – | – | – |
sampleRate | Sample Rate | number | 32000 | – | – | 8000, 16000, 22050, 24000, 32000, 44100 |
bitrate | Bitrate | number | 128000 | – | – | 32000, 64000, 128000, 256000 |
channel | Channel | select | mono | – | – | mono, stereo |
languageBoost | Language Boost | select | Automatic | – | – | (list of 25 language options) |
Minimax Speech 2.6 Turbo
Model ID: model_minimax-speech-2-6-turbo
| Input | Label | Type | Default | Min | Max | Allowed Values |
|---|---|---|---|---|---|---|
text | Text | string | – | – | – | – |
voiceId | Voice Id | select | Wise_Woman | – | – | Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl |
speed | Speed | number | 1 | 0.5 | 2 | – |
volume | Volume | number | 1 | 0 | 10 | – |
pitch | Pitch | number | 0 | -12 | 12 | – |
emotion | Emotion | select | auto | – | – | auto, neutral, happy, sad, angry, fearful, disgusted, surprised |
englishNormalization | English Normalization | boolean | false | – | – | – |
sampleRate | Sample Rate | number | 32000 | – | – | 8000, 16000, 22050, 24000, 32000, 44100 |
bitrate | Bitrate | number | 128000 | – | – | 32000, 64000, 128000, 256000 |
channel | Channel | select | mono | – | – | mono, stereo |
languageBoost | Language Boost | select | Automatic | – | – | (list of 25 language options) |
Tada
Tada 1B Text to Speech
Model ID: model_tada-1b-text-to-speech
| Input | Label | Type | Default | Min | Max | Allowed Values | Notes |
|---|---|---|---|---|---|---|---|
audio | Reference Audio | file | None | None | None | None | Required. Reference audio for voice cloning. |
prompt | Prompt | string | None | None | 10000 | None | Required. Text to synthesize with the reference voice. |
transcript | Reference Transcript | string | "" | None | None | None | Transcript of the reference audio. Required for non-English references. |
language | Language | string | en | None | None | en, ar, ch, de, es, fr, it, ja, pl, pt | Language used for text alignment. |
outputFormat | Output Format | string | wav | None | None | wav, mp3 | Output audio file format. |
numInferenceSteps | Inference Steps | number | 20 | 1 | 50 | None | Number of ODE solver steps for acoustic generation. |
speedUpFactor | Speed Up Factor | number | 1 | 0.5 | 2 | None | Values > 1 speed up and values < 1 slow down speech. |
temperature | Temperature | number | 0.6 | 0 | 2 | None | Sampling temperature for text token generation. |
topP | Top P | number | 0.9 | 0 | 1 | None | Top-p nucleus sampling value. |
repetitionPenalty | Repetition Penalty | number | 1.1 | 1 | 2 | None | Penalty applied to repeated tokens. |
acousticCfgScale | Acoustic CFG Scale | number | 1.6 | 0 | 10 | None | Classifier-free guidance scale for acoustic generation. |
noiseTemperature | Noise Temperature | number | 0.9 | 0 | 2 | None | Temperature for diffusion noise during flow matching. |
numExtraSteps | Extra Steps | number | 0 | 0 | 50 | None | Additional autoregressive steps for continuation. |
Tada 3B Text to Speech
Model ID: model_tada-3b-text-to-speech
| Input | Label | Type | Default | Min | Max | Allowed Values | Notes |
|---|---|---|---|---|---|---|---|
audio | Reference Audio | file | None | None | None | None | Required. Reference audio for voice cloning. |
prompt | Prompt | string | None | None | 10000 | None | Required. Text to synthesize with the reference voice. |
transcript | Reference Transcript | string | "" | None | None | None | Transcript of the reference audio. Required for non-English references. |
language | Language | string | en | None | None | en, ar, ch, de, es, fr, it, ja, pl, pt | Language used for text alignment. |
outputFormat | Output Format | string | wav | None | None | wav, mp3 | Output audio file format. |
numInferenceSteps | Inference Steps | number | 20 | 1 | 50 | None | Number of ODE solver steps for acoustic generation. |
speedUpFactor | Speed Up Factor | number | 1 | 0.5 | 2 | None | Values > 1 speed up and values < 1 slow down speech. |
temperature | Temperature | number | 0.6 | 0 | 2 | None | Sampling temperature for text token generation. |
topP | Top P | number | 0.9 | 0 | 1 | None | Top-p nucleus sampling value. |
repetitionPenalty | Repetition Penalty | number | 1.1 | 1 | 2 | None | Penalty applied to repeated tokens. |
acousticCfgScale | Acoustic CFG Scale | number | 1.6 | 0 | 10 | None | Classifier-free guidance scale for acoustic generation. |
noiseTemperature | Noise Temperature | number | 0.9 | 0 | 2 | None | Temperature for diffusion noise during flow matching. |
numExtraSteps | Extra Steps | number | 0 | 0 | 50 | None | Additional autoregressive steps for continuation. |
Lux TTS
Model ID: model_lux-tts
| Input | Label | Type | Default | Min | Max | Allowed Values | Notes |
|---|---|---|---|---|---|---|---|
prompt | Prompt | string | None | None | 10000 | None | Required. Text to convert to speech. |
audio | Reference Audio | file | None | None | None | None | Required. Reference audio for voice cloning. |
guidanceScale | Guidance Scale | number | 3 | 0 | 10 | None | Higher values increase adherence to the reference voice. |
numInferenceSteps | Inference Steps | number | 4 | 1 | 16 | None | Number of flow-matching inference steps. |
maxRefLength | Max Reference Length | number | 5 | 1 | 15 | None | Maximum reference audio duration used for voice encoding (seconds). |
seed | Seed | number | None | 0 | 2147483647 | None | Seed for reproducible outputs. |
Updated 6 days ago