Hosted Speech to Text API
Introduction
Our Speech to Text API offers a streamlined and easy to use solution, eliminating the need for complex setup and maintenance. Perfect for organizations focused on rapid development, our hosted API handles all processing on our secure servers, ensuring that your data remains protected from third-party access. This allows you to integrate advanced speech recognition capabilities effortlessly, saving valuable time and resources.
Authentication
To use the Banafo Speech-To-Text API, you need to provide your API Key as apiKey
query parameter in your Websocket handshake.
Banafo AI users can generate API keys. This is restricted to a maximum of 5 active keys per user.
To get an API key follow these simple steps:
- Login into your Banafo AI account
- Click the API Authentication sidebar menu
- Click on the button "Generate new API key" to generate a new API key
Once generated, your API keys will not expire or get invalidated, unless you revoke them manually using the Banafo AI Dashboard.
Copy and keep your API key in a secure place. A lost API keys cannot be retrieved and you will need to generate a new one.
Streaming
Step by step guide
- Generate your authentication API key - check the Authentication section section
- Initialize an audio streaming session - check the WebSocket handshake
- Stream your audio to Banafo and retrieve the transcript - Streaming your audio data
WebSocket handshake
All connections to this API should start with a WebSocket handshake HTTP request containing needed authorization info.
wss://app.banafo.ai/api/v1/transcripts/streaming
Query parameters:
apiKey
- required, if missing or invalid one is set, the WebSocket connection is closed with 4001 code.languageCode
- required, a code to specify the language for your transcript.endpoints
- optional, if specified you control whether you want the transcript to be split in segments. If omitted it will be automatically set to true (with new lines in the response represented by segment number)
Available languages here
After the request is successfully authorized, you will receive a message containing connection status.
{
"type": "connected",
"id": "5959b2e0-7569-11ed-9e85-b55fdb596ba0"
}
Streaming your audio data
Once the message with "type": "connected"
is received you can start sending binary WebSocket messages containing audio data and retrieve the transcript result from Banafo.
Audio requirements:
- 1 channel of PCM 32-bit floating-point audio (IEEE-754 Standard)
- sample rate - 16 kHz (16 000 Hz)
- up to 4 seconds of audio data per message is allowed
If your audio is in a different format, it needs to be prepared according to the requirements above before sending.
Transcript results
After processing each audio chunk your client sends, you will receive a JSON response containing the transcribed text. Below is an example of a single line of transcribed text:
{
"type": "partial",
"text": "Text from the current segment",
"segment": 0,
"startedAt": 0.0
}
Definition:
- type: Indicates the type of response. The value can be either:
- "partial": Signifies that this is an intermediate result and the transcription is not yet complete.
- "final": Indicates that this is the complete transcription of the audio segment.
- text: Contains the transcribed text from the current segment of audio.
- segment: The index of the audio segment.
- startedAt: The start time of the segment in seconds.
Pre-recorded
Step by step guide
- Generate your authentication API key - check the Authentication section
- Initialize an audio streaming session - check the WebSocket handshake
- Stream your audio to Banafo and retrieve the transcript - Streaming audio data as binary WebSocket messages section
WebSocket handshake
All connections to this API should start with a WebSocket handshake HTTP request containing needed authorization info.
wss://app.banafo.ai/api/v1/transcripts/pre-recorded
Query parameters:
apiKey
- required, if missing or invalid one is set, the WebSocket connection is closed with 4001 code.languageCode
- required, a code to specify the language for your transcript.
Available languages here
After the request is successfully authorized, you will receive a message containing connection status.
{
"type": "connected",
"id": "5959b2e0-7569-11ed-9e85-b55fdb596ba0"
}
Streaming your audio data
Once the message with "type": "connected"
is received you can start sending your audio file in chunks and receiving transcriptions after the entire file passes the Banafo server.
In the first part of the buffer before you send your audio, you need to provide some metadata about the audio in an 8-byte header.
Metadata requirements:
- Sample rate: The first 4 bytes little endian (for example 16000 for 16kHz audio)
- Sample size: In the next 4 bytes little endian (the total number of bytes in the audio)
Audio requirements:
- 1 channel of PCM 32-bit floating-point audio (IEEE-754 Standard)
- Sample rate: could vary, but it's recommended to be 16 kHz (16 000 Hz)
Multiple files could be sent in one Banafo WebSocket session and the transcript result will be received at the end of each file streamed.
To close the connection and notify the Banafo API that your session is over you should send "Done"
text message.
If your audio is in a different format, it needs to be prepared according to the requirements above before sending.
Transcript results
After processing each audio file, you will receive a JSON response containing the transcribed text and detailed elements for the entire file, including segments and words with timestamps. Below is an example of a single line of transcribed text:
...
{
"type": "final",
"text": "Text from the current segment",
"segment": 0,
"startedAt": 0.0,
"elements": {
"segments": [
{
"type": "segment",
"text": "",
"startedAt": 0.0,
"segment": 0
}
],
"words": [
{
"type": "word",
"text": "",
"startedAt": 0.0,
"segment": 0
}
]
}
}
...
Definition:
- type: Indicates the type of response. The value can be either:
- "partial": Signifies that this is an intermediate result and the transcription is not yet complete.
- "final": Indicates that this is the complete transcription of the audio segment.
- text: Contains the transcribed text from the current segment of audio.
- segment: The index of the audio segment.
- startedAt: The start time of the segment in seconds.
- elements: This field contains detailed breakdowns of the transcription:
- segments: An array of segment objects.
- type: The type of the element, which is "segment" for segment objects.
- text: The text within this particular segment.
- startedAt: The start time of this segment in seconds.
- segment: The index of the segment.
- words: An array of word objects.
- type: The type of the element, which is "word" for word objects.
- text: The individual word text.
- startedAt: The start time of this word in seconds.
- segment: The index of the segment that this word belongs to.
- segments: An array of segment objects.
This detailed response structure allows you to not only retrieve the transcribed text but also analyze the timing and segmentation of the transcription.
Common errors
Bad url:
- Code: 4004;
- Message: Not found;
Bad credentials:
- Code: 4001;
- Message: Unauthorized;
Limit reached & Payment required:
- Code: 4003;
- Message: Limit reached;
Invalid languageCode:
- Code: 4004;
- Message: Invalid language code;
Data is not binary:
- Code: 4004;
- Message: Data is not binary;
Upstream closed:
- Code: 4010;
- Message: Upstream closed;
Upstream error:
- Code: 4010;
- Message: Upstream error;
Unknown error:
- Code: 4010;
- Message: Server error;