Hosted Speech to Text API


Introduction

Our Speech to Text API offers a streamlined and easy to use solution, eliminating the need for complex setup and maintenance. Perfect for organizations focused on rapid development, our hosted API handles all processing on our secure servers, ensuring that your data remains protected from third-party access. This allows you to integrate advanced speech recognition capabilities effortlessly, saving valuable time and resources.


Authentication

To use the Banafo Speech-To-Text API, you need to provide your API Key as apiKey query parameter in your Websocket handshake.

Banafo AI users can generate API keys. This is restricted to a maximum of 5 active keys per user.

To get an API key follow these simple steps:

  • Login into your Banafo AI account
  • Click the API Authentication sidebar menu
  • Click on the button "Generate new API key" to generate a new API key

Once generated, your API keys will not expire or get invalidated, unless you revoke them manually using the Banafo AI Dashboard.

Copy and keep your API key in a secure place. A lost API keys cannot be retrieved and you will need to generate a new one.


Streaming


Step by step guide

  1. Generate your authentication API key - check the Authentication section section
  2. Initialize an audio streaming session - check the WebSocket handshake
  3. Stream your audio to Banafo and retrieve the transcript - Streaming your audio data

WebSocket handshake

All connections to this API should start with a WebSocket handshake HTTP request containing needed authorization info.

wss://app.banafo.ai/api/v1/transcripts/streaming

Query parameters:

  • apiKey - required, if missing or invalid one is set, the WebSocket connection is closed with 4001 code.
  • languageCode - required, a code to specify the language for your transcript.
  • endpoints - optional, if specified you control whether you want the transcript to be split in segments. If omitted it will be automatically set to true (with new lines in the response represented by segment number)

Available languages here

After the request is successfully authorized, you will receive a message containing connection status.

{
    "type": "connected",
    "id": "5959b2e0-7569-11ed-9e85-b55fdb596ba0"
}

Streaming your audio data

Once the message with "type": "connected" is received you can start sending binary WebSocket messages containing audio data and retrieve the transcript result from Banafo.

Audio requirements:

  • 1 channel of PCM 32-bit floating-point audio (IEEE-754 Standard)
  • sample rate - 16 kHz (16 000 Hz)
  • up to 4 seconds of audio data per message is allowed

If your audio is in a different format, it needs to be prepared according to the requirements above before sending.


Transcript results

After processing each audio chunk your client sends, you will receive a JSON response containing the transcribed text. Below is an example of a single line of transcribed text:

{
    "type": "partial",
    "text": "Text from the current segment",
    "segment": 0,
    "startedAt": 0.0
}

Definition:

  • type: Indicates the type of response. The value can be either:
    • "partial": Signifies that this is an intermediate result and the transcription is not yet complete.
    • "final": Indicates that this is the complete transcription of the audio segment.
  • text: Contains the transcribed text from the current segment of audio.
  • segment: The index of the audio segment.
  • startedAt: The start time of the segment in seconds.

Pre-recorded


Step by step guide

  1. Generate your authentication API key - check the Authentication section
  2. Initialize an audio streaming session - check the WebSocket handshake
  3. Stream your audio to Banafo and retrieve the transcript - Streaming audio data as binary WebSocket messages section

WebSocket handshake

All connections to this API should start with a WebSocket handshake HTTP request containing needed authorization info.

wss://app.banafo.ai/api/v1/transcripts/pre-recorded

Query parameters:

  • apiKey - required, if missing or invalid one is set, the WebSocket connection is closed with 4001 code.
  • languageCode - required, a code to specify the language for your transcript.

Available languages here

After the request is successfully authorized, you will receive a message containing connection status.

{
    "type": "connected",
    "id": "5959b2e0-7569-11ed-9e85-b55fdb596ba0"
}

Streaming your audio data

Once the message with "type": "connected" is received you can start sending your audio file in chunks and receiving transcriptions after the entire file passes the Banafo server.

In the first part of the buffer before you send your audio, you need to provide some metadata about the audio in an 8-byte header.

Metadata requirements:

  • Sample rate: The first 4 bytes little endian (for example 16000 for 16kHz audio)
  • Sample size: In the next 4 bytes little endian (the total number of bytes in the audio)

Audio requirements:

  • 1 channel of PCM 32-bit floating-point audio (IEEE-754 Standard)
  • Sample rate: could vary, but it's recommended to be 16 kHz (16 000 Hz)

Multiple files could be sent in one Banafo WebSocket session and the transcript result will be received at the end of each file streamed.

To close the connection and notify the Banafo API that your session is over you should send "Done" text message.

If your audio is in a different format, it needs to be prepared according to the requirements above before sending.


Transcript results

After processing each audio file, you will receive a JSON response containing the transcribed text and detailed elements for the entire file, including segments and words with timestamps. Below is an example of a single line of transcribed text:

...
{
    "type": "final",
    "text": "Text from the current segment",
    "segment": 0,
    "startedAt": 0.0,
    "elements": {
        "segments": [
            {
                "type": "segment",
                "text": "",
                "startedAt": 0.0,
                "segment": 0
            }
        ],
        "words": [
            {
                "type": "word",
                "text": "",
                "startedAt": 0.0,
                "segment": 0
            }
        ]
    }
}
...

Definition:

  • type: Indicates the type of response. The value can be either:
    • "partial": Signifies that this is an intermediate result and the transcription is not yet complete.
    • "final": Indicates that this is the complete transcription of the audio segment.
  • text: Contains the transcribed text from the current segment of audio.
  • segment: The index of the audio segment.
  • startedAt: The start time of the segment in seconds.
  • elements: This field contains detailed breakdowns of the transcription:
    • segments: An array of segment objects.
      • type: The type of the element, which is "segment" for segment objects.
      • text: The text within this particular segment.
      • startedAt: The start time of this segment in seconds.
      • segment: The index of the segment.
    • words: An array of word objects.
      • type: The type of the element, which is "word" for word objects.
      • text: The individual word text.
      • startedAt: The start time of this word in seconds.
      • segment: The index of the segment that this word belongs to.

This detailed response structure allows you to not only retrieve the transcribed text but also analyze the timing and segmentation of the transcription.


Common errors

Bad url:

  • Code: 4004;
  • Message: Not found;

Bad credentials:

  • Code: 4001;
  • Message: Unauthorized;

Limit reached & Payment required:

  • Code: 4003;
  • Message: Limit reached;

Invalid languageCode:

  • Code: 4004;
  • Message: Invalid language code;

Data is not binary:

  • Code: 4004;
  • Message: Data is not binary;

Upstream closed:

  • Code: 4010;
  • Message: Upstream closed;

Upstream error:

  • Code: 4010;
  • Message: Upstream error;

Unknown error:

  • Code: 4010;
  • Message: Server error;