On-premise Speech to Text models
Introduction
The main benefit of our on-premise models is to keep your sensitive information entirely in your hands. Designed for organizations that prioritize data privacy and security, our on-premise ASR models ensure that all processing is done locally, safeguarding your data from external access.
Security
Our on-premise ASR models ensure that none of your private data passes through our servers. Your recordings, transcripts or any other identifying data remain on your systems and are never shared with us or third parties. By using our on-premise ASR models, you retain full control over your sensitive information, ensuring it remains in your infrastructure at all times. Once Banafo's on-premise models are set up on your machine they will only contact the Banafo licensing server to:
- Validate your license and allowed features: This includes sending your license key and machine fingerprint to the licensing server. The machine fingerprint is used to control number of machines on which your on-premise models are activated.
- Report usage information: This includes sending the audio duration processed by the model to the licensing server.
Streaming models
Introduction
In this section you will learn how to use our on-premise streaming models to add real-time transcripts securely and efficiently. Whether you are streaming from a microphone or another audio source, our models provide seamless and accurate transcription. Interaction with our on-premise streaming API is done through WebSocket protocol.
Ideal for:
- Live captions
- Accessibility
- Voice command services
- Chatbots
- Virtual assistants
Hardware and software requirements
- supported on latest Linux distributions
- Windows support – coming soon
Setup
-
Download the model:
Begin by downloading the on-premise streaming model.
-
Start the server:
To start the server listening on the default port, use the following command:
./streaming-websocket-server --key=LICENSE_KEY
-
Specify a custom port:
To start the server on a specified port, use:
./streaming-websocket-server --key=LICENSE_KEY --port=6007
Note: Replace
LICENSE_KEY
with the activation key provided by Banafo. -
Endpoints specifics:
Endpoints are moments in a transcription where the streaming model determines that the previous transcription will no longer be modified. These typically occur during pauses in speech, but an additional rule prevents excessively long intervals if no pauses are detected.
Endpointing Rules
Each rule defines specific conditions under which an endpoint is triggered. If any one rule's conditions are met, an endpoint occurs.
Rule 1
--rule1-must-contain-nonsilence
- If True, this rule applies only when there is non-silence in the best-path traceback. A non-blank token is considered non-silence. (Default: false)--rule1-min-trailing-silence
- The duration of trailing silence (in seconds) must be at least this value. (Default: 2.4)--rule1-min-utterance-length
- The utterance length (in seconds) must be at least this value. (Default: 0)Rule 2
--rule2-must-contain-nonsilence
- If True, this rule applies only when there is non-silence in the best-path traceback. (Default: true)--rule2-min-trailing-silence
- The trailing silence must be at least this long. (Default: 1.2)--rule2-min-utterance-length
- The utterance must be at least this long. (Default: 0)Rule 3
--rule3-must-contain-nonsilence
- If True, this rule applies only when there is non-silence in the best-path traceback. (Default: false)--rule3-min-trailing-silence
- The trailing silence must be at least this long. (Default: 0)--rule3-min-utterance-length
- The utterance must be at least this long. (Default: 20)
How Endpoints Are Triggered
An endpoint is triggered when all conditions of any single rule are met. Only one rule needs to be satisfied for an endpoint to occur.
Important Notes:
- A value of 0 disables a rule; it does not mean "equal to or longer than 0 seconds."
- Typically, at least one parameter in a rule should be disabled to allow flexibility in endpointing.
Default Behavior
With the default settings:
-
Rule1 triggers after 2.4 seconds of silence, even if nothing was transcribed.
-
Rule2 triggers after 1.2 seconds of silence, but only if something was transcribed.
-
Rule3 triggers when the utterance reaches 20 seconds, regardless of silence.
Explanation of Parameters
-
*-must-contain-nonsilence
- If True, the rule applies only if there is non-silence in the segment. -
*-min-trailing-silence
- The rule is triggered if the trailing silence is equal to or longer than the specified value. -
*-min-utterance-length
- The rule is triggered once the segment length reaches or exceeds the specified value. -
Check full list of parameters:
For a complete list of available parameters, run:
./streaming-websocket-server --help
Getting started
The Banafo streaming model uses the WebSocket protocol for streaming audio data and receiving transcriptions in real-time.
Audio requirements:
- 1 channel of PCM 32-bit floating-point audio (IEEE-754 Standard)
- sample rate - 16 kHz (16 000 Hz)
- up to 4 seconds of audio data per message is allowed
If your audio is in a different format, it needs to be prepared according to the requirements of the on-premise models.
Once the server is started as described in the installation guidelines and your audio is prepared you can easily try out our on-premise models using our code samples prepared in different languages:
Transcript results
After processing each audio chunk your client sends, you will receive a JSON response containing the transcribed text and detailed elements for this part, including segments and words with timestamps. Below is an example of a single line of transcribed text:
...
{
"type": "partial",
"text": "Text from the current segment",
"segment": 0,
"startedAt": 0.0,
"elements": {
"segments": [
{
"type": "segment",
"text": "",
"startedAt": 0.0,
"segment": 0
}
],
"words": [
{
"type": "word",
"text": "",
"startedAt": 0.0,
"segment": 0
}
]
}
}
...
Definition:
- type: Indicates the type of response. The value can be either:
- "partial": Signifies that this is an intermediate result and the transcription is not yet complete.
- "final": Indicates that this is the complete transcription of the audio segment.
- text: Contains the transcribed text from the current segment of audio.
- segment: The index of the audio segment.
- startedAt: The start time of the segment in seconds.
- elements: This field contains detailed breakdowns of the transcription:
- segments: An array of segment objects.
- type: The type of the element, which is "segment" for segment objects.
- text: The text within this particular segment.
- startedAt: The start time of this segment in seconds.
- segment: The index of the segment.
- words: An array of word objects.
- type: The type of the element, which is "word" for word objects.
- text: The individual word text.
- startedAt: The start time of this word in seconds.
- segment: The index of the segment that this word belongs to.
- segments: An array of segment objects.
This detailed response structure allows you to not only retrieve the transcribed text but also analyze the timing and segmentation of the transcription.
Post-processed models
Introduction
In this section you will learn how to use our on-premise post-processed models to add transcripts securely and efficiently after your call is over. Whether handling recorded audio or processing data post-event, our models ensure high-quality transcription. Interaction with our on-premise post-processed API is done through WebSocket protocol.
Ideal for:
- Meeting transcripts - online and offline
- Visual voicemail
- Voice memo transcripts
- Content creation
- Productivity and analytics
- Movie subtitles
Hardware and software requirements
- supported on latest Linux distributions
- Windows support – coming soon
Setup
-
Download the model:
Begin by downloading the on-premise streaming model.
-
Start the server:
To start the server listening on the default port, use the following command:
./post-processed-websocket-server --key=LICENSE_KEY --port=PORT --num-io-threads=NUMBER_OF_THREADS
Note:
-
Replace
LICENSE_KEY
with the activation key provided by Banafo. -
Replace
NUMBER_OF_THREADS
with the number of concurrent file uploads you want to handle based on your needs and hardware setup. For example, if you set this to 3 and try to stream 5 large files to your post-processed server, the first 3 will start uploading and the rest will wait, which might result in client timeout. If omitted, the default number of concurrent file streams will be set to 3. -
Replace
PORT
with your desired custom port number to start the server. If omitted, default port will be set to6006
.
For a complete list of available parameters, run:
./post-processed-websocket-server --help
Getting started
The Banafo post-processed model uses the WebSocket protocol for streaming your audio file in chunks and receiving transcriptions after the entire file passes the server.
In the first part of the buffer before you send your audio, you need to provide some metadata about the audio in an 8-byte header.
Metadata requirements:
- Sample rate: The first 4 bytes little endian (for example 16000 for 16kHz audio)
- Sample size: In the next 4 bytes little endian (the total number of bytes in the audio)
Audio requirements:
- 1 channel of PCM 32-bit floating-point audio (IEEE-754 Standard)
- Sample rate: could vary, but it's recommended to be 16 kHz (16 000 Hz)
Multiple files could be sent in one Banafo WebSocket session and the transcript result will be received at the end of each file streamed.
To close the connection and notify your on-premise model that your session is over you should send "Done"
text message.
If your audio is in a different format, it needs to be prepared according to the requirements of the on-premise models.
Once the server is started as described in the installation guidelines and your audio is prepared you can easily try out our on-premise models using our code samples prepared in different languages:
Transcript results
After processing each audio file, you will receive a JSON response containing the transcribed text and detailed elements for the entire file, including segments and words with timestamps. Below is an example of a single line of transcribed text:
...
{
"type": "final",
"text": "Text from the current segment",
"segment": 0,
"startedAt": 0.0,
"elements": {
"segments": [
{
"type": "segment",
"text": "",
"startedAt": 0.0,
"segment": 0
}
],
"words": [
{
"type": "word",
"text": "",
"startedAt": 0.0,
"segment": 0
}
]
}
}
...
Definition:
- type: Indicates the type of response. The value can be either:
- "partial": Signifies that this is an intermediate result and the transcription is not yet complete.
- "final": Indicates that this is the complete transcription of the audio segment.
- text: Contains the transcribed text from the current segment of audio.
- segment: The index of the audio segment.
- startedAt: The start time of the segment in seconds.
- elements: This field contains detailed breakdowns of the transcription:
- segments: An array of segment objects.
- type: The type of the element, which is "segment" for segment objects.
- text: The text within this particular segment.
- startedAt: The start time of this segment in seconds.
- segment: The index of the segment.
- words: An array of word objects.
- type: The type of the element, which is "word" for word objects.
- text: The individual word text.
- startedAt: The start time of this word in seconds.
- segment: The index of the segment that this word belongs to.
- segments: An array of segment objects.
This detailed response structure allows you to not only retrieve the transcribed text but also analyze the timing and segmentation of the transcription.
Common errors
-
License does not exist - This error indicates that your activation key is not found in our system. Please verify the key you received from us. If the key matches but the error persists, contact us at licensing@banafo.com for assistance.
-
License is suspended - This error means that your license is suspended, and transcript generation is no longer possible. This can happen if the payment for your billing period was not processed correctly. Please contact us at licensing@banafo.com so we can help resolve the issue.
-
License is expired - This error occurs if you are using a trial license that has expired, making transcript generation no longer possible. Contact us at licensing@banafo.com to for assistance.
-
License is missing one or more required entitlements - This error indicates that you are using the wrong key for your model. This might occur if you have multiple licenses or received an incorrect key. Please verify the key you received from us for that specific model you're trying to activate. If the key matches but the error persists, contact us at licensing@banafo.com for assistance.
-
License machine count has exceeded maximum allowed for license (3) - This error means you are trying to start the model on a second machine, which is not permitted under your current license. If you mistakenly attempted to start it on another machine, please switch back to the correct one. If you wish to move your model to a different machine, contact us at licensing@banafo.com for help.
If you encounter any of the above errors or other issues not listed here, please contact us at licensing@banafo.com, and we will assist you in resolving the problem.