Skip to content

Releases: Wordcab/wordcab-transcribe

v0.5.3

01 Apr 21:21
419278c
Compare
Choose a tag to compare

This PR introduces an engine system to swap Whisper engines between faster-whisper and TensorRT-LLM.

API

  • MIT License!
  • Added the ability to swap the Whipser "engine" from the default faster-whisper to TensorRT-LLM, which is much faster. #285
  • Added support for distil models like distil-large-v2 and distil-large-v3. These work with the TensorRT-LLM engine.
  • Added a batch_size parameter for the endpoints. It doesn't do anything yet, but the TensorRT-LLM engine supports batch processing of files, and the idea is add this feature along with dynamic batch.
  • Overall tighter control over dependencies, and various dependency updates.

Diarization

  • Started work implementing Nvidia NeMo's new long-form diarization class. Currently it still consumes too much memory.

Documentation

  • Add examples for offline models for various backends #288 #289

Thanks to contributors @aleksandr-smechov and for the work from the NeMo team, and the WhisperS2T project for the initial code for the TensorRT-LLM backend, and by extension, TensorRT-LLM's Whisper example.

v0.5.2

12 Oct 12:19
8c967bd
Compare
Choose a tag to compare

This PR introduces a lot of things to allow remote execution and single-service deployment.

API

  • Added the possibility to make RemoteExecution or LocalExecution for transcription and diarization services #258
  • Implemented single-service deployment with the new only_transcription and only_diarization asr types #261
  • Added new endpoints to manage remote execution servers #263
  • Allow the user to auto-switch between local and remote execution for all services #266

Diarization

  • Adjusted VAD speech padding and updated the diarization logic #271

Bug and Fixes

  • Fixed a newly introduced dual_channel bug #268
  • Added a fix to avoid empty utterances #273

Documentation

  • Added new documentation to the project via mkdocs-material and GitHub pages #269

Thanks to contributors @aleksandr-smechov @chainyo

v0.5.1

25 Sep 10:50
bcc7e98
Compare
Choose a tag to compare

API

  • Added the offset_start and offset_end parameters to the API
  • Add Live transcription #241 #247

Transcription

  • Updated dual_channel to multi_channel with auto-detection feature #239 #244 #248
  • Let faster-whisper handle the model path for the custom model hosted on HF #251

Bugs and Fixes

  • Fixed the python 3.8 compat #235
  • Fixed the hms format for timestamps #237
  • Fixed the extra_languages config #235

v0.5.0

01 Sep 16:10
7e08cc9
Compare
Choose a tag to compare

This release is a significant change from poetry to hatch with many improvements to CI, tests, local development, and dependencies handling.

  • Huge project updates #227
  • Updated tests, dependencies, config defaults #229

API

  • Added a warmup for inference #201
  • Added repetion_penalty parameter #207
  • Added num_speakers parameter #195
  • Improved the time_and_tell function #213
  • Updated the API schemas #188
  • Added transcription parameters for control #213

Transcription

  • Added bfloat16 to compute types #209

Diarization

  • Added empty audio catch during diarization #223 #225
  • Reimplemented the entire diarization module to skip NeMo module installation #186 #202

CI

  • Added concurrency on CI tests #191

Contributors:
@aleksandr-smechov @chainyo

v0.4.0

02 Aug 13:10
ad689f4
Compare
Choose a tag to compare

This release includes a lot of improvements and a new License starting with the v0.4.0 of wordcab-transcribe (inspired by the HFOIL).

The new License WTLv0.1

The new License prevents anyone from using this project after v0.4.0 (included) to sell a self-hosted version of this software without any agreements from Wordcab.

But you can still use the project for research, personal use, or even as a backend tool for your projects.

API

  • Fixed CortexResponse for Svix size limit #101
  • Made alignment non-critical if the process fails #105
  • Added multi-GPU support for transcription, alignment, and diarization #114
  • Added the audio_duration (in seconds) in the API response #127
  • Added a catch for invalid or empty audio file #128
  • Added a log about the number of detected and used GPUs at launch #138
  • Updated pydantic to v2 #157
  • Added an audio file global download queue #168
  • Added the new WTL v0.1 License #177 #183 #184

Transcription

  • Added the vocab feature #124
  • Added an internal_vad parameter that helps with empty utterances #142 #173
  • Added a new fallback for empty segments during transcription #149
  • Added the float32 compute type for the transcription model #157

Diarization

  • Decomposed the diarization process into sub-modules and optimized diarization inference #180

Alignment

  • Added new cs, in, sl and th alignment models #164

Post-processing

  • Improved the post-processing strategy #136 #157
  • Fix word_timestamps parameter for dual_channel #152

Instructions

  • Improvement of the contributions instructions #131

Deploy

  • Update error payload for Svix in cortex endpoint #118
  • Docker image updated to cuda:11.7.1 #133
  • Update Svix payload in cortex endpoint #144
  • Add a configuration file using Nginx for custom deploy #146

Need improvements / Not fully working

  • Added the possibility to use extra transcription models for specific languages #110

Contributors:
@chainyo @aleksandr-smechov @jissagn

v0.3.1

07 Jun 19:55
d63237c
Compare
Choose a tag to compare

TL;DR: Transcription is now on steroids. 2x faster than the actual faster-whisper implementation.

API

  • Add time_and_tell decorator on specific functions to time individual processes on debug=True #77
  • Add a LoggingMiddleware on debug=True #77
  • Add a fallback for dual_channel if the audio file is not stereo #87

Transcription

  • Add quality metrics for the batch process and fallback if the quality is under defined thresholds #89
  • Implement word_timestamps for the batch process #91

Post-processing

  • Fix timestamps format during the post-processing step #86

@chainyo

v0.3.0

02 Jun 07:20
c015930
Compare
Choose a tag to compare

Documentation

  • Improve .env readability for an easier API configuration #52
  • Add README instructions for profiling container #72

API

  • Add authentication when the API is not in debug mode #56
  • Fix the audio file endpoint inputs #59
  • All submitted files are converted into .wav 16kHz for consistency #60
  • Reworked and more coherent Request/Response models for the API endpoints #60
  • Streamline the post-process functions (with or without alignment/diarization) #63
  • Simplify timestamps conversion in outputs #63
  • Fix blocking non-async functions #67
  • Huge API rework for handling concurrent requests better #71
  • Fix Exception/Error returns through the API -> raised errors should be more transparent for user #72
  • VAD use now onnx and faster-whisper implementation #72

AI models

  • Add alignment (from whisperX) as a new possible step #51
  • Fix alignment for fr, de, es, and it models #59
  • Add dual_channel transcription process for stereo audio file #60
  • Add the choice to use diarization or not #63
  • Implement Batch request process for transcription #72

Deploy

  • Docker is aligned with the local setup now #55
  • Improve Dockerfile and commands to use cache for models #55

Contributors:
@aleksandr-smechov @chainyo

v0.2.0

10 May 13:57
dea46b9
Compare
Choose a tag to compare

Changes from #31 @chainyo

  • Replace diarization with NVIDIA NeMo asr toolkit
  • Update config.py and add validators for necessary config settings
  • Update the Docker image with the latest from NVIDIA
  • Fix dependencies and versions
  • Fix the Python version to 3.9 locally and on Docker
  • New available timestamps format: ms. Now user can choose between hms, s (default) and ms.
  • Remove unused num_speakers parameter.

v0.1.0

31 Mar 15:42
a58fd59
Compare
Choose a tag to compare

First official release:

  • Open-source
  • Fast
  • Easy to deploy
  • Batch requests
  • Cost-effective
  • Easy-to-use