Skip to content

Commit

Permalink
Merge pull request #39 from NavodPeiris/dev
Browse files Browse the repository at this point in the history
added fp16 support when running regular whisper on gpu
  • Loading branch information
NavodPeiris authored Jun 4, 2024
2 parents f8a4d03 + d94bc22 commit 7c2414f
Show file tree
Hide file tree
Showing 7 changed files with 80 additions and 273 deletions.
267 changes: 24 additions & 243 deletions README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion examples/transcribe.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from speechlib import Transcriptor

file = "obama1.wav" # your audio file
file = "obama_zach.wav" # your audio file
voices_folder = "voices" # voices folder containing voice samples for recognition
language = "en" # language code
log_folder = "logs" # log folder for storing transcripts
Expand Down
52 changes: 36 additions & 16 deletions library.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
### Run your IDE as administrator

you will get following error if administrator permission is not there:

**OSError: [WinError 1314] A required privilege is not held by the client**

### Requirements

* Python 3.8 or greater
Expand Down Expand Up @@ -31,13 +37,13 @@ This library does speaker diarization, speaker recognition, and transcription on

This library contains following audio preprocessing functions:

1. convert mp3 to wav
1. convert other audio formats to wav

2. convert stereo wav file to mono

3. re-encode the wav file to have 16-bit PCM encoding

Transcriptor method takes 6 arguments.
Transcriptor method takes 7 arguments.

1. file to transcribe

Expand All @@ -47,9 +53,11 @@ Transcriptor method takes 6 arguments.

4. model size ("tiny", "small", "medium", "large", "large-v1", "large-v2", "large-v3")

5. voices_folder (contains speaker voice samples for speaker recognition)
5. ACCESS_TOKEN: huggingface acccess token (also get permission to access `pyannote/speaker-diarization@2.1`)

6. voices_folder (contains speaker voice samples for speaker recognition)

6. quantization: this determine whether to use int8 quantization or not. Quantization may speed up the process but lower the accuracy.
7. quantization: this determine whether to use int8 quantization or not. Quantization may speed up the process but lower the accuracy.

voices_folder should contain subfolders named with speaker names. Each subfolder belongs to a speaker and it can contain many voice samples. This will be used for speaker recognition to identify the speaker.

Expand All @@ -64,26 +72,34 @@ transcript will also indicate the timeframe in seconds where each speaker speaks
```
from speechlib import Transcriptor
file = "obama_zach.wav"
voices_folder = "voices"
language = "en"
log_folder = "logs"
modelSize = "medium"
file = "obama_zach.wav" # your audio file
voices_folder = "voices" # voices folder containing voice samples for recognition
language = "en" # language code
log_folder = "logs" # log folder for storing transcripts
modelSize = "tiny" # size of model to be used [tiny, small, medium, large-v1, large-v2, large-v3]
quantization = False # setting this 'True' may speed up the process but lower the accuracy
ACCESS_TOKEN = "your huggingface access token" # get permission to access pyannote/speaker-diarization@2.1 on huggingface
transcriptor = Transcriptor(file, log_folder, language, modelSize, voices_folder, quantization)
# quantization only works on faster-whisper
transcriptor = Transcriptor(file, log_folder, language, modelSize, ACCESS_TOKEN, voices_folder, quantization)
res = transcriptor.transcribe()
# use normal whisper
res = transcriptor.whisper()
# use faster-whisper (simply faster)
res = transcriptor.faster_whisper()
res --> [["start", "end", "text", "speaker"], ["start", "end", "text", "speaker"]...]
```

#### if you don't want speaker names: keep voices_folder as an empty string ""

start: starting time of speech in seconds
end: ending time of speech in seconds
text: transcribed text for speech during start and end
speaker: speaker of the text

voices_folder structure:
#### voices folder structure:
```
voices_folder
|---> person1
Expand Down Expand Up @@ -116,15 +132,16 @@ supported language names:
from speechlib import PreProcessor
file = "obama1.mp3"
#initialize
prep = PreProcessor()
# convert mp3 to wav
wav_file = PreProcessor.convert_to_wav(file)
wav_file = prep.convert_to_wav(file)
# convert wav file from stereo to mono
PreProcessor.convert_to_mono(wav_file)
prep.convert_to_mono(wav_file)
# re-encode wav file to have 16-bit PCM encoding
PreProcessor.re_encode(wav_file)
prep.re_encode(wav_file)
```

### Performance
Expand Down Expand Up @@ -170,6 +187,9 @@ metrics for faster-whisper "large" model:
transcription time: 343s
```

#### why not using pyannote/speaker-diarization-3.1, speechbrain >= 1.0.0, faster-whisper >= 1.0.0:

because older versions give more accurate transcriptions. this was tested.

This library uses following huggingface models:

Expand Down
17 changes: 9 additions & 8 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
transformers
torch
torchaudio
pydub
pyannote.audio
speechbrain
accelerate
faster-whisper
transformers==4.36.2
torch==2.1.2
torchaudio==2.1.2
pydub==0.25.1
pyannote.audio==3.1.1
speechbrain==0.5.16
accelerate==0.26.1
faster-whisper==0.10.1
openai-whisper==20231117
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

setup(
name="speechlib",
version="1.1.0",
version="1.1.2",
description="speechlib is a library that can do speaker diarization, transcription and speaker recognition on an audio file to create transcripts with actual speaker names. This library also contain audio preprocessor functions.",
packages=find_packages(),
long_description=long_description,
Expand Down
2 changes: 1 addition & 1 deletion setup_instruction.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ for publishing:
pip install twine

for install locally for testing:
pip install dist/speechlib-1.1.0-py3-none-any.whl
pip install dist/speechlib-1.1.2-py3-none-any.whl

finally run:
twine upload dist/*
Expand Down
11 changes: 8 additions & 3 deletions speechlib/transcribe.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,14 @@ def transcribe(file, language, model_size, whisper_type, quantization):
Exception("Language code not supported.\nThese are the supported languages:\n", model.supported_languages)
else:
try:
model = whisper.load_model(model_size)
result = model.transcribe(file, language=language)
res = result["text"]
if torch.cuda.is_available():
model = whisper.load_model(model_size, device="cuda")
result = model.transcribe(file, language=language, fp16=True)
res = result["text"]
else:
model = whisper.load_model(model_size, device="cpu")
result = model.transcribe(file, language=language, fp16=False)
res = result["text"]

return res
except Exception as err:
Expand Down

0 comments on commit 7c2414f

Please sign in to comment.