Skip to content

Commit

Permalink
resolve conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
NavodPeiris committed Jun 3, 2024
0 parents commit f8a4d03
Show file tree
Hide file tree
Showing 37 changed files with 1,723 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
venv
build
dist
speechlib.egg-info
.env
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Navod Peiris

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
427 changes: 427 additions & 0 deletions README.md

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions examples/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
example1.wav
temp
segments
pretrained_models
audio_cache
__pycache__
logs
3 changes: 3 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
##### Run transcribe.py for trancribing an audio file

##### Run preprocess.py for preprocessing an audio file
Binary file added examples/obama1.mp3
Binary file not shown.
Binary file added examples/obama1.wav
Binary file not shown.
Binary file added examples/obama_zach.wav
Binary file not shown.
13 changes: 13 additions & 0 deletions examples/preprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from speechlib import PreProcessor

file = "obama1.mp3"
#initialize
prep = PreProcessor()
# convert mp3 to wav
wav_file = prep.convert_to_wav(file)

# convert wav file from stereo to mono
prep.convert_to_mono(wav_file)

# re-encode wav file to have 16-bit PCM encoding
prep.re_encode(wav_file)
18 changes: 18 additions & 0 deletions examples/transcribe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from speechlib import Transcriptor

file = "obama1.wav" # your audio file
voices_folder = "voices" # voices folder containing voice samples for recognition
language = "en" # language code
log_folder = "logs" # log folder for storing transcripts
modelSize = "tiny" # size of model to be used [tiny, small, medium, large-v1, large-v2, large-v3]
quantization = False # setting this 'True' may speed up the process but lower the accuracy
ACCESS_TOKEN = "your huggingface access token" # get permission to access pyannote/speaker-diarization@2.1 on huggingface

# quantization only works on faster-whisper
transcriptor = Transcriptor(file, log_folder, language, modelSize, ACCESS_TOKEN, voices_folder, quantization)

# use normal whisper
res = transcriptor.whisper()

# use faster-whisper (simply faster)
res = transcriptor.faster_whisper()
Binary file added examples/voices/obama/obama1.wav
Binary file not shown.
Binary file added examples/voices/obama/obama2.wav
Binary file not shown.
Binary file added examples/voices/zach/zach1.wav
Binary file not shown.
Binary file added examples/voices/zach/zach2.wav
Binary file not shown.
178 changes: 178 additions & 0 deletions library.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
### Requirements

* Python 3.8 or greater

### GPU execution

GPU execution needs CUDA 11.

GPU execution requires the following NVIDIA libraries to be installed:

* [cuBLAS for CUDA 11](https://developer.nvidia.com/cublas)
* [cuDNN 8 for CUDA 11](https://developer.nvidia.com/cudnn)

There are multiple ways to install these libraries. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below.

### Google Colab:

on google colab run this to install CUDA dependencies:
```
!apt install libcublas11
```

You can see this example [notebook](https://colab.research.google.com/drive/1lpoWrHl5443LSnTG3vJQfTcg9oFiCQSz?usp=sharing)

### installation:
```
pip install speechlib
```

This library does speaker diarization, speaker recognition, and transcription on a single wav file to provide a transcript with actual speaker names. This library will also return an array containing result information. ⚙

This library contains following audio preprocessing functions:

1. convert mp3 to wav

2. convert stereo wav file to mono

3. re-encode the wav file to have 16-bit PCM encoding

Transcriptor method takes 6 arguments.

1. file to transcribe

2. log_folder to store transcription

3. language used for transcribing (language code is used)

4. model size ("tiny", "small", "medium", "large", "large-v1", "large-v2", "large-v3")

5. voices_folder (contains speaker voice samples for speaker recognition)

6. quantization: this determine whether to use int8 quantization or not. Quantization may speed up the process but lower the accuracy.

voices_folder should contain subfolders named with speaker names. Each subfolder belongs to a speaker and it can contain many voice samples. This will be used for speaker recognition to identify the speaker.

if voices_folder is not provided then speaker tags will be arbitrary.

log_folder is to store the final transcript as a text file.

transcript will also indicate the timeframe in seconds where each speaker speaks.

### Transcription example:

```
from speechlib import Transcriptor
file = "obama_zach.wav"
voices_folder = "voices"
language = "en"
log_folder = "logs"
modelSize = "medium"
quantization = False # setting this 'True' may speed up the process but lower the accuracy
transcriptor = Transcriptor(file, log_folder, language, modelSize, voices_folder, quantization)
res = transcriptor.transcribe()
res --> [["start", "end", "text", "speaker"], ["start", "end", "text", "speaker"]...]
```

start: starting time of speech in seconds
end: ending time of speech in seconds
text: transcribed text for speech during start and end
speaker: speaker of the text

voices_folder structure:
```
voices_folder
|---> person1
| |---> sample1.wav
| |---> sample2.wav
| ...
|
|---> person2
| |---> sample1.wav
| |---> sample2.wav
| ...
|--> ...
```

supported language codes:

```
"af", "am", "ar", "as", "az", "ba", "be", "bg", "bn", "bo", "br", "bs", "ca", "cs", "cy", "da", "de", "el", "en", "es", "et", "eu", "fa", "fi", "fo", "fr", "gl", "gu", "ha", "haw", "he", "hi", "hr", "ht", "hu", "hy", "id", "is","it", "ja", "jw", "ka", "kk", "km", "kn", "ko", "la", "lb", "ln", "lo", "lt", "lv", "mg", "mi", "mk", "ml", "mn","mr", "ms", "mt", "my", "ne", "nl", "nn", "no", "oc", "pa", "pl", "ps", "pt", "ro", "ru", "sa", "sd", "si", "sk","sl", "sn", "so", "sq", "sr", "su", "sv", "sw", "ta", "te", "tg", "th", "tk", "tl", "tr", "tt", "uk", "ur", "uz","vi", "yi", "yo", "zh", "yue"
```

supported language names:

```
"Afrikaans", "Amharic", "Arabic", "Assamese", "Azerbaijani", "Bashkir", "Belarusian", "Bulgarian", "Bengali","Tibetan", "Breton", "Bosnian", "Catalan", "Czech", "Welsh", "Danish", "German", "Greek", "English", "Spanish","Estonian", "Basque", "Persian", "Finnish", "Faroese", "French", "Galician", "Gujarati", "Hausa", "Hawaiian","Hebrew", "Hindi", "Croatian", "Haitian", "Hungarian", "Armenian", "Indonesian", "Icelandic", "Italian", "Japanese","Javanese", "Georgian", "Kazakh", "Khmer", "Kannada", "Korean", "Latin", "Luxembourgish", "Lingala", "Lao","Lithuanian", "Latvian", "Malagasy", "Maori", "Macedonian", "Malayalam", "Mongolian", "Marathi", "Malay", "Maltese","Burmese", "Nepali", "Dutch", "Norwegian Nynorsk", "Norwegian", "Occitan", "Punjabi", "Polish", "Pashto","Portuguese", "Romanian", "Russian", "Sanskrit", "Sindhi", "Sinhalese", "Slovak", "Slovenian", "Shona", "Somali","Albanian", "Serbian", "Sundanese", "Swedish", "Swahili", "Tamil", "Telugu", "Tajik", "Thai", "Turkmen", "Tagalog","Turkish", "Tatar", "Ukrainian", "Urdu", "Uzbek", "Vietnamese", "Yiddish", "Yoruba", "Chinese", "Cantonese",
```

### Audio preprocessing example:

```
from speechlib import PreProcessor
file = "obama1.mp3"
# convert mp3 to wav
wav_file = PreProcessor.convert_to_wav(file)
# convert wav file from stereo to mono
PreProcessor.convert_to_mono(wav_file)
# re-encode wav file to have 16-bit PCM encoding
PreProcessor.re_encode(wav_file)
```

### Performance
```
These metrics are from Google Colab tests.
These metrics do not take into account model download times.
These metrics are done without quantization enabled.
(quantization will make this even faster)
metrics for faster-whisper "tiny" model:
on gpu:
audio name: obama_zach.wav
duration: 6 min 36 s
diarization time: 24s
speaker recognition time: 10s
transcription time: 64s
metrics for faster-whisper "small" model:
on gpu:
audio name: obama_zach.wav
duration: 6 min 36 s
diarization time: 24s
speaker recognition time: 10s
transcription time: 95s
metrics for faster-whisper "medium" model:
on gpu:
audio name: obama_zach.wav
duration: 6 min 36 s
diarization time: 24s
speaker recognition time: 10s
transcription time: 193s
metrics for faster-whisper "large" model:
on gpu:
audio name: obama_zach.wav
duration: 6 min 36 s
diarization time: 24s
speaker recognition time: 10s
transcription time: 343s
```


This library uses following huggingface models:

#### https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb
#### https://huggingface.co/Ransaka/whisper-tiny-sinhala-20k-8k-steps-v2
#### https://huggingface.co/pyannote/speaker-diarization
39 changes: 39 additions & 0 deletions metrics.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
These metrics are from Google Colab tests.
These metrics do not take into account model download times.
These metrics are done without quantization enabled.
(quantization will make this even faster)

metrics for faster-whisper "tiny" model:
on gpu:
audio name: obama_zach.wav
duration: 6 min 36 s
diarization time: 24s
speaker recognition time: 10s
transcription time: 64s


metrics for faster-whisper "small" model:
on gpu:
audio name: obama_zach.wav
duration: 6 min 36 s
diarization time: 24s
speaker recognition time: 10s
transcription time: 95s


metrics for faster-whisper "medium" model:
on gpu:
audio name: obama_zach.wav
duration: 6 min 36 s
diarization time: 24s
speaker recognition time: 10s
transcription time: 193s


metrics for faster-whisper "large" model:
on gpu:
audio name: obama_zach.wav
duration: 6 min 36 s
diarization time: 24s
speaker recognition time: 10s
transcription time: 343s
21 changes: 21 additions & 0 deletions pyannote-audio_LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2020 CNRS

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
8 changes: 8 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
transformers
torch
torchaudio
pydub
pyannote.audio
speechbrain
accelerate
faster-whisper
24 changes: 24 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from setuptools import find_packages, setup

with open("library.md", "r") as f:
long_description = f.read()

setup(
name="speechlib",
version="1.1.0",
description="speechlib is a library that can do speaker diarization, transcription and speaker recognition on an audio file to create transcripts with actual speaker names. This library also contain audio preprocessor functions.",
packages=find_packages(),
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/NavodPeiris/speechlib",
author="Navod Peiris",
author_email="navodpeiris1234@gmail.com",
license="MIT",
classifiers=[
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3.10",
"Operating System :: OS Independent",
],
install_requires=["transformers==4.36.2", "torch==2.1.2", "torchaudio==2.1.2", "pydub==0.25.1", "pyannote.audio==3.1.1", "speechbrain==0.5.16", "accelerate==0.26.1", "faster-whisper==0.10.1", "openai-whisper==20231117"],
python_requires=">=3.8",
)
19 changes: 19 additions & 0 deletions setup_instruction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
for building setup:
pip install setuptools
pip install wheel

on root:
python setup.py sdist bdist_wheel

for publishing:
pip install twine

for install locally for testing:
pip install dist/speechlib-1.1.0-py3-none-any.whl

finally run:
twine upload dist/*

fill as follows:
username: __token__
password: {your token value}
Loading

0 comments on commit f8a4d03

Please sign in to comment.