.sttWithMetadata() produces inaccurate token timings #3687

nikola-ctx · 2021-09-13T22:02:56Z

When transcribing audio (.wav, 16KHz) using DeepSpeech's sttWithMetadata method, I get inaccurate token start_time values. If compared against the actual waveform in tools such as Audacity, the DeepSpeech timings are invariably late/wrong.
Here's the code I use to generate the metadata/transcription (deepspeech version 0.9.3):

import deepspeech
from scipy.io import wavfile

model_path = "deepspeech-0.9.3-models.pbmm"
scorer_path = "deepspeech-0.9.3-models.scorer"
ds_model = deepspeech.Model(model_path)
ds_model.enableExternalScorer(scorer_path)

sr, audio_signal = wavfile.read("some_audio.wav")
metadata = ds_model.sttWithMetadata(audio_signal)

The text was updated successfully, but these errors were encountered:

ftyers · 2022-09-29T13:54:46Z

See #3693

ftyers closed this as completed Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.sttWithMetadata() produces inaccurate token timings #3687

.sttWithMetadata() produces inaccurate token timings #3687

nikola-ctx commented Sep 13, 2021 •

edited

Loading

ftyers commented Sep 29, 2022

.sttWithMetadata() produces inaccurate token timings #3687

.sttWithMetadata() produces inaccurate token timings #3687

Comments

nikola-ctx commented Sep 13, 2021 • edited Loading

ftyers commented Sep 29, 2022

nikola-ctx commented Sep 13, 2021 •

edited

Loading