Skip to content

wsobanski/transcription

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Goal of this project is to create a script for transcribing youtube videos.

Background

Background for this projects comes from (not mine) idea to compare two politics corpuses - polish and french. In order to utilize NLP tools and methods, the data has to be represented in a textual form. Afer profound research, it turns out that there are very few sources providing transcriptions of political speeches, especially for french and polish language.

As a solution, I've decided to use speech-to-text tools and transcribe selected youtube videos.

Implementation

Download youtube video

First step of the pipeline is to download youtube videos. For that purpose pytube library is used.

Snippet below shows simple usage that is a part of the implementation. Firstly, pytube connects to given youtube video and lists all available streamings. Then, all results are filtered to contain only mp4 files. Using options only_video=False asserts that given videos contain sound. Then firs result meeting the criteria is being downloaded.

from pytbue import YouTube

YouTube(link).streams.filter(file_extension='mp4',
			     only_video=False).first().download()

Convert video to audio file

In order to convert mp4 file to audio file (such as mp3) you can use ffmpeg and pydub library. When ffmpeg is installed and configured properly on your machine, you can convert files as described below:

from pydub import AudioSegment

sound = AudioSegment.from_file(input_file, format="input format")
sound.export(output_file, format="desired format")

Transcribe audio file

One of the most prominent examples of speech-to-text tools is Google Speech-to-text utility accessible via API endpoint which can be accessed using Python client. Once it is set and enabled on GCP, you can send short audio file (less than 1 minute or 10MB) directly from local machine:

from google.cloud import speech


def speech_to_text(
    config: speech.RecognitionConfig,
    audio: speech.RecognitionAudio,
) -> speech.RecognizeResponse:
    client = speech.SpeechClient()

    # Synchronous speech recognition request
    response = client.recognize(config=config, audio=audio)

    return response


def print_response(response: speech.RecognizeResponse):
    for result in response.results:
        print_result(result)


def print_result(result: speech.SpeechRecognitionResult):
    best_alternative = result.alternatives[0]
    print("-" * 80)
    print(f"language_code: {result.language_code}")
    print(f"transcript:    {best_alternative.transcript}")
    print(f"confidence:    {best_alternative.confidence:.0%}")

Usage

To run this, you need to create new GCP project and enable speech-to-text API. Other than that, you need to set up a stoage bucket and provide a URI to it and its name as env variables. AFter succesful authorization, you should be able to run the transcription job.

For now, files can be transcribed in batches. You can select text file with links to youtube videos and specify transcription language code (BCP 47 codes):

python main.py --file='path/to/file.txt' --language='en-EN'

About

Transcribe any youtube video.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages