This is the code behind the politsAIkroonika project on Instagram and YouTube. With just a single command, you can produce a short video clip featuring a fictional crime news story in the style of a certain Estonian 90s TV show. The story, audio and video are all 100% AI-generated using various models. The Estonian text-to-speech model used for the news reporter's voice has been custom trained for maximum authenticity.
A brief overview of the process:
- Generate story title, summary and script using OpenAI GPT-3.5 and GPT-4
- Convert script to audio using Voice Cloning App by BenAAndrew
- Generate video clips to illustrate the story using ModelScope Text-to-Video
- Enhance the video using Topaz Video AI (optional but highly recommended - improve resolution and frame rate)
- Merge video clips, audio and subtitles using ffmpeg
- Upload to Google Drive (optional - for the convenience of sharing the clip)
Installing the package and its dependencies is a bit more involved than usual due to the need to install and configure the Voice Cloning App and Topaz Video AI. The following instructions are for Windows, but should be easily adaptable to Linux.
- Python 3.8 or 3.9
- Poetry (tested with 1.6.1)
- ffmpeg
- NVIDIA GPU with at least 8 GB of VRAM (tested on a GTX 1070)
- OpenAI account with API key (paid subscription required, but the cost is a few cents per episode)
Optional:
- Topaz Video AI (tested with version 3.2.0)
Whilst this is paid software, it is currently the best available option for frame interpolation and upscaling. Open source options do exist (RIFE/CAIN/DAIN etc) but would require additional development to implement.
Clone the repository and install the dependencies using Poetry:
poetry install
Voice Cloning App is used for the text-to-speech functionality and is executed under its own virtual environment. This is because it requires specific versions of various libraries that may conflict with the versions required by this package.
Follow the manual install instructions here, except install the requirements into a virtual environment under /Voice-Cloning-App/.venv
:
cd Voice-Cloning-App
python -m venv .venv
.venv\Scripts\activate # Or the Linux equivalent
pip install -r requirements.txt
The code requires several environment variables to be configured. You may choose to set these in your system environment variables, or in a .env
file in the root of the repository. For the latter option, you need to install the Poetry dotenv plugin.
The following environment variables are required:
OPENAI_API_KEY
- get from your OpenAI account (instructions here)- The
ffmpeg
executable must be in yourPATH
variable
If you are using Topaz Video AI, the following environment variables are also required:
TVAI_MODEL_DIR
andTVAI_MODEL_DATA_DIR
- set according to instructions hereTVAI_FFMPEG
- set to the path of theffmpeg
executable in your Topaz Video AI installation (e.g.C:\Program Files\Topaz Labs LLC\Topaz Video AI\ffmpeg.exe
)
If you are using Google Drive, the following environment variable is also required:
GOOGLE_DRIVE_FOLDER_ID
- the ID of the Google Drive folder where the videos will be uploaded. This is a long string of letters and numbers that can be found in the URL of the folder in Google Drive.
Once everything has been installed, you will need to download and place the below models in the correct directories (relative to the Voice-Cloning-App directory). If the directories do not exist, create them.
- Voice model - download from here and place in
data/models/reporter
- Vocoder model - download from here, rename from
g_02500000
tomodel.pt
and place indata/hifigan/vctk
- Vocoder model config file - download from here and place in
data/hifigan/vctk
- Alphabet file - copy from
alphabets/Estonian.txt
todata/languages/Estonian
and rename toalphabet.txt
The text-to-video model is automatically downloaded by the code.
For Topaz Video AI, if you have a fresh install, you may need to run the GUI first to download the required models. Simply load a video file and process it using the same models that the code uses:
- Apollo v8 (
apo-8
) - frame interpolation - Theia Fine Tune Detail v3 (
thd-3
) - upscaling
If everything has been installed correctly, you should be able to run the following command to generate a new episode:
poetry run python .\politsaikroonika\make_episode.py
The above command will generate a new episode using the default settings. You can customise the episode using various command line arguments. For example, to avoid the topics of animals, theft, stealing and robbery, and to include fireworks and "new year's celebration", you can run the following command:
poetry run python .\politsaikroonika\make_episode.py -v --interactive --avoid animals,theft,stealing,robbery --include fireworks --include "new year's celebration"
The -v
flag is for verbose output, and the --interactive
flag is for interactive mode, which will prompt you to confirm the generated text parts before proceeding, and lets you override them if you wish.
More information on the available command line arguments can be found by running:
poetry run python .\politsaikroonika\make_episode.py --help
If you are interested in training your own text-to-speech model, you can follow the instructions in the Voice Cloning App repository. For reference, the training data used for the Estonian model included over 1000 sentences with a total duration of around 1.5 hours. The training took approximately 2 days on a GTX 1070.
To gather the training data, I processed all publicly available clips of the original TV show and extracted the audio track. Then, I transcribed the audio using tekstiks.ee (with a fair amount of manual corrections) and used split_audio.py
and various scripts under scripts
to split the audio into individual sentences. Background noise was removed using OpenVINO's noise-suppression-poconetlike-0001 model. Finally, the audio was upsampled using NU-Wave2.