The amount of digital consumption in the form of Audio and Video has become so prominent than ever before. Due to which we have so much of content that is available for our limited amount of time. Just like an abstract for a research paper, it would be easier if we have a quick summary of the video, which can help us understand what the video trying to convey. In this project I tried to solve this problem using the recent developments in deep learning sphere STT (Speech-to-text) and TTT (Text-to-Text) transformer models.
Next, we can try to utilize the generated summary text and build a Text-to-Speech model by using the video's voice as training. So, when Obama speaks the summarized text shall also output it in Obama's voice. By using video's meta data we can try to capture the transcribes timeline and try doing adaptive auto video editing for the generated summary.