Skip to content
This repository has been archived by the owner on Jan 12, 2023. It is now read-only.

This repository contains the Video Analysis (VIA) Framework, a collection of Google Cloud services that you can use to transcribe video.

License

Notifications You must be signed in to change notification settings

GoogleCloudPlatform/dataflow-video-analysis

Repository files navigation

Video Analysis (VIA) Framework

This repository contains the Video Analysis (VIA) Framework, a collection of Google Cloud services that you can use to transcribe video.

Video Analysis Framework Architecture

The repository also contains an extended version of the the Video Analysis (VIA) Framework which includes a collection of components including Elasticsearch and a web interface that you can use to search for words and phrases within your videos.

Extended Version Video Analysis Framework Architecture

It can:

  • Process uploaded video files to Cloud Storage.
  • Enrich the processed video files with Google Cloud Video Intelligence API.
  • Write the enriched data to BigQuery.
  • With the Extended version add the enriched data to Elasticsearch index and provide a user interface to search for words and phrases

The life of a video file with the VIA:

  1. A video file is uploaded to Cloud Storage
  2. The Cloud Function is triggered on object.create
  3. The Cloud Function sends a long running job request to the Video Intelligence API
  4. The Video Intelligence API starts processing the video file
  5. The Cloud Function then sends the job ID from Video Intelligence API with additional metadata to Cloud Pub/Sub
  6. The Cloud Dataflow job enriches the data
  7. Cloud Dataflow then writes the data to Google Cloud BigQuery

Extended Version: The life of a video file with the VIA:

Scroll to the bottom for instructions on how to install the extended version.

  1. A video file is uploaded to Cloud Storage
  2. The Cloud Function is triggered on object.create
  3. The Cloud Function sends a long running job request to the Video Intelligence API
  4. The Video Intelligence API starts processing the video file
  5. The Cloud Function then sends the job ID from Video Intelligence API with additional metadata to Cloud Pub/Sub
  6. The Cloud Dataflow job enriches the data
  7. Cloud Dataflow then writes the data to Google Cloud BigQuery
  8. Next step in the pipeline includes the data to be written to Elasticsearch index
  9. The data is now ready to be searched with Elasticsearch

How to install the Video Analysis Framework

  1. Install the Google Cloud SDK

  2. Create a storage bucket for Dataflow Staging Files

gsutil mb gs://[BUCKET_NAME]/
  1. Through the Google Cloud Console create a folder named tmp in the newly created bucket for the DataFlow staging files

  2. Create a storage bucket for Uploaded Video Files

gsutil mb gs://[BUCKET_NAME]/
  1. Create a BigQuery Dataset
bq mk [YOUR_BIG_QUERY_DATABASE_NAME]
  1. Create Cloud Pub/Sub Topic
gcloud pubsub topics create [YOUR_TOPIC_NAME]
  1. Enable Cloud Dataflow API
gcloud services enable dataflow
  1. Enable Cloud Video Intelligence API
gcloud services enable videointelligence.googleapis.com
  1. Deploy the Google Cloud Function
  • In the cloned repo, go to the via-longrun-job-func directory and deploy the following Cloud Function.
gcloud functions deploy viaLongRunJobFunc --region=us-central1 --stage-bucket=[YOUR_UPLOADED_AUDIO_FILES_BUCKET_NAME] --runtime=nodejs8 --trigger-event=google.storage.object.finalize --trigger-resource=[YOUR_UPLOADED_AUDIO_FILES_BUCKET_NAME]
  1. Deploy the Cloud Dataflow Pipeline
  • python3 --version Python 3.7.8
  • In the cloned repo, go to via-longrun-job-dataflow directory and deploy the Cloud Dataflow Pipeline. Run the commands below to deploy the dataflow job.
# Apple/Linux
python3 -m venv env
source env/bin/activate
pip3 install apache-beam[gcp]
pip3 install dateparser
  • The Dataflow job will create the BigQuery Table you listed in the parameters.
  • Please wait as it might take a few minutes to complete.
python3 vialongrunjobdataflow.py --project=[YOUR_PROJECT_ID] --input_topic=projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME] --runner=DataflowRunner --temp_location=gs://[YOUR_DATAFLOW_STAGING_BUCKET]/tmp --output_bigquery=[DATASET NAME].[TABLE] --requirements_file="requirements.txt" --region=[GOOGLE_CLOUD_REGION]

How to install the Extended version of the Video Analysis Framework

The VIA Framework requires you have an working Elasticsearch install, for more information visit Managed Elasticsearch on Google Cloud

  1. Install the Google Cloud SDK

  2. Create a storage bucket for Dataflow Staging Files

gsutil mb gs://[BUCKET_NAME]/
  1. Through the Google Cloud Console create a folder named tmp in the newly created bucket for the DataFlow staging files

  2. Create a storage bucket for Uploaded Video Files

gsutil mb gs://[BUCKET_NAME]/
  1. Create a BigQuery Dataset
bq mk [YOUR_BIG_QUERY_DATABASE_NAME]
  1. Create Cloud Pub/Sub Topic
gcloud pubsub topics create [YOUR_TOPIC_NAME]
  1. Enable Cloud Dataflow API
gcloud services enable dataflow
  1. Enable Cloud Video Intelligence API
gcloud services enable videointelligence.googleapis.com
  1. Deploy the Google Cloud Function
  • In the cloned repo, go to the via-longrun-job-func directory and deploy the following Cloud Function.
gcloud functions deploy viaLongRunJobFunc --region=us-central1 --stage-bucket=[YOUR_UPLOADED_AUDIO_FILES_BUCKET_NAME] --runtime=nodejs8 --trigger-event=google.storage.object.finalize --trigger-resource=[YOUR_UPLOADED_AUDIO_FILES_BUCKET_NAME]
  1. Deploy the Cloud Dataflow Pipeline
  • python3 --version Python 3.7.8
  • In the cloned repo, go to via-longrun-job-dataflow-extended directory and deploy the Cloud Dataflow Pipeline.
  • You need to edit the pipeline to include your Elasticsearch settings on line 100
  • Run the commands below to deploy the dataflow job.
# Apple/Linux
python3 -m venv env
source env/bin/activate
pip3 install apache-beam[gcp]
pip3 install dateparser
pip3 install elasticsearch
  • The Dataflow job will create the BigQuery Table you listed in the parameters.
  • Please wait as it might take a few minutes to complete.
python3 viaextendedlongrunjobdataflow.py --project=[YOUR_PROJECT_ID] --input_topic=projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME] --runner=DataflowRunner --temp_location=gs://[YOUR_DATAFLOW_STAGING_BUCKET]/tmp --output_bigquery=[DATASET NAME].[TABLE] --requirements_file="requirements.txt" --region=[GOOGLE_CLOUD_REGION]
  1. Deploy Search Interface
  • In the cloned repo, go to the via-web/src directory. Edit the Settings.js file to include your Elasticsearch parameters.
  • Run the commands below in the via-web directory to deploy in the search interface.
npm run build
gcloud app deploy
  1. The Search Interface requires Google Cloud Identity-Aware Proxy (IAP)

  2. Browse to the newly created App Engine service URL.

Notes

  • To search for phrases enter your text string in quotes as:

Video Analysis Phrase Search

  • To search for multiple words enter your words separated by space as:

Video Analysis Word Search

This is not an officially supported Google product

About

This repository contains the Video Analysis (VIA) Framework, a collection of Google Cloud services that you can use to transcribe video.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published