Skip to content

Input Dimensionality Mismatch #56

Answered by ghenter
Shrey-55 asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @Shrey-55,

My understanding is that the first thing that happens is that each frame (the high-dimensional text+audio vector) is passed through a feed-forward network that encodes it to a 124-dimensional vector. (If you are familiar with CNNs, this can alternatively be seen as a "1x1 convolution".) I don't know where in the code this happens, but the paper does include a description of this dimensionality reduction:

First, the text and audio features of each frame are jointly encoded by a feed-forward neural network to reduce dimensionality.

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@Svito-zar
Comment options

@Shrey-55
Comment options

@Svito-zar
Comment options

@Shrey-55
Comment options

Answer selected by Svito-zar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants