Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 11, part 2, cannot load Glove #214

Open
zdjordje123 opened this issue Oct 30, 2022 · 1 comment
Open

Chapter 11, part 2, cannot load Glove #214

zdjordje123 opened this issue Oct 30, 2022 · 1 comment

Comments

@zdjordje123
Copy link

When trying to run the code for preparing Glove word-embeddings matrix, towards the end of the notebook for chapter 11, part 2, I get an error:
embedding_dim = 100

Retrieve the vocabulary indexed by our previous TextVectorization layer.

vocabulary = text_vectorization.get_vocabulary()

Use it to create a mapping from words to their index in the vocabulary.

word_index = dict(zip(vocabulary, range(len(vocabulary))))

Prepare a matrix that will be filled with the GloVe vectors.

embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
if i < max_tokens:
embedding_vector = embeddings_index.get(word)
# Fill entry i in the matrix with the word vector for index i.
# Words not found in the embedding index will be all zeros.
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
UnicodeDecodeError Traceback (most recent call last)
Input In [29], in <cell line: 4>()
1 embedding_dim = 100
3 # Retrieve the vocabulary indexed by our previous TextVectorization layer.
----> 4 vocabulary = text_vectorization.get_vocabulary()
5 # Use it to create a mapping from words to their index in the vocabulary.
6 word_index = dict(zip(vocabulary, range(len(vocabulary))))

File C:\ProgramData\Anaconda3\envs\tf-gpu\lib\site-packages\keras\layers\preprocessing\text_vectorization.py:448, in TextVectorization.get_vocabulary(self, include_special_tokens)
439 def get_vocabulary(self, include_special_tokens=True):
440 """Returns the current vocabulary of the layer.
441
442 Args:
(...)
446 vocabulary will not include any padding or OOV tokens.
447 """
--> 448 return self._lookup_layer.get_vocabulary(include_special_tokens)

File C:\ProgramData\Anaconda3\envs\tf-gpu\lib\site-packages\keras\layers\preprocessing\index_lookup.py:336, in IndexLookup.get_vocabulary(self, include_special_tokens)
334 keys, values = self.lookup_table.export()
335 vocab, indices = (values, keys) if self.invert else (keys, values)
--> 336 vocab, indices = (self._tensor_vocab_to_numpy(vocab), indices.numpy())
337 lookup = collections.defaultdict(lambda: self.oov_token,
338 zip(indices, vocab))
339 vocab = [lookup[x] for x in range(self.vocabulary_size())]

File C:\ProgramData\Anaconda3\envs\tf-gpu\lib\site-packages\keras\layers\preprocessing\string_lookup.py:401, in StringLookup._tensor_vocab_to_numpy(self, vocabulary)
399 def _tensor_vocab_to_numpy(self, vocabulary):
400 vocabulary = vocabulary.numpy()
--> 401 return np.array([tf.compat.as_text(x, self.encoding) for x in vocabulary])

File C:\ProgramData\Anaconda3\envs\tf-gpu\lib\site-packages\keras\layers\preprocessing\string_lookup.py:401, in (.0)
399 def _tensor_vocab_to_numpy(self, vocabulary):
400 vocabulary = vocabulary.numpy()
--> 401 return np.array([tf.compat.as_text(x, self.encoding) for x in vocabulary])

File C:\ProgramData\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\util\compat.py:110, in as_text(bytes_or_text, encoding)
108 return bytes_or_text
109 elif isinstance(bytes_or_text, bytes):
--> 110 return bytes_or_text.decode(encoding)
111 else:
112 raise TypeError('Expected binary or unicode string, got %r' % bytes_or_text)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

@ifond
Copy link

ifond commented Oct 30, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants