Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching with Synonyms using KeyLLM OR KeyBERT #245

Open
ChettakattuA opened this issue Jul 29, 2024 · 5 comments
Open

Matching with Synonyms using KeyLLM OR KeyBERT #245

ChettakattuA opened this issue Jul 29, 2024 · 5 comments

Comments

@ChettakattuA
Copy link

I have been playing with KeyBERT and KeyLLM for a while now. And here is something I would like to achieve.

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

Text = "CO2 emissions are high these days"
candidate keyword list have the word ["Carbon dioxide"] and not "CO2"

Expected output = ["Carbon dioxide"]

@MaartenGr
Copy link
Owner

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

I think it should be possible if you use it as a candidate word. Have you tried it out?

@ChettakattuA
Copy link
Author

image

In this result the acronym and synonyms are not identified by KeyBERT

acronym used = CO2 -> carbon dioxide
synonym used = emission -> release
Plural = emission -> emissions 

The code used

from keybert import KeyBERT 
kw_model = KeyBERT() 
text = "CO2 emissions are high these days"
can = ["carbon dioxide", "emissions","release","emission","co2"]
Keywords = kw_model.extract_keywords(text,candidates=can)

Is there some way to resolve this?

@MaartenGr
Copy link
Owner

Ah right, that's because the candidates should appear in the original document in order to find them. Instead, you might want to use the seed_keywords parameter which allows you to steer the model towards certain words. Note that you might have to use the global perspective here.

@ChettakattuA
Copy link
Author

But do you know why its require the word itself to appear in the text? What I understood from the documentation is it uses embeddings and cosine similarity. Aint it enough to understand similar words or synonyms from the text and candidates?

@MaartenGr
Copy link
Owner

@ChettakattuA That depends on what you want. Generally, keywords are derived directly from the article that was written for SEO reasons. In KeyBERT candidates are passed to the CountVectorizer as a vocabulary, which means they should appear in the original documents (as they are fitted on the original documents):

KeyBERT/keybert/_model.py

Lines 163 to 182 in f0f96a6

# Extract potential words using a vectorizer / tokenizer
if vectorizer:
count = vectorizer.fit(docs)
else:
try:
count = CountVectorizer(
ngram_range=keyphrase_ngram_range,
stop_words=stop_words,
min_df=min_df,
vocabulary=candidates,
).fit(docs)
except ValueError:
return []
# Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0
# and will be removed in 1.2. Please use get_feature_names_out instead.
if version.parse(sklearn_version) >= version.parse("1.0.0"):
words = count.get_feature_names_out()
else:
words = count.get_feature_names()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants