-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community[minor]: add document transformer for extracting links #24186
community[minor]: add document transformer for extracting links #24186
Conversation
This makes it easy to package up one or more link extractors that operate on `Document` to add links.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
libs/community/langchain_community/graph_vectorstores/extractors/link_extractor_transformer.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/graph_vectorstores/extractors/link_extractor_transformer.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/graph_vectorstores/extractors/link_extractor.py
Show resolved
Hide resolved
looks good -- pending questions from author. It would be good to fix up the mutation of content in place since that will likely lead to bugs in users code. We can do a shallow copy of the metadata dict if we're just mutating the root of the namespace, and that shouldn't have too much of a performance consequence |
Head branch was pushed to by a user without write access
Head branch was pushed to by a user without write access
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
from langchain_community.graph_vectorstores.extractors.keybert_link_extractor import ( | ||
KeybertInput, | ||
KeybertLinkExtractor, | ||
) | ||
from langchain_community.graph_vectorstores.extractors.link_extractor import ( | ||
|
||
from .html_link_extractor import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems absolute imports are preferred in the codebase rather than relative ones ? (although we can find some relative imports here and there).
@eyurtsev ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, generally we use absolute imports rather than explicit relative. Absolute imports can be ambiguous in some cases, but it tends to be easier to work with them w/ respect to refactors
extract_links.transform_documents(docs) | ||
""" | ||
|
||
def __init__(self, link_extractors: Iterable[LinkExtractor[Document]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be a Sequence in this case? i.e., length is known and one can materialize it more than once?
def __init__(self, link_extractors: Iterable[LinkExtractor[Document]]): | |
def __init__(self, link_extractors: Sequence[LinkExtractor[Document]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to make a PR for remaining updates -- they're not blocking
LinkExtractor
s and adding the extracted links to each document.