Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[minor]: add document transformer for extracting links #24186

Merged

Conversation

bjchambers
Copy link
Contributor

  • Description: Add a DocumentTransformer for executing one or more LinkExtractors and adding the extracted links to each document.
  • Issue: n/a
  • Depedencies: none

This makes it easy to package up one or more link extractors that
operate on `Document` to add links.
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jul 12, 2024
Copy link

vercel bot commented Jul 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 19, 2024 6:13pm

@dosubot dosubot bot added community Related to langchain-community 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 12, 2024
@eyurtsev eyurtsev self-assigned this Jul 12, 2024
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 15, 2024
@eyurtsev eyurtsev added the waiting-on-author PR Status: Confirmation from author is required label Jul 15, 2024
@eyurtsev
Copy link
Collaborator

looks good -- pending questions from author. It would be good to fix up the mutation of content in place since that will likely lead to bugs in users code. We can do a shallow copy of the metadata dict if we're just mutating the root of the namespace, and that shouldn't have too much of a performance consequence

@bjchambers bjchambers requested a review from eyurtsev July 15, 2024 16:34
@eyurtsev eyurtsev changed the title community: add document transformer for extracting links community[minor]: add document transformer for extracting links Jul 16, 2024
@eyurtsev eyurtsev enabled auto-merge (squash) July 16, 2024 13:46
auto-merge was automatically disabled July 16, 2024 14:45

Head branch was pushed to by a user without write access

@bjchambers bjchambers requested a review from eyurtsev July 18, 2024 13:26
@eyurtsev eyurtsev enabled auto-merge (squash) July 18, 2024 13:55
auto-merge was automatically disabled July 18, 2024 14:52

Head branch was pushed to by a user without write access

@bjchambers bjchambers requested a review from eyurtsev July 18, 2024 14:54
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
from langchain_community.graph_vectorstores.extractors.keybert_link_extractor import (
KeybertInput,
KeybertLinkExtractor,
)
from langchain_community.graph_vectorstores.extractors.link_extractor import (

from .html_link_extractor import (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems absolute imports are preferred in the codebase rather than relative ones ? (although we can find some relative imports here and there).
@eyurtsev ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, generally we use absolute imports rather than explicit relative. Absolute imports can be ambiguous in some cases, but it tends to be easier to work with them w/ respect to refactors

extract_links.transform_documents(docs)
"""

def __init__(self, link_extractors: Iterable[LinkExtractor[Document]]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be a Sequence in this case? i.e., length is known and one can materialize it more than once?

Suggested change
def __init__(self, link_extractors: Iterable[LinkExtractor[Document]]):
def __init__(self, link_extractors: Sequence[LinkExtractor[Document]]):

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to make a PR for remaining updates -- they're not blocking

@eyurtsev eyurtsev merged commit 5ac936a into langchain-ai:master Jul 23, 2024
97 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files. waiting-on-author PR Status: Confirmation from author is required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants