community[minor]: add document transformer for extracting links #24186

bjchambers · 2024-07-12T15:30:49Z

Description: Add a DocumentTransformer for executing one or more LinkExtractors and adding the extracted links to each document.
Issue: n/a
Depedencies: none

This makes it easy to package up one or more link extractors that operate on `Document` to add links.

vercel · 2024-07-12T15:30:53Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 19, 2024 6:13pm

libs/community/langchain_community/graph_vectorstores/extractors/link_extractor_transformer.py

libs/community/langchain_community/graph_vectorstores/extractors/link_extractor.py

eyurtsev · 2024-07-15T15:34:01Z

looks good -- pending questions from author. It would be good to fix up the mutation of content in place since that will likely lead to bugs in users code. We can do a shallow copy of the metadata dict if we're just mutating the root of the namespace, and that shouldn't have too much of a performance consequence

libs/core/langchain_core/graph_vectorstores/links.py

Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>

cbornet · 2024-07-22T16:16:42Z

libs/community/langchain_community/graph_vectorstores/extractors/__init__.py

 from langchain_community.graph_vectorstores.extractors.keybert_link_extractor import (
    KeybertInput,
    KeybertLinkExtractor,
 )
-from langchain_community.graph_vectorstores.extractors.link_extractor import (
+
+from .html_link_extractor import (


It seems absolute imports are preferred in the codebase rather than relative ones ? (although we can find some relative imports here and there).
@eyurtsev ?

Yes, generally we use absolute imports rather than explicit relative. Absolute imports can be ambiguous in some cases, but it tends to be easier to work with them w/ respect to refactors

eyurtsev · 2024-07-23T02:00:38Z

libs/community/langchain_community/graph_vectorstores/extractors/link_extractor_transformer.py

+            extract_links.transform_documents(docs)
+    """
+
+    def __init__(self, link_extractors: Iterable[LinkExtractor[Document]]):


Should probably be a Sequence in this case? i.e., length is known and one can materialize it more than once?

Suggested change

def __init__(self, link_extractors: Iterable[LinkExtractor[Document]]):

def __init__(self, link_extractors: Sequence[LinkExtractor[Document]]):

eyurtsev

Feel free to make a PR for remaining updates -- they're not blocking

add document transformer for extracting links

3789a2e

This makes it easy to package up one or more link extractors that operate on `Document` to add links.

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jul 12, 2024

dosubot bot added community Related to langchain-community 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 12, 2024

eyurtsev self-assigned this Jul 12, 2024

bjchambers added 2 commits July 12, 2024 11:54

lint tests

dce486b

imports

6c949ff

eyurtsev approved these changes Jul 15, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 15, 2024

eyurtsev added the waiting-on-author PR Status: Confirmation from author is required label Jul 15, 2024

shallow-copy metadata

3c828f1

bjchambers requested a review from eyurtsev July 15, 2024 16:34

eyurtsev approved these changes Jul 16, 2024

View reviewed changes

eyurtsev changed the title ~~community: add document transformer for extracting links~~ community[minor]: add document transformer for extracting links Jul 16, 2024

eyurtsev enabled auto-merge (squash) July 16, 2024 13:46

fix tests

e4b6311

auto-merge was automatically disabled July 16, 2024 14:45
Head branch was pushed to by a user without write access

bjchambers requested a review from eyurtsev July 18, 2024 13:26

eyurtsev approved these changes Jul 18, 2024

View reviewed changes

eyurtsev enabled auto-merge (squash) July 18, 2024 13:55

format

55dac68

auto-merge was automatically disabled July 18, 2024 14:52
Head branch was pushed to by a user without write access

bjchambers requested a review from eyurtsev July 18, 2024 14:54

bjchambers added 5 commits July 19, 2024 08:11

Merge branch 'master' into link-extractor-document-transformer

ff6f4c4

Merge branch 'master' into link-extractor-document-transformer

44ff56b

merge

535042c

Merge branch 'master' into link-extractor-document-transformer

45a3a2d

Merge branch 'master' into link-extractor-document-transformer

89fbc3f

fix test and mypy

85f4dd8

eyurtsev reviewed Jul 19, 2024

View reviewed changes

libs/core/langchain_core/graph_vectorstores/links.py Outdated Show resolved Hide resolved

Update libs/core/langchain_core/graph_vectorstores/links.py

55e33e1

Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>

vercel bot deployed to Preview July 19, 2024 18:13 View deployment

bjchambers requested a review from eyurtsev July 22, 2024 15:32

cbornet reviewed Jul 22, 2024

View reviewed changes

eyurtsev reviewed Jul 23, 2024

View reviewed changes

eyurtsev approved these changes Jul 23, 2024

View reviewed changes

eyurtsev merged commit 5ac936a into langchain-ai:master Jul 23, 2024
97 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community[minor]: add document transformer for extracting links #24186

community[minor]: add document transformer for extracting links #24186

bjchambers commented Jul 12, 2024

vercel bot commented Jul 12, 2024 •

edited

Loading

eyurtsev commented Jul 15, 2024

cbornet Jul 22, 2024

eyurtsev Jul 23, 2024

eyurtsev Jul 23, 2024

eyurtsev left a comment

	def __init__(self, link_extractors: Iterable[LinkExtractor[Document]]):
	def __init__(self, link_extractors: Sequence[LinkExtractor[Document]]):

community[minor]: add document transformer for extracting links #24186

community[minor]: add document transformer for extracting links #24186

Conversation

bjchambers commented Jul 12, 2024

vercel bot commented Jul 12, 2024 • edited Loading

eyurtsev commented Jul 15, 2024

cbornet Jul 22, 2024

Choose a reason for hiding this comment

eyurtsev Jul 23, 2024

Choose a reason for hiding this comment

eyurtsev Jul 23, 2024

Choose a reason for hiding this comment

eyurtsev left a comment

Choose a reason for hiding this comment

vercel bot commented Jul 12, 2024 •

edited

Loading