Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add html parsers #4874

Merged
merged 9 commits into from
May 18, 2023
Merged

Add html parsers #4874

merged 9 commits into from
May 18, 2023

Conversation

eyurtsev
Copy link
Collaborator

Add bs4 html parser

  • Some minor refactors
  • Extract the bs4 html parsing code from the bs html loader
  • Move some tests from integration tests to unit tests

Copy link
Contributor

@dev2049 dev2049 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

def __init__(
self,
features: str = "lxml",
bs_kwargs: Optional[Mapping[str, Any]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooc (out of curiosity) how do you think about passing in kwargs as dict vs **kwargs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not :)

My main concern tends to be that there might be another function that'll require them... but here maybe it's OK to assume it goes into the initializer by default, and everything else has to be explicitly specified


def __init__(
self,
features: str = "lxml",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add this to bs4html loader as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's already there, just hidden in the init

@eyurtsev eyurtsev merged commit 0dc304c into master May 18, 2023
@eyurtsev eyurtsev deleted the eugene/add_html_parsers branch May 18, 2023 02:39
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants