Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCRAPER]: Initial implementation #1233

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

amadolid
Copy link
Collaborator

@amadolid amadolid commented Nov 10, 2023

SCRAPER (Playwright Python)

!IMPORTANT!

remote action scraper needs to be run via FastAPI instead of jaseci serv_action. Remote action runs via uvicorn and this will require running playwright asynchronously and currently, remote action use non async function hence creating dedicated async scrape api. This is to allow local and remote scraper.
To run remotely:

export SCRAPER_SOCKET_ENABLED=true
export SCRAPER_SOCKET_URL=ws://your-websocket/ws # defaults to ws://jaseci-socket/ws
uvicorn jac_misc.scraper:app --host 0.0.0.0 --port 80 --reload

wbs.scrape

Arguments:
pages: list (structure below)
pre_configs: list (structure below)
detailed: bool = False
target: str = None

Return:
str or dict

Usage:
To scrape specified url

Remarks:
detailed true will return dict with scanned/scraped urls
target optional client id for websocket progress notifications

STRUCTURE
###########################################################################
#                             pages structure                             #
###########################################################################
[{
    # required
    # https://playwright.dev/python/docs/api/class-page#page-goto
    # this will load the targeted URL
    "goto": {
        #required
        "url": "",
        "wait_until": "networkidle",

        # -- these next fields will be popped before goto is called --

        # optional
        # all pre and post scripts have same structure
        "pre_scripts": [{
            # methods from playwright.sync_api.Page
            # https://playwright.dev/python/docs/api/class-page#methods
            "method": "wait_for_selector",

            # all other fields other than "method" will be used as **kwargs
            "**": "value"
        }],
        # optional
        "post_scripts": []
    },

    # optional
    # this will be used for scraping the loaded page
    "getters": [{
        # "selector" | "custom" | "none" | else default
        "method": "default",

        # optional
        # selector == css query selector to target element where to trigger textContent
        # custom == your custom js script that will return string
        # none == empty
        # anything else == whole document.body
        # only works with selector and custom
        "expression": "",

        # optional
        # defaults to ["script", "style", "link", "noscript"]
        # element to remove before textContent
        # only works with method selector and default
        "excluded_element": [],

        # optional
        "pre_scripts": [],
        # optional
        "post_scripts": []
    }],

    # optional
    # this option is for scraping clickable navigation such "a tag" with href
    # this urls will be appended to pages field with default structure unless found matched in pre_configs
    # you may use pre_configs to use different structure matched to your preferred regex
    "crawler": {
        # required
        # list of query selection with different attributes
        "queries": [{
                # css query selector
                "selector": "",
                # element attributes where we can get the url for crawling
                "attribute": ""
        }],

        # list of regex string that will be included in crawler
        # empty will allow everything
        "filters": [],

        # depth of crawl default to zero
        # zero will stop crawling
        "depth": 1,

        "pre_scripts": [],
        "post_scripts": []
    }
}]

###########################################################################
#                          pre_configs structure                          #
###########################################################################
[{
    # if crawled url matched to this regex, scraper field will be the structured used to append in pages field
    "regex": "",

    # similar to pages structure without goto.url
    "scraper": {
        "goto": {
            "wait_until": "networkidle",
            "pre_scripts": [],
            "post_scripts": []
        }
    }
}]
HOW TO TRIGGER
wbs.scrape(
    pages = [{
        "goto": {
            "url": "http://google.com",
            "wait_until": "networkidle",
            "pre_scripts": [],
            "post_scripts": [{
                "method": "evaluate",
                "expression": """
                try {
                    document.querySelector("textarea[id=APjFqb]").value = "speed test";
                    document.querySelector("form[action='/search']:has(input[type=submit][value='Google Search'])").submit();
                } catch (err) { }
                """
            },{
                "method": "wait_for_selector",
                "selector": "#result-stats",
                "state": "visible"
            }]
        },
        "getters": [{
            "method": "default",
        }],
        "crawler": {
            "filters": ["^((?!google\\.com).)*$"],
            "depth": 1
        }
    }],
    pre_configs = [{
        "regex": "speedtest\\.net",
        "scraper": {
            "goto": {
                "wait_until": "load"
            },
            "getters": [{
                "method": "default",
            }],
        }
    },{
        "regex": "fast\\.com",
        "scraper": {
            "goto": {
                "wait_until": "load"
            },
            "getters": [{
                "method": "default",
            }],
        }
    },{
        "regex": "speedcheck\\.org",
        "scraper": {
            "goto": {
                "wait_until": "load"
            },
            "getters": [{
                "method": "default",
            }],
        }
    }],
    detailed = True
)

@amadolid amadolid force-pushed the feature-request/scraper branch 3 times, most recently from cc1ea5f to 075a598 Compare November 28, 2023 13:41
@amadolid amadolid marked this pull request as ready for review November 28, 2023 13:45
@amadolid amadolid force-pushed the feature-request/scraper branch 8 times, most recently from 35e53f1 to f094e0b Compare December 6, 2023 02:26
@amadolid amadolid force-pushed the feature-request/scraper branch 4 times, most recently from e4a4b08 to 0fedf37 Compare December 11, 2023 20:16
@amadolid amadolid force-pushed the feature-request/scraper branch 6 times, most recently from e8bbeb0 to 8c73c4b Compare January 9, 2024 11:06
@amadolid amadolid force-pushed the feature-request/scraper branch 4 times, most recently from 980c67d to 66079ce Compare January 10, 2024 15:43
@amadolid amadolid force-pushed the feature-request/scraper branch 4 times, most recently from 6c6414f to 0397b75 Compare February 19, 2024 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant