[SCRAPER]: Initial implementation #1233

amadolid · 2023-11-10T09:11:50Z

SCRAPER (`Playwright Python`)

`!IMPORTANT!`

remote action scraper needs to be run via FastAPI instead of jaseci serv_action. Remote action runs via uvicorn and this will require running playwright asynchronously and currently, remote action use non async function hence creating dedicated async scrape api. This is to allow local and remote scraper.
To run remotely:

export SCRAPER_SOCKET_ENABLED=true
export SCRAPER_SOCKET_URL=ws://your-websocket/ws # defaults to ws://jaseci-socket/ws
uvicorn jac_misc.scraper:app --host 0.0.0.0 --port 80 --reload

wbs.`scrape`

Arguments:
pages: list (structure below)
pre_configs: list (structure below)
detailed: bool = False
target: str = None

Return:
str or dict

Usage:
To scrape specified url

Remarks:
detailed true will return dict with scanned/scraped urls
target optional client id for websocket progress notifications

`STRUCTURE`

###########################################################################
#                             pages structure                             #
###########################################################################
[{
    # required
    # https://playwright.dev/python/docs/api/class-page#page-goto
    # this will load the targeted URL
    "goto": {
        #required
        "url": "",
        "wait_until": "networkidle",

        # -- these next fields will be popped before goto is called --

        # optional
        # all pre and post scripts have same structure
        "pre_scripts": [{
            # methods from playwright.sync_api.Page
            # https://playwright.dev/python/docs/api/class-page#methods
            "method": "wait_for_selector",

            # all other fields other than "method" will be used as **kwargs
            "**": "value"
        }],
        # optional
        "post_scripts": []
    },

    # optional
    # this will be used for scraping the loaded page
    "getters": [{
        # "selector" | "custom" | "none" | else default
        "method": "default",

        # optional
        # selector == css query selector to target element where to trigger textContent
        # custom == your custom js script that will return string
        # none == empty
        # anything else == whole document.body
        # only works with selector and custom
        "expression": "",

        # optional
        # defaults to ["script", "style", "link", "noscript"]
        # element to remove before textContent
        # only works with method selector and default
        "excluded_element": [],

        # optional
        "pre_scripts": [],
        # optional
        "post_scripts": []
    }],

    # optional
    # this option is for scraping clickable navigation such "a tag" with href
    # this urls will be appended to pages field with default structure unless found matched in pre_configs
    # you may use pre_configs to use different structure matched to your preferred regex
    "crawler": {
        # required
        # list of query selection with different attributes
        "queries": [{
                # css query selector
                "selector": "",
                # element attributes where we can get the url for crawling
                "attribute": ""
        }],

        # list of regex string that will be included in crawler
        # empty will allow everything
        "filters": [],

        # depth of crawl default to zero
        # zero will stop crawling
        "depth": 1,

        "pre_scripts": [],
        "post_scripts": []
    }
}]

###########################################################################
#                          pre_configs structure                          #
###########################################################################
[{
    # if crawled url matched to this regex, scraper field will be the structured used to append in pages field
    "regex": "",

    # similar to pages structure without goto.url
    "scraper": {
        "goto": {
            "wait_until": "networkidle",
            "pre_scripts": [],
            "post_scripts": []
        }
    }
}]

`HOW TO TRIGGER`

wbs.scrape(
    pages = [{
        "goto": {
            "url": "http://google.com",
            "wait_until": "networkidle",
            "pre_scripts": [],
            "post_scripts": [{
                "method": "evaluate",
                "expression": """
                try {
                    document.querySelector("textarea[id=APjFqb]").value = "speed test";
                    document.querySelector("form[action='/search']:has(input[type=submit][value='Google Search'])").submit();
                } catch (err) { }
                """
            },{
                "method": "wait_for_selector",
                "selector": "#result-stats",
                "state": "visible"
            }]
        },
        "getters": [{
            "method": "default",
        }],
        "crawler": {
            "filters": ["^((?!google\\.com).)*$"],
            "depth": 1
        }
    }],
    pre_configs = [{
        "regex": "speedtest\\.net",
        "scraper": {
            "goto": {
                "wait_until": "load"
            },
            "getters": [{
                "method": "default",
            }],
        }
    },{
        "regex": "fast\\.com",
        "scraper": {
            "goto": {
                "wait_until": "load"
            },
            "getters": [{
                "method": "default",
            }],
        }
    },{
        "regex": "speedcheck\\.org",
        "scraper": {
            "goto": {
                "wait_until": "load"
            },
            "getters": [{
                "method": "default",
            }],
        }
    }],
    detailed = True
)

…active sentinel

…imensions

amadolid force-pushed the feature-request/scraper branch 3 times, most recently from cc1ea5f to 075a598 Compare November 28, 2023 13:41

amadolid marked this pull request as ready for review November 28, 2023 13:45

amadolid force-pushed the feature-request/scraper branch 8 times, most recently from 35e53f1 to f094e0b Compare December 6, 2023 02:26

amadolid force-pushed the feature-request/scraper branch 4 times, most recently from e4a4b08 to 0fedf37 Compare December 11, 2023 20:16

amadolid added 8 commits January 9, 2024 16:56

[SCRAPER]: Initial implementation

aaf4ca0

[SCRAPER]: Complex conditions

396043f

[SCRAPER]: Version 2

d5438b2

[OPENAI]: Fix chat completion response and memory leak

b6a2f20

[SCRAPER]: Unit test

5bde498

[GET-ARCHITYPE]: Prioritize element's parent/creator before master's …

c12fc6d

…active sentinel

[DJSON-ACTION]: Json dynamic parser

26c1112

[SCRAPER]: bypass setup

db89a40

amadolid force-pushed the feature-request/scraper branch 6 times, most recently from e8bbeb0 to 8c73c4b Compare January 9, 2024 11:06

amadolid force-pushed the feature-request/scraper branch 4 times, most recently from 980c67d to 66079ce Compare January 10, 2024 15:43

amadolid added 2 commits January 11, 2024 02:35

[SCRAPER]: Websocket Integration

708c24e

[SCRAPER]: Websocket different approach

e965936

amadolid force-pushed the feature-request/scraper branch from 66079ce to e965936 Compare January 10, 2024 18:39

[SCRAPER]: Websocket additional reconnection

5430b14

amadolid force-pushed the feature-request/scraper branch from 57ce98c to 9681bc5 Compare January 11, 2024 05:39

[SCRAPER]: Websocket improved reconnection

730626e

amadolid force-pushed the feature-request/scraper branch from 9681bc5 to 730626e Compare January 11, 2024 06:09

[SCRAPER]: Bugfix - add slash character

12c29cb

amadolid force-pushed the feature-request/scraper branch from 2a09629 to 12c29cb Compare January 16, 2024 06:29

[SCRAPER]: Experimental html preview

544a990

amadolid force-pushed the feature-request/scraper branch from 81634bb to d31850e Compare January 24, 2024 08:21

amadolid force-pushed the feature-request/scraper branch 4 times, most recently from 6c6414f to 0397b75 Compare February 19, 2024 18:36

[SCRAPER]: Experimental html preview websocket

12d0336

amadolid force-pushed the feature-request/scraper branch from 0397b75 to 333023c Compare February 19, 2024 18:38

[MINOR]: std.title and timestamp to datetime

f39c2d8

amadolid force-pushed the feature-request/scraper branch from 333023c to 6fbb9aa Compare February 19, 2024 18:55

[BLACK]: Update to latest version

25feec7

amadolid force-pushed the feature-request/scraper branch from 6fbb9aa to 25feec7 Compare March 20, 2024 11:10

amadolid added 2 commits March 20, 2024 20:22

[JAC-NLP-WORKFLOW]: Add space cleanup to avoid "No space left on device"

ca32482

[ELASTIC-RETRIEVAL]: Upgrade openai model and maximum elasticsearch d…

1339667

…imensions

amadolid force-pushed the feature-request/scraper branch from e7fe1cf to 1339667 Compare March 22, 2024 08:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SCRAPER]: Initial implementation #1233

[SCRAPER]: Initial implementation #1233

amadolid commented Nov 10, 2023 •

edited

Loading

[SCRAPER]: Initial implementation #1233

Are you sure you want to change the base?

[SCRAPER]: Initial implementation #1233

Conversation

amadolid commented Nov 10, 2023 • edited Loading

SCRAPER (Playwright Python)

!IMPORTANT!

wbs.scrape

STRUCTURE

HOW TO TRIGGER

amadolid commented Nov 10, 2023 •

edited

Loading

SCRAPER (`Playwright Python`)

`!IMPORTANT!`

wbs.`scrape`

`STRUCTURE`

`HOW TO TRIGGER`