Skip to content

This project hosts a powerful and versatile web crawler module designed for PowerShell. Harness the capabilities of this module to explore and interact with the web, gather data, and automate web-related tasks with ease.

License

Notifications You must be signed in to change notification settings

voytas75/PSWebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PSWebCrawler

PowerShell Web Crawler

ko-fi

status   status   PowerShell version   PowerShell Gallery Version (including pre-releases)   PowerShell Gallery   Codacy Badge

Overview

PSWebCrawler is a PowerShell script that provides web crawling and URL processing functionality. It allows you to crawl web pages, extract information, and store data for analysis or further processing. Module depends on the HtmlAgilityPack (HAP) library, a .NET library that provides an agile and easy way to manipulate HTML documents.

PSWebCrawler module features

  • Web crawling functionality to retrieve HTML content from specified URLs
  • Ability to extract and save images from web pages
  • Extraction and storage of contact information from web pages
  • Retrieval and storage of HTML head data
  • Cache folder creation and data folder setup

Installation

The module is available on PowerShell Gallery.

Install-Module -Name PSWebCrawler

Import module:

Import-Module -Module PSWebCrawler

To get all commands in installed module including cmdlets, functions and aliases:

Get-Command -Module PSWebCrawler

Usage

Crawl

  • PSWC (alias of Start-PSWebCrawler). When the PSWC command is run alone, it displays a menu with various options and examples for using the PSWebCrawler module. This menu provides examples of how to use the PSWC command with different parameters to initiate web crawling with various configurations, such as crawling web pages with specified depths, displaying cache folders, and extracting specific elements from web pages.

    Start module to display possible options and examples:

    PSWC

    or

    Start-PSWebCrawler
  • When the PSWC command is used with the url and depth parameters, it initiates web crawling starting from the specified URL and crawling to the specified depth. This allows the user to extract various elements from the web pages, such as href elements, non-href elements, domains, and internal links, up to the specified depth. The command also logs and stores the extracted data in the default log folder (user' document folder), providing a record of the crawling process and the extracted elements for further analysis or reference:

    PSWC -Url "https://example.com" -Depth 2
  • The onlyDomains parameter, when set to true, instructs the web crawler to only crawl to the domains found in the href elements of the web page being crawled. This means that the crawler will restrict its exploration to the domains referenced by the links on the page, rather than following links to external domains:

    PSWC -Url "https://example.com" -Depth 2 -onlyDomains
  • The outputFolder parameter is used to specify the path where the log and data folders will be created when the Url and Depth parameters are used. When initiating a web crawl with the Url and Depth parameters, the outputFolder parameter allows you to define the location where the log and data folders will be stored for the crawl. This can be useful for organizing and managing the output of the web crawling operation. If the outputFolder parameter is not provided the log and data folders will be stored in the default location. The default location is the user's document folder under the 'PSWebCrawler' directory:

    PSWC -Url "https://example.com" -Depth 2 -outputFolder "C:\temp\logs\"

    More about log and data folder: Default Log and Data Folder

  • The resolve parameter is used when initiating a web crawl with the Url and Depth parameters. When resolve switch is on, the web crawler will resolve the domain name of current processed URL to IP address:

    PSWC -Url "https://example.com" -Depth 2 -resolve

ShowCacheFolder

  • The ShowCacheFolder option is used to open the default log and data folder (user's document folder under the 'PSWebCrawler' directory) in Windows File Explorer:

    PSWC -ShowCacheFolder

    More about log and data folder: Default Log and Data Folder

ShowAllElements

  • The PSWC command with the ShowAllElements and Type options extracts all elements from the specified URL, including Href elements, non-Href elements, domains, and internal links. The Type parameter in the command specifies which type of elements to extract. It can be set to one of the following values:

    • Href: extracts only elements with an href attribute.
    • noHref: extracts only elements without an href attribute.
    • onlyDomains: extracts only the domains from the href attributes.
    • All: extracts all elements from the specified URL.

    To use this option, you can run the following command:

    PSWC -ShowAllElements -Type All -Url 'https://www.example.com'

    This example extracts all elements from the URL "https://www.example.com"

GetImageUrls

  • The GetImageUrls option in PSWC retrieves the URLs of all images in an HTML document. The retrieved URLs are saved in a text file named Images.txt in the session folder. To use this option, run the command:

    PSWC -GetImageUrls -url "https://example.com"

    The saved file path will be displayed in the console output.

GetHTMLMetadata

  • The GetHTMLMetadata option in the PSWC module retrieves metadata from the specified HTML content and displays it. The metadata includes the title, description, keywords, author, copyright, robots, viewport, and generator. Retrieved data are saved in a text file named metadata.txt in the session folder. To use this option, run the command:

    PSWC -GetHTMLMetadata -url "https://example.com"

    The saved file path will be displayed in the console output.

GetContactInformation

  • The GetContactInformation option in the PSWebCrawler PowerShell module extracts contact information (emails, addresses, phone numbers) from an HTML document. Once the function is executed, it will return the extracted contact information in JSON format. Additionally, the option saves the extracted contact information to a file named Contact.json in the session folder. The session folder is created automatically and its path is logged in the "Settings.log" file. To call the option, use the following command:

    PSWC -GetContactInformation -url "https://example.com"

    The saved file path will be displayed in the console output.

GetHeadersAndValues

  • The GetHeadersAndValues option in the PSWebCrawler PowerShell module extracts headers and their corresponding values from the <head> section of an HTML document. To use this option, you need to provide the URL of the website you want to crawl using the url parameter. The extracted headers and values are returned as a hashtable. You can save the output to a file by specifying the output folder using the outputFolder parameter. Without outputFolder data are saved the extracted contact information to a file named HTMLhead.json in default place, the session folder. To call the option, use the following command:

    PSWC -GetHeadersAndValues -url "https://example.com"

Default Log and Data Folder

The PSWebCrawler module uses default folder for storing log files and data. If specific paths are not provided, the module uses <User's document folder>/PSWebCrawler/ folder.

User can override these default paths by providing custom paths using outputFolder parameters when using the module's functions. For example:

PSWC -Url 'https://example.com' -Depth 2 -outputFolder 'C:\Crawl\LOGandDATA\'    

INFO: In the PSWebCrawler PowerShell module, the default location for the log and data folder is the user's document folder. If the log and data folder does not exist, it will be created automatically when the module is imported.

Versioning

We use SemVer for versioning.

Contributing

We welcome contributions from the community! Feel free to submit pull requests, report issues, or suggest new features to make the framework even more powerful and user-friendly.

Clone the Repository: Clone the PSWebCrawler repository to your local machine.

License

The PSWebCrawler is released under the MIT License.

Contact: If you have any questions or need assistance, please feel free to reach out to us via GitHub Issues.

Join us on the journey to make PowerShell scripting a truly awesome experience!

About

This project hosts a powerful and versatile web crawler module designed for PowerShell. Harness the capabilities of this module to explore and interact with the web, gather data, and automate web-related tasks with ease.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published