Website Searcher

Summary

This naive implementation of a website searcher uses a Producer/Consumer pattern to search a list of websites for a particular pattern. It is designed to be easily tested in an automated fashion by allowing the reader, writer, and the strategy for retrieving URLs to be injected by the calling code via the constructor of the WebsiteSearcher and WebsiteSearcherWorker classes. The WebsiteSearcher corresponds to a searcher thread which reads the input from the given reader and adds it to a shared input queue. It then polls a shared output queue for results and is the only thread that writes to the result output file. Each of the worker threads, as defined by the WebsiteSearcherWorker class, consumes messages from the input queue one at a time. For each consumed URL/regex pattern pair, it retrieves the content of the URL using its URL streaming strategy and executes the regular expression match logic against the retrieved content. If any of the content matches, the URL is added to the output queue for consumption by the searcher thread.

Build

This project uses gradle - see http://www.gradle.org for more information.

To clean: ./gradlew clean
To build: ./gradlew build

Running

This project is written using Java 8 and provides a Java 8 compatible jar file.

To run (where project_dir is the root of the project as checked out from git):

cd project_dir/dist
java -jar website-searcher.jar

This assumes you have Java installed and available on your path.

The main method of this application spawns a WebsiteSearcher that parses the urls.txt file in the working directory and writes results to the results.txt in the same directory. Each WebsiteSearcherWorker produces output for the searcher thread to consume if the retrieved URL contains the word "and" anywhere in its content.

Caveats

Error handling is fairly minimal since the product specification does not give much detail on how errors ought to be handled. Errors in the worker threads are printed to standard error and ignored (note that this may create rather verbose output when a site is unreachable or its content unreadable). Other errors are generally bubbled up to the top. This could be refined on future iterations of this product.
This application does not terminate on its own. Since the searcher thread has no way of knowing how many outputs it will receive (as the workers could return any number of matches from 0-N where N is the number of inputs) or when it will receive them, it continues to block on the output queue waiting for more input. To terminate the application, hit CTRL+C or send SIGKILL from a terminal and the application will clean up and exit.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
dist		dist
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website Searcher

Summary

Build

Running

Caveats

About

Releases

Packages

Languages

dkaminsky/website-searcher

Folders and files

Latest commit

History

Repository files navigation

Website Searcher

Summary

Build

Running

Caveats

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages