Skip to content

ninoCan/scrapp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python webscraper ScrApp

Setting the virtual environment

Source This is similar to the npm init, but python has a little bit more complication.

Let's initailaize the virtual environment with:

  $ python3 -m venv env

So to create an isolated dependency for the project we need to activate it through:

  $ source env/bin/activate

This could be later deactivated simply by running $ deactivate.

Installing the dependencies

Source We are going to use the request and BeautifulSoup packages which we can now install inside the venv with:

  $ pip install requests BeautifulSoup4

Following the BeatifulSoup documentation, the best way to use it is through the use of a third party parser: lxml. Let us install it, as well:

  $ pip install lxml

In order to communicate with the MySQL server we also need to install a mysql/python connector

  $ pip install mysql-connector-python

To sum up the extra dependencies needed are:

  • requests
  • BeatifulSoup4
  • lxml
  • mysql-connector-python

Moreover, the scraper will depend on the following standard libraries:

  • time
  • json
  • contextlib
  • hashlib

these should already come with the basic python installation, but in case the program is not running make sure that are installed.

The scraper

The name of the scraper is scrApp.py, it depends on files from the ./Modules folder, which are definitions of functions arrenged in such a way to keep everything neater. These files are found in the repo.

In addition to these files there are two other files that need to be created by hand, and these are the keys for the APIs and MySQL database. For this purpose, create a file called api.key in which you need to insert the omdb password; for clarity (this is not the actual password):

S0m3NumB3R54nDd1git5

Then,touch a text file named database_credential.key with the following structure:

<database-user-name>
<database-user-password>
<host-address>
<database-name>

I chose to create a Docker container to run the MySQL database on it, this has been created with the following command:

  $ docker run -p 3306:3306 -d --rm  --name <container-name> -e MYSQL_USER=<database-user-name> -e MYSQL_ROOT_PASSWORD=<database-root-password> -e MYSQL_DATABASE=<database-name> -e MYSQL_PASSWORD=<database-user-password> -v $(pwd)/.data:/var/lib/mysql mysql:5.7

The flags has the following meaning

FLAGS FUNCTION COMMENTS
-p ... opens the port between localhost and container The second portnumber is the standard mysql port.
-d gives back the focus to the shell If this is not passed once the container is lauched the shell will wait for it to stop before gaining back focus. CTRL-D will not help stopping it.
--name ... assigns a name to the container Very useful, otherwise a name could be automatically generated by docker, but usually is long. In any case, the first four digits of the hash for the container should suffice to refer back to it.
-e ... specifies that we are passing an environment variable Usually, eenvironment variables are defined in the Dockerfile of the base. Therefore, further info need to be looked up in the mysql dockerhub page.
-v : creates a symbolic links from localhost to container In the way we are running it, the $(pwd) will execute the pwd command and then catenate the folder .data which is going to be created if it does not exist already. After the semicolon there is the internal path of the container that will be linked to the folder.
mysqld:5.7 specifies the base image This is found either locally or on the dockerhub servers, after semicolon is the version being used.

In this way the container run on its own isolated environment and it is easy to fire it up or tear it down. Make sure that the fields in the last command match with the digits in the creditial file.

If you decide to use the mysql database on your local machine, then the field in the credential file should be simply "localhost" eventually followed by the port. However, if you choose to proceed with the container road instead, in order to retrieve the correct address run the following:

$ docker inspect <container-name>

and, then, by looking at the "IPAddress" entry.

Extracting data from the database inside the container

During the course of the project, we had the need to extract manually the data inserted in the database. This was done by logging into the container with:

   $ docker exec -it <container-name> bash

Once inside, move to the folder /var/run/mysqld and run the following command:

 $ mysqldump -u <database-user-name> -p <database-name> > <database-record>.sql

To copy it outside of the container run the following comman:

  $ docker cp <container-name>:/var/run/mysqld/<database-record>.sql <path/to/folder/on/local/machine>

This could be useful to make a backup of the database.

Shipping ScrApp to container

To build the container for the python3 app we are going to touch two files: a Dockerfile and a .dockerignore. In the latter, we are going to add every file of the folder but the scrapp.py and the Modules folder which contains the functions needed for the main program.

The Dockerfile contains the followed content:

### Dockerfile for ScrApp.py
# base for the container
FROM python:3.8-alpine

# metadata
LABEL author="ninocan" maintainer="ninocangialosi@yahoo.it"

# import the project files and move to the folder
# where they are located in the container
COPY . /app
WORKDIR /app
RUN mkdir Posters

# needed to avoid errors when installing lxml library
RUN apk add --update --no-cache g++ gcc libxslt-dev
# install dependencies
RUN pip install -r requirements.txt

#open port for connection
EXPOSE 5000

#launch the application
CMD ["python", "./scrapp.py"]

Let's comment on their function:

COMMAND FUNCTION COMMENTS
FROM python:3.8-alpine Fetches the base image from the dockerhub This should be the lightest base for python3.8, the package manager of alpine is apk
LABEL ... This includes metadata of the produced image We are just listing the author and the email for convinience
COPY . app/ Copies the files from locale to container Thanks to the dockerignore file we are only including the files of interest
WORKDIR /app Moves inside the folder of the application The container acts as independent filesystem, here is where we do the dirty job
RUN mkdir Posters Creates a new folder inside /app We are not including Poster to the repo so we need to create it for the app to work properly
RUN apk ... Installs g++ gcc libxslt-dev First version of the dockerfile did not have this line and the building failed when installing lxml. This line fixes it. Source: stackoverflow
RUN pip install ... Installs the python dependencies for the application This requires the creation of a requirements.txt file. Details below
EXPOSE 5000 Opens port for external connection optional
CMD ... Launches the app

About

This is the scraper for the vonDD webapp.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published