Python Web Development Tools
- Link and words harvester Ripper
html.parser
$ pip install html5lib # html5lib
lxml
,
lxml-xml
- Text generator LoremPysum
pip install pywebber --upgrade
pip install https://github.com/Parousiaic/pywebber/archive/master.zip
$ from pywebber import Ripper
Accessing words and links is easy
$ page = Ripper('http://python.org')
$ soup = page.soup # the raw Beautifulsoup4 object
$ uncleaned_links = page.raw_links # all raw <a> tags on page as bs4 objects
$ cleaned_links = page.links() # generator of all links in the form `http://www.domain.location`
$ words = page.words() # a generator of words between <p> tags
The following object creation options are available
url
: Default tourl="http://python.org"
parser
: Default toparser="html.parser"
. To see a complete list of parsers, userobject_instance.parsers
refresh
: Default torefresh=False
. The first timeRipper
hits a page, it saves aprettified
Beautifulsoup4 object of the scrapped page in a text file from which consequent calling of the class reads. But if set toTrue
,Ripper
will hit the site to get its data every single time its called to construc the page object.save_path
: Default tosave_path=None
. In this case,Ripper
creates a folder on yourUSER DESKTOP
. This folder name is in the formatdomainName_extension
. Every page scrapped from that site is saved inside this folder. Its also possible to setsave_path=/some/other/path
. The save file name is of the formatpage_url.txt
split_string
: Defaults tostring.punctuation.extend(["n", " ", "://",])
. You can supply a list to add to this set.stop_words
: Defaults to['', '#', '\n', 'the', 'to', "but", "and"]
. These are words that should not be included whenobject_instance.words()
is called. You can supply a list to add to this set.
$ from pywebber import LoremPysum
Create a single LoremPysum instance with default Lorem Ipsum text
$ p = LoremPysum(*args, domains=None, lorem=True)
You can also decide to include your words with the standard lorem ipsum text. But if you want your words only simply pass lorem=False
like this ::
$ p = LoremPysum(*args, domains=None, lorem=False)
*args
is an optional list of files from which to get the words to be used. Just pass any number of text files as shown below
$ p = LoremPysum("file1_path.txt1", "file2_path.txt", domains=None, lorem=True)
The following methods are defined
$ p.email() # return a single email address. You could pass in a file for list of domains. Defaults are `[".com", ".info", ".net", ".org"]`
$ p.name() # return a name in the form "firstname I. lastname".
$ p.sentence() # generate a single sentence.
$ p.paragraphs() # return a single paragraph of standard Lorem Ipsum text.
$ p.paragraphs(count=3) # return 3 paragraphs where the first paragraph is the standard text.
$ p.paragraphs(common=False) # return a single paragraph where the first paragraph is random.
$ p.title() # generate a string (title case) with 2 to n words. Defaults is 5. Good for article titles.
In case you want to look into the words used, the following instance attributes are defined. ::
$ p.common # A list of the first few words in the lorem ipsum text
$ p.words # A list of all the words in the lorem ipsum text.
$ p.standard # Standard lorem ipsum text. Usually the first 1/3rd portion of a sample file.
$ p.domains # list of domain name endings
- Luca De Vitis for the inspiration and starter code for
LoremPysum
- 'BeautifulSoup documentation'