The crawler implemented is specific-topic, since it will collect the 1000 pages most similar to a default one, and all are expected to address the same topic or well-related topics.
- NLTK stopwords (execute with python nltk.download('stopwords'))
- NLTK WordNet (execute with python nltk.download('wordnet'))
- Scrapy (install with pip install scrapy)
- NLTK (install with pip install nltk)
- Spacy (install with pip install spacy)
- Spacy's spanish model (install with python -m spacy download es_core_news_sm)
The implementation in particular is a Scrapy project, a very popular Python framework in web crawling.
The most modified section was __init__.py file which contains the QuotesSpider
class with the definition of the parse()
method that defines the algorithm of crawling as such.
def parse(self, response:HtmlResponse):
texto = self.extract_text(response)
vector = vectorice(texto,self.indexed)
similarity = compute_cosine_similarity(vector,self.madre_vetor)
print('url = {0}'.format(response.url))
print('similarity = {0}'.format(similarity))
input('look')
if similarity < 0.2:
return
splited = response.url.split('/')
splited = [x for x in splited if x != '']
page_name = ''.join(splited[len(splited)-1])
with open('web_pages/{0}.html'.format(page_name),'wb') as f:
f.write(response.body)
self.descargas += 1
input('downloaded: {0}'.format(response.url))
input('descargas = {0}'.format(self.descargas))
if self.descargas == self.to_recolect:
return
extractor = LinkExtractor()
links = extractor.extract_links(response)
for link in links:
if link is not None:
requests
yield Request(link.url)
Since the task does not have to be shared with other crawlers, the seed URLs are a set of popular pages, chosen in this case with a query to Google with the word "paint", those with the highest PageRank.
The set of indexed terms, ideally, is chosen by a group of specialists. Alternatively, this can be automated, and there are different ways to do it. The two main ones are: selecting the nouns in the text, or choosing the groups of nouns that encompass the same concept. In this case, we will proceed with the first of the variants on the page indicated by Pintura.
This phase is directly affected by stemming and normalization. .
- Lexical analysis: digits, punctuation marks, hyphens are eliminated and words are normalized (upper or lower case).
- Elimination of stopwords (words that do not provide transcendent meaning).
- Stemming: reduce each word to its root, includes removing prefixes and suffixes.
From the chosen indexed terms, each document is a vector in which the -ith position contains the frequency in the document of that term.
For each crawled page, its similarity with the parent page Pintura will be found by the cosine criterion[1].
[1] Modern information retrieval. Baeza-Yates, Ricardo and Ribeiro-Neto, Berthier and others. (1999).