forked from WING-NUS/RSScrawler-1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
rsscrawler-doc.txt
84 lines (72 loc) · 2.67 KB
/
rsscrawler-doc.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
Help on module rsscrawler:
NAME
rsscrawler
FILE
c:\users\fangning\desktop\rssarchive\rsscrawler.py
DESCRIPTION
Following explanation gives purposes of some files generated in the process :
1. rsscrawler.cfg store some options.
2. RSSName.db stores words frequency file generated from fetched web pages' title.
3. fileName.db (fileName means one RSS source link address) stores all links of web pages fetched in terms of one RSS link source.
4. filename.html stores the whole webpage fetched, where filename is gotten by news' title.
5. RSSName.txt stores words frequency file generated from fetched web pages' tilte.
CLASSES
rsscrawler
class rsscrawler
| Methods defined here:
|
| __init__(self)
|
| calWordsFrequency(self, content, daytime)
| # calculate word frequency
|
| createAllFetchedLinks(self, fileName)
| # create all fetched links database file
|
| determineDuplication(self, fileName, link)
| # determine duplication of fetched links
|
| fetchAllImages(self, images, webpage_link)
| # fetch all images
|
| fetchNews(self)
| # fetch all RSS sources in terms of one news site (a source file)
|
| fetchWebpage(self, link)
| # fetch a whole webpage
|
| fetchXML(self, rssResource)
| # fetch a XML from a RSS source
|
| loadWordsFrequency(self)
| # load word frequency record from cnn.txt file
|
| processArgs(self, argv)
| # process argments input
|
| replaceAll4FileName(self, oldName)
| # every webpage fetched is stored in a specified html file in hard disk.
| # file name is its title, so we need to remove illegal chars from its title in order to consist with file name requirement.
|
| storeNewLinkInMERGEandHTML(self, file_id, rssResource, page, title, firstPage_link, link, page_num, date)
| # store new fetched links into MERGE.TXT and a whole webpage into .html file
|
| updateNewLinks(self, fileName, newlink)
| # insert a new fetched link into database
|
| updateWordsFrequency(self)
| # words frequency file format is "date source words frequency"
DATA
__author__ = 'Fang Ning'
__credits__ = "This code mainly complete to fetch all RSS news ... but...
__date__ = '06.10.2012'
__version__ = '1.5.0'
VERSION
1.5.0
DATE
06.10.2012
AUTHOR
Fang Ning
CREDITS
This code mainly complete to fetch all RSS news .
You may freely use it in your project , but please remain the head's information part.