rsscrawler-doc.txt

Help on module rsscrawler:

NAME
    rsscrawler

FILE
    c:\users\fangning\desktop\rssarchive\rsscrawler.py

DESCRIPTION
    Following explanation gives purposes of some files generated in the process :
    1. rsscrawler.cfg store some options.
    2. RSSName.db stores words frequency file generated from fetched web pages' title.
    3. fileName.db (fileName means one RSS source link address) stores all links of web pages fetched in terms of one RSS link  source.
    4. filename.html stores the whole webpage fetched, where filename is gotten by news' title. 
    5. RSSName.txt stores words frequency file generated from fetched web pages' tilte.

CLASSES
    rsscrawler
    
    class rsscrawler
     |  Methods defined here:
     |  
     |  __init__(self)
     |  
     |  calWordsFrequency(self, content, daytime)
     |      # calculate word frequency
     |  
     |  createAllFetchedLinks(self, fileName)
     |      # create all fetched links database file
     |  
     |  determineDuplication(self, fileName, link)
     |      # determine duplication of fetched links
     |  
     |  fetchAllImages(self, images, webpage_link)
     |      # fetch all images
     |  
     |  fetchNews(self)
     |      # fetch all RSS sources in terms of one news site (a source file)
     |  
     |  fetchWebpage(self, link)
     |      # fetch a whole webpage
     |  
     |  fetchXML(self, rssResource)
     |      # fetch a XML from a RSS source
     |  
     |  loadWordsFrequency(self)
     |      # load word frequency record from cnn.txt file
     |  
     |  processArgs(self, argv)
     |      # process argments input
     |  
     |  replaceAll4FileName(self, oldName)
     |      # every webpage fetched is stored in a specified html file in hard disk.
     |      # file name is its title, so we need to remove illegal chars from its title in order to consist with file name requirement.
     |  
     |  storeNewLinkInMERGEandHTML(self, file_id, rssResource, page, title, firstPage_link, link, page_num, date)
     |      # store new fetched links into MERGE.TXT and a whole webpage into .html file
     |  
     |  updateNewLinks(self, fileName, newlink)
     |      # insert a new fetched link into database
     |  
     |  updateWordsFrequency(self)
     |      # words frequency file format is "date  source  words   frequency"

DATA
    __author__ = 'Fang Ning'
    __credits__ = "This code mainly complete to fetch all RSS news ... but...
    __date__ = '06.10.2012'
    __version__ = '1.5.0'

VERSION
    1.5.0

DATE
    06.10.2012

AUTHOR
    Fang Ning

CREDITS
    This code mainly complete to fetch all RSS news . 
    You may freely use it in your project , but please remain the head's  information part.