Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler #11

Closed
wants to merge 29 commits into from
Closed

Crawler #11

wants to merge 29 commits into from

Conversation

casssidyHong
Copy link
Contributor

跟之前一樣的儲存方式,是把每一篇抓取到的新聞存成字典的形式然後再存入list中。
modified date則是以「年/月/日 時:分 」的形式儲存(ex. 2023/12/23 13:59),如果新聞網站上只有保留到日期的話就自動補為00:00。

@casssidyHong casssidyHong self-assigned this Jan 7, 2024
Copy link
Collaborator

@david20571015 david20571015 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @blablablahoyo, after reviewing your program I have included some general comments here:

  1. Move the tryy/ directory into sync_crawler/ and consider renaming it to crawler.
  2. Delete the unused code (commented code). If needed, you can restore it by Git in the future.
  3. Refer to the code in sync_crawler/ or search on Google to understand how to implement inheritance in Python. In this way, you won't need to copy and paste existing code (just like using functions).
  4. Add typing to improve readability.
  5. Did you forget to git add pyproject.toml poetry.lock or poetry add requests beautifulsoup4? You should include them if adding any dependencies.

If the workload is too heavy for you, it's okay to focus on improving the base class and one of the crawlers in this pull request (move the others to another branch).🤩

tryy/base_class.py Outdated Show resolved Hide resolved
tryy/cna.py Outdated
Comment on lines 72 to 81
news_dict = {}
news_dict['title'] = title
news_dict['content'] = content
news_dict['category'] = category
news_dict['modified_date'] = modified_date
news_dict['media'] = media
news_dict['tags'] = tags
news_dict['url'] = url
news_dict['url_hash'] = instance.url_hash
news_dict['content_hash'] = instance.content_hash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use proto.news_pb2.News directly for this.

tryy/cna.py Outdated
Comment on lines 56 to 61
tags = []
tag_links = soup.select('div.keywordTag a')
for tag in tag_links:
tag = tag.get_text()
tag = tag.replace('#', '')
tags.append(tag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using list comprehension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change it into:
tag_links = soup.select('div.keywordTag a')
tags = [tag.get_text().replace('#', '') for tag in tag_links]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
You can push a new commit about your modification, then github will mark this section as outdated and I will review this again.

tryy/cna.py Outdated
category=None,
modified_date=None,
media=None)
soup = temp_base.get_page('https://www.cna.com.tw/list/aall.aspx', headers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether soup is None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add it to the base_class like this:
def get_page(self, url, headers):
try:
r = requests.get(url, headers)
r.encoding = 'UTF-8'
soup = BeautifulSoup(r.text, "html.parser")
if soup is None:
print("Soup object is None. Parsing failed.")
return soup
except requests.RequestException as e:
print(f"Error fetching page")
return None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add it to the base_class like this: def get_page(self, url, headers): try: r = requests.get(url, headers) r.encoding = 'UTF-8' soup = BeautifulSoup(r.text, "html.parser") if soup is None: print("Soup object is None. Parsing failed.") return soup except requests.RequestException as e: print(f"Error fetching page") return None

FYI, you can make good use of code blocks to improve the readability.

tryy/cna.py Outdated
Comment on lines 24 to 26
urls = []
for s in sel:
urls.append(s.find('a')['href'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using list comprehension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change it into:
urls = [s.find('a')['href'] for s in sel if s.find('a')]

tryy/cna.py Outdated Show resolved Hide resolved
tryy/base_class.py Outdated Show resolved Hide resolved
tryy/base_class.py Outdated Show resolved Hide resolved
tryy/cna.py Outdated
print(e)
continue

return article_list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to yield each element instead of return an list to reduce the memory cost.

@casssidyHong
Copy link
Contributor Author

casssidyHong commented Jan 8, 2024 via email

@david20571015
Copy link
Collaborator

david20571015 commented Jan 8, 2024

看不到圖🫠 email回信好像不會有圖?
你可能要用github的網站#11
在底下回覆可以直接貼圖片

@casssidyHong
Copy link
Contributor Author

截圖 2024-01-08 上午10 48 22 git pull origin try之後會config,這樣要怎麼處理

@david20571015
Copy link
Collaborator

david20571015 commented Jan 8, 2024

git pull --rebase
不行的話就git pull --ff-only

然後順便提一下這個是conflict不是config

@casssidyHong
Copy link
Contributor Author

截圖 2024-01-08 上午11 02 05 好的知道了,然後它現在變成這樣子,我應該要換成" git branch --set-upstream-to=origin/ try "嗎

@david20571015
Copy link
Collaborator

git switch try
git branch --set-upstream-to=origin/try try

@casssidyHong
Copy link
Contributor Author

打完git switch try , git branch --set-upstream-to=origin/try try 之後在push的時後還是失敗
截圖 2024-01-08 上午11 17 29

@david20571015
Copy link
Collaborator

git pull --rebase 不行的話就git pull --ff-only

然後順便提一下這個是conflict不是config

這個
然後因為已經設定好upstream branch了
所以不用再打origin try

Comment on lines 37 to 40
self.url_hash = url_hash if url_hash else self.generate_hash(
url) if url else None
self.content_hash = content_hash if content_hash else self.generate_hash(
content) if content else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is a little hard to read, use

if ...:
    ...
elif ...:
    ...
else:
    ...

here.

Comment on lines 58 to 67
def get_content(self, soup, content_sel, title):
content_sel = soup.select(content_sel)
article_content = []
content_str = ""
content_str += title
for s in content_sel:
s = s.text.replace('\n', '')
article_content.append(s)
content_str += s
return article_content
Copy link
Collaborator

@david20571015 david20571015 Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is content_str used for? Remove if we don't need it.

Comment on lines 48 to 49
except requests.RequestException as e:
print(f"Error fetching page")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can print out the error, and don't use f-strings if the string is hard-coded.

return soup
except requests.RequestException as e:
print(f"Error fetching page")
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python implicitly returns None for us, so there's no need to explicitly return it.

return date_text[:16]
except Exception as e:
print(f"Error getting modified date {e}")
return None
Copy link
Collaborator

@david20571015 david20571015 Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python implicitly returns None for us, so there's no need to explicitly return it.

tryy/cna.py Outdated
Comment on lines 9 to 12
headers = {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the member of base class because all crawlers use the same header.

@david20571015 david20571015 changed the title Try Crawler Jan 8, 2024
Comment on lines 31 to 34
self.url_hash = url_hash if url_hash else self.generate_hash(
url) if url else None
self.content_hash = content_hash if content_hash else self.generate_hash(
content) if content else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you accidently duplicated this?

self.content_hash = content_hash if content_hash else self.generate_hash(
content) if content else None

@abstractmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the use of @abstractmethod meet your intention, as it requires all derived classes to implement the get_page function?

Comment on lines 57 to 58
content_str = ""
content_str += title
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is content_str used for? Remove it if unused.

Comment on lines 70 to 71
for c in category:
print(c.text(), " ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this, as it seems like a simple test that we won't need in a production scenario.

category = soup.select(category_sel)
return category

def find_category(self, soup, type, class_):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type is a build in function, consider rename it.

Comment on lines 56 to 62
article_content = []
content_str = ""
content_str += title
for s in content_sel:
s = s.text.replace('\n', '')
article_content.append(s)
return article_content
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use list comprehension.

@david20571015
Copy link
Collaborator

david20571015 commented Feb 5, 2024

Hi @blablablahoyo, please run git pull before starting to work on this PR. I have moved the crawler codes and removed other codes that might not be relevant in this PR. If needed, you can find these codes using git checkout.

Copy link
Collaborator

@david20571015 david20571015 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please consistent your choice of string quote character, we choose to use ' in this project.
  2. Add python typing annotation to every function.

self.content_hash = content_hash if content_hash else self.generate_hash(
content) if content else None

@abstractmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should not be abstract because you implemented it!

@abstractmethod
def get_page(self, url, headers):
try:
r = requests.get(url, headers)
Copy link
Collaborator

@david20571015 david20571015 Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Missing timeout argument for method requests.get can cause your program to hang indefinitely.
  2. You can set the header as the property (data member) of this class.

r.encoding = 'UTF-8'
soup = BeautifulSoup(r.text, "html.parser")
if soup is None:
print("Soup object is None. Parsing failed.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should raise an error while parsing fail.

Comment on lines 27 to 34
self.url_hash = url_hash if url_hash else self.generate_hash(
url) if url else None
self.content_hash = content_hash if content_hash else self.generate_hash(
content) if content else None
self.url_hash = url_hash if url_hash else self.generate_hash(
url) if url else None
self.content_hash = content_hash if content_hash else self.generate_hash(
content) if content else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these used for? Remove them if not needed.

Comment on lines 7 to 8
def __init__(self, custom_property, *args, **kwargs):
super().__init__(*args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this bacause python will automatically call super().__init__() for you.

Comment on lines 7 to 9
def __init__(self, custom_property, *args, **kwargs):
# 呼叫父類別的 __init__ 方法,並傳遞必要的參數
super().__init__(*args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this bacause python will automatically call super().__init__() for you.

super().__init__(*args, **kwargs)

def cna_urls(self, headers):
url = 'https://www.cna.com.tw/list/aall.aspx'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migth be treated as a class constant.

class CnaCrawler(BaseCrawler):

  URL = 'https://www.cna.com.tw/list/aall.aspx'

...

super().__init__(*args, **kwargs)

def cts_urls(self, headers):
url = 'https://news.cts.com.tw/real/index.html'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migth be treated as a class constant.

class CtsCrawler(BaseCrawler):

  URL = 'https://news.cts.com.tw/real/index.html'

...

@david20571015
Copy link
Collaborator

Hello @blablablahoyo, sorry for lately review.
I have requested some changes, please modify them and let me know if you have any question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants