-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler #11
Crawler #11
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @blablablahoyo, after reviewing your program I have included some general comments here:
- Move the
tryy/
directory intosync_crawler/
and consider renaming it tocrawler
. - Delete the unused code (commented code). If needed, you can restore it by Git in the future.
- Refer to the code in
sync_crawler/
or search on Google to understand how to implement inheritance in Python. In this way, you won't need to copy and paste existing code (just like using functions). - Add typing to improve readability.
- Did you forget to
git add pyproject.toml poetry.lock
orpoetry add requests beautifulsoup4
? You should include them if adding any dependencies.
If the workload is too heavy for you, it's okay to focus on improving the base class and one of the crawlers in this pull request (move the others to another branch).🤩
tryy/cna.py
Outdated
news_dict = {} | ||
news_dict['title'] = title | ||
news_dict['content'] = content | ||
news_dict['category'] = category | ||
news_dict['modified_date'] = modified_date | ||
news_dict['media'] = media | ||
news_dict['tags'] = tags | ||
news_dict['url'] = url | ||
news_dict['url_hash'] = instance.url_hash | ||
news_dict['content_hash'] = instance.content_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use proto.news_pb2.News
directly for this.
tryy/cna.py
Outdated
tags = [] | ||
tag_links = soup.select('div.keywordTag a') | ||
for tag in tag_links: | ||
tag = tag.get_text() | ||
tag = tag.replace('#', '') | ||
tags.append(tag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using list comprehension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change it into:
tag_links = soup.select('div.keywordTag a')
tags = [tag.get_text().replace('#', '') for tag in tag_links]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
You can push a new commit about your modification, then github will mark this section as outdated and I will review this again.
tryy/cna.py
Outdated
category=None, | ||
modified_date=None, | ||
media=None) | ||
soup = temp_base.get_page('https://www.cna.com.tw/list/aall.aspx', headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check whether soup
is None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add it to the base_class like this:
def get_page(self, url, headers):
try:
r = requests.get(url, headers)
r.encoding = 'UTF-8'
soup = BeautifulSoup(r.text, "html.parser")
if soup is None:
print("Soup object is None. Parsing failed.")
return soup
except requests.RequestException as e:
print(f"Error fetching page")
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add it to the base_class like this: def get_page(self, url, headers): try: r = requests.get(url, headers) r.encoding = 'UTF-8' soup = BeautifulSoup(r.text, "html.parser") if soup is None: print("Soup object is None. Parsing failed.") return soup except requests.RequestException as e: print(f"Error fetching page") return None
FYI, you can make good use of code blocks to improve the readability.
tryy/cna.py
Outdated
urls = [] | ||
for s in sel: | ||
urls.append(s.find('a')['href']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using list comprehension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change it into:
urls = [s.find('a')['href'] for s in sel if s.find('a')]
tryy/cna.py
Outdated
print(e) | ||
continue | ||
|
||
return article_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to yield
each element instead of return an list
to reduce the memory cost.
它會變成這樣
[image: 截圖 2024-01-08 上午10.42.09.png]
David Chiu ***@***.***> 於 2024年1月8日 週一 上午10:37寫道:
… 學長那我現在改了我要怎麼重新push上去? 如果我直接git push origin try的話會config
先git pull看看
—
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AXGVHSWZ4FUYHSCWR56HTT3YNNLWTAVCNFSM6AAAAABBQJUIZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGMYTCNRTHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
看不到圖🫠 email回信好像不會有圖? |
然後順便提一下這個是conflict不是config |
git switch try
git branch --set-upstream-to=origin/try try |
這個 |
tryy/base_class.py
Outdated
self.url_hash = url_hash if url_hash else self.generate_hash( | ||
url) if url else None | ||
self.content_hash = content_hash if content_hash else self.generate_hash( | ||
content) if content else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is a little hard to read, use
if ...:
...
elif ...:
...
else:
...
here.
tryy/base_class.py
Outdated
def get_content(self, soup, content_sel, title): | ||
content_sel = soup.select(content_sel) | ||
article_content = [] | ||
content_str = "" | ||
content_str += title | ||
for s in content_sel: | ||
s = s.text.replace('\n', '') | ||
article_content.append(s) | ||
content_str += s | ||
return article_content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is content_str
used for? Remove if we don't need it.
tryy/base_class.py
Outdated
except requests.RequestException as e: | ||
print(f"Error fetching page") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can print out the error, and don't use f-strings if the string is hard-coded.
tryy/base_class.py
Outdated
return soup | ||
except requests.RequestException as e: | ||
print(f"Error fetching page") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python implicitly returns None for us, so there's no need to explicitly return it.
tryy/base_class.py
Outdated
return date_text[:16] | ||
except Exception as e: | ||
print(f"Error getting modified date {e}") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python implicitly returns None for us, so there's no need to explicitly return it.
tryy/cna.py
Outdated
headers = { | ||
"User-Agent": | ||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the member of base class because all crawlers use the same header.
tryy/base_class.py
Outdated
self.url_hash = url_hash if url_hash else self.generate_hash( | ||
url) if url else None | ||
self.content_hash = content_hash if content_hash else self.generate_hash( | ||
content) if content else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you accidently duplicated this?
tryy/base_class.py
Outdated
self.content_hash = content_hash if content_hash else self.generate_hash( | ||
content) if content else None | ||
|
||
@abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the use of @abstractmethod
meet your intention, as it requires all derived classes to implement the get_page
function?
tryy/base_class.py
Outdated
content_str = "" | ||
content_str += title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is content_str
used for? Remove it if unused.
tryy/base_class.py
Outdated
for c in category: | ||
print(c.text(), " ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this, as it seems like a simple test that we won't need in a production scenario.
tryy/base_class.py
Outdated
category = soup.select(category_sel) | ||
return category | ||
|
||
def find_category(self, soup, type, class_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type
is a build in function, consider rename it.
tryy/base_class.py
Outdated
article_content = [] | ||
content_str = "" | ||
content_str += title | ||
for s in content_sel: | ||
s = s.text.replace('\n', '') | ||
article_content.append(s) | ||
return article_content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use list comprehension.
Hi @blablablahoyo, please run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please consistent your choice of string quote character, we choose to use
'
in this project. - Add python typing annotation to every function.
sync_crawler/crawler/base_crawler.py
Outdated
self.content_hash = content_hash if content_hash else self.generate_hash( | ||
content) if content else None | ||
|
||
@abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should not be abstract because you implemented it!
sync_crawler/crawler/base_crawler.py
Outdated
@abstractmethod | ||
def get_page(self, url, headers): | ||
try: | ||
r = requests.get(url, headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Missing timeout argument for method
requests.get
can cause your program to hang indefinitely. - You can set the header as the property (data member) of this class.
sync_crawler/crawler/base_crawler.py
Outdated
r.encoding = 'UTF-8' | ||
soup = BeautifulSoup(r.text, "html.parser") | ||
if soup is None: | ||
print("Soup object is None. Parsing failed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should raise an error while parsing fail.
sync_crawler/crawler/base_crawler.py
Outdated
self.url_hash = url_hash if url_hash else self.generate_hash( | ||
url) if url else None | ||
self.content_hash = content_hash if content_hash else self.generate_hash( | ||
content) if content else None | ||
self.url_hash = url_hash if url_hash else self.generate_hash( | ||
url) if url else None | ||
self.content_hash = content_hash if content_hash else self.generate_hash( | ||
content) if content else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these used for? Remove them if not needed.
sync_crawler/crawler/cna_crawler.py
Outdated
def __init__(self, custom_property, *args, **kwargs): | ||
super().__init__(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this bacause python will automatically call super().__init__()
for you.
sync_crawler/crawler/cts_crawler.py
Outdated
def __init__(self, custom_property, *args, **kwargs): | ||
# 呼叫父類別的 __init__ 方法,並傳遞必要的參數 | ||
super().__init__(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this bacause python will automatically call super().__init__()
for you.
sync_crawler/crawler/cna_crawler.py
Outdated
super().__init__(*args, **kwargs) | ||
|
||
def cna_urls(self, headers): | ||
url = 'https://www.cna.com.tw/list/aall.aspx' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This migth be treated as a class constant.
class CnaCrawler(BaseCrawler):
URL = 'https://www.cna.com.tw/list/aall.aspx'
...
sync_crawler/crawler/cts_crawler.py
Outdated
super().__init__(*args, **kwargs) | ||
|
||
def cts_urls(self, headers): | ||
url = 'https://news.cts.com.tw/real/index.html' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This migth be treated as a class constant.
class CtsCrawler(BaseCrawler):
URL = 'https://news.cts.com.tw/real/index.html'
...
Hello @blablablahoyo, sorry for lately review. |
跟之前一樣的儲存方式,是把每一篇抓取到的新聞存成字典的形式然後再存入list中。
modified date則是以「年/月/日 時:分 」的形式儲存(ex. 2023/12/23 13:59),如果新聞網站上只有保留到日期的話就自動補為00:00。