Crawler #11

casssidyHong · 2024-01-07T12:48:31Z

跟之前一樣的儲存方式，是把每一篇抓取到的新聞存成字典的形式然後再存入list中。
modified date則是以「年/月/日時:分」的形式儲存（ex. 2023/12/23 13:59），如果新聞網站上只有保留到日期的話就自動補為00:00。

david20571015

Thanks for your contribution @blablablahoyo, after reviewing your program I have included some general comments here:

Move the tryy/ directory into sync_crawler/ and consider renaming it to crawler.
Delete the unused code (commented code). If needed, you can restore it by Git in the future.
Refer to the code in sync_crawler/ or search on Google to understand how to implement inheritance in Python. In this way, you won't need to copy and paste existing code (just like using functions).
Add typing to improve readability.
Did you forget to git add pyproject.toml poetry.lock or poetry add requests beautifulsoup4? You should include them if adding any dependencies.

If the workload is too heavy for you, it's okay to focus on improving the base class and one of the crawlers in this pull request (move the others to another branch).🤩

tryy/base_class.py

david20571015 · 2024-01-07T13:59:41Z

tryy/cna.py

+            news_dict = {}
+            news_dict['title'] = title
+            news_dict['content'] = content
+            news_dict['category'] = category
+            news_dict['modified_date'] = modified_date
+            news_dict['media'] = media
+            news_dict['tags'] = tags
+            news_dict['url'] = url
+            news_dict['url_hash'] = instance.url_hash
+            news_dict['content_hash'] = instance.content_hash


You should use proto.news_pb2.News directly for this.

david20571015 · 2024-01-07T14:20:02Z

tryy/cna.py

+            tags = []
+            tag_links = soup.select('div.keywordTag a')
+            for tag in tag_links:
+                tag = tag.get_text()
+                tag = tag.replace('#', '')
+                tags.append(tag)


Consider using list comprehension.

I change it into:
tag_links = soup.select('div.keywordTag a')
tags = [tag.get_text().replace('#', '') for tag in tag_links]

LGTM
You can push a new commit about your modification, then github will mark this section as outdated and I will review this again.

david20571015 · 2024-01-07T14:23:14Z

tryy/cna.py

+                     category=None,
+                     modified_date=None,
+                     media=None)
+    soup = temp_base.get_page('https://www.cna.com.tw/list/aall.aspx', headers)


Check whether soup is None.

I add it to the base_class like this:
def get_page(self, url, headers):
try:
r = requests.get(url, headers)
r.encoding = 'UTF-8'
soup = BeautifulSoup(r.text, "html.parser")
if soup is None:
print("Soup object is None. Parsing failed.")
return soup
except requests.RequestException as e:
print(f"Error fetching page")
return None

I add it to the base_class like this: def get_page(self, url, headers): try: r = requests.get(url, headers) r.encoding = 'UTF-8' soup = BeautifulSoup(r.text, "html.parser") if soup is None: print("Soup object is None. Parsing failed.") return soup except requests.RequestException as e: print(f"Error fetching page") return None

FYI, you can make good use of code blocks to improve the readability.

david20571015 · 2024-01-07T14:44:30Z

tryy/cna.py

+    urls = []
+    for s in sel:
+        urls.append(s.find('a')['href'])


Consider using list comprehension.

I change it into:
urls = [s.find('a')['href'] for s in sel if s.find('a')]

tryy/cna.py

tryy/base_class.py

david20571015 · 2024-01-07T16:16:44Z

tryy/cna.py

+            print(e)
+            continue
+
+    return article_list


We prefer to yield each element instead of return an list to reduce the memory cost.

casssidyHong · 2024-01-08T02:42:59Z

它會變成這樣 [image: 截圖 2024-01-08 上午10.42.09.png] David Chiu ***@***.***> 於 2024年1月8日週一上午10:37寫道：

…

學長那我現在改了我要怎麼重新push上去？如果我直接git push origin try的話會config 先git pull看看 — Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXGVHSWZ4FUYHSCWR56HTT3YNNLWTAVCNFSM6AAAAABBQJUIZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGMYTCNRTHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

david20571015 · 2024-01-08T02:45:40Z

看不到圖🫠 email回信好像不會有圖?
你可能要用github的網站#11
在底下回覆可以直接貼圖片

casssidyHong · 2024-01-08T02:48:57Z

git pull origin try之後會config，這樣要怎麼處理

david20571015 · 2024-01-08T02:50:05Z

git pull --rebase
不行的話就git pull --ff-only

然後順便提一下這個是conflict不是config

casssidyHong · 2024-01-08T03:03:57Z

好的知道了，然後它現在變成這樣子，我應該要換成" git branch --set-upstream-to=origin/ try "嗎

david20571015 · 2024-01-08T03:05:05Z

git switch try
git branch --set-upstream-to=origin/try try

casssidyHong · 2024-01-08T03:18:10Z

打完git switch try , git branch --set-upstream-to=origin/try try 之後在push的時後還是失敗

david20571015 · 2024-01-08T03:19:49Z

git pull --rebase 不行的話就git pull --ff-only

然後順便提一下這個是conflict不是config

這個
然後因為已經設定好upstream branch了
所以不用再打origin try

david20571015 · 2024-01-08T03:14:05Z

tryy/base_class.py

+        self.url_hash = url_hash if url_hash else self.generate_hash(
+            url) if url else None
+        self.content_hash = content_hash if content_hash else self.generate_hash(
+            content) if content else None


This section is a little hard to read, use

if ...: ... elif ...: ... else: ...

here.

david20571015 · 2024-01-08T03:14:36Z

tryy/base_class.py

+    def get_content(self, soup, content_sel, title):
+        content_sel = soup.select(content_sel)
+        article_content = []
+        content_str = ""
+        content_str += title
+        for s in content_sel:
+            s = s.text.replace('\n', '')
+            article_content.append(s)
+            content_str += s
+        return article_content


What is content_str used for? Remove if we don't need it.

david20571015 · 2024-01-08T03:22:33Z

tryy/base_class.py

+        except requests.RequestException as e:
+            print(f"Error fetching page")


You can print out the error, and don't use f-strings if the string is hard-coded.

david20571015 · 2024-01-08T03:23:46Z

tryy/base_class.py

+            return soup
+        except requests.RequestException as e:
+            print(f"Error fetching page")
+            return None


Python implicitly returns None for us, so there's no need to explicitly return it.

david20571015 · 2024-01-08T03:25:47Z

tryy/base_class.py

+            return date_text[:16]
+        except Exception as e:
+            print(f"Error getting modified date {e}")
+            return None


Python implicitly returns None for us, so there's no need to explicitly return it.

david20571015 · 2024-01-08T03:27:16Z

tryy/cna.py

+    headers = {
+        "User-Agent":
+            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
+    }


This should be the member of base class because all crawlers use the same header.

david20571015 · 2024-01-08T13:59:40Z

tryy/base_class.py

+        self.url_hash = url_hash if url_hash else self.generate_hash(
+            url) if url else None
+        self.content_hash = content_hash if content_hash else self.generate_hash(
+            content) if content else None


Maybe you accidently duplicated this?

david20571015 · 2024-01-08T14:07:39Z

tryy/base_class.py

+        self.content_hash = content_hash if content_hash else self.generate_hash(
+            content) if content else None
+
+    @abstractmethod


Does the use of @abstractmethod meet your intention, as it requires all derived classes to implement the get_page function?

david20571015 · 2024-01-08T14:09:33Z

tryy/base_class.py

+        content_str = ""
+        content_str += title


What is content_str used for? Remove it if unused.

david20571015 · 2024-01-08T14:13:56Z

tryy/base_class.py

+        for c in category:
+            print(c.text(), " ")


Please remove this, as it seems like a simple test that we won't need in a production scenario.

david20571015 · 2024-01-08T14:14:40Z

tryy/base_class.py

+        category = soup.select(category_sel)
+        return category
+
+    def find_category(self, soup, type, class_):


type is a build in function, consider rename it.

david20571015 · 2024-01-08T14:17:25Z

tryy/base_class.py

+        article_content = []
+        content_str = ""
+        content_str += title
+        for s in content_sel:
+            s = s.text.replace('\n', '')
+            article_content.append(s)
+        return article_content


It would be better to use list comprehension.

david20571015 · 2024-02-05T16:34:14Z

Hi @blablablahoyo, please run git pull before starting to work on this PR. I have moved the crawler codes and removed other codes that might not be relevant in this PR. If needed, you can find these codes using git checkout.

david20571015

Please consistent your choice of string quote character, we choose to use ' in this project.
Add python typing annotation to every function.

david20571015 · 2024-02-05T18:43:50Z

sync_crawler/crawler/base_crawler.py

+        self.content_hash = content_hash if content_hash else self.generate_hash(
+            content) if content else None
+
+    @abstractmethod


This method should not be abstract because you implemented it!

david20571015 · 2024-02-05T18:45:00Z

sync_crawler/crawler/base_crawler.py

+    @abstractmethod
+    def get_page(self, url, headers):
+        try:
+            r = requests.get(url, headers)


Missing timeout argument for method requests.get can cause your program to hang indefinitely.

You can set the header as the property (data member) of this class.

david20571015 · 2024-02-05T18:46:48Z

sync_crawler/crawler/base_crawler.py

+            r.encoding = 'UTF-8'
+            soup = BeautifulSoup(r.text, "html.parser")
+            if soup is None:
+                print("Soup object is None. Parsing failed.")


You should raise an error while parsing fail.

david20571015 · 2024-02-05T19:14:07Z

sync_crawler/crawler/base_crawler.py

+        self.url_hash = url_hash if url_hash else self.generate_hash(
+            url) if url else None
+        self.content_hash = content_hash if content_hash else self.generate_hash(
+            content) if content else None
+        self.url_hash = url_hash if url_hash else self.generate_hash(
+            url) if url else None
+        self.content_hash = content_hash if content_hash else self.generate_hash(
+            content) if content else None


What are these used for? Remove them if not needed.

david20571015 · 2024-03-11T06:35:31Z

sync_crawler/crawler/cna_crawler.py

+    def __init__(self, custom_property, *args, **kwargs):
+        super().__init__(*args, **kwargs)


Remove this bacause python will automatically call super().__init__() for you.

david20571015 · 2024-03-11T06:35:36Z

sync_crawler/crawler/cts_crawler.py

+    def __init__(self, custom_property, *args, **kwargs):
+        # 呼叫父類別的 __init__ 方法，並傳遞必要的參數
+        super().__init__(*args, **kwargs)


Remove this bacause python will automatically call super().__init__() for you.

david20571015 · 2024-03-11T06:37:53Z

sync_crawler/crawler/cna_crawler.py

+        super().__init__(*args, **kwargs)
+
+    def cna_urls(self, headers):
+        url = 'https://www.cna.com.tw/list/aall.aspx'


This migth be treated as a class constant.

class CnaCrawler(BaseCrawler): URL = 'https://www.cna.com.tw/list/aall.aspx' ...

david20571015 · 2024-03-11T06:39:07Z

sync_crawler/crawler/cts_crawler.py

+        super().__init__(*args, **kwargs)
+
+    def cts_urls(self, headers):
+        url = 'https://news.cts.com.tw/real/index.html'


This migth be treated as a class constant.

class CtsCrawler(BaseCrawler): URL = 'https://news.cts.com.tw/real/index.html' ...

david20571015 · 2024-03-11T06:46:16Z

Hello @blablablahoyo, sorry for lately review.
I have requested some changes, please modify them and let me know if you have any question.

casssidyHong added 19 commits November 24, 2023 20:53

first try of base class

e28c6fe

First try of the crawlers

c37c740

remove the redundant upload of last week

7a7b719

remove the testing file

09db9a5

first try of crawler

26b4a2a

right

20901c9

12/20 modified

86b60b8

12/25 success

7228272

12/25 modified

774546f

12/25 success

1d1bdf2

12/25 success

23b4ea1

12/25 success

fec1304

12/25 success

d3f6f7a

12/25 modified

88f9cd5

12/25 success

6a4ef55

12/20 success

23ee849

12/20 success

53b2c38

try to delete the wrong one

8784113

try to delete the wrong one

93b65b7

casssidyHong self-assigned this Jan 7, 2024

[pre-commit.ci lite] apply automatic fixes

20d63d6

casssidyHong assigned david20571015 and andyrochi and unassigned casssidyHong Jan 7, 2024

david20571015 requested review from andyrochi and david20571015 January 7, 2024 15:04

david20571015 unassigned andyrochi and david20571015 Jan 7, 2024

david20571015 requested changes Jan 7, 2024

View reviewed changes

david20571015 requested changes Jan 8, 2024

View reviewed changes

1/8 10:51

f1b7e49

try

3c714a6

david20571015 requested changes Jan 8, 2024

View reviewed changes

david20571015 changed the title ~~Try~~ Crawler Jan 8, 2024

david20571015 assigned casssidyHong Jan 8, 2024

try

74941fb

david20571015 force-pushed the try branch from b1f55e8 to 74941fb Compare January 8, 2024 13:29

Merge branch 'main' into try

e067609

david20571015 requested changes Jan 8, 2024

View reviewed changes

2/5

235fd19

casssidyHong requested a review from david20571015 February 5, 2024 13:09

casssidyHong and others added 2 commits February 5, 2024 21:14

2/5-2

d3e4a53

chore: move codes and remove unused codes

237a253

david20571015 reviewed Feb 5, 2024

View reviewed changes

casssidyHong requested a review from david20571015 March 11, 2024 06:20

david20571015 requested changes Mar 11, 2024

View reviewed changes

3/11

790a3c5

david20571015 force-pushed the main branch from 0a25493 to b605d48 Compare July 3, 2024 07:59

david20571015 closed this Aug 26, 2024

david20571015 deleted the try branch August 26, 2024 17:12

		except requests.RequestException as e:
		print(f"Error fetching page")

		def __init__(self, custom_property, args, *kwargs):
		super().__init__(args, *kwargs)

Crawler #11

Crawler #11

Conversation

casssidyHong commented Jan 7, 2024

david20571015 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

casssidyHong commented Jan 8, 2024 via email

david20571015 commented Jan 8, 2024 • edited Loading

casssidyHong commented Jan 8, 2024

david20571015 commented Jan 8, 2024 • edited Loading

casssidyHong commented Jan 8, 2024

david20571015 commented Jan 8, 2024

casssidyHong commented Jan 8, 2024

david20571015 commented Jan 8, 2024

Choose a reason for hiding this comment

david20571015 Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david20571015 Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david20571015 commented Feb 5, 2024 • edited Loading

david20571015 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david20571015 Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david20571015 commented Mar 11, 2024

david20571015 left a comment •

edited

Loading

david20571015 commented Jan 8, 2024 •

edited

Loading

david20571015 commented Jan 8, 2024 •

edited

Loading

david20571015 Jan 8, 2024 •

edited

Loading

david20571015 Jan 8, 2024 •

edited

Loading

david20571015 commented Feb 5, 2024 •

edited

Loading

david20571015 left a comment •

edited

Loading

david20571015 Feb 5, 2024 •

edited

Loading