A Python package that helps crawl updates from top Vietnamese news providers.
-
You can install the latest
vnnews
crawler version from source with the following command:pip install git+https://github.com/thinh-vu/vnnews.git@main
-
Install the stable version:
pip install vnnews
(*) You might need to insert a!
before your command when running terminal commands on Google Colab. -
To start using functions, you need to import them:
from vnnews import *
- VN Express
- Tuổi trẻ Online
- CafeF
- Cafebiz
- Kinh tế Sài Gòn Online
- VN Economy
- Pháp Luật Tp.HCM
- Đầu tư Online
- Nhịp cầu đầu tư
- Diễn đàn doanh nghiệp
See more
-
url_extract (url, key, tag_class='', type='link', bs_on=True, user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64; rv:11.0) Gecko/20100101')
- Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
- Arguments:
- url (:obj:
str
, required): url of the target news source. Eg. 'https://cafef.vn/' - key (:obj:
str
, required): HTML tag which contains the information that you want to extract. Eg. 'h3', 'article', 'div' - tag_class (:obj:
str
, required): The HTML class attribute specifies one or more class names for an element. Eg. 'pdate' in the tag 19-11-2022 - 15:32 PM on CafeF. - type (:obj:
str
, optional): 'link' as default to extract only the article link from a news homepage. Use blank value''
when extracting article detail on the article page. - bs_on (:obj:
str
, optional):True
as default. Input blank''
when the issue is raised. - user_agent (:obj:
str
, optional): The default value for Desktop has been provided. You can find more user agent value here: https://developers.whatismybrowser.com/useragents/explore/operating_system_name/
- url (:obj:
-
fix_url(host, url)
- Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
- Arguments:
- host (:obj:
str
, required): the host name of the news source. Eg. 'https://vneconomy.vn - url (:obj:
str
, required): the url string of the target news source. This might not contain the host at the beginning. Eg. '/de-viet-nam-thanh-digital-hub-cua-khu-vuc-vao-nam-2030-e290.htm'
- host (:obj:
- VN Express
- Get the list of article urls:
url_extract('https://vnexpress.net/kinh-doanh', key='h3')
- Extract article details:
url_extract('https://vnexpress.net/thuong-mai-va-dau-tu-ben-vung-se-giup-apec-ung-pho-nguy-co-suy-thoai-4538015.html', key='span', tag_class='date', type='')
- Get the list of article urls:
- Tuổi trẻ Online
- Get the list of article urls:
url_extract('https://tuoitre.vn/phap-luat.htm', key='h3')
- Extract article details:
url_extract('https://tuoitre.vn/gap-thu-tuong-xuc-dong-chuyen-co-giao-mam-non-miet-mai-lam-thien-nguyen-cho-vung-xa-20221119175021292.htm', key='div', tag_class='date-time', type='')
- Get the list of article urls:
- CafeF
- Get the list of article urls:
url_extract('https://cafef.vn/bat-dong-san.chn', key='h3', type='link')
- Extract article details:
url_extract('https://cafef.vn/dau-se-la-phan-khuc-bds-giu-duoc-nhiet-trong-thoi-gian-toi-2022111913083069.chn', key='span', tag_class='pdate', type='')
- Get the list of article urls:
- Cafebiz
- Get the list of article urls:
url_extract('https://cafebiz.vn/vi-mo.chn', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://cafebiz.vn/tai-sao-nha-o-my-la-tai-san-con-o-nhat-ban-thi-lai-chang-khac-gi-hang-tieu-dung-176221119095831295.chn', key='span', tag_class='time', type='')
- Get the list of article urls:
See more
- Kinh tế Sài Gòn Online
- Get the list of article urls:
url_extract('https://thesaigontimes.vn/', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://thesaigontimes.vn/kinh-te-tuan-hoan-mo-ra-nhung-mo-hinh-kinh-doanh-moi/', key='time', tag_class='', type='')
- Get the list of article urls:
- VN Economy
- Get the list of article urls:
url_extract('https://vneconomy.vn/', key='h3', type='link', bs_on=False)
- Extract article details:
url_extract('https://vneconomy.vn/xuat-khau-det-may-van-tu-tin-voi-muc-tieu-42-ty-usd.htm', key='div', tag_class='detail__meta', type='')
- Get the list of article urls:
- Pháp Luật Tp.HCM
- Get the list of article urls:
url_extract('https://m.plo.vn/phap-luat/', key='h3', type='link')[0][1]
- Extract article details:
test = url_extract('https://plo.vn/dieu-tra-trung-tam-dang-kiem-cap-so-song-sinh-cho-xe-tai-post705918.html', key='time', tag_class='', type='')
- Get the list of article urls:
- Đầu tư Online
- Get the list of article urls:
url_extract('https://baodautu.vn/', key='article', type='link', bs_on='')
- Extract article details:
url_extract('https://baodautu.vn/nguoi-dan-rong-ra-cau-cuu-khi-nao-co-so-do-tu-du-an-cua-cong-ty-bach-dat-an-d177946.html', key='span', tag_class='post-time', type='')
- Get the list of article urls:
- Nhịp cầu đầu tư
- Get the list of article urls:
url_extract('https://m.nhipcaudautu.vn/kinh-doanh/', key='article', type='link', bs_on='', user_agent='Mozilla/5.0 (iPhone; CPU iPhone OS 15_5 like Mac OS X)')
- Extract article details:
url_extract('https://m.nhipcaudautu.vn/ti-le-don-bay-tai-chinh-toan-thi-truong-giam-dan-tu-quy-i-3348999/', key='span', tag_class='date-post', type='')
- Diễn đàn doanh nghiệp
- Get the list of article urls:
url_extract('https://diendandoanhnghiep.vn/', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://diendandoanhnghiep.vn/https-diendandoanhnghiep-vn-dien-mat-troi-mai-nha-can-hoan-thien-co-che-ho-tro-doanh-nghiep-phat-trien-225626-html-e313.html', key='span', tag_class='created_time', type='')
- Get the list of article urls:
- Diễn đàn kinh tế Việt Nam - Vietnamnet
- Get the list of article urls:
url_extract('https://vef.vn/diem-nong/', key='article', type='link', bs_on='')
- Extract article details: ``
- Get the list of article urls:
- Forbes Việt Nam
- Get the list of article urls:
url_extract('https://forbes.vn', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://forbes.vn/m-village-cua-nguyen-hai-ninh-xay-lang-trong-pho/', key='div', tag_class='forbes-single__heading-time', type='')
- Get the list of article urls:
- Vietstock
- Get the list of article urls:
url_extract('https://vietstock.vn/', key='h4', type='link', bs_on='')
- Extract article details:
url_extract('https://vietstock.vn/2022/11/thieu-hut-iphone-14-nguoi-dung-viet-lua-chon-iphone-doi-cu-4264-1017483.htm', key='span', tag_class='date', type='')
- Get the list of article urls:
- Tin nhanh chứng khoán
- Get the list of article urls: Doesn't work
url_extract('https://m.tinnhanhchungkhoan.vn/', key='h2', type='link', bs_on='')
- Extract article details:
url_extract('https://www.tinnhanhchungkhoan.vn/big-trends-sau-con-mua-troi-lai-sang-post310328.html', key='time', tag_class='', type='')
- Get the list of article urls: Doesn't work
- Cafe Land
- Get the list of article urls:
url_extract('https://cafeland.vn/', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://cafeland.vn/phan-tich/bien-doi-khi-hau-dang-leo-thang-nhung-doanh-nghiep-chu-yeu-doi-pho-114941.html', key='div', tag_class='info-date right', type='')
- Get the list of article urls:
- Kenh14
- Get the list of article urls:
url_extract('https://m.kenh14.vn/doi-song.chn', key='h3', type='link')
- Extract article details:
url_extract('https://m.kenh14.vn/phia-sau-nhung-gen-z-okela-co-luc-that-bai-co-luc-khong-on-lam-nhung-chua-bao-gio-ngung-no-luc-20221119153833146.chn', key='span', tag_class='kbwcm-time', type='')
- Get the list of article urls:
- Dân trí
- Get the list of article urls:
url_extract('https://dantri.com.vn/', key='h3', type='link', bs_on='')
- Extract article details:
url_extract('https://dantri.com.vn/the-gioi/moscow-cao-buoc-ukraine-kich-dong-xung-dot-quan-su-nga-nato-20221119145209276.htm', key='time', tag_class='author-time', type='')
- Get the list of article urls:
- Thanh niên
- Get the list of article urls: ``
- Extract article details: ``
- Vietnamnet
- Get the list of article urls: ``
- Extract article details: ``
- Nhân dân điện tử
- Get the list of article urls: ``
- Extract article details: ``
- Lao động
- Get the list of article urls: ``
- Extract article details: ``
- Đời sống & pháp luật
- Get the list of article urls: ``
- Extract article details: ``
- Demo video: How to select the key
- Explore User Agents by Operating System: here
You can contact me at one of my social network profiles:
If you want to support my open-source projects, you can "buy me a coffee" via Patreon or Momo e-wallet (VN). Your support will help to maintain my blog hosting fee & to develop high-quality content.