In this chapter, a pipeline is built in order to gather all the essential items (data) for various reasons.
The data will be collected from popular newsletter sources like VNExpress or CBSNews, including different categories.
Different kinds of data will be scheduled to crawl at different intervals and different times of a day due to data's variety.
In order to run the project successfully, all the dependencies must be resolved using the following command:
pip install -r requirements.txt
You can run project in different ways either using the Dagster CLI or the Dagster UI (Dagit). In this project, the Dagit way is preferred as interactions with jobs are so intuitive and straightforward on the UI.
- Run command
dagit -p 3141
- Now you can access Dagit (UI) via
localhost:3141
In order to run daemon for running schedules and sensors, a Dagster Daemon is required for the process. To start the daemon:
- Create a dagster_home folder using the command
mkdir -p dagster_home
- Specify $DAGSTER_HOME environment variable to the recently created
dagster_home
folder in your shell:export $DAGSTER_HOME=[dagster_home's path]
- Having set the path, now we create a YML file in the
dagster_home
folder for further config by the commandtouch dagster_home/dagster.yaml
All the commit messages must be following the Conventional Commits guide for semantic purposes! Otherwise your commits will be rejected automatically by commit hook!
<type>(<scope>): <short summary>
│ │ │
│ │ └─⫸ Summary in present tense. Not capitalized. No period at the end.
│ │
│ └─⫸ Commit Scope: Feature scopes
│
└─⫸ Commit Type: build|ci|docs|feat|fix|perf|refactor|test|chore
The <type>
and <summary>
fields are mandatory, the (<scope>)
field is optional.
Must be one of the following:
- build: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm)
- ci: Changes to our CI configuration files and scripts (example scopes: Circle, BrowserStack, SauceLabs)
- docs: Documentation only changes
- feat: A new feature
- fix: A bug fix
- perf: A code change that improves performance
- refactor: A code change that neither fixes a bug nor adds a feature
- test: Adding missing tests or correcting existing tests
- chore: Adding commit that is not related to code (resolve conflicts, etc...)
The scope should be the name of the feature's scope that you're developing, it is OPTIONAL so feel free to skip it if you want to be more generic!