Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of queries to Airflow database in "DAG File Processing Stats" #40282

Open
2 tasks done
MaksYermak opened this issue Jun 17, 2024 · 6 comments
Open
2 tasks done
Assignees
Labels
kind:feature Feature Requests

Comments

@MaksYermak
Copy link
Contributor

Description

This new feature will add a new column to the "DAG File Processing Stats" of DAG processor logs. This column will store information about the number of queries to the Airflow database per DAG.

Use case/motivation

This new column may be convenient to have it in case of debugging issues related to high load on Airflow database, e.g. typical scenario is when DAG file(s) have a lot of queries to database done on the top level of code and those are executed each time during parsing of these DAG files. One common example is excessive usage of "Variables.get" as top-level statements in DAG files.

Having information about "number of queries to Airflow database" per DAG file may help a lot during debugging issues related to high load on database or issues related to long parsing of the DAG files.

Related issues

Thread with discussion in the Airflow community: https://lists.apache.org/thread/9j6q2lq521rt5zx46l2dvow2c85sgqwb

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@MaksYermak MaksYermak added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Jun 17, 2024
@raphaelauv
Copy link
Contributor

raphaelauv commented Jun 17, 2024

Good idea !

I would also love a global airflow option DAG_PARSING_METADATABASE_CALL_FORBIDEN

that could be set to True , so parsing of dags with variable / connection ... at top level fail. And promote usage of Dynamic Task mapping in the error log

@troxil
Copy link

troxil commented Jun 17, 2024

@raphaelauv Wouldn't such a feature conflict with the theory of the Secrets Cache that was recently released too?

@raphaelauv
Copy link
Contributor

raphaelauv commented Jun 18, 2024

AIRFLOW__SECRETS__USE_CACHE is by default at false

and yes this feature will no more be usefull when DAG_PARSING_METADATABASE_CALL_FORBIDEN is set to True

@VladaZakharova
Copy link
Contributor

@eladkal @potiuk @kaxil
Let's continue here our discussion about adding more logs in our output
@raphaelauv Thank you for your ideas, but I think this can be not directly related to this exact issue with adding exact logs to output. Maybe it will be more efficient to create a new issue for this? WDYT?

@kaxil
Copy link
Member

kaxil commented Jun 19, 2024

@raphaelauv Thank you for your ideas, but I think this can be not directly related to this exact issue with adding exact logs to output. Maybe it will be more efficient to create a new issue for this? WDYT?

Yeah, let's keep it separate

@MaksYermak
Copy link
Contributor Author

In my PR for this feature I have added a new column to the table with processing results in the log file. I have seen in the discussion thread some ideas about adding this information to the UI or DB, but I didn't notice a strong consensus about it and I decided not to make any UI or DB changes in this PR. I will be ready to create a new PR in future when we have a clear vision where on UI and DB are better to have this data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature Feature Requests
Projects
None yet
Development

No branches or pull requests

6 participants