Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigquery): add better timers around every API call #8626

Merged
merged 20 commits into from
Sep 15, 2023

Conversation

mayurinehate
Copy link
Collaborator

@mayurinehate mayurinehate commented Aug 14, 2023

Tested with bigquery connector-tests for accuracy.

Primary changes -

  1. Updated PerfTimer implementation to allow pausing timer
  2. Renamed BigQueryDataDictionary to BigQuerySchemaApi, added timers for the calls in this class.
  3. Refractored core audit log extraction methods into new class BigQueryAuditLogApi, added timers for the calls in this class.

Refractors:

  1. Refractored connection config and related methods in BigQueryConnectionConfig class.
  2. Refractored BigQueryDataDictionary to include bigquery client as instance method and moved related static methods to instance methods.
  3. Refractored lineage code to be solely in BigQueryLineageExtractor.
  4. Refractored and moved all queries/query templates to bigquery_v2/queries.py

Functional changes -

  1. Fix repetitive access to BQ exported audit logs in lineage case
  2. Deduplicates insertId via QUALIFY in BQ exported audit logs in lineage case

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Aug 14, 2023
Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some trouble reviewing this because there's so much copied / moved code, ended up only skimming thru most of the lineage and usage changes. Looks very good to me, my main concerns around assertions in the perf timer. I think we should avoid erroring due to the perf time and instead note something went wrong in the report, if possible

try:
projects = BigQueryDataDictionary.get_projects(conn)
projects = self.bigquery_data_dictionary.get_projects()
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might change my mind after seeing the data dictionary, but I think it might make sense to catch this in the data dictionary method rather than here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it makes a big difference but I moved it as suggested.

Comment on lines +41 to +48
bigquery_audit_metadata_query_template: Callable[
[
str, # dataset: str
bool, # use_date_sharded_tables: bool
Optional[int], # limit: Optional[int] = None
],
str,
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine here, but in general would like to avoid passing complex functions because it's kinda ugly and hard to extend

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Comment on lines 90 to 98
if rate_limiter:
with rate_limiter:
for entry in query_job:
with current_timer.pause_timer():
yield entry
else:
for entry in query_job:
with current_timer.pause_timer():
yield entry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty awkward. Maybe we can set the rate limiter to an empty context manager or put the decision whether to rate limit inside the rate limiter (would prob have to wrap it then)? Not sure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refractored it slightly. I think it looks better now.

Comment on lines 127 to 130
if i == 0:
logger.info(
f"Starting log load from GCP Logging for {client.project}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary -- can gain this info from the next log

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this

self.report = report

def get_client(self) -> bigquery.Client:
assert self.bq_client is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, removing this method altogether.

metadata-ingestion/src/datahub/utilities/perf_timer.py Outdated Show resolved Hide resolved
Comment on lines 33 to 38
assert (
not self.paused and not self.end_time
), "Can not pause a paused/stopped timer"
assert (
self.start_time is not None
), "Can not pause a timer that hasn't started. Did you forget to start the timer ?"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For safety, since we're not great about wrapping everything with try / excepts, what do you think about a logger warning or error here instead? I would rather complete ingestion with faulty timers than have an uncaught exception. Or perhaps we can store an error state in this object, so it can be seen in the report and we know not to trust the time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Let me add warning logs in this case, and also add the error state indicator for reported time in such case.

Comment on lines +52 to +55
if self.paused: # Entering paused timer context, NO OP
pass
else:
self.start()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.paused: # Entering paused timer context, NO OP
pass
else:
self.start()
if not self.paused:
self.start()

if self.paused:
self.paused = False

def pause_timer(self) -> "PerfTimer":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on renaming this to just pause(). Generally it's clear that the object being paused is a timer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

time.sleep(1)
assert round(timer.elapsed_seconds()) == 1

assert round(timer.elapsed_seconds()) == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert round(timer.elapsed_seconds()) == 1
assert pytest.approx(timer.elapsed_seconds()) == 1

Can do the same for others

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@asikowitz
Copy link
Collaborator

Looks like a legitimate failure on FAILED tests/unit/test_bigquery_source.py::test_get_projects_list_failure because of the project list change

@mayurinehate
Copy link
Collaborator Author

mayurinehate commented Sep 15, 2023

Looks like a legitimate failure on FAILED tests/unit/test_bigquery_source.py::test_get_projects_list_failure because of the project list change

@asikowitz the test is fixed.

@hsheth2 hsheth2 merged commit cdb9f5b into datahub-project:master Sep 15, 2023
58 checks passed
hsheth2 added a commit to hsheth2/datahub that referenced this pull request Sep 20, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants