Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-12855: Add previous event IDs to provenance events to facilitate full graph traversal #9171

Closed
wants to merge 1 commit into from

Conversation

mattyb149
Copy link
Contributor

Summary

NIFI-12855 This PR adds "previousEventIds" to provenance events. This can help graph-based applications/databases provide correct lineage more quickly (namely, without having to sort the events by timestamp)

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@joewitt
Copy link
Contributor

joewitt commented Aug 11, 2024

Is this PR intended for Apache NiFI? It appears to fail the build due to usage of internal/fork versions of NiFi components.

It appears to be very similar to the second of two PRs previously offered for this JIRA specifically #8498

In that previous PR (8498) a concern was raised and appears to remain which relates to 'previousEventIds'. That comment can be found here #8498 (review). I echo that concern.

I left comments/questions on the previous PR as well. I don't see responses on that PR or the JIRA or any comments in this PR to address those concerns/questions. Possibly I'm just missing where that dialogue happened and if so please direct where it is.

@mattyb149
Copy link
Contributor Author

My mistake on the version, I pushed the fix. I believe the discussion took place on Slack but it's too old to retrieve from the Apache NiFi Slack as we don't have the pro version. Basically the idea is to provide a sort of "linked-list adjacency" that you see implemented fully as "index-free adjacency" in graph databases. Currently the lineage computation involves getting every event for a FlowFile, then sorting by timestamp, then repeating for all parent and child FlowFiles. This is too slow for determining an entire provenance graph. Consider GenerateFlowFile -> UpdateAttribute -> UpdateAttribute -> , the UUID for the FlowFile remains the same but to tell which UpdateAttribute came first, you have to sort them by time. previousEventIds is used to enable a sort of "on-the-fly" streaming behavior to stream new provenance events into a graph database without having to re-link any of the previous nodes, just adding new nodes and links from the previous event(s) to the new one.

* @return a unique ID for the "parent" Provenance Event, namely the one that came directly before this event
* for the given FlowFile. For source events such as CREATE, this value should be set to -1
*/
Set<Long> getPreviousEventIds();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why choose '-1' as a value to indicate this is the first in a chain rather than an empty set which would also be consistent with other methods?

What scenario could result in more than one previous event identifier? In the case this event represents a JOIN then there will have been one to many parent flow files and we already have those captured as parent flow file references. That seems correct as the mechanism then to follow the provenance event trail of those flowfiles already referenced here. Capturing both the event ids and the parent flowfile references seems unnecessary. If there is no previousEventIdentifier it should suggest you check the parentFlowFiles to iterate and follow their provenance path.

* @param record The record for which to update the previous event IDs
* @param previousIds the list of previous event IDs to set for the record, or null to remove
*/
void updatePreviousEventIds(ProvenanceEventRecord record, Set<Long> previousIds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would they need to be updated rather than just set?

@@ -122,6 +122,8 @@ public class JdbcCommon {

public static final Pattern LONG_PATTERN = Pattern.compile("^-?\\d{1,19}$");
public static final Pattern SQL_TYPE_ATTRIBUTE_PATTERN = Pattern.compile("sql\\.args\\.(\\d+)\\.type");

public static final Pattern JDBC_ARRAY_ELEMENT_CLASS_TYPE_PATTERN = Pattern.compile("JavaType\\(class (.*)\\) ARRAY");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this JDBC utility change necessary for the provenance change being proposed in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, will remove

@@ -51,5 +51,11 @@
<artifactId>lucene-backward-codecs</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.nifi</groupId>
<artifactId>nifi-framework-core-api</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is happening that is requiring a new dependency on the nifi-framework-core-api to be introduced here? And the reference to compile - is it intended this API is copied into the NAR?

@@ -33,6 +33,7 @@ public class LookupTableEventRecordFields {

// General Event fields.
public static final RecordField RECORD_IDENTIFIER_OFFSET = new SimpleRecordField(EventFieldNames.EVENT_IDENTIFIER, FieldType.INT, EXACTLY_ONE);
public static final RecordField PREVIOUS_EVENT_IDENTIFIERS = new SimpleRecordField(EventFieldNames.PREVIOUS_EVENT_IDENTIFIERS, FieldType.LONG, ZERO_OR_MORE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like it would be 'one or more' given the '-1' mechanism. But zero or more does seem more correct. Left previous comment on that

import org.apache.nifi.provenance.serialization.StorageSummary;

public class StorageSummaryEvent implements ProvenanceEventRecord {
public class StorageSummaryEvent implements UpdateableProvenanceEventRecord {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change to an 'UpdateableProvenanceEventRecord' abstraction and addition of setters stands out. What is it about this new concept (previous event identifiers) that is causing a change in abstraction ?

@joewitt
Copy link
Contributor

joewitt commented Aug 11, 2024

Understood on Slack. It is well understood about that limitation. Please put the necessary context in the PR or JIRA so reviewers can participate and/or know their comments are being factored in. You can copy/paste from slack.

I better understand the intent based on your recent comment. Effectively while the key information is provided in our current provenance event/flowfile model it puts too much strain on the destination database to determine what the lineage graph looked like in terms of ordering. This is because we provide the necessary relationships and timestamps but it would require the downstream database to reconstruct them once all events are completed as timestamp is currently a key indicator of order of operations for a given set of events associated with a single flow file.

I can see how this could be beneficial to resolve.

What other solutions were considered to address this?

The proposed change creates a link by the long identifier of all possible 'previous' events for a given provenance event and adds an alternative linkage over the current model which is purely based on the relationship of a flow file, the type of event, and any parent flowfiles from which it stems.

The core issue is ordering within the lifespan of a single flowfile. That seems solvable with a single 'previous event identifier'. In the case the previous event identifier is not established this is the start of this flowfile's life. That could mean it was a created flowfile or it was the result of various different sources being combined into a new child object. We already know and capture the parent flowfile identifier so this chain can be followed by the downstream database.

I left a couple comments as it relates to changes in the abstractions as well (Abstract repository, UpdateableEvent...). It isn't clear to me why such changes are desirable.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattyb149 Thanks for providing a bit of additional context around the difficulties related to streaming Provenance Records into a graph database as they stand. It is helpful to understand the intent in addition to the code itself.

With that being said, I am a -1 on the current approach. Making changes to the public nifi-api for Provenance Records, with the associated impact on implementation components, does not seem like the right solution. I have several concerns at the implementation level as well, but I'm skipping over those for now.

As I highlighted in previous pull requests, and as @joewitt noted, the parent FlowFile link already exists in the Provenance Record.

I would also be interested to know what other approaches were considered. The ProvenanceRepository in a framework-level extension, which is one potential alternative for supporting other solutions.

Although the use case is worth consideration, this is an important area to evaluate possible options. We should give careful thought to introducing public API changes for something that is outside of a standard framework concern.

@mattyb149
Copy link
Contributor Author

I will investigate a single previous event ID, hopefully that will be good enough for lineage purposes.

@mattyb149 mattyb149 closed this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants