Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add --max-comments-per-command configuration #3905

Merged
merged 19 commits into from
Jun 27, 2024

Conversation

glasser
Copy link
Contributor

@glasser glasser commented Oct 31, 2023

what

Allow Atlantis administrators to cap the number of VCS comments that will be created for a single command. (The default value is 0, which means unlimited comments.)

why

If you're trying something like TF_LOG=debug, Atlantis can produce so many comments that it becomes challenging to read your PR, or worse still, your VCS might rate-limit Atlantis, effectively breaking your ability to use it for an extended period of time. While this PR does not change the default behavior, it's probably a good idea to set this flag to something like 10.

tests

Unit tests for common.SplitComment. (There doesn't appear to be other tests of comment splitting.)

references

@glasser
Copy link
Contributor Author

glasser commented Oct 31, 2023

Hi! This is my first Atlantis PR and is a basic attempt to fix #416. I'd love to have some high-level feedback as to whether this is a reasonable approach that would be accepted before I finish it.

So far I've just focused on the top level structure of how to add a new entry; before this could be merged I would write implementations for non-GitHub VCS implementations.

@glasser glasser changed the title Add --max-comments-per-command configuration feat: Add --max-comments-per-command configuration Oct 31, 2023
Allow Atlantis administrators to cap the number of VCS comments that
will be created for a single command. (The default value is 0, which
means unlimited comments.)

If you're trying something like `TF_LOG=debug`, Atlantis can produce so
many comments that it becomes challenging to read your PR, or worse
still, your VCS might rate-limit Atlantis, effectively breaking your
ability to use it for an extended period of time. While this PR does not
change the default behavior, it's probably a good idea to set this flag
to something like 10.

Fixes runatlantis#416.
@jamengual jamengual added the waiting-on-review Waiting for a review from a maintainer label Nov 1, 2023
@lukemassa
Copy link
Contributor

@glasser thanks for the contribution! I think this is definitely a step in the right direction.

I've recently been a bit worried about the growing number of configuration options to server, I wonder if a first pass sets a hard-coded value like 10, or even 100? We can always make it configurable later, I just feel like the marginal benefit of comments goes way down after even just a couple, and if you need to capture 327,680 (or 3M) characters, you can (and should) be using a "debug workflow" with a custom command. We could even write a brief "how to" and link to it in the footer. I'm interested in what the maintainers have to say about the tradeoff between not changing current behavior vs stemming the increasing tide of configuration options. And also to you @glasser whether there's a hard-coded number you think would be sufficient (or even, given this is implemented, what you had planned to set it at for your setup)?

Another unrelated concern I have is clipping the end, when likely the last few lines will have a lot more value than the lines in the middle. I'm not sure really how to address that, and that could will be a future improvement, something akin to: golang/go#7181. For example if there are determined to be n comments worth of text and the max is k < n, we show the first k-2, one comment that says "elided pages of text" then one last comment with the nth part? Again this is just an idea for a possible future improvement, I don't think it needs to complicate this PR.

All that said, certainly some logs are better than the crashing behavior we have now, so I definitely want to pursue this.

@glasser
Copy link
Contributor Author

glasser commented Nov 4, 2023

I do like the idea of preserving (say) the last page. That said, you can also view the full log at the web UI anyway, right? Which in practice is what I'd do for any log that triggers this.

I was trying to avoid backwards-incompatible changes without any way to even revert it, thus the configuration. If you'd prefer an uncontrollable change, I'd get that. My project will probably limit to 10.

@lukemassa
Copy link
Contributor

I do like the idea of preserving (say) the last page. That said, you can also view the full log at the web UI anyway, right? Which in practice is what I'd do for any log that triggers this.

Yeah good point. Let's leave this as "maybe do" for later.

I was trying to avoid backwards-incompatible changes without any way to even revert it, thus the configuration. If you'd prefer an uncontrollable change, I'd get that. My project will probably limit to 10.

That's my personal preference, others may differ. We can always make it configurable later if need be, I just want to defer adding a server flag if possible, since those are harder to remove.

I personally agree that getting back >10 comments, each with 30k characters of text, feels pretty cumbersome, and users should at that point be encouraged to pursue a different way of saving/shipping logs.

@glasser
Copy link
Contributor Author

glasser commented Nov 4, 2023

Ok, cool. I'm not sure how the process works around here -- before I go and rip out the option, is this a relatively solid decision and you're likely to merge the PR, or should we wait for more thoughts?

@jamengual
Copy link
Contributor

@GenPage ,@nitrocode and I are the maintainers and @lukemassa has been helping with a lot of contributions lately.

let's wait for us to catch up with this PR in a few weeks.

Thanks

@lukemassa
Copy link
Contributor

Ah sorry, I should have been more clear; I was just offering my opinion, I don't have any authority on the matter :) I'll let @jamengual take it from here with a recommendation, apologies for not clarifying that.

@jamengual
Copy link
Contributor

Ah sorry, I should have been more clear; I was just offering my opinion, I don't have any authority on the matter :) I'll let @jamengual take it from here with a recommendation, apologies for not clarifying that.

all good @lukemassa your input is super valuable to us and is always welcome.

I hope next week we could find more time to do reviews, plus we are in the process of changing the release pipeline do is going to take some time.

@brandon-fryslie
Copy link

brandon-fryslie commented Nov 8, 2023

Hi, I'm just a regular engineer and I absolutely do not mean to knock the work you've put it. I really appreciate every contributor to this project and it's helped us so much. I think this PR is pretty useful functionality to have, but maybe some minor tweaks. I see a few issues:

  • The argument to GitubClient seems unrelated to the rest of the arguments. Could there be a better place to get this value, or otherwise have a more generic pattern for passing user configurations to to SCM providers? For example, where is CHECKOUT_DEPTH provided? Of all the arguments, it is certainly the odd one out
  • I echo the previous concerns about truncating. Truncating the middle is vastly preferable. You'd miss a huge amount of context with this, like how many changes were made, what failed, any error messages. Unless the logs were saved somewhere else (like disk, s3 bucket, etc), the logs of the run would be pretty much useless
  • I'd almost prefer some sort of "minification" before a truncation. I find most of my "many comments PRs" come from changes to inline policy tempates used across many roles. Terraform doesn't do any sort of comparison right now and lists the entire output of both the "removed" role and then "new" role. Most of that is useless for anything, so maybe "diffs that are larger than 100-200-whatever lines" get a truncated in the middle. That would dovetail nicely with functionality to save these logs off to s3 or somewhere else, while hardly impacting the usability in any way
  • I'm not exactly sure how to achieve this (not familiar with all of scm arch in Atlantis) but I don't see why this would be implemented separately for each SCM (github/gitlab/whatever the other one is). It doesn't seem like configuring these separately per SCM would be especially valuable

These are just my 2c, we do run into this issue sometimes. We would never use TF_LOG=lebug in Atlantis such we're admins and can run things locally, but honestly I don't think you'd ever want that output directly in a PR comment. It might contain tokens or sensitive values and lots of people can see those PR comments. It would be far better to filter ALL debug output from the comments and have a secondary storage medium for unfiltered output (that is of course highly restricted like Atlantis itself should be). Disc is easiest at first, but then you have to think about cleanup. S3 is AWS but there are tools that allow mounting a bucket as a file path. Then you can push cleanup onto the users and avoid a whole mess of complexity around filling your disk withint Atlantis itself

@glasser
Copy link
Contributor Author

glasser commented Nov 8, 2023

The argument to GitubClient seems unrelated to the rest of the arguments. Could there be a better place to get this value, or otherwise have a more generic pattern for passing user configurations to to SCM providers? For example, where is CHECKOUT_DEPTH provided? Of all the arguments, it is certainly the odd one out

Sure, it would be reasonable to have a struct for common configuration for creating comments.

I'm not exactly sure how to achieve this (not familiar with all of scm arch in Atlantis) but I don't see why this would be implemented separately for each SCM (github/gitlab/whatever the other one is). It doesn't seem like configuring these separately per SCM would be especially valuable

It's not being configured separately per SCM, just passed through. (The different SCMs do have different formats for their headers/footers and length limits, which is why the split call is happening inside the SCM code.)

@brandon-fryslie
Copy link

brandon-fryslie commented Nov 8, 2023

It's not being configured separately per SCM, just passed through. (The different SCMs do have different formats for their headers/footers and length limits.)

Totally missed that. awesome, great work. It wouldn't especially help our team / teams (we have 10+ atlantis instances running internally, my team runs 2 - one for dev, one for prod), but what I am seeing is that teams are simply taking the output and persisting it elsewhere via a python script.

I (highly) appreciate everyone who has contributed to this project. My ideas were mostly suggestions to spur future thoughts, I don't expect them to make it into your PR or want to complicate things. However middle truncation is kind of a base level functionality. It would almost be better to limit the maximum number of projects that can run, or to allow custom filtering inline.

Many different people are using Atlantis. If this would be e.g., a default of 10 comments per atlantis command, e.g., atlants plan, then nearly every single PR we have would be truncated. We have a monorepo (this was to get the team familiar with tf, now we're splitting new things out) with (these are made up numbers) ~40 root modules and ~40 child modules. We have 8 separate environments. So if you were to make a change that impacted 4 modules that were used in all environments, that's 32 different runs of terraform alone. It's easy to make a change that requires 50, 60 terraform executions. If you have even a single module that needs to split outputs into multiple comments, you're automatically looking at 70+ comments just for a SINGLE execution of atlantis plan. It's not uncommon to have 150+ comments from Atlantis by the time its done

I'm not expecting you to change anything our behalf. Just trying to give some perspective on different use cases in case it's useful. Huge appreciators of the maintainers along with everyone who contributes great useful features like this!

@glasser
Copy link
Contributor Author

glasser commented Nov 8, 2023

That seems like a good data point about the question @lukemassa raised as to whether it would be reasonable to have this be a non-configurable hardcoded value (with the answer "no").

@brandon-fryslie
Copy link

brandon-fryslie commented Nov 8, 2023

That seems like a good data point about the question @lukemassa raised as to whether it would be reasonable to have this be a non-configurable hardcoded value (with the answer "no").

We would be stuck on whatever version of Atlantis allowed us to use it, I guess :)

If there was some sort of mechanism to persist the file logs to "offsite storage" (a bucket seems logical since there are relatively few restrictions beyond access, and it removes the burden of cleanup from Atlantis entirely, let users cleanup their own bucket). Then that would completely resolve any concerns about truncation, number of comments, whatever.

Tne logs in the "Detail Link" are kind of a joke unless you're just testing a single module to learn terraform and have time to sit around and watch it. 95% of the time the data is gone by the time we need it. Since Atlantis is in a completely isolated environment, non-admins can't get anywhere near that cluster. I don't think more advanced devops people would care if it was removed entirely for the most part. Our users can't access it and it would be empty if they did. Without persistence, pre/post workflow hooks, stuff like that it's maybe between 0-5% useful in my experience. It's certainly not close to a replacement for the PR comments which are available for more than the 3-4 minutes it takes for the module to run. The single usecase is "my module has been running for 45 minutes and I have no idea what's going on, lets see whats up*. But out of thousands and thousands of deployments we've never actually needed it to debug that issue with those logs (hint: it's network connectivity or you're replacing node groups on an eks cluster).

The point there is that those live streaming logs are not a replacement for anything. Possibly "offsite" persistence via s3 could actually make those logs useful, who knows. I really apologize if one of you implemented those. Absolutely nothing personal and I'll buy you a drink next time you're in town. Sorry for being brutally honest

@brandon-fryslie
Copy link

We have teams persisting to gist but I don't think that's generic enough for Atlantis code. There just needs to be an extremely simple adapter to optionally persist the full, unaltered logs to a second "offsite" location. Lifecycle rules, cleanup, whatever, could be entirely left to the user. No need to complicate Atlantis.

That would remove any pressure on changes to the PR comments, truncationation, formatting anything. No matter what, you'd always be able to refer to the actual unfiltered output. The PR comments are critical right now because you can't do anything without them. They're the source of truth. Full, unaltered text output could become the source of truth, and give more flexibility to alterations in the content of the comments.

And hey you probably read those files back in when someone wants to see the output of "job" via the "Show Details" link. Then rather than the constant disappointment that greets users when they see that empty, black screen, there's a chance it could help them solve a problem

@brandon-fryslie
Copy link

brandon-fryslie commented Nov 8, 2023

Just to recap since my comments might be interpreted as off-topic. Here here's is gist

  • This directly relates to the "comments issue" because it would always provide a fallback option if something got truncated in the comments. Otherwise, we're no matter what truncation is losing some data
  • Right now the PR comments are the ONLY viable way to see what happened to during an atlantis execution
  • The "Show Details" link is not helpful because it's empty most of the time, seems to be empty as soon as the build finishes
  • With fully enabled and unfiltered "offsite" storage, Comments could be experimented with and made much more flexible without breaking the single interface to your job results
  • offsite storage (vs storing it on the volume) means that Atlantis doesn't need to implement ANY logic for cleanup. Make users do it, there are so many options, S3 has lifecycle rules for pretty much anything and a tiny lambda could do it. Atlantis shouldn't need to worry about it

There is no fallback right now if Truncation is merged and breaks a 1/10000 use case. Changes to the primary interface with 0 fallbacks are going to be inherently risks. People will come in hot with stuff like "". At least if they're in s3 or somewhere you can tell them to use that instead. like git plumbing vs porcelain commands. offsite output is plumbing. PR comments are porcelain.

@jamengual
Copy link
Contributor

@glasser @brandon-fryslie I'm a bit conflicted by this feature. It is trying to solve a problem that does not pertain to Atlanti's core features, and I will explain why.

  • Terraform Debug: is usually considered sensitive; it is not a good idea to expose that anywhere it can be saved, especially in a repo where it will be held forever.
  • Debug logs: Debug/Trace logs are usually too lengthy and traditionally sent to other systems for reconciliation, searching, etc.

Based on that, if you want to debug a run in Atlantis, you should enable debug and use the Detail live logs to see the logs, but I won't recommend running a debug run based on my earlier comment.

Atlantis support for external log storage will be better suited for this kind of stuff, where you can send logs to s3, cloud watch, fluentd or any other system instead of your pr comments. You could have a repo server-side config like:

external-logging: enabled
external-log-facility: localhost:2020 
external-log-strategy: debug ( options could be debug, all)

That could point to your log aggregator to then send the log somewhere else to be seen by whoever is trying to debug.

That feature does not exist and will be to be built.

@benoittoulme benoittoulme dismissed stale reviews from lukemassa and GenPage via 96c71a3 June 18, 2024 18:03
@benoittoulme
Copy link
Contributor

The changes here don't seem to incorporate #3905 (comment), however seeing as they are backwards compatible, if people are happy with the approach, this could be merged as is and I could implement my proposal in a separate PR. This might be more clear in separating the implementation from the "policy".

@lukemassa sorry for the back-and-forth. I addressed the first 2 points of your comment. for 3) and 4) I would argue that this would be better in its own PR, as AFAIK no such mechanism exists. Also, I would argue (and plan to do so in my company) that post-plan or post-apply hooks would allow someone to implement such mechanism - for example by uploading the plan / command results to s3. But I doubt we can have a solution baked in atlantis that works for all.

So please, could you do another review, and thank you for your feedback!

@lukemassa
Copy link
Contributor

lukemassa commented Jun 26, 2024

@benoittoulme looks good to me! @GenPage if this looks good to you we I'm happy to go ahead and merge.

NOTE: If we accept this as is, it'll need to go out in a minor release, and there should be a call in the release notes. This changes default functionality, such that if a user relies on >100 comment long plans, they will need to set this value to 0 to restore the current behavior.

Overall I still think we're more likely to fix problems than cause them with this change (famous last words...), so I'm still supportive of the non-backwards compatible approach in this instance, as long as it's properly announced with a description of how to keep prior behavior.

@GenPage GenPage added feature New functionality/enhancement breaking-change and removed waiting-on-response Waiting for a response from the user needs discussion Large change that needs review from community/maintainers labels Jun 26, 2024
Copy link
Member

@GenPage GenPage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I added the feature and breaking-change and removed docs label which should highlight this in the release notes

@GenPage GenPage removed the docs Documentation label Jun 26, 2024
@lukemassa lukemassa merged commit 5a918e3 into runatlantis:main Jun 27, 2024
28 checks passed
@glasser
Copy link
Contributor Author

glasser commented Jun 27, 2024

@benoittoulme Thanks for finishing this up!

(I removed you from my fork.)

@RodrigoMenezes-Vantage
Copy link

Thanks for all the work everyone put into this.

Any idea when this will be getting shipped out? Just tried to use the flag only to realize it's not included in the latest release.

@benoittoulme
Copy link
Contributor

Thanks for all the work everyone put into this.

Any idea when this will be getting shipped out? Just tried to use the flag only to realize it's not included in the latest release.

I believe this will be included into v0.29.0 release, because of the breaking change ( I am also waiting for this release :) )

@benoittoulme
Copy link
Contributor

Thanks for all the work everyone put into this.
Any idea when this will be getting shipped out? Just tried to use the flag only to realize it's not included in the latest release.

I believe this will be included into v0.29.0 release, because of the breaking change ( I am also waiting for this release :) )

🥳 🎉 https://github.com/runatlantis/atlantis/releases/tag/v0.29.0
Thank you!

@mubarak-j
Copy link

@benoittoulme Thank you for working on this awesome addition, much appreciated!
I filed #4920 which seems like a regression when combining this new flag with hide-prev-plan-comments. Assuming it’s not intentional, is this something you can replicate in your setup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

Don't post comment if it's too large
9 participants