Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Analysis SigV4 Support #2458

Closed
lewinkedrs opened this issue Dec 6, 2022 · 21 comments · Fixed by #2489 or #2794
Closed

Prometheus Analysis SigV4 Support #2458

lewinkedrs opened this issue Dec 6, 2022 · 21 comments · Fixed by #2489 or #2794
Labels
enhancement New feature or request

Comments

@lewinkedrs
Copy link
Contributor

Summary

Prometheus analysis should support SigV4 when creating the prometheus client. This would allow it to query Amazon Managed Prometheus, which requires SigV4 authentication.

The prometheus client already supports a SigV4 RoundTripper. This can be used when constructing the client to allow a user to authenticate with Amazon Managed Prometheus using the default GO credential provider chain.

Use Cases

When performing prometheus analysis using Amazon Managed Prometheus as a metric data store.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@danil-smirnov
Copy link
Contributor

Hi @zachaller

We tried to leverage this feature using the latest tag of the AR controller image but we were hit by an error:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1d986c4]
goroutine 353 [running]:
github.com/prometheus/common/sigv4.NewSigV4RoundTripper(0x0, {0x0?, 0x0?})
	/go/pkg/mod/github.com/prometheus/common/sigv4@v0.1.0/sigv4.go:57 +0x64
github.com/argoproj/argo-rollouts/metricproviders/prometheus.NewPrometheusAPI({{0xc000fbbcc0, 0xc}, {0xc000fbbccc, 0x2}, {0x0, 0x0}, 0x0, {0xc000fbbcd0, 0x10}, {0xc000fbbce0, ...}, ...})
	/go/src/github.com/argoproj/argo-rollouts/metricproviders/prometheus/prometheus.go:176 +0x334
github.com/argoproj/argo-rollouts/metricproviders.(*ProviderFactory).NewProvider(_, {0xc0001f6000, 0xc000c2eb70, {0x0, 0x0, 0x0}, 0x0, 0x0, {0x0, 0x0}, ...}, ...)
	/go/src/github.com/argoproj/argo-rollouts/metricproviders/metricproviders.go:40 +0xe98
github.com/argoproj/argo-rollouts/analysis.(*Controller).runMeasurements.func1({{{0xc000fbbcc0, 0xc}, {0xc000fbbccc, 0x2}, {0x0, 0x0}, 0x0, {0xc000fbbcd0, 0x10}, {0xc000fbbce0, ...}, ...}, ...})
	/go/src/github.com/argoproj/argo-rollouts/analysis/analysis.go:328 +0x3fd
created by github.com/argoproj/argo-rollouts/analysis.(*Controller).runMeasurements
	/go/src/github.com/argoproj/argo-rollouts/analysis/analysis.go:319 +0x38a

We use the following code in our AnalysisTemplate:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: {{ include "damage-report-helm.backend.fullname" . }}-errors-rate
spec:
  args:
  - name: targetGroupName
  dryRun:
  - metricName: success-rate
  metrics:
  - name: success-rate
    interval: 5m
    successCondition: isNaN(result[0])
    failureCondition: result[0] >= 1
    failureLimit: 1
    provider:
     prometheus:
       address: https://aps-workspaces.eu-central-1.amazonaws.com/workspaces/ws-f77bee0c-0494-4267-b64c-91c938eb734b
       query: |
         sum(irate(http_requests_total{job="damage-report-backend",status="5xx"}[5m]))

I'm not sure if should I create another issue since release 1.5 is not out yet...

@lewinkedrs
Copy link
Contributor Author

lewinkedrs commented Mar 24, 2023

I think I see why this is happening. The RoundTripper should be initialized as a pointer and then passed as a reference.

	if strings.Contains(address, "aps-workspaces") {
		var cfg *sigv4.SigV4Config
		var next http.RoundTripper
		sigv4RoundTripper, err := sigv4.NewSigV4RoundTripper(cfg, next)

should be:

	if strings.Contains(address, "aps-workspaces") {
		var cfg sigv4.SigV4Config
		var next http.RoundTripper
		sigv4RoundTripper, err := sigv4.NewSigV4RoundTripper(&cfg, next)

Will submit a fix for this.

@zachaller
Copy link
Collaborator

@lewinkedrs would you be able to get a fix in soon, I was planning on cutting a a 1.5-rc here today or monday but will possibly hold for this even though it is a bug so could go in a patch release as well or 1.5 GA.

@lewinkedrs
Copy link
Contributor Author

submitted #2680

zachaller pushed a commit that referenced this issue Mar 24, 2023
fixing nil pointer

Signed-off-by: Kevin Lewin <lewinke@amazon.com>
@danil-smirnov
Copy link
Contributor

danil-smirnov commented Apr 20, 2023

@zachaller @lewinkedrs Was it tested against the real AMP? I am trying to make it work but it still fails with the 403 error

@lewinkedrs
Copy link
Contributor Author

@danil-smirnov I was able to get this working with AMP. This is just using the go-sdk default provider chain. Do you have the proper environment variables/credential file set up to provide authorization to AMP? Does the error provide any additional details?

@danil-smirnov
Copy link
Contributor

danil-smirnov commented Apr 21, 2023

@lewinkedrs I'm using AR Docker image with the Service Account attached. Not sure if this is enough to pass the credentials. I see only the following environment when describing the AR pod:

    Environment:
      AWS_REGION:                   eu-central-1
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_ROLE_ARN:                 arn:aws:iam::1234567890:role/argo-rollouts-service-account
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token

Also I can not exec into the pod to check the environment from inside since the image is distroless (

@jeromeinsf
Copy link

1234567890 looks like a placeholder value

@danil-smirnov
Copy link
Contributor

1234567890 looks like a placeholder value

I've spoiled it intentionally.

@jeromeinsf
Copy link

@lewinkedrs should this come with some documentation/read.me on how to use it?

@lewinkedrs
Copy link
Contributor Author

@lewinkedrs I'm using AR Docker image with the Service Account attached. Not sure if this is enough to pass the credentials. I see only the following environment when describing the AR pod:

    Environment:
      AWS_REGION:                   eu-central-1
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_ROLE_ARN:                 arn:aws:iam::1234567890:role/argo-rollouts-service-account
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token

Also I can not exec into the pod to check the environment from inside since the image is distroless (

Are you using the same analysis template in your top level comment? The prometheus address needs to use the 'query' end point of amazon managed prometheus to accept the query. The address should look like this:

https://aps-workspaces.Region.amazonaws.com/workspaces/Workspace-id/api/v1/query

@jeromeinsf I will work on a write up for this.

@danil-smirnov
Copy link
Contributor

danil-smirnov commented Apr 21, 2023

@lewinkedrs If I change that to https://aps-workspaces.eu-central-1.amazonaws.com/workspaces/my-id/api/v1/query I'm getting client_error: client error: 404 error instead of 403 as before.

@danil-smirnov
Copy link
Contributor

@lewinkedrs Stil can't make this work. ( I'm getting 403 or 404 errors depending on the URL I use. Did you prepare any docs on how to use this feature already?

@lewinkedrs
Copy link
Contributor Author

@danil-smirnov I will work on writing this up now. But as long as the IRSA you are using has permissions to read and write from the AMP workspace you should have no problem.

As a test, could you set up a regular prometheus server helm chart on your cluster using the same IRSA and see if you can remote-write to your AMP workspace? You can find instructions for that here. This PR just uses the same Sigv4 signing process that prometheus is using in their client so i'd like to see if you can reproduce your error there. If it works with prometheus but not argo, we may need to revisit this PR and figure out why the signing process is not working properly for you.

@moharam-dev
Copy link

moharam-dev commented May 17, 2023

Hi,

I think there is an issue with IRSA in the current implementation.

I'm trying to utilize AWS AMP in my AnalysisTemplates in EKS cluster. I configured Argo Rollouts controller pod to utilize IRSA and I'm still getting "403" when I try to run my analysis templates.

To test my setup I just added a dummy Ubuntu pod configured with the same SA as that used for my AR controller pod. From within this dummy pod, I can list AMP workspaces using awscli and I used awscurl to successfully query the AMP workspace referred in my AnalysisTemplate.

Then I copied partial go code from AR Prometheus metrics provider for testing purposes. Run it on my dummy test pod, and I got the same error "403"

following my test code (this is my first time to write golang so pls mind any mistakes !!)

package main
import (
        "fmt"
        "context"
        "time"
        "net/http"

        v1 "github.com/prometheus/client_golang/api/prometheus/v1"

        "github.com/prometheus/client_golang/api"
        "github.com/prometheus/common/sigv4"
)
func main() {
    fmt.Println("Hello, World!")
        # masked workspace id 
        prometheusApiConfig := api.Config{
                Address: "https://aps-workspaces.eu-central-1.amazonaws.com/workspaces/xxxxxxxxxxxxxxxxxxxxxxxx",
        }
        var cfg sigv4.SigV4Config
        var next http.RoundTripper
        sigv4RoundTripper, err := sigv4.NewSigV4RoundTripper(&cfg, next)
        if err != nil {
                fmt.Println("Error creating SigV4 RoundTripper: %v", err)
                return 
        }
        prometheusApiConfig.RoundTripper = sigv4RoundTripper
        client, err := api.NewClient(prometheusApiConfig)
        if err != nil {
                fmt.Println("Error in getting prometheus client: %v", err)
                return
        }
        api := v1.NewAPI(client)
        ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
        defer cancel()
        response, warnings, err := api.Query(ctx, "avg{dummy}", time.Now())
        if err != nil {
                fmt.Println("Error in getting prometheus client: %v", err)
        }
        if err != nil {
                fmt.Println("Warning in getting prometheus client: %v", warnings)
        }
        if response != nil {
                fmt.Println("Prometheus client response is : %v", response)
        }
}

and my mod file is as following:

module example/hello

go 1.19

require (
        github.com/aws/aws-sdk-go-v2 v1.17.7
        github.com/aws/aws-sdk-go-v2/config v1.18.19
        github.com/prometheus/client_golang v1.14.0
        github.com/prometheus/client_model v0.3.0
        github.com/prometheus/common v0.42.0
        github.com/prometheus/common/sigv4 v0.1.0
)

require (
        github.com/aws/aws-sdk-go v1.38.35 // indirect
        github.com/beorn7/perks v1.0.1 // indirect
        github.com/cespare/xxhash/v2 v2.1.2 // indirect
        github.com/golang/protobuf v1.5.2 // indirect
        github.com/jmespath/go-jmespath v0.4.0 // indirect
        github.com/jpillora/backoff v1.0.0 // indirect
        github.com/json-iterator/go v1.1.12 // indirect
        github.com/matttproud/golang_protobuf_extensions v1.0.4 // indirect
        github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
        github.com/modern-go/reflect2 v1.0.2 // indirect
        github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f // indirect
        github.com/prometheus/procfs v0.8.0 // indirect
        golang.org/x/net v0.7.0 // indirect
        golang.org/x/oauth2 v0.5.0 // indirect
        golang.org/x/sys v0.5.0 // indirect
        golang.org/x/text v0.7.0 // indirect
        google.golang.org/appengine v1.6.7 // indirect
        google.golang.org/protobuf v1.28.1 // indirect
        gopkg.in/yaml.v2 v2.4.0 // indirect
)

the output for my go run is:

Hello, World!
Error in getting prometheus client: %v client_error: client error: 403
Warning in getting prometheus client: %v []

@danil-smirnov
Copy link
Contributor

@lewinkedrs In this comment you are insisting on /api/v1/query suffix for the query url.
But in this docs you use it without the suffix in your example.
I'm now totally confused. Which URL should we use?

@lewinkedrs
Copy link
Contributor Author

lewinkedrs commented May 17, 2023 via email

@lewinkedrs
Copy link
Contributor Author

Ok I've partially figured out what is going on here. I was able to reproduce the error @danil-smirnov is seeing with a 403 when just trying to use the service account credentials. I believe it's because there is only a role provided but no region is provided. According to the docs it looks like a region is required to be passed in the signing key. I don't believe IRSA provides this and I did not catch this because I was testing in a local environment. Using the testing scenario above from @moharam-dev I am able to get the same 403 error when not providing a region.

Hello, World!
Error in getting prometheus client: %v client_error: client error: 403

But if I use the following code instead and explicitly provide a region to the sigv4 config. We have success:

func main() {
	fmt.Println("Hello, World!")
	// masked workspace id
	prometheusApiConfig := api.Config{
		Address: "https://aps-workspaces.us-east-2.amazonaws.com/workspaces/someworkspace",
	}
	cfg := sigv4.SigV4Config{
		Region: "us-east-2",
	}
	var next http.RoundTripper
	sigv4RoundTripper, err := sigv4.NewSigV4RoundTripper(&cfg, next)
	if err != nil {
		fmt.Println("Error creating SigV4 RoundTripper: %v", err)
		return
	}
	prometheusApiConfig.RoundTripper = sigv4RoundTripper
	//fmt.Println(&prometheusApiConfig.RoundTripper)
	client, err := api.NewClient(prometheusApiConfig)
	if err != nil {
		fmt.Println("Error in getting prometheus client: %v", err)
		return
	}
	api := v1.NewAPI(client)
	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer cancel()
	response, warnings, err := api.Query(ctx, "string", time.Now())
	if err != nil {
		fmt.Println("Error in getting prometheus client: %v", err)
	}
	if err != nil {
		fmt.Println("Warning in getting prometheus client: %v", warnings)
	}
	if response != nil {
		fmt.Println("Prometheus client response is successful!")
	}
}
➜  sig4test go run .
Hello, World!
Prometheus client response is successful!

This must be why other tools that provide sigV4 functionalities tend to just encourage a region be provided and not rely on one to be provided from the default provider chain. One example of this is the managed prometheus ingest documentation , which has the example written like this:

serviceAccounts:
  server:
    name: amp-iamproxy-ingest-service-account
    annotations: 
      eks.amazonaws.com/role-arn: ${IAM_PROXY_PROMETHEUS_ROLE_ARN}
server:
  remoteWrite:
    - url: https://aps-workspaces.${REGION}.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write
      sigv4:
        region: ${REGION}
      queue_config:
        max_samples_per_send: 1000
        max_shards: 200
        capacity: 2500

We probably need some way to just grab this from the yaml config and we can update the docs to inform others to include the region.

@danil-smirnov
Copy link
Contributor

danil-smirnov commented May 18, 2023

Thank you for your research @lewinkedrs , is there any workaround for the time being? We are using the Argo Rollouts helm chart and we're already setting the region like this:

  controller:
    extraEnv:
    - name: AWS_REGION
      value: eu-central-1

@lewinkedrs
Copy link
Contributor Author

Thank you for your research @lewinkedrs , is there any workaround for the time being? We are using the Argo Rollouts helm chart and we're already setting the region like this:

  controller:
    extraEnv:
    - name: AWS_REGION
      value: eu-central-1

The best work around right now is probably to just a SigV4 proxy and then here it is as an admission controller If you prefer that method of deployment. I will work on putting together the fix to explicitly pass the region from yaml config, probably just in the analysis template we put it there.

@ju187
Copy link

ju187 commented Sep 25, 2023

The sigv4 parameters need to be set in the yaml file which defines the analysistemplate. I think it should also be able to use the environment variables just like the Prometheus address field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
6 participants