Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE: Investigate why production-deployment workflow fails #1778

Closed
ADPennington opened this issue May 11, 2022 · 14 comments
Closed

SPIKE: Investigate why production-deployment workflow fails #1778

ADPennington opened this issue May 11, 2022 · 14 comments
Assignees
Labels
dev Refined Ticket has been refined at the backlog refinement

Comments

@ADPennington
Copy link
Collaborator

ADPennington commented May 11, 2022

Description:

On 5/10, we attempted the production-deployment workflow from hhs:master, and it failed at the login to cloud.gov and set application targets step (see below)
proddeploy1

This ticket tracks troubleshooting.

@ADPennington ADPennington self-assigned this May 11, 2022
@ADPennington
Copy link
Collaborator Author

ADPennington commented May 11, 2022

  1. Confirmed the correct key value pairs for the following environment variables in CircleCI:

CF_ORG
CF_USERNAME_PROD and CF_PASSWORD_PROD (via cf service-key tanf-keys deployer ref)

@ADPennington
Copy link
Collaborator Author

ADPennington commented May 11, 2022

  1. Checked the cloudgov traceback which indicated the following:
10:34:20.680: [APP/PROC/WEB.0] ERROR Invalid HTTP_HOST header: 'tdp-backend-prod.app.cloud.gov'. You may need to add 'tdp-backend-prod.app.cloud.gov' to ALLOWED_HOSTS.
10:34:20.684: [APP/PROC/WEB.0] Invalid HTTP_HOST header: 'tdp-backend-prod.app.cloud.gov'. You may need to add 'tdp-backend-prod.app.cloud.gov' to ALLOWED_HOSTS.
  1. Checked which routes prod apps are bound to:
apennington@HHSL9XYTF13 MINGW64 / 07:01:00 (master)
$ cf apps
Getting apps in org hhs-acf-ofa / space tanf-prod as alexandra.pennington@acf.hhs.gov...

name                requested state   processes   routes
clamav-rest         started           web:1/1     tanf-prod-clamav-rest.apps.internal
tdp-backend-prod    started           web:1/1     tdp-backend-prod.app.cloud.gov
tdp-frontend-prod   started           web:1/1     tdp-frontend-prod.app.cloud.gov
  1. Not sure how this happened given 726, but unmapped and deleted temp routes and remapped to correct routes and confirmed:
apennington@HHSL9XYTF13 MINGW64 / 08:40:23 (master)
$ cf unmap-route tdp-backend-prod app.cloud.gov --hostname tdp-backend-prod
Removing route tdp-backend-prod.app.cloud.gov from app tdp-backend-prod in org hhs-acf-ofa / space
OK

apennington@HHSL9XYTF13 MINGW64 / 08:45:11 (master)
$ cf unmap-route tdp-frontend-prod app.cloud.gov --hostname tdp-frontend-prod
Removing route tdp-frontend-prod.app.cloud.gov from app tdp-frontend-prod in org hhs-acf-ofa / spac
OK

apennington@HHSL9XYTF13 MINGW64 / 08:43:25 (master)
$ cf map-route tdp-backend-prod api.tanfdata.acf.hhs.gov
Mapping route api.tanfdata.acf.hhs.gov to app tdp-backend-prod in org hhs-acf-ofa / space tanf-prod as alexandra.pennington@acf.hhs.
OK

apennington@HHSL9XYTF13 MINGW64 / 08:45:56 (master)
$ cf map-route tdp-frontend-prod tanfdata.acf.hhs.gov
Mapping route tanfdata.acf.hhs.gov to app tdp-frontend-prod in org hhs-acf-ofa / space tanf-prod as alexandra.pennington@acf.hhs.gov
OK


$ cf apps
Getting apps in org hhs-acf-ofa / space tanf-prod as alexandra.pennington@acf.hhs.gov...

name                requested state   processes   routes
clamav-rest         started           web:1/1     tanf-prod-clamav-rest.apps.internal
tdp-backend-prod    started           web:1/1     api.tanfdata.acf.hhs.gov
tdp-frontend-prod   started           web:1/1     tanfdata.acf.hhs.gov

apennington@HHSL9XYTF13 MINGW64 / 09:55:20 (master)
$ cf routes
Getting routes for org hhs-acf-ofa / space tanf-prod as alexandra.pennington@acf.hhs.gov...

space       host                    domain                     port   path   protocol   apps
tanf-prod                           api.tanfdata.acf.hhs.gov                 http       tdp-backend-prod
tanf-prod                           tanfdata.acf.hhs.gov                     http       tdp-frontend-prod
tanf-prod   tanf-prod-clamav-rest   apps.internal                            http       clamav-rest



re-deployment failed.

@ADPennington
Copy link
Collaborator Author

ADPennington commented May 11, 2022

  1. looking in cloud.gov, the prod routes look as follows (note the . that precedes the routes):

proddeploy2

Why is this?

@ADPennington
Copy link
Collaborator Author

ADPennington commented May 11, 2022

  1. Confirmed the correct key value pairs for the following environment variables in CircleCI:

CF_ORG CF_USERNAME_PROD and CF_PASSWORD_PROD (via cf service-key tanf-keys deployer ref)

Update: there are multiple keys associated with the tanf-key service instance, which can be confirmed via cf service-keys tanf-keys. The credentials being used in raft circleci were associated with the test-deploy-key and the username for this key has a tanf-prod space role in cloudgov. The credentials in HHS circleci are associated with the deployer key, and the username for this key does not have a tanf-prod space role in cloudgov.

for the purposes of testing, I updated the credentials in HHS circleci with test-deploy credentials to confirm that the deploy-infrastructure-production job can run. it was successful. the deploy-production job was also successful.
proddeploy5

@andrew-jameson can we discuss the appropriate path forward? There are at least a couple of items worth revisiting:

  • appropriate credentials to use for cf_username_prod and cf_password_prod
  • updating tanf-prod space users with appropriate tanf-key service instance credentials
  • cleaning up routes (see below) and updating allowed_hosts (per @raftmsohani)
    proddeploy4

@andrew-jameson
Copy link
Collaborator

@andrew-jameson can we discuss the appropriate path forward? There are at least a couple of items worth revisiting:

* appropriate credentials to use for `cf_username_prod` and `cf_password_prod`

* updating tanf-prod space users with appropriate tanf-key service instance credentials

* cleaning up routes (see below) and updating allowed_hosts (per @raftmsohani)

Yes, happy to discuss when you're available. While the user/pass tasks were noted in #1764 I don't believe I was looped in on all that was done there. I'm also a little curious on why #1766 does not fully address the issues here given that it was functioning prior to recent deployments.

@ADPennington
Copy link
Collaborator Author

ADPennington commented May 17, 2022

5/17 Update

Notes below include prod deployment-related details that helped me to confirm that apps deploy to the correct routes and login buttons redirect to authentication services (Neither service is currently working. login.gov will be replaced with xms and we are still troubleshooting AMS in #1627)


  • HHS CircleCI needs at least the following env var set:
BASE_URL  # including "/v1" at end
FRONTEND_BASE_URL
CF_ORG
CF_PASSWORD_PROD # using p for tanf-keys "deployer" 
CF_USERNAME_PROD # using u for tanf-keys "deployer" 
DJANGO_SU_NAME
JWT_KEY  # using the production key from login.gov dashboard
LOGGING_LEVEL
CLAMAV_NEEDED

  • I also added to HHS Circle CI AMS_CONFIGURATION_ENDPOINT, AMS_CLIENT_ID, and AMS_CLIENT_SECRET
  • At least a few scripts need updating:
    • removing /v1 from this line in set-backend-env-vars.sh (given note above re: base url). Tested successfully on my fork.
    • also need to add AV_SCAN_URL to set-backend-env-var.sh to ensure files can be scanned upon upload.
    • deploy-backend.sh and deploy-frontend.sh have a line where the routes are mapped to the apps. This could be updated to include the production context. Tested this successfully on my fork.
  • cloudgov.py should be updated to include tdp-backend-prod route.

@andrew-jameson
Copy link
Collaborator

Deleted CDN service, telling it to forward cookie and other auth headers made no difference. Once this operation is complete, will run this:

cf create-service external-domain domain tdp-prod-domain -c '{"domains": ["api.tanfdata.acf.hhs.gov", "tanfdata.acf.hhs.gov"]}'

@ADPennington
Copy link
Collaborator Author

since we're tracking prod-related issues in this ticket, including a link to the most recent update: #726 (comment)

@andrew-jameson
Copy link
Collaborator

Sent follow-up to cloud.gov support ticket #4217 detailing CDN/domain service changes. Next step will be to try circumventing DNS alias records using just cloud.gov provided tdp-backend-prod.app.cloud.gov.

@andrew-jameson
Copy link
Collaborator

andrew-jameson commented Jun 7, 2022

Circumventing DNS aliases allowed successful login.gov authentication. No cookie magic needed. Still working with cloud.gov so we can use the official ACF domains. @ADPennington has reached out to AMS for us to set up a session with them to try AMS using these provided domains.

Image Pasted at 2022-6-7 14-36

@ADPennington
Copy link
Collaborator Author

testing login.gov flow 2 ways:

@ADPennington
Copy link
Collaborator Author

acf ams authentication service now works on tdp-frontend-prod.app.cloud.gov

1782

@stevenino stevenino added the Refined Ticket has been refined at the backlog refinement label Jun 10, 2022
@ADPennington
Copy link
Collaborator Author

reached out to hhs help desk re: new DNS for backend app given problem below.
backenddns

this issue has been escalated for additional support on 6/24.

@ADPennington
Copy link
Collaborator Author

production-deployment workflow is functioning since this update. there's still some cleanup of environment variables to ensure this continues to work properly but this will be tracked in #897. Closing this ticket @stevenino

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev Refined Ticket has been refined at the backlog refinement
Projects
None yet
Development

No branches or pull requests

3 participants