Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf_job_simple_test results not being report #1426

Closed
jlewi opened this issue Aug 25, 2018 · 3 comments
Closed

tf_job_simple_test results not being report #1426

jlewi opened this issue Aug 25, 2018 · 3 comments
Labels

Comments

@jlewi
Copy link
Contributor

jlewi commented Aug 25, 2018

Here's a postsubmit run
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubeflow_kubeflow/kubeflow-postsubmit/619

There are 10 passed tess
test_jsonnet
deploy-kubeflow-deploy_argo-test-argo-deploy
deploy-kubeflow-deploy_minikube
deploy-kubeflow-deploy_model-mnist-cpu
deploy-kubeflow-deploy_pytorchjob-pytorch-job
deploy-kubeflow-teardown
deploy-kubeflow-teardown_minikube
smoke-tfjob-gke
test_jsonnet_formatting test_jsonnet_formatting
tf-serving-image-mnist-cpu

1 Failed test
deploy-kubeflow-deploy_model-mnist-gpu

There is no report for the simple TFJob prototype test
Looking at Argo it looks like the test ran
http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-postsubmit-kfctl-a41ba72-619-0fc2?tab=workflow&nodeId=kubeflow-postsubmit-kfctl-a41ba72-619-0fc2-203334643

There's no indication that the test completed successfully; i.e.
https://github.com/kubeflow/kubeflow/blob/master/testing/tf_job_simple_test.py#L111
we should print out "TFJob launched successfully."

There's also no indication in the logs that we saved the results/failure to GCS as an example file

@jlewi jlewi added priority/p1 area/tfjob Issues related to TFJobs. area/testing and removed priority/p1 labels Aug 25, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Aug 25, 2018

INFO|2018-08-25T01:35:06|util.py:41| Running: ks apply default -c tf-job-simple
cwd=None
INFO|2018-08-25T01:35:06|util.py:56| Subprocess output:
INFO|2018-08-25T01:35:12|util.py:62| level=info msg="Updating tfjobs kubeflow.tf-job-simple"

So test didn't create a new job; but updated an existing job; that's a bit strange.
Also it looks like the test expects the job to be named mycnn job.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 25, 2018

List of the XML files uploaded are here
http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/kubeflow_kubeflow/kubeflow-postsubmit/619/artifacts/

Looks like generate_xml might not be logging a message when it writes the output
https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/test_helper.py#L47

jlewi added a commit to jlewi/testing that referenced this issue Aug 25, 2018
* This is intended to help debug kubeflow/kubeflow#1426 by showing whether
  generate_xml is called.
jlewi added a commit to jlewi/kubeflow that referenced this issue Aug 25, 2018
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue Aug 25, 2018
* This is intended to help debug kubeflow/kubeflow#1426 by showing whether
  generate_xml is called.
@jlewi
Copy link
Contributor Author

jlewi commented Aug 25, 2018

After adding logging in #197 and #1427

We see the following logging output

ERROR|2018-08-25T19:55:49|tf_job_simple_test.py:115| Test failed waiting for job; Could not find pods with label tf_job_name=mycnnjob
INFO|2018-08-25T19:55:49|test_helper.py:60| Writing file: /output/artifacts/junit_.xml

So it looks like the xml file is named incorrectly and that probably prevents gubernator from detecting the test.

jlewi added a commit to jlewi/testing that referenced this issue Aug 25, 2018
* name is used in the XMl file containing the test results. If name
  isn't set the XML file won't be created correctly and therefore
  not surfaced in gubernator correctly; see kubeflow/kubeflow#1426

* Related to kubeflow/kubeflow#1426
jlewi added a commit to jlewi/kubeflow that referenced this issue Aug 25, 2018
* Fix kubeflow#1426

There are two problems with the test

  1. Test isn't properly reporting results to gubernator; so test failures
     aren't being noticed.
  2. Test needs to be updated to work with v1alpha2.

* The TestSuite name needs to be set because this is used as the name
  of the junit XML file.

* simple-prototype-test should set test_dir and artifacts_dir.

* Fix the test; use tf_job_client to wait for the job to be in the Running
  condition. This should be more reliable than checking for actual pods.

* The test has probably been broken for a while but this went unnoticed
  because results weren't being properly surfaced in test grid because
  the XML file is improperly named. I suspect things broke as part of
  the switch to v1alpha2 which changed the names of the pods.
k8s-ci-robot pushed a commit that referenced this issue Aug 27, 2018
* Fix #1426

There are two problems with the test

  1. Test isn't properly reporting results to gubernator; so test failures
     aren't being noticed.
  2. Test needs to be updated to work with v1alpha2.

* The TestSuite name needs to be set because this is used as the name
  of the junit XML file.

* simple-prototype-test should set test_dir and artifacts_dir.

* Fix the test; use tf_job_client to wait for the job to be in the Running
  condition. This should be more reliable than checking for actual pods.

* The test has probably been broken for a while but this went unnoticed
  because results weren't being properly surfaced in test grid because
  the XML file is improperly named. I suspect things broke as part of
  the switch to v1alpha2 which changed the names of the pods.
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue Aug 30, 2018
* name is used in the XMl file containing the test results. If name
  isn't set the XML file won't be created correctly and therefore
  not surfaced in gubernator correctly; see kubeflow/kubeflow#1426

* Related to kubeflow/kubeflow#1426
richardsliu pushed a commit to richardsliu/testing that referenced this issue Sep 21, 2018
* This is intended to help debug kubeflow/kubeflow#1426 by showing whether
  generate_xml is called.
richardsliu pushed a commit to richardsliu/testing that referenced this issue Sep 21, 2018
* name is used in the XMl file containing the test results. If name
  isn't set the XML file won't be created correctly and therefore
  not surfaced in gubernator correctly; see kubeflow/kubeflow#1426

* Related to kubeflow/kubeflow#1426
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 11, 2021
* Fix kubeflow#1426

There are two problems with the test

  1. Test isn't properly reporting results to gubernator; so test failures
     aren't being noticed.
  2. Test needs to be updated to work with v1alpha2.

* The TestSuite name needs to be set because this is used as the name
  of the junit XML file.

* simple-prototype-test should set test_dir and artifacts_dir.

* Fix the test; use tf_job_client to wait for the job to be in the Running
  condition. This should be more reliable than checking for actual pods.

* The test has probably been broken for a while but this went unnoticed
  because results weren't being properly surfaced in test grid because
  the XML file is improperly named. I suspect things broke as part of
  the switch to v1alpha2 which changed the names of the pods.
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
This will change the katib-controller and katib-ui
roles to clusterroles.

Additionally Dominik Fleischmann is being added to
the owners of the katib operators.
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
* Migrate Istio and Dex to V3

* Roll back AWS change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant