Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic polling of container images #17

Closed
tumido opened this issue Jan 31, 2022 · 17 comments
Closed

Periodic polling of container images #17

tumido opened this issue Jan 31, 2022 · 17 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. wg/byon-build-pipelines Categorizes an issue or PR as relevant to WG BYON Build Pipelines.

Comments

@tumido
Copy link
Member

tumido commented Jan 31, 2022

Investigate how to manage imagestream updates. This issue is looking for answers/analysis on:

Based on my initial screening it seems there's no way to gate when image stream tag import if set to scheduled. Also it's unsure when the cache is cleaned or accessed (is this per node?) and how an image is cached. Investigate what happens when pods are started: if the image is pulled from a remote repository or mirrored to internal container registry on the first start and pulled from there. What happens when pod is scheduled to a different node (again pulling from internal/external registry?) Experiments needed. It effects when we get the updated image and if we know about it or user is served a newer version withouth admin/validation pipeline noticing first.

Besides scheduled import described above, there's no common reactive facility to initiate/trigger notification from remote registries - webhooks etc, this is remote repository specific and must be set on the remote repository side.

@tumido
Copy link
Member Author

tumido commented Jan 31, 2022

/assign @tumido

@tumido
Copy link
Member Author

tumido commented Jan 31, 2022

/label wg/byon-build-pipelines

@sesheta
Copy link
Contributor

sesheta commented Jan 31, 2022

@tumido: The label(s) /label wg/byon-build-pipelines cannot be applied. These labels are supported: community/discussion, community/group-programming, community/maintenance, community/question, deployment_name/ocp4-stage, deployment_name/ocp4-test, deployment_name/moc-prod, hacktoberfest, hacktoberfest-accepted, kind/cleanup, kind/demo, kind/deprecation, kind/documentation, kind/question, sig/advisor, sig/build, sig/cyborgs, sig/devops, sig/documentation, sig/indicators, sig/investigator, sig/knowledge-graph, sig/slo, sig/solvers, thoth/group-programming, thoth/human-intervention-required, thoth/potential-observation, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, triage/accepted, triage/duplicate, triage/needs-information, triage/not-reproducible, triage/unresolved, lifecycle/submission-accepted, lifecycle/submission-rejected

In response to this:

/label wg/byon-build-pipelines

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacospace pacospace added the wg/byon-build-pipelines Categorizes an issue or PR as relevant to WG BYON Build Pipelines. label Feb 1, 2022
@tumido
Copy link
Member Author

tumido commented Feb 3, 2022

/assign @mimotej

@mimotej
Copy link
Member

mimotej commented Feb 9, 2022

So far I have conducted some experiments and I have the following information:

  1. importPolicy.scheduled on Imagestream causes that OCP periodically (every 15 minutes by default) checks for changes on the image (meaning it checks if digest of an image has changed). This can be useful for automatic updates of images, but it does not generate any event, so it might be harder to detect.
  2. referencePolicy works in the same way in OCP 4.x as in OCP 3.x - not sure why it is not mentioned in the docs, but there doesn't seem to be any change. One bit that is also mentioned in the docs is that when using referencePolicy.type Local we have to set importPolicy.insecure otherwise the image will not be pulled. When using referencePolicy.type Source this is not the case, and you don't have to set importPolicy.insecure. One interesting bit, it seems that even in case of referencePolicy.type Source ImageStream stores image to internal registry, but only instructs client to pull from external when possible (it also points to manifest SHA address not tag). In my test I have created ImageStream with referencePolicy.type Source and even though I deleted image from external image registry it was still able to pull image which didn't exist. (note *)
  3. Another important part is pull policy, which is defined in pod creation. In this case, imagePullPolicy can have three values IfNotPresent, Always, Never, where Never isn't useful for us, because it only starts container when the image is pre-cached on the node. ifNotPresent checks if an image exists with same name and tag (it doesn't consider differences in digests) and if it is available it creates a container, otherwise it pulls an image. Always also checks if image is pre-cached, but it also checks digests and if they are not the same (for example tag didn't change, but there were some security patches), it pulls new image.
  4. Triggering pipelines is another problem we need to resolve first I have found examples from Andrew Block how to trigger pipelines when image changes, but as I mentioned earlier ImageStream does not generate event upon image change, so this option is not usable in our case. 
    Combing policies 1-3, seems to result in predictable behavior which seems to follow what is written in docs or what I have written above.
    There are still some scenarios which need further testing, once I try other possible scenarios I will extend this comment on these notes as well.

*EDIT: Seems that OCP has a bug, which causes ImageStream to not be resolved properly more about here: https://bugzilla.redhat.com/show_bug.cgi?id=2000216 . In my testing I have been using "workaround", where I pointed image to internal registry, which means that at least my observation about reference.Policy might not be 100 % correct. I will try looking in to this problem more (maybe use annotation-based triggers as was suggested in thread linked above...).

@tumido
Copy link
Member Author

tumido commented Feb 11, 2022

Great research! Thank you!

  1. ...but it does not generate any event, so it might be harder to detect.

I think it won't be a problem. It generates event/action via additional kubernetes resources like the creation of Image resource and ImageStreamTag resource update. That should do the trick.

  1. ...i n my test I have created ImageStream with referencePolicy.type Source and even though I deleted image from external image registry it was still able to pull image which didn't exist.

I think this is the behavior we want to achieve, correct? Cache the image in local registry and reuse it from there, without ever pulling from a remote for pod spawns. Good.

  1. ... ifNotPresent checks if an image exists with same name and tag (it doesn't consider differences in digests) and if it is available it creates a container, otherwise it pulls an image.

What does it mean if the image is available? In the internal registry or in node cache? When it is not available where does it pull from (internal registry or remote location)? Is this affected by referencePolicy.type?

  1. ... how to trigger pipelines when image changes, but as I mentioned earlier ImageStream does not generate event upon image change

Yes, something like Andrew's demo will solve it for us. Again, we don't need Events, Knative can trigger on API actions/changes, that's all we need. I was eyeballing KNative for this prior to knowing about Andrew's demo. I think we're heading in the right direction here.

@tumido
Copy link
Member Author

tumido commented Feb 15, 2022

In my test I've used an image stream with referencePolicy.type: Local and importPolicy.scheduled: true:

  1. Spawn a pod in JupyterHub - it pulls the image from internal registry
  2. Shut down the pod
  3. Spawn a new pod in JupyterHub - it pulls the image from node cache
  4. Shut down the pod
  5. Updating image tag in remote container image repository - After a while it triggers a scheduled import
  6. Spawning a new pod in JupyterHub - it pulls the image from internal registry
  7. Shut down the pod
  8. Delete the tag in remote container image repository - After a while the Image Stream reports import error
  9. Spawning a new pod in JupyterHub - it pulls the image from node cache
  10. Shut down the pod
  11. Mark the node of previous pods unschedulable
  12. Spawn a pod in JupyterHub - it pulls the image from internal registry

I think this solution works, wdyt @mimotej ?

@mimotej
Copy link
Member

mimotej commented Feb 16, 2022

Yes, I think this configuration will give us most consistent results out of all options 👍 .

@tumido
Copy link
Member Author

tumido commented Feb 16, 2022

Good, then the proposed design will be:

flowchart TD
A[Image updated in remote container repository]
B[ImageStreamTag updated in internal container repository]
C[Knative eventing triggers a PipelineRun]
D[Mark phase 'Updating/Validating' and hides it from UI]
E[Run validation checks]
F{Did checks pass?}
G[Mark phase 'Success' and show in UI]
H[Mark phase 'Failed']
A --> B --> C --> D --> F
C --> E --> F
F -- Yes --> G
F -- No --> H
Loading

This approach has few drawbacks.

  1. The image stream disappears without explanation from the user facing UI (we probably need to show it as "disabled" to lower user confusion)
  2. There's a small window between validation pipeline marking the imagestream as unavailable due to running validation and the actual imagestream update potentially allowing users to click the image in Spawner UI while the image is already updated and was not yet validated.
  3. Automatic update and validation may result in image failing the validation. In such case ImageStreams don't support any rollback options. We would need to workaround that (possibly adding a backup tag pointing to the previously resolved image, but I'm not sure yet if that's possible)

Local resolution has 1 significant positive effect:

  • Image is available in internal container registry even if it disappears from the source registry = Stable behavior.

Questions: Is there a way to immediately trigger the image stream update?

Possible solution:

  1. Each ImageStream contains 2 tags which are initially the same
  • 1st tag is a "served" one, on display.
  • 2nd tag is hidden from UI and has scheduled: true.
  1. If update is received, the second tag is being updated and validated by a pipeline.
  2. During this whole time the 1st tag is still being served.
  3. If validation on the 2nd tag passes 1st tag is updated to point to this image. If it fails it stays the same and is kept/marked as outdated.
flowchart TD
A[Image updated in remote container repository]
B[ImageStreamTag updated in internal container repository on the 'backup' tag]
C[Knative eventing triggers a PipelineRun]
D[Mark phase 'Updating']
E[Run validation checks]
F{Did checks pass?}
G[Mark phase 'Success']
H[Mark phase 'Outdated Update Failed']
I[Replace pointer in visible tag with the updated backup]
A --> B --> C --> D --> F
C --> E --> F
F -- Yes --> G --> I
F -- No --> H
Loading

@tumido
Copy link
Member Author

tumido commented Feb 22, 2022

Decision. This is postponed/deprioritized due to the necessity to introduce a KNative dependency. Revisit post-MVP.

@tumido tumido changed the title Periodic pooling of container images Periodic polling of container images Mar 31, 2022
@sesheta
Copy link
Contributor

sesheta commented Jun 29, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2022
@sesheta
Copy link
Contributor

sesheta commented Jul 29, 2022

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 29, 2022
@codificat
Copy link
Member

going to
/lifecycle frozen
this to prevent auto-closure for the time being

@sesheta sesheta added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 23, 2022
@goern
Copy link
Contributor

goern commented Sep 5, 2022

/kind feature
/priority backlog

@sesheta sesheta added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Sep 5, 2022
@schwesig
Copy link
Member

mentioned to @codificat

@codificat
Copy link
Member

As BYON evolved into Custom Notebook Images, which currently use an operator to manage the imagestreams and pipeline runs, I believe this is not relevant anymore.

/close

Having said that, it's worth exploring what meteor-operator could do to keep custom images current. But that will be handled in its own repo and issue

@sesheta sesheta closed this as completed Jan 31, 2023
@sesheta
Copy link
Contributor

sesheta commented Jan 31, 2023

@codificat: Closing this issue.

In response to this:

As BYON evolved into Custom Notebook Images, which currently use an operator to manage the imagestreams and pipeline runs, I believe this is not relevant anymore.

/close

Having said that, it's worth exploring what meteor-operator could do to keep custom images current. But that will be handled in its own repo and issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. wg/byon-build-pipelines Categorizes an issue or PR as relevant to WG BYON Build Pipelines.
Projects
No open projects
Development

No branches or pull requests

7 participants