Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp FlightRecorder API #72

Merged
merged 11 commits into from
Mar 24, 2020
Merged

Revamp FlightRecorder API #72

merged 11 commits into from
Mar 24, 2020

Conversation

ebaron
Copy link
Member

@ebaron ebaron commented Feb 11, 2020

This PR contains some first steps for a new version of the FlightRecorder API that addresses some shortcomings of the current version. A few key shortcomings are:

  1. No way to delete unwanted recordings
  2. No JFR event listing in the API
  3. Cannot manually stop continuous recordings

In order to delete recordings with the API, it made most sense to me to promote recordings to be individual Kubernetes objects. When the Recording object gets deleted, we could simply instruct Container JFR to also delete the recording. The FlightRecorder resource will remain, and contain information that is common across all recordings. This makes it a good spot to hold a JFR event listing.

The FlightRecorder is still created by the operator and responsible for one service, linked by an ObjectReference as before. The FlightRecorder is then associated with any number of Recording objects created for that service. Consider FlightRecorder -> Recording analogous to Deployment -> Pod. This association is handled with a label selector, allowing for fast lookups. This requires each new Recording to be created with a specific label, but in practice this API will be consumed by tools and not directly.

The Recording spec contains the options necessary to create a new flight recording, and the status is updated by the controller with information from Container JFR about that recording. In order to support manually stopping continuous recordings, we use an approach similar to the scale subresource for deployments. The spec contains a state field where the consumer sets a desired state of the recording. This should realistically be either running or stopped. The status contains a corresponding state field reflecting the actual current state of the recording, and the operator will try to reconcile this with what is requested in the spec.

The interesting code can be found in:
pkg/apis/rhjmc/v1alpha2/flightrecorder_types.go
pkg/apis/rhjmc/v1alpha2/recording_types.go
(particularly the spec and status types within)

There is no controller code in this PR to use these modified API types, I just wanted to get some feedback on this idea before proceeding further.

@andrewazores
Copy link
Member

Makes sense to me. The concept does feel nicely integrated into the custom resources API.

@ebaron
Copy link
Member Author

ebaron commented Feb 12, 2020

Glad to hear it! I'll start working on some controller code then.

@ebaron
Copy link
Member Author

ebaron commented Feb 14, 2020

I've added a boolean archive property to Recording.Spec now to allow the user to specify if they want to store the JFR file in persistent storage. My thinking is that we should delete the JFR file from persistent storage when the user deletes the Recording object. Any thoughts on that approach?

@andrewazores
Copy link
Member

Recordings will be preserved when the target JVM exits, right? If so then I think that seems like a sensible approach. I think it would generally behave as a user would expect, and also matches what container-jfr is doing behind the scenes.

@ebaron
Copy link
Member Author

ebaron commented Feb 18, 2020

That's right. I think Recordings will exist until the user deletes them, and FlightRecorders will still be garbage collected when their owner (service) is deleted.

One more thing about the archive flag I forgot to mention. Right now, in order to allow download links to work without Container JFR already being connected to the proper JVM, we automatically save recordings when they complete. So the archive property will basically be ignored for now.

@andrewazores
Copy link
Member

Cool, that all sounds good.

@ebaron
Copy link
Member Author

ebaron commented Mar 20, 2020

I think this PR is complete enough now. Some noteworthy changes:

FlightRecorder -> Recording is now linked by a LocalObjectReference. This prevents a Recording from being created without FlightRecorder link.
Label and selector are for fast lookups only by clients.

Both controllers have their own WebSocket connections to Container JFR. They synchronize with a Mutex to perform their operations without interfering with each other. We don't want one controller to connect to a different JVM while the other is still working.

Bumped version since this is a breaking change, and generated new bundle. I'll have to build and push all necessary images to Quay once merged.

Delete recording functionality is still not there. I plan to implement it using a Finalizer [1]. This PR has already gotten so large, I'll save deletion for a separate one.

[1] https://book.kubebuilder.io/reference/using-finalizers.html

@ebaron ebaron changed the title [WIP] Revamp FlightRecorder API Revamp FlightRecorder API Mar 20, 2020
@andrewazores
Copy link
Member

Nice work, I just gave it a read-through and everything looks great. I'm trying to manually test it to verify how it all works and see how it "feels" to use the API, but I'm having issues with the built image. I did export IMAGE_STREAM=quay.io/andrewazores/container-jfr-operator; export IMAGE_TAG=0.4.0; make clean image && podman push $IMAGE_STREAM:$IMAGE_TAG && make deploy sample_app2. After waiting a bit, I see that the operator pod is in ImagePullBackoff state, with the following events:

Events:
  Type     Reason     Age                From                         Message
  ----     ------     ----               ----                         -------
  Normal   Scheduled  <unknown>          default-scheduler            Successfully assigned default/container-jfr-operator-765684dc54-4xjjs to crc-pxt56-master-0
  Normal   Pulling    14s (x2 over 29s)  kubelet, crc-pxt56-master-0  Pulling image "quay.io/andrewazores/container-jfr-operator:0.4.0"
  Warning  Failed     13s (x2 over 28s)  kubelet, crc-pxt56-master-0  Failed to pull image "quay.io/andrewazores/container-jfr-operator:0.4.0": rpc error: code = Unknown desc = Image operating system mismatch: image uses "", expecting "linux"
  Warning  Failed     13s (x2 over 28s)  kubelet, crc-pxt56-master-0  Error: ErrImagePull
  Normal   BackOff    0s (x2 over 27s)   kubelet, crc-pxt56-master-0  Back-off pulling image "quay.io/andrewazores/container-jfr-operator:0.4.0"
  Warning  Failed     0s (x2 over 27s)   kubelet, crc-pxt56-master-0  Error: ImagePullBackOff

I'm guessing something has (temporarily) gone wrong with the ubi base image used by operator-sdk. I'll try it again later on and take a look around to see if I can find anyone else reporting the same issue currently.

@andrewazores
Copy link
Member

andrewazores commented Mar 24, 2020

Okay, so I still see that issue above when I try building current master with no local changes. podman inspect reveals that the image ends up built with an empty OS string in the metadata.

Pulling and inspecting registry.access.redhat.com/ubi7/ubi-minimal:7.6 shows that the base image at tag 7.6 does have Os: "linux" in its metadata, but latest (which is also currently 7.7) does not. 7.7-98, the first 7.7 release, still does. I'm working on bisecting this to figure out which image was the last to have it. I'm not sure why this changed however, but it really seems that at some level, the Kubernetes/OpenShift stack within CRC 1.7.0 doesn't want to run an image without this metadata.

I'm not sure what to do with this information once I do narrow it down however. There must be some reason that the images are being published with that string now empty.

@andrewazores
Copy link
Member

... I have no idea what just happened. I podman pulled another couple of image tags and now when I inspect any of the images I have locally, they all have the correct Os metadata.

Some podman bug, then? That was very confusing.

@andrewazores
Copy link
Member

Okay, but when I do a make image and inspect the resulting operator image, it still has Os: "".

@ebaron can you confirm if you're seeing any of this? I'm not sure now if what I'm seeing is maybe just something messed up with my podman installation, or something along those lines.

@andrewazores
Copy link
Member

Building this branch with no local changes but using buildah (ie. BUILDER=buildah make image) results in an image with the correct metadata. I'll go ahead with testing it that way. I really don't understand what's happening under the hood of my podman installation if buildah, which is also using libpod and AFAIK the same storage as podman, is working as expected. Seems likely this is just a local installation problem on my end.

@andrewazores
Copy link
Member

andrewazores commented Mar 24, 2020

Ah, on a directly relevant note, should API.md be updated yet to reflect these changes?

@ebaron
Copy link
Member Author

ebaron commented Mar 24, 2020

Building this branch with no local changes but using buildah (ie. BUILDER=buildah make image) results in an image with the correct metadata. I'll go ahead with testing it that way. I really don't understand what's happening under the hood of my podman installation if buildah, which is also using libpod and AFAIK the same storage as podman, is working as expected. Seems likely this is just a local installation problem on my end.

I'm seeing the same as you. Building with podman gives empty Os and Architecture metadata, but they're present when building with buildah.

@ebaron
Copy link
Member Author

ebaron commented Mar 24, 2020

Ah, on a directly relevant note, should API.md be updated yet to reflect these changes?

Yes, it should be. Should I do that as part of this PR or a follow-up one?

@andrewazores
Copy link
Member

Ah, on a directly relevant note, should API.md be updated yet to reflect these changes?

Yes, it should be. Should I do that as part of this PR or a follow-up one?

I'm fine with that coming as a follow-up PR, or maybe just as part of the recording deletion PR you're planning.

@andrewazores
Copy link
Member

Found the relevant podman bug: containers/podman#5503

Copy link
Member

@andrewazores andrewazores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just got through some manual testing and everything looks like it's working perfectly. Awesome work.

@ebaron ebaron mentioned this pull request Mar 24, 2020
@ebaron
Copy link
Member Author

ebaron commented Mar 24, 2020

Thanks for the review! I created #78 that I'll address along with the delete recording functionality.

@ebaron ebaron merged commit 3d67cf5 into cryostatio:master Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants