Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade all VMs to AMD64 Ubuntu 22.04 #86194

Merged
merged 19 commits into from
Aug 24, 2023
Merged

Upgrade all VMs to AMD64 Ubuntu 22.04 #86194

merged 19 commits into from
Aug 24, 2023

Conversation

richlander
Copy link
Member

@richlander richlander commented May 13, 2023

Ubuntu 18.04 has transitioned from standard to LTS support -> https://wiki.ubuntu.com/Releases. Our VM hosts should always run on standard support Ubuntu versions.

All VMs need to move to 22.04 unless there is a specific reason. This PR is a bulk upgrade of the VMs. Nothing has been tested nor do I know how to (other than making the PR and running CI).

This PR is specific to AMD64 to make the change smaller. Other changes need to follow. If this change is still too big, it can be further reduced.

@ghost
Copy link

ghost commented May 13, 2023

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details

All VMs need to move to 22.04 unless there is a specific reason. This PR is a bulk upgrade of the VMs. Nothing has been tested nor do I know how to (other than making the PR and running CI).

This PR is specific to AMD64 to make the change smaller. Other changes need to follow. If this change is still too big, it can be further reduced.

Author: richlander
Assignees: richlander
Labels:

area-Infrastructure, needs-area-label

Milestone: -

@teo-tsirpanis teo-tsirpanis removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 13, 2023
@agocke
Copy link
Member

agocke commented May 16, 2023

If we're moving forward machine versions by policy, do we want to extract the name of the "current standard linux testing VM" into a general variable like "linux-latest"?

@richlander
Copy link
Member Author

If we're moving forward machine versions by policy, do we want to extract the name of the "current standard linux testing VM" into a general variable like "linux-latest"?

I like the idea in theory, however a scheme like that forces us to make everything compatible in one PR. The current scheme seems better.

@carlossanlop
Copy link
Member

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richlander
Copy link
Member Author

Nice @carlossanlop -- It is ironic (and a little embarrassing) that I was changing these pipelines with all these conditions for the various run types and didn't realize that this PR requires running the full matrix of runs. Oops!

@carlossanlop
Copy link
Member

No worries @richlander, the runtime test matrix is super complex. Thanks for sending the change.

Co-authored-by: Jeremy Koritzinsky <jkoritzinsky@gmail.com>
@akoeplinger
Copy link
Member

akoeplinger commented May 17, 2023

Regarding the Android Helix queues: those only run on 18.04 because that was the common version when the queues were created, we can ask core-eng to create 22.04-based ones.

@carlossanlop
Copy link
Member

We also want to backport this to 7.0 and 6.0. Right?

@richlander
Copy link
Member Author

richlander commented May 17, 2023

@carlossanlop -- yes, we want to backport. Might we wait until we have the .NET 8 project done (so that we understand the full scope, including where we needed to make extra changes)? If the best pattern is to just port as we go, that's fine, too.

Yes @akoeplinger -- please ask for that. FYI @ilyas1974

@agocke
Copy link
Member

agocke commented May 18, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@radical
Copy link
Member

radical commented May 18, 2023

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wfurt
Copy link
Member

wfurt commented Aug 10, 2023

may be related to #89788
But that should be lardy in main. @MihaZupan look at this area recently.

@richlander
Copy link
Member Author

Just re-ran the failed checks. Looks like a repeat. Should we disable the test again?

@wfurt
Copy link
Member

wfurt commented Aug 14, 2023

The HTTP test failures seems environmental. I set up helix-repro machine and I could not reproduce the failure.

However, looking at one the test failure:

      Output:
        http.client.active_requests=1 [url.scheme=https, server.address=127.0.0.1, server.port=34583, http.request.method=GET]
        http.client.open_connections=1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=idle]
        http.client.open_connections=-1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=idle]
        http.client.open_connections=1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=active]
        http.client.request.time_in_queue=0.0417808 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, http.request.method=GET]
        http.client.open_connections=-1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=active]
        http.client.open_connections=1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=idle]
        http.client.active_requests=-1 [url.scheme=https, server.address=127.0.0.1, server.port=34583, http.request.method=GET]
        http.client.request.duration=0.0422139 [url.scheme=https, server.address=127.0.0.1, server.port=34583, http.request.method=GET, http.response.status_code=200, network.protocol.version=2]
        http.client.open_connections=-1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=idle]
        http.client.open_connections=1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=active]
        http.client.open_connections=-1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=active]
        http.client.open_connections=1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=idle]
        http.client.connection.duration=0.064 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1]
        http.client.open_connections=-1 [network.protocol.version=2, url.scheme=https, server.address=127.0.0.1, server.port=34583, server.socket.address=::ffff:127.0.0.1, http.connection.state=idle]

there are more events than excepted and it really looks like there is extra request e.g. like multiple tests interfering.
The test infrastructure assumes that binding on anonymous ports gives unique port.
But that may not be the case on Linux when mining with dual mode sockets - this smells ike #79820 - fixed by #80715 when we wrote our own port reservation system.

It is possible that something changed in the new Kernel and the port conflict are more likely.
if that is the case disabling single test will not help IMHO - we would need to disable them all.
We would need to change the HTTP test infrastructure - cc @tmds for more thoughts.

While 18.04 is out of support we can still move to 20.04, right @richlander? It should be supported for another two years.

If this turns out the be interaction between kernel and test infrastructure there may not be any quick and easy fix.
In order to get confidence would need to be able reproduce locally and investigate with tarring and packet captures.

cc: @karelz

@richlander
Copy link
Member Author

richlander commented Aug 14, 2023

While 18.04 is out of support we can still move to 20.04,

This doesn't make sense as a strategy. We support 22.04. If our test machines cannot successfully run on it, that's a problem. For the life of .NET 8, 22.04 will be much more important than 20.04. We should focus on it.

@wfurt
Copy link
Member

wfurt commented Aug 14, 2023

While 18.04 is out of support we can still move to 20.04,

This doesn't make sense as a strategy. We support 22.04. If our test machines cannot successfully run on it, that's a problem. For the life of .NET 8, 22.04 will be much more important than 20.04. We should focus on it.

yes, it is. But as I mentioned I don't see easy fix at the moment e.g. it will take some time to get stable runs IMHO.
The failed tests I looked are different than originally reported so I still think the whole runs are unstable.

@danmoseley
Copy link
Member

The test infrastructure assumes that binding on anonymous ports gives unique port.

Is this mainly a test issue (apparently customers haven't reported this since 2204 was released)? In which case, that's small beans compared to we have an impactful bug when on 2204.

@richlander
Copy link
Member Author

So, what should be do? I see no reason to give up on 22.04.

@wfurt
Copy link
Member

wfurt commented Aug 15, 2023

So, what should be do? I see no reason to give up on 22.04.

We should investigate IMHO. I saw some metric failures on other platforms -> may be just flaky test. I also saw random HTTP/2 failures in my local run. Everything seem happier after reverting #89788 -> may (or may not) be very recent failure. Both @MihaZupan and @antonfirsov are out this week. I can keep looking into since RC1 checkpoint is reached as of today. I'll discuss next steps with @karelz tomorrow and we should be able to come up with some action plan.

@antonfirsov
Copy link
Member

We should also update http and ssl stress (here or in a separate PR):

demands: ImageOverride -equals 1es-ubuntu-1804-open

@wfurt which test does #86194 (comment) refer to? Now I only see Http2_MultipleConnectionsEnabled_OpenAndCloseMultipleConnections_Success failing

@radical
Copy link
Member

radical commented Aug 21, 2023

I'm trying to upgrade the vm images used by wasm to 22.04 also. For that I opened dotnet/dotnet-buildtools-prereqs-docker#911 . But I'm wondering whether there is any way to use the image produced on that CI for the PR, in a dotnet/runtime PR, so I can check, and debug it with the various wasm builds.

The alternative would be to merge a new PR there for any change/debugging that needs to be done with the image.

@sbomer
Copy link
Member

sbomer commented Aug 21, 2023

One way to do it is to build the container locally, and push to your own container registry. Then you can make a runtime PR which uses that image. I used our Azure credit to create a container registry last time I did this.

Here's an example of a commit where I pointed some of the legs to my own container registry: ce31e13

@wfurt
Copy link
Member

wfurt commented Aug 23, 2023

We talk about this with team @richlander and you should probably just go ahead and merge this. Some of the test that failed here in the past failed elsewhere - with low failure rate so it does not seems unique to Ubuntu 22. We have ongoing monitoring and we are ready to react if we increased number of failures. In the mean time, investigation how to improve stability is going on but that IMHO should not block this. And the linux_musl failures is clearly unrelated.

@richlander
Copy link
Member Author

Thanks. The musl failure is also an HTTP test, yes? Do you have thoughts on it, or do I misunderstand?

@wfurt
Copy link
Member

wfurt commented Aug 24, 2023

Thanks. The musl failure is also an HTTP test, yes? Do you have thoughts on it, or do I misunderstand?

yes, Http2 / Multiple connections.
AFAIK it is being looked at... but not by me so I don't have any details.

@richlander
Copy link
Member Author

Is there an open issue to include here?

Copy link
Member

@ivdiazsa ivdiazsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account what @wfurt explained here #86194 (comment), we are good to go to merge this PR. Approving right now to unblock it.

Copy link
Member

@wfurt wfurt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can deal with tests later if we see spike of failures.

@richlander
Copy link
Member Author

Thanks! Glad to have this one merged.

@richlander richlander merged commit 658bdd0 into dotnet:main Aug 24, 2023
169 of 173 checks passed
@richlander richlander deleted the vm-host-upgrade branch August 24, 2023 17:21
@richlander
Copy link
Member Author

In case it become important, this is the relevant issue that was still active.

#91075

@ghost ghost locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.