Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test test-dns #2468

Closed
joaocgreis opened this issue Aug 20, 2015 · 31 comments
Closed

Investigate flaky test test-dns #2468

joaocgreis opened this issue Aug 20, 2015 · 31 comments
Labels
build Issues and PRs related to build files or the CI. test Issues and PRs related to the tests.

Comments

@joaocgreis
Copy link
Member

Examples of failures:

@joaocgreis joaocgreis added build Issues and PRs related to build files or the CI. test Issues and PRs related to the tests. labels Aug 20, 2015
@joaocgreis
Copy link
Member Author

It seems that this test always fails on the platforms where it fails, and always passes where it passes. Perhaps this can be related to the DNS configuration on the machines.

@Trott
Copy link
Member

Trott commented Aug 25, 2015

It should perhaps be noted that the flaky test is test/internet/test-dns.js and not test/parallel/test-dns.js.

@Trott
Copy link
Member

Trott commented Aug 26, 2015

On the CentOS and Ubuntu failures, it seems likely that the issue (or maybe just an issue?) is that they are not configured for IPv6?

@rvagg
Copy link
Member

rvagg commented Aug 27, 2015

could be, on DO you have to be in the right datacenter to get IPv6 and these might not be

@Trott
Copy link
Member

Trott commented Sep 4, 2015

For the two CentOS 5 setups that fail with this test, I strongly suspect they would be fixed by adding this to /etc/hosts:

::1               localhost.localdomain localhost

I don't think that failure is a bug in Node. I think Node is correctly getting the name resolution response from the operating system.

I guess this is really an issue to open over in the build repo. I'll go do that now...

@rvagg
Copy link
Member

rvagg commented Sep 4, 2015

@Trott added, care to submit a few runs https://ci.nodejs.org/job/node-test-commit-linux/ to try it out?

@Trott
Copy link
Member

Trott commented Sep 4, 2015

Looks like that did the trick. The two CentOS 5 setups are now passing the test.

All looks good on CentOS 5.

@Trott
Copy link
Member

Trott commented Sep 4, 2015

By the way, any guidance would be welcome on how I might be able to troubleshoot stuff like this on the CI server without playing quite so many shenanigans. (For example, I submitted a test job with code in it that dumped /etc/hosts so I could see what was in it.)

With 4.0 about to drop, I imagine I should wait until after the dust settles from that and ask again. But just to get the question/request out there while I'm thinking of it...

@Trott
Copy link
Member

Trott commented Sep 6, 2015

Looks like the failure on fedora22 for this test is also related to /etc/hosts. The problem is this line:

127.0.0.1 {{fqdn}} {{hostname}}

That causes the lookup to return {{fqdn}} which common.isValidHostname() rejects.

@Trott
Copy link
Member

Trott commented Sep 7, 2015

The FreeBSD boxes are giving an error on this test file because the test for IPv6 lookups with hints uses the V4MAPPED flag. The FreeBSD man page for getaddrinfo(3) says in the BUGS section:

The getaddrinfo function as implemented in FreeBSD currently does not support AI_ALL and AI_V4MAPPED flags and returns EAI_BADFLAGS if one of them is specified.

Sure enough, the test fails with EAI_BADFLAGS.

I'll open a PR to skip the one relevant test on FreeBSD and include a comment explaining that if/when the bug is fixed, then the code skipping the test can be removed.

@jbergstroem
Copy link
Member

@Trott FWIW, we're not using 7.2-RELEASE (currently on a rc of 10, moving to 10.2 in a few days) -- support for V4MAPPED was removed in more recent versions.

We filtered out this flagged if passed since a while back: https://github.com/nodejs/node/blob/master/lib/net.js#L954

(relevant commit) 9bc2e26

@Trott
Copy link
Member

Trott commented Sep 7, 2015

@jbergstroem OK, so if I'm understanding correctly, we shouldn't add code to the test to skip FreeBSD. Node tries to filter out the V4MAPPED flag where it needs to. So if the test is still failing with EAI_BADFLAGS on FreeBSD (and it is) then that is likely indicative of an actual bug in Node. Does that sound about right?

@Trott
Copy link
Member

Trott commented Sep 7, 2015

It seems that the code to screen out the flag on FreeBSD is in a private function that is only called by Socket.connect(). It seems that dns.lookup() bypasses this.

@jbergstroem
Copy link
Member

Just had a look as well -- seems about right. Do we add a similar check to lib/dns.js?

@Trott
Copy link
Member

Trott commented Sep 7, 2015

I don't know. Is it better to honor the flags the user sends and if the OS blows up, then throw an error so the user knows what went wrong? Or do we try to smooth the path, but then risk inserting magic that frustrates the user? "I'm specifying the flag! Why is it ignoring it?!?!"

@jbergstroem
Copy link
Member

In this case its probably better to just let it bubble up. DNS error messages are already a bit of a mess.

@Trott
Copy link
Member

Trott commented Sep 7, 2015

Good enough for me. So skipping the test on FreeBSD it is.

@jbergstroem
Copy link
Member

Cool, feel free to R=me.

@Trott
Copy link
Member

Trott commented Sep 8, 2015

Looks like armv7-ubuntu1404 failure is likely the same as the cause on CentOS 5. This line (or something like it) probably needs to be added to /etc/hosts:

::1               localhost locahost.localdomain

@rvagg
Copy link
Member

rvagg commented Sep 8, 2015

@Trott they already had:

::1             localhost ip6-localhost ip6-loopback

but I've added localhost.localdomain to the list. If this works would you mind filing an issue at nodejs/build about this so we remember to get this in to the build scripts and don't regress next time we replace the machine(s)?

@rvagg
Copy link
Member

rvagg commented Sep 8, 2015

ooops, wrong machine, you said ubunu1404, that didn't have a ::1 at all, fixed that now

@Trott
Copy link
Member

Trott commented Sep 9, 2015

armv7-ubuntu1404 now passes. Thanks, @rvagg! https://ci.nodejs.org/job/node-test-commit-arm/566/nodes=armv7-ubuntu1404/

@Trott
Copy link
Member

Trott commented Sep 9, 2015

So, the /etc/hosts addition has fixed the test for 3 of the 8 builds where it was broken. There's a pull request I have open to fix the test for the two FreeBSD builds that has a couple LGTM's and I'm just waiting at this point for the 48 hours to go by before merging. That will mean 5 of 8 fixed.

That leaves the two Windows builds and fedora22. The problem on fedora22 appears to stem from this line in /etc/hosts:

127.0.0.1 {{fqdn}} {{hostname}}

That seems weird to me, as 127.0.0.1 should not resolve to the fully qualified domain name. Or at least, it seems to me that it shouldn't. Can we just remove that line somehow? /cc @rvagg

UPDATE: And regardless of whether it should resolve to the fqdn, the fact is that it doesn't on that box. I imagine that might be because the fqdn may not actually be set. Just a guess though. That line causes 127.0.0.1 to resolve, literally, to the string {{fqdn}} and the test here decides that {{fqdn}} is not a valid hostname so the test fails.

Trott added a commit to Trott/io.js that referenced this issue Sep 9, 2015
FreeBSD does not support the V4MAPPED flag so expect an error.

This is a partial fix for nodejs#2468.
It only fixes it on FreeBSD. Failures on other platforms are due to
other reasons and need to be fixed separately.

PR-URL: nodejs#2724
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Fixes: nodejs#2468
@Trott
Copy link
Member

Trott commented Sep 9, 2015

FreeBSD issue fixed in f8152df. We're down to just fedora22, win2008r2, and win2012r2.

@rvagg
Copy link
Member

rvagg commented Sep 9, 2015

I'm not particularly happy with having to modify test machines to make this work because it indicates that we maybe need more robust tests -- when someone tries to run our test suite on their computer and it fails are we happy to tell them that their distro shipped a screwey /etc/hosts, obviously we're dealing with a multiplicity of opinions about exactly what a 127.0.0.* and ::1 should map to so maybe we should be taking that into account in our tests?

@Trott
Copy link
Member

Trott commented Sep 9, 2015

Perhaps if HOSTALIASES environment variable is honored on the machines this issue affects, we can use it to point to our own hosts file so that we have complete predictability across those systems concerning how hostnames in that file will be resolved.

@Trott
Copy link
Member

Trott commented Sep 9, 2015

In other news, it looks like this test is no longer a problem on win2008r2. Just win2012r2.

https://ci.nodejs.org/job/node-test-commit-windows/578/
https://ci.nodejs.org/job/node-test-commit-windows/577/

@Trott
Copy link
Member

Trott commented Sep 9, 2015

win2012r2 is failing because 127.0.0.1 resolves to iojs-rackspace- on that machine and common.isValidHostname() rejects that because hostnames cannot end with -.

So here's another option for win2012r2 and fedora22 that won't involve messing with hosts files: Don't check if the result is a valid hostname. Just check that it's a non-empty string. We already check that an error object isn't sent. And, as we're seeing, resolution from hosts files is not guaranteed to provide a valid hostname. So just go with what the OS gives you.

We could (if we think there's any value in it) still run the hostname through common.isValidHostname() but only print a warning if that fails.

@Fishrock123
Copy link
Contributor

Me and @jasnell ran into this while trying to run the tests on awful nodeconf.eu wifi. it's possible that with a connection without full connectivity, these tests get timed out from c-ares, while working when there is no internet at all.

Trott added a commit to Trott/io.js that referenced this issue Sep 10, 2015
Operating systems can and do return invalid hostnames if that's
what they have (for example) in /etc/hosts. Test passes if no
error is thrown and the hostname string is not empty.

Fixes: nodejs#2468
Trott added a commit that referenced this issue Sep 10, 2015
Operating systems can and do return invalid hostnames if that's
what they have (for example) in /etc/hosts. Test passes if no
error is thrown and the hostname string is not empty.

Fixes: #2468
PR-URL: #2785
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Trott added a commit that referenced this issue Sep 11, 2015
FreeBSD does not support the V4MAPPED flag so expect an error.

This is a partial fix for #2468.
It only fixes it on FreeBSD. Failures on other platforms are due to
other reasons and need to be fixed separately.

PR-URL: #2724
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Fixes: #2468
Trott added a commit that referenced this issue Sep 11, 2015
Operating systems can and do return invalid hostnames if that's
what they have (for example) in /etc/hosts. Test passes if no
error is thrown and the hostname string is not empty.

Fixes: #2468
PR-URL: #2785
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Trott added a commit that referenced this issue Sep 12, 2015
FreeBSD does not support the V4MAPPED flag so expect an error.

This is a partial fix for #2468.
It only fixes it on FreeBSD. Failures on other platforms are due to
other reasons and need to be fixed separately.

PR-URL: #2724
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Johan Bergström <bugs@bergstroem.nu>
Fixes: #2468
Trott added a commit that referenced this issue Sep 12, 2015
Operating systems can and do return invalid hostnames if that's
what they have (for example) in /etc/hosts. Test passes if no
error is thrown and the hostname string is not empty.

Fixes: #2468
PR-URL: #2785
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@Trott
Copy link
Member

Trott commented Sep 12, 2015

At this point, the only PR remaining to be merged to close out this bug is #2802. It's a pretty straightforward "split up this monolith of too many tests that cumulatively take too long and cause the file to timeout in CI on Windows" thing, so I'm hoping someone will give it the ol' LGTM soon-ish...

Trott added a commit that referenced this issue Sep 12, 2015
For whatever reason, the CI win2012 machine was timing out on the
internet test-dns file. Split out ipv4 and ipv6 specific tests to
separate files so tests do not time out. (Each file is given a 60
second timeout on CI. Tests within a file are run in sequence.)

PR-URL: #2802
Fixes: #2468
Reviewed-By: Roman Reiss <me@silverwind.io>
@Trott
Copy link
Member

Trott commented Sep 12, 2015

Between machine config changes made by @rvagg in the course of this issue discussion and PRs #2802, #2785, and #2724, this issue is now resolved. Closing.

@Trott Trott closed this as completed Sep 12, 2015
Trott added a commit that referenced this issue Sep 15, 2015
For whatever reason, the CI win2012 machine was timing out on the
internet test-dns file. Split out ipv4 and ipv6 specific tests to
separate files so tests do not time out. (Each file is given a 60
second timeout on CI. Tests within a file are run in sequence.)

PR-URL: #2802
Fixes: #2468
Reviewed-By: Roman Reiss <me@silverwind.io>
Trott added a commit that referenced this issue Sep 15, 2015
For whatever reason, the CI win2012 machine was timing out on the
internet test-dns file. Split out ipv4 and ipv6 specific tests to
separate files so tests do not time out. (Each file is given a 60
second timeout on CI. Tests within a file are run in sequence.)

PR-URL: #2802
Fixes: #2468
Reviewed-By: Roman Reiss <me@silverwind.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

5 participants