Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

cluster: tests multiple open of ports in workers #8027

Closed
wants to merge 144 commits into from
Closed

cluster: tests multiple open of ports in workers #8027

wants to merge 144 commits into from

Conversation

sam-github
Copy link

Note that test/simple/test-cluster-dgram-4.js is a simplified case, since
dgram.send() for clients does an implict .bind(0), so it could also be written
as:

  A = dgram.createSocket('udp4')
  A.send(...)
  A.close();
  ... time passes...
  B = dgram.createSocket('udp4')
  B.send(...)
  B.close();

Above was the original form, dscape/lynx kills the cluster master when run in
a worker.

IMO, dgram-4 is the most serious of this, using multiple dgram client sockets
is currently impossible with cluster.

Also note test/simple/test-cluster-net-listen-2.js, the TCP equivalent of
dgram-4, passes, because TCP close causes the cluster master to be notified,
UDP close does not.

Also, note that I use the ephemeral "flag" port 0 in the tests, the same
failure is seen with any port.

@sam-github
Copy link
Author

I spent about an hour trying to fix this, and can see some of the problems, but threading through the cluster messaging protocol to fix this is more than I can do tonight.

  • the master worker's tracking assumes a port has only been listened on once in a worker. I'm not sure how easy that would be to fix, when RR sends a TCP connection to a worker, which server would the connection be emitted on?
  • UDP is symmetrical, unlike TCP, so all UDP sockets are treated as servers by node, and all node dgram objects are shared among workers if they share the same port, and all client/ephemerally bound sockets share the same ephemeral port in node... I suspect that the implicit bind() in dgram.send() should probably NOT create a shareable socket, and instead create a worker-local socket.
  • Either way, if a dgram socket is closed, the master needs to know, at least one test passes with TCP and not UDP because of this lack.

sam-github added a commit to strongloop/strong-agent-statsd that referenced this pull request Jul 31, 2014
See nodejs/node-v0.x-archive#8027, work-around is to create a single dgram socket,
and to pass it to Lynx.
@trevnorris
Copy link

@sam-github Thanks much for the tests.

@indutny while I hate failing tests, these do point out issues in the implementation. What do you think about merging them?

@indutny
Copy link
Member

indutny commented Aug 2, 2014

OK :)

@tjfontaine
Copy link

I'm -1 on including tests that are failing, unless we're able to bring in the fixes at the same time. We want to ship with a green build.

My main question is if these are differences between how it works in 0.10 and 0.11?

Do you want these to be considered blockers for releasing 0.12?

Figuring out the right api for worker exclusive listening is desired, but I don't think we can get that properly addressed before 0.12.

@trevnorris
Copy link

@sam-github So, for the v0.12 release we'll have to do one of the following:

  1. Hold off on merging the tests until after v0.12 is cut, and then readdress these in v0.13 (or whatever it is).
  2. Include fixes for these tests, and document the change in behavior (since it'll be different than v0.10).

While no one thinks the current implementation is 100% correct, we won't want to merge failing tests because that would indicate something that we've added to the roadmap. In your opinion, since you obviously have a lot of experience with this issue, what's the work effort to fix these failing tests?

Meaning, if they could be fixed we'd be more than happy to merge both the tests and the fixes.

@sam-github
Copy link
Author

I think the cluster master asserting and dying should be blocking for 0.12.

This isn't really a PR, its OK not to merge, its more of an issue with attached reproductions in node unit test form, though if you agree that some of the failures are blocking, I can seperate them out.

Below is rundown of test output for 0.10 and 0.11:

dgram-3: pass on 0.10, asserts the cluster master in 0.11. I think this is 0.12 blocking.

assert.js:98
  throw new assert.AssertionError({
        ^
AssertionError: false == true
    at SharedHandle.add (cluster.js:79:3)
    at queryServer (cluster.js:411:12)
    at Worker.onmessage (cluster.js:368:7)
    at ChildProcess.<anonymous> (cluster.js:622:8)
    at ChildProcess.emit (events.js:129:20)
    at handleMessage (child_process.js:326:10)
    at Pipe.channel.onread (child_process.js:354:11)

dgram-4: fails on both 10 and 11, I don't know why, but I suspect that the second open is returning the first, now closed, socket instead of an open one. possibly just "unsupported", though that sucks

dgram.js:440
    throw new Error('Not running'); // error message from dgram_legacy.js
          ^
Error: Not running
    at Socket._healthCheck (dgram.js:440:11)
    at Socket.address (dgram.js:350:8)
    at Socket.<anonymous> (cluster.js:499:25)
    at Socket.g (events.js:199:16)
    at Socket.emit (events.js:129:20)
    at startListening (dgram.js:137:10)
    at dgram.js:201:9
    at shared (cluster.js:518:5)
    at Worker.<anonymous> (cluster.js:493:9)
    at process.<anonymous> (cluster.js:622:8)

listen-2 and listen-3 both fail on 0.10, it seems closing servers does not work even in 0.10

cluster.js:575
      port: tcpSelf.address().port || port,
                             ^
TypeError: Cannot read property 'port' of null
    at Server.<anonymous> (cluster.js:575:30)
    at Server.g (events.js:180:16)
    at Server.emit (events.js:117:20)
    at net.js:1055:10
    at process._tickCallback (node.js:419:13)

listen-2 passes on 0.11

listen-3 fails on 0.11, it kills the cluster master I think not being able to close a server is pretty close to release blocking, and that asserting the cluster master is definitely release blocking

assert.js:98
  throw new assert.AssertionError({
        ^
AssertionError: false == true
    at RoundRobinHandle.add (cluster.js:122:3)
    at queryServer (cluster.js:411:12)
    at Worker.onmessage (cluster.js:368:7)
    at ChildProcess.<anonymous> (cluster.js:622:8)
    at ChildProcess.emit (events.js:129:20)
    at handleMessage (child_process.js:326:10)
    at Pipe.channel.onread (child_process.js:354:11)

@trevnorris
Copy link

@bnoordhuis I'd like your feedback on the current dgram implementation for cluster.

@tjfontaine Based on #8027 (comment) I'd like your feedback on if any/all of those cases would/should be considered blockers for v0.12.

I've had little to do with either of these areas of code thus far, but if we do consider any of these a v0.12 blocker I'm willing to work on it.

@bnoordhuis
Copy link
Member

@bnoordhuis I'd like your feedback on the current dgram implementation for cluster.

I hate the implementation - it's one horrible hack on top of another - but feature-wise, it seems pretty okay.

@sam-github
Copy link
Author

Why does dgram do an explicit bind(0) on send()? The OS will do that implicitly on first sendto(), assigning a unique ephemeral port. Did it just seem convenient at the time, or is there more to it, perhaps an explicit bind causes some necessare libuv calls?

This would fit my expectations of how dgram would work in cluster. bind to a port (including 0), and you get a socket/port shareable between workers. Don't explicitly bind, and the workers socket is its own, dynamically and ephemerally bound to a un-shared port (as net's actively connected stream sockets are).

@bnoordhuis
Copy link
Member

Why does dgram do an explicit bind(0) on send()? The OS will do that implicitly on first sendto(), assigning a unique ephemeral port. Did it just seem convenient at the time, or is there more to it, perhaps an explicit bind causes some necessare libuv calls?

I'm not 100% sure but it could be legacy from the libev days. The explicit bind() isn't necessary with libuv though it doesn't hurt either. I suppose one minor upside is that it catches 'out of ephemeral ports' errors early.

@sam-github
Copy link
Author

Note that the explicit bind runs afoul of #5587, causing even dgram sockets that don't need any sharing to fail to be useful on Windows, and in a rather nasty way: they assert the cluster master.

@trevnorris
Copy link

From @bnoordhuis' response, it seems like the cluster/dgram implementation could/should use improvement.

@tjfontaine Think we should consider this a blocker for v0.12 until these tests pass?

@sam-github If so, can you help out creating fixes for these tests?

@trevnorris trevnorris added this to the v0.12 milestone Aug 9, 2014
@rmg
Copy link

rmg commented Sep 4, 2014

@sam-github there's now a work around for the implicit .bind() problem. #3856 was closed by 029cfc1 which adds an exclusive option to .bind() that lets you bypass the cluster behaviour in workers.

guisouza and others added 11 commits September 17, 2014 12:19
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
Reviewed-by: Fedor Indutny <fedor@indutny.com>
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
Reviewed-by: Fedor Indutny <fedor@indutny.com>
Adding a compatibility section to node.exe embedded manifest so that
Node is declared explicitly compatible with Windows 8.1. Required so
that os.release() can return the correct version on Windows 8.1.

See http://msdn.microsoft.com/en-us/library/windows/desktop/ms724451(v=vs.85).aspx

Reviewed-by: Trevor Norris <trev.norris@gmail.com>
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
SSL_CTX is shared between multiple connections and is not a right place
to store per-connection data.

fix #8348

Reviewed-By: Trevor Norris
Add a simple test to cover workers' implementation of
Worker.prototype.destroy(). Before adding this test, this code wouldn't
be covered by the tests suite, and any regression introduced in workers'
implementation of Worker.prototype.destroy wouldn't be caught.

Fixes: #8223
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
When V8 started supporting Promises natively it also introduced a
microtack queue. This feature operates similar to process.nextTick(),
and created an issue where neither knew when the other had run. This
patch has nextTick() call the microtask queue runner at the end of
processing callbacks in the nextTickQueue.

Fixes: #7714
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
Increase the performance of new Buffer construction by initializing all
properties before SetIndexedPropertiesToExternalArrayData call.

Reviewed-by: Trevor Norris <trev.norris@gmail.com>
When calling write() after end() has been called on an OutgoingMessage,
an error is emitted and the write's callback is called with an instance
of Error.

Fix #7477.

Reviewed-By: Fedor Indutny <fedor@indutny.com>
Export External getters for a internal structs: SSL, SSL_CTX.
trevnorris and others added 28 commits December 5, 2014 04:56
Expose basic hooks for AsyncWrap via the async_wrap binding. Right now
only the PROVIDER types are exposed. This is a preliminary step before
more functionality is added.

PR-URL: #8110
Signed-off-by: Trevor Norris <trev.norris@gmail.com>
Reviewed-by: Fedor Indutny <fedor@indutny.com>
Reviewed-by: Alexis Campailla <alexis@janeasystems.com>
Reviewed-by: Julien Gilli <julien.gilli@joyent.com>
When instantiating a new AsyncWrap allow the parent AsyncWrap to be
passed. This is useful for cases like TCP incoming connections, so the
connection can be tied to the server receiving the connection.

Because the current architecture instantiates the *Wrap inside a
v8::FunctionCallback, the parent pointer is currently wrapped inside a
new v8::External every time and passed as an argument. This adds ~80ns
to instantiation time.

A future optimization would be to add the v8::External as the data field
when creating the v8::FunctionTemplate, change the pointer just before
making the call then NULL'ing it out afterwards. This adds enough code
complexity that it will not be attempted until the current approach
demonstrates it is a bottle neck.

PR-URL: #8110
Signed-off-by: Trevor Norris <trev.norris@gmail.com>
Reviewed-by: Fedor Indutny <fedor@indutny.com>
Reviewed-by: Alexis Campailla <alexis@janeasystems.com>
Reviewed-by: Julien Gilli <julien.gilli@joyent.com>
Call a user-defined callback at specific points in the lifetime of an
asynchronous event. Which are on instantiation, just before/after the
callback has been run.

**If any of these callbacks throws an exception, there is no forgiveness
or recovery. A message will be displayed and a core file dumped.**

Currently these only tie into AsyncWrap, meaning no call to a hook
callback will be made for timers or process.nextTick() events. Though
those will be added in a future commit.

Here are a few notes on how to make the hooks work:

- The "this" of all event hook callbacks is the request object.

- The zero field (kCallInitHook) of the flags object passed to
  setupHooks() must be set != 0 before the init callback will be called.

- kCallInitHook only affects the calling of the init callback. If the
  request object has been run through the create callback it will always
  run the before/after callbacks. Regardless of kCallInitHook.

- In the init callback the property "_asyncQueue" must be attached to
  the request object. e.g.

  function initHook() {
    this._asyncQueue = {};
  }

- DO NOT inspect the properties of the object in the init callback.
  Since the object is in the middle of being instantiated there are some
  cases when a getter is not complete, and doing so will cause Node to
  crash.

PR-URL: #8110
Signed-off-by: Trevor Norris <trev.norris@gmail.com>
Reviewed-by: Fedor Indutny <fedor@indutny.com>
Reviewed-by: Alexis Campailla <alexis@janeasystems.com>
Reviewed-by: Julien Gilli <julien.gilli@joyent.com>
Float libuv/libuv@484a3a9 to fix incorrect
indentation in REPL.
Add documentation for the callback parameter of http.ClientRequest's and
http.ServerResponse's write and end methods.
In `tls.markdown`, there was a misuse of 'a' which has been replaced
with 'an'.

In `timers.markdown`...
  line 31: misuse of 'a', replaced with 'an'
  line 59: unclear wording, haywire 'a', added new comma
Dtrace probes were removed from libuv recently, but their usage by node
was not completely removed, causing build breaks on SmartOS.

Even though the build is working on other platforms, these probes are
not fired by libuv anymore, so there's no point in using them on these
platforms too.

Reviewed-by: Trevor Norris <trev.norris@gmail.com>
PR-URL: #8847
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
Marking these two tests as flaky, since they have been failing
intermittenly in recent builds:
test-debug-signal-cluster
test-cluster-basic
PR-URL: #8546
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
PR-URL: #8845
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
Clarify the fd option: it is preferred to the path parameter, omits
the "open" event if given, and is available on WriteStreams as well.

PR-URL: #7707
Fixes: #7707
Fixes: #7708
Fixes: #4367
Reviewed-By: Chris Dickinson <christopher.s.dickinson@gmail.com>
Reviewed-By: Fedor Indutny <fedor@indutny.com>
PR-URL: #6442
Fix Interface.setBreakpoint() to correctly handle an attempt to set a
breakpoint in the current script when there is no current script.
This usually happens when the debugged process is not paused.

Fixes: #6453
PR-URL: #6460
Reviewed-By: Chris Dickinson <christopher.s.dickinson@gmail.com>
Fix a Windows-only build error that was introduced in
commit 1183ba4 ("zlib: support concatenated gzip files").

Rename the NO_ERROR and FAILED enumerations, they conflict
with macros of the same name in <winerror.h>.

PR-URL: #8893
Reviewed-By: Fedor Indutny <fedor@indutny.com>
Reviewed-By: Rod Vagg <rod@vagg.org>
Reviewed-by: Timothy J Fontaine <tjfontaine@gmail.com>
In cases where many small writes are made to a stream
lacking _writev, the array data structure backing the
WriteReq buffer would greatly increase GC pressure.

Specifically, in the fs.WriteStream case, the
clearBuffer routine would only clear a single WriteReq
from the buffer before exiting, but would cause the
entire backing array to be GC'd. Switching to [].shift
lessened pressure, but still the bulk of the time was
spent in memcpy.

This replaces that structure with a linked list-backed
queue so that adding and removing from the queue is O(1).
In the _writev case, collecting the buffer requires an
O(N) loop over the buffer, but that was already being
performed to collect callbacks, so slowdown should be
neglible.

PR-URL: #8826
Reviewed-by: Timothy J Fontaine <tjfontaine@gmail.com>
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
Add documentation for the callback parameter of http.ClientRequest's and
http.ServerResponse's end methods.

Signed-off-by: Julien Gilli <julien.gilli@joyent.com>
The url.parse() function now checks whether an escapable character is in
the URL before trying to escape it.

PR-URL: #8638
[trev.norris@gmail.com: Switch to use continue instead of if]
Signed-off-by: Trevor Norris <trev.norris@gmail.com>
Make "--with-intl=none" the default and add "intl-none" option to
vcbuild.bat.

If icu data is missing print a warning unless either --download=all or
--download=icu is set. If set then automatically download, verify (MD5)
and unpack the ICU data if not already available.

There's a "list" of URLs being used, but right now only the first is
picked up. The logic works something like this:

* If there is no directory deps/icu,
  * If no zip file (currently icu4c-54_1-src.zip),
    * Download zip file (icu-project.org -> sf.net)
  * Verify the MD5 sum of the zipfile
    * If bad, print error and exit
  * Unpack the zipfile into deps/icu
* If deps/icu now exists, use it, else fail with help text

Add the configuration option "--with-icu-source=..."

Usage:
  * --with-icu-source=/path/to/my/other/icu
  * --with-icu-source=/path/to/icu54.zip
  * --with-icu-source=/path/to/icu54.tgz
  * --with-icu-source=http://example.com/icu54.tar.bz2

Add the configuration option "--with-icu-locals=...".  Allows choosing
which locales are used in the "small-icu" case.

Example:
    configure --with-intl=small-icu --with-icu-locales=tlh,grc,nl

(Also note that as of this writing, neither Klingon nor Ancient Greek
are in upstream CLDR data. Serving suggestion only.)

Don't use hard coded ../../out paths on windows. This was suggested by
@misterdjules as it causes test failures.  With this fix, "out" is no
longer created on windows and the following can run properly:

    python tools/test.py simple

Reduce space by about 1MB with ICU 54 (over without this patch). Also
trims a few other source files, but only conditional on the exact ICU
version used. This is to future-proof - a file that is unneeded now may
be needed in future ICUs.

Also:
  * Update distclean to remove icu related files
  * Refactor some code into tools/configure.d/nodedownload.py
  * Update docs
  * Add test

PR-URL: #8719
Fixes: #7676 (comment)
[trev.norris@gmail.com small change to test's whitespace and logic]
Signed-off-by: Trevor Norris <trev.norris@gmail.com>
PR-URL: #8964
Reviewed-by: Trevor Norris <trev.norris@gmail.com>
Socket.prototype.connect() sometimes throws on bad inputs
after an asynchronous operation. This commit makes the input
validation synchronous. This commit also removes some hard
coded IP addresses.

PR-URL: #8180
Fixes: #8140
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
Reviewed-By: Fedor Indutny <fedor@indutny.com>
Reviewed-By: Timothy J Fontaine <tjfontaine@gmail.com>
If the data length passed to smalloc.alloc() the array_length will be
zero, causing an overflow check to fail. This prevents that from
happening.

Signed-off-by: Trevor Norris <trev.norris@gmail.com>
Revert "src: fix windows build error" and "zlib: support
concatenated gzip files". Treating subsequent data as a
concatenated stream breaks npm install.

This reverts commits 93533e9
and 6f6a979.

Fixes: #8962
PR-URL: #8985
Reviewed-By: Julien Gilli <julien.gilli@joyent.com>
b636ba8 caused a regression
on Windows due to the way server handles are cleaned up. This
commit fixes the test by allowing the handle to be cleaned up.

Fixes: #8986
PR-URL: #8998
Reviewed-By: Julien Gilli <julien.gilli@joyent.com>
Reviewed-By: Bert Belder <bertbelder@gmail.com>
Note that test/simple/test-cluster-dgram-4.js is a simplified case, since
dgram.send() for clients does an implict .bind(0), so it could also be written
as:

  A = dgram.createSocket('udp4')
  A.send(...)
  A.close();
  ... time passes...
  B = dgram.createSocket('udp4')
  B.send(...)
  B.close();

Above was the original form, dscape/lynx kills the cluster master when run in
a worker.

IMO, dgram-4 is the most serious of this, using multiple dgram client sockets
is currently impossible with cluster.

Also note test/simple/test-cluster-net-listen-2.js, the TCP equivalent of
dgram-4, passes, because TCP close causes the cluster master to be notified,
UDP close does not.

Also, note that I use the ephemeral "flag" port `0` in the tests, the same
failure is seen with any port.
In case that would help the tests pass... but it does not.
@sam-github sam-github closed this Jan 12, 2015
@sam-github sam-github deleted the fix-cluster-multi-ephemeral-bind branch January 12, 2015 20:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.