Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osx, build: Node v6.3.0 won't build on Clang 3.2 (OSX 10.8 Darwin 12.5.0) #7618

Closed
gibfahn opened this issue Jul 8, 2016 · 18 comments
Closed
Labels
build Issues and PRs related to build files or the CI. macos Issues and PRs related to the macOS platform / OSX.

Comments

@gibfahn
Copy link
Member

gibfahn commented Jul 8, 2016

  • Version: v6.3.0
  • Platform: Mac OSX 10.8 (Darwin 12.5.0)
  • Subsystem: build

The build is failing on OSX 10.8 but passing on 10.10. It looks like the problem is due to this commit: 4014ecb, which was added in #7157 and backported in #7546. I note that @zbjornson said that:

Tested with gcc/g++ 4.8.3, clang 3.4 and MSVC 2015 (i.e. something like #4284 shouldn't happen here).

But I'm still getting this problem. I assume something similar to #4290 would fix it. @bnoordhuis

Failing machine:
Darwin Kernel Version 12.5.0
$ cc --version
Apple LLVM version 4.2 (clang-425.0.28) (based on LLVM 3.2svn)
Target: x86_64-apple-darwin12.5.0
Passing machine:
Darwin Kernel Version 14.0.0
$ cc --version
Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)
Target: x86_64-apple-darwin14.0.0
Thread model: posix
Error:
../src/node_buffer.cc:1193:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ˆ
../src/node_buffer.cc:56:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ˆ
  c++ '-D_DARWIN_USE_64_BIT_INODE=1' '-DNODE_ARCH="x64"' '-DNODE_WANT_INTERNALS=1' '-DV8_DEPRECATION_WARNINGS=1' '-DNODE_USE_V8_PLATFORM=1' '-DNODE_TAG="-IBMBuild-"' '-DNODE_HAVE_I18N_SUPPORT=1' '-DNODE_HAVE_SMALL_ICU=1' '-DHAVE_INSPECTOR=1' '-DV8_INSPECTOR_USE_STL=1' '-DHAVE_OPENSSL=1' '-DHAVE_DTRACE=1' '-D__POSIX__' '-DNODE_PLATFORM="darwin"' '-DUCONFIG_NO_TRANSLITERATION=1' '-DUCONFIG_NO_SERVICE=1' '-DUCONFIG_NO_REGULAR_EXPRESSIONS=1' '-DU_ENABLE_DYLOAD=0' '-DU_STATIC_IMPLEMENTATION=1' '-DU_HAVE_STD_STRING=0' '-DUCONFIG_NO_BREAK_ITERATION=0' '-DUCONFIG_NO_LEGACY_CONVERSION=1' '-DUCONFIG_NO_CONVERSION=1' '-DHTTP_PARSER_STRICT=0' '-D_LARGEFILE_SOURCE' '-D_FILE_OFFSET_BITS=64' -I../src -I../tools/msvs/genfiles -I../deps/uv/src/ares -I/build/node/out/Release/obj/gen -I../deps/v8_inspector -I../deps/v8_inspector/deps/wtf -I/build/node/out/Release/obj/gen/blink -I../deps/v8/include -I../deps/icu-small/source/i18n -I../deps/icu-small/source/common -I../deps/openssl/openssl/include -I../deps/zlib -I../deps/http_parser -I../deps/cares/include -I../deps/uv/include  -Os -gdwarf-2 -mmacosx-version-min=10.7 -arch x86_64 -Wall -Wendif-labels -W -Wno-unused-parameter -std=gnu++0x -fno-rtti -fno-exceptions -fno-threadsafe-statics -fno-strict-aliasing -MMD -MF /build/node/out/Release/.deps//build/node/out/Release/obj.target/node/src/node_file.o.d.raw   -c -o /build/node/out/Release/obj.target/node/src/node_file.o ../src/node_file.cc
1 error generated.
make[2]: *** [/build/node/out/Release/obj.target/node/src/node_buffer.o] Error 1
make[2]: *** Waiting for unfinished jobs....
rm _node_deps_v8_inspector_platform_v8_inspector_v8_inspector_gyp_protocol_sources_target_generateV8InspectorProtocolBackendSources.intermediate
make[1]: *** [node] Error 2
make: *** [node-v7.0.0-darwin-x64.tar] Error 2
@gibfahn gibfahn changed the title osx, build: Node v6.3.0 won't build on OSX 10.8 Darwin 12.5.0 osx, build: Node v6.3.0 won't build on OSX 10.8 Darwin 12.5.0 Clang 4.2 Jul 8, 2016
@MylesBorins
Copy link
Contributor

/cc @nodejs/build regarding our OSX testing infra

@mscdex mscdex added build Issues and PRs related to build files or the CI. macos Issues and PRs related to the macOS platform / OSX. labels Jul 8, 2016
@mscdex
Copy link
Contributor

mscdex commented Jul 8, 2016

I'm not familiar with clang, so I'm not entirely sure which version we should be looking at, but the last version number on the failing machine shows 3.2 which is less than 3.4. shrug

@zbjornson
Copy link
Contributor

Urg. I tested 3.4 because that was the oldest version listed in https://github.com/nodejs/node/blob/master/BUILDING.md. If 3.2 needs to be supported, then we can detect builtin presence and fallback to one of the alternative definitions.

@gibfahn
Copy link
Member Author

gibfahn commented Jul 8, 2016

@mscdex @zbjornson I had assumed that the clang version was the 425.0.28 (i.e. 4.2), but if it is the 3.2 then our clang level might be out of date.

I think this clang level is the default one for OSX 10.8, which is theoretically supported by node.

@zbjornson
Copy link
Contributor

Apple versions clang kinda unusually. Their 4.2 is based on LLVM 3.2, and their 5.1 is based on LLVM 3.4.

I can work on a patch today, unless the answer is to require people who build to update xcode to a version with LLVM 3.4+ support.

@zbjornson
Copy link
Contributor

zbjornson commented Jul 10, 2016

@gibfahn when you run ./configure do you get the message "WARNING: C++ compiler too old, need g++ 4.8 or clang++ 3.4 (CXX=clang++)"? I don't have a mac to play with, and I'm facing problems unrelated to 4014ecb trying to get LLVM clang 3.2 to compile node on ubuntu.

@bnoordhuis
Copy link
Member

@zbjornson No, no warning. OS X 10.8 with Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn).

For anyone who wants to try it, you have to build with make CXX.host=c++ now because apple's g++ driver doesn't cut it, it doesn't understand the -std=gnu++0x switch.

@bnoordhuis
Copy link
Member

Opening gambit: #7644

@gibfahn gibfahn changed the title osx, build: Node v6.3.0 won't build on OSX 10.8 Darwin 12.5.0 Clang 4.2 osx, build: Node v6.3.0 won't build on Clang 3.2 (OSX 10.8 Darwin 12.5.0) Jul 10, 2016
@zbjornson
Copy link
Contributor

Contra gambit: #7645 :)

bnoordhuis added a commit to bnoordhuis/io.js that referenced this issue Jul 11, 2016
Said builtins are not supported by older versions of apple-gcc, breaking
the build on OS X 10.8.

Fixes: nodejs#7618
Refs: nodejs#4290
Refs: nodejs#7157
@mhdawson
Copy link
Member

Any update on this ?

@zbjornson
Copy link
Contributor

zbjornson commented Aug 22, 2016

I have a small change (addressing my own comment) that I seem to have forgotten to push and that's sitting on a computer that I can't get to 'till tonight. Not sure what @bnoordhuis wants to do, but I'd propose moving forward with #7645 because it benchmarks faster at least on Windows.

Will push that change and post the updated benchmarks tonight.

@bnoordhuis
Copy link
Member

I'm not hugely invested in this issue so I don't really care which way it goes. The thing I like about my own PR is that it reduces line count and removes the intrinsics (something I've never been a fan of) but if it's that much slower with VS, then I guess we should go with Zach's PR.

@zbjornson
Copy link
Contributor

zbjornson commented Sep 19, 2016

(Really sorry about the delay.)

I'm also not highly invested in this.

#7645 currently benchmarks faster on windows in 5 of 6 cases (slower for aligned 16-bit; faster for other combos of size/alignment) and I think would be okay to merge. If anyone wants, I could work to make the final case to match Ben's on perf. I haven't had time to benchmark on linux recently.

If we go with Ben's, I do have an npm package for fast byte swapping that would satisfy anyone else who needs as-fast-as-possible swapping. Thus, I don't really care which moves forward.

Edit Looked at linux benchmarks. Even with -march=native gcc will only use movbe element-wise and won't vectorize (whereas msvc does). Vectorizing it by hand works nicely, but (a) requires more preprocessor macros and builtins, (b) as discussed won't help with the default build config, and (c) because of (a) means it's harder to get test coverage on. So, I don't think it's a great idea to pursue that, but can if someone thinks it's the right path. (I'm putting the manual vectorization for GCC into my package though.)

@gibfahn
Copy link
Member Author

gibfahn commented Sep 21, 2016

So to summarise, the two options are #7644 (@bnoordhuis ), which makes the code cleaner/more maintainable but reduces performance on windows, or #7645 (@zbjornson ), which increases performance on Windows but keeps the intrinsics.

@zbjornson Do you have a rough estimate of how significant the speedup will be? Is it only windows which is currently slower on #7644? As VS 2013 has been dropped on master, the windows performance loss only affects us if it happens on VS 2015, so it'd be useful to have some specific numbers. If you've written a benchmark you could point me to, I'd be happy to try it with VS 2015.

It'd be great to have master building on Clang 3.2 again either way!

EDIT: Looking at #7645 (comment), it sounds like #7644 also includes the performance boost from #7157, so we'll get @zbjornson's performance improvements either way!

@zbjornson
Copy link
Contributor

zbjornson commented Sep 21, 2016

Mostly correct, except for the edit:

Less important:

  • src: fix build failure for clang 3.2; consolidate byte swapping code; fix buffer writes for unaligned ucs2 strings #7645 even with -march=native doesn't get much faster with GCC (up to a snapshot of GCC 7) because it only emits movbe and not [v]pshufb. MSVC emits pshufb with the default build params. GCC emits bswap (32/64) or ror (16).
  • Vectorizing the loop by hand is effective for GCC and clang (emit vpshufb), but requires another intrinsic that is only available when non-default compiler flags are used. I'm taking this approach in my bswap lib, but it seems pointless given that it would be rarely engaged and hard to maintain in node.js core.

If you'd like to play with benchmarks, they're in https://github.com/nodejs/node/blob/master/benchmark/buffers/buffer-swap.js. You probably want to just pick one len (a high value like 8192 so it uses the C++ impl) and drop the n, otherwise it's dreadfully slow to run them all.

Given the above, I think it would make sense to use the intrinsics on Windows only and remove them from the rest, which will keep perf where it's easy and commonly available, and cut the line count down. I can try to get to this tonight.

Edit example godbolt: https://godbolt.org/g/4rvmAa

@gibfahn
Copy link
Member Author

gibfahn commented Sep 21, 2016

@zbjornson You're right, the edit was unclear. I meant that the performance improvements from #7157 were in both #7644 and #7645 except for windows as mentioned earlier.

It would seem to make a lot of sense to only keep the intrinsics in for windows if that's where they make a big performance difference. Is there a way to get to at least the previous performance for aligned 16-bit types? The speedup for the other types might make it worth it even if not.

#7645 even with -march=native doesn't get much faster with GCC (up to a snapshot of GCC 7) because it only emits movbe and not [v]pshufb. MSVC emits pshufb with the default build params. GCC emits bswap (32/64) or ror (16).

Could you explain a little more here? #7645 doesn't get much faster that what? Are you saying that #7645 doesn't get much faster than #7644 (i.e. using the builtins doesn't give a large boost without using non-default compiler flags)?

So I guess your updated PR would be halfway between #7645 and #7644? That makes sense to me (as long as @bnoordhuis approves).

@zbjornson
Copy link
Contributor

zbjornson commented Sep 22, 2016

Is there a way to get to at least the previous performance for aligned 16-bit types?

Sorry, to be clear: #7645 does not have a perf regression for aligned 16-bit types on Windows, if that's what you mean. Rather, Ben's PR gives an improvement to this case.

Could you explain a little more here?

I meant, even if someone were to compile node with -march=native to allow SSE/AVX, GCC emits MOVBE (which is fast for one element) for the bswap builtin, but it does not vectorize the loop, so you don't see huge perf gains with or without the builtin. MSVC vectorizes and ends up a lot faster.

Here's what the hand-vectorized code looks like:
https://github.com/zbjornson/node-bswap/blob/master/src/bswap.cc#L13-L56
and here's how it benchmarks on win and linux:
https://github.com/zbjornson/node-bswap#benchmarks
The node values are equivalent to what we would get from #7645. The bswap.native column is what it could be if we went all-out and risked breaking non-x86 platforms (~30% faster on Windows and 2 to 7x faster on Linux).

Working on the halfway PR now that I've figured that stuff out. :)

@zbjornson
Copy link
Contributor

zbjornson commented Sep 22, 2016

OK, #7645 is now modified to use builtins for Windows only and vanilla C++ for the rest.

Benchmarks faster than or comparably to #7644 in all cases on Windows, and aligned cases on Linux:

image

I could keep fiddling with this to improve unaligned perf on Linux, but given that unaligned is less common than aligned, I'd also be happy with this as-is.

zbjornson added a commit to zbjornson/node that referenced this issue Sep 30, 2016
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Fixes: nodejs#7618
@gibfahn gibfahn closed this as completed in 7420835 Oct 4, 2016
jasnell pushed a commit that referenced this issue Oct 6, 2016
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Also adds test for unaligned ucs2 buffer write. Between #3410
and #7645, bytes were swapped twice on bigendian platforms if buffer
was not two-byte aligned. See comment in #7645.

PR-URL: #7645
Fixes: #7618
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Fishrock123 pushed a commit that referenced this issue Oct 11, 2016
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Also adds test for unaligned ucs2 buffer write. Between #3410
and #7645, bytes were swapped twice on bigendian platforms if buffer
was not two-byte aligned. See comment in #7645.

PR-URL: #7645
Fixes: #7618
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>

 Conflicts:
	src/node_buffer.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. macos Issues and PRs related to the macOS platform / OSX.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants