Request to improve performance of stringifying in internal codes #2498

ywave620 · 2023-12-05T08:47:26Z

Bug Description

The cpp binding function, utf8slice called by Buffer.toString() is accounted for a big proportion in the CPU profile resulted from benchmarking a HTTP proxy using undici.

By using Uint8Arrray, we can eliminat the overhead incurred by the language transition

Reproducible By

const dispatcher = new Pool('http://127.0.0.1:8090' /* an 4worker nginx */, {
  pipelining: 30,
  connections: 100,
});

const proxy = http.createServer((req, res) => {
  dispatcher.request({ // the constructor of RequestHandler in undici/lib/api/api-request.js
    method: req.method,
    path: req.url,
    headers: req.headers,
    body: req,
  }, (err, { statusCode, headers, body /** a Readable */ }) => {
    if (err) {
      throw err
    }
    res.writeHead(statusCode, headers);
     pipe(body, res);
  });
});

function pipe(src, dst) {
  src.on('readable', () => {
    let chunk;
    while (null !== (chunk = src.read(65536))) {
      dst.write(chunk);
    }
  });
  src.on('end', () => {
    dst.end()
  });
}

Expected Behavior

Logs & Screenshots

Environment

node v22
linux

Additional context

I hope the members can provide some guide and more importantly caveats to applying this idea

ronag · 2023-12-05T09:03:45Z

FastBuffer is pretty much an Uint8Arrray with some helper methods. How exactly would it make it faster?

ywave620 · 2023-12-05T10:02:41Z

At least, no language switch overhead from these

ywave620 · 2023-12-05T10:08:03Z

One caveat I found nodejs/node#17431. %TypedArray%.prototype.subarray suffers from an upstream bug and plays slow. However, I think %TypedArray%.prototype.toString will be faster than Buffer.prototype.toString, thanks to the saved language switch

ronag · 2023-12-05T10:27:38Z

When is utf8Slice called? And what faster option does Uint8Array provide?

ronag · 2023-12-05T10:28:09Z

However, I think %TypedArray%.prototype.toString will be faster than Buffer.prototype.toString, thanks to the saved language switch

Do you have a benchmark?

ronag · 2023-12-05T10:28:24Z

I think this should be fixed in Node if it's faster.

ronag · 2023-12-05T10:29:51Z

Uint8Array.toString and Buffer.toString don't do the same thing...

ywave620 · 2023-12-05T11:54:46Z

When is utf8Slice called? And what faster option does Uint8Array provide?

Sorry for the careless reply. The original motivation of this issue is that since the binding function, utf8Slice becomes the second largest contributor to CPU among all functions in HTTP module, I think we should works on a more efficent alternative, probably a pure js one.

Yes, Uint8Array.toString is not we want, TextDecoder is one thing I comes up with, but my benchmark shows it does not have advantage, which is out of my expectation

const buf = Buffer.from('Content-Type')

console.time()
for(let i = 0; i < 1e6; i ++) {
  buf.toString();
}
console.timeEnd()

const decoder = new TextDecoder();
const arr = new Uint8Array([0x43,0x6f,0x6e,0x74,0x65,0x6e,0x74,0x2d,0x54,0x79,0x70,0x65])
console.time()
for(let i = 0; i < 1e6; i ++) {
  decoder.decode(arr)
}
console.timeEnd()

default: 116.781ms
default: 113.501ms

ywave620 · 2023-12-05T12:00:32Z

So if Buffer.toString is slow and has no room to improve, can we skip it as possible as we can.
Maybe in

undici/lib/core/util.js

Line 227 in e218fc6

const key = headers[i].toString()

If headers[i] represents a known header field, we can use the cached string rather than call toString on it to get one.

ronag · 2023-12-05T13:16:16Z

If headers[i] represents a known header field, we can use the cached string rather than call toString on it to get one.

The question is how to achieve that... you would need to somehow determine that a specific buffer corresponds to a specific string.

ronag · 2023-12-05T13:17:20Z

@anonrig

ronag · 2023-12-05T13:17:58Z

We have to constraints in our favor:

header keys are ascii (or are they?)
they are mostly from a limited set of possible values

ronag · 2023-12-05T16:16:51Z

@lemire

lemire · 2023-12-05T18:07:19Z

You are taking small buffers and converting them into v8 strings. It is highly likely that the bottleneck is in the speed of creating v8 strings from an array of bytes.

We can do some profiling to verify...

Regarding toString() (first half of the benchmark), the time spent is as follows (Node 20):

          10.17%        [.] node::encoding_binding::BindingData::DecodeUTF8                                                                             ▒
           8.51%        [.] Builtins_LoadIC                                                                                                             ▒
           6.46%        [.] v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString                                                       ▒
           5.01%        [.] v8::internal::Factory::AllocateRaw                                                                                          ▒
           4.12%        [.] v8::internal::Factory::NewStringFromUtf8                                                                                    ▒
           2.79%        [.] v8::internal::Utf8DecoderBase<v8::internal::Utf8Decoder>::Utf8DecoderBase                                                   ▒
           2.78%        [.] v8::ArrayBufferView::ByteLength                                                                                             ▒
           2.66%        [.] Builtins_CallApiCallback                                                                                                    ▒
           2.51%        [.] v8::ArrayBufferView::CopyContents                                                                                           ▒
           2.35%        [.] v8::Value::IsArrayBufferView                                                                                                ▒
           1.92%        [.] v8::String::NewFromUtf8                                                                                                     ▒
           1.73%        [.] v8::internal::Utf8DecoderBase<v8::internal::Utf8Decoder>::Decode<unsigned char>                                             ▒
           1.58%        [.] node::StringBytes::Encode                                                                                                   ▒
           1.31%        [.] v8::Value::IsSharedArrayBuffer                                                                                              ▒
           1.18%        [.] v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap                                                ▒
           1.10%        [.] v8::internal::Deserializer<v8::internal::Isolate>::ReadSingleBytecodeData<v8::internal::SlotAccessorForHeapObject>          ▒
           1.02%        [.] v8::Value::IsArrayBuffer                                                                                                    ▒
           1.01%        [.] v8::Context::GetNumberOfEmbedderDataFields                                                                                  ▒
           0.98%        [.] v8::internal::Deserializer<v8::internal::Isolate>::ReadObject

We can compare Bun with Node.

Node: default: 128.842ms

Bun: 58.57ms

So Bun is 2x faster.

Regarding the second half...

          10.22%        [.] node::encoding_binding::BindingData::DecodeUTF8                                                                             ▒
           6.60%        [.] Builtins_LoadIC                                                                                                             ▒
           5.15%        [.] Builtins_CallApiCallback                                                                                                    ▒
           5.01%        [.] v8::internal::Utf8DecoderBase<v8::internal::Utf8Decoder>::Utf8DecoderBase                                                   ▒
           3.85%        [.] v8::internal::Factory::NewStringFromUtf8                                                                                    ▒
           3.37%        [.] v8::ArrayBufferView::ByteLength                                                                                             ▒
           3.37%        [.] v8::ArrayBufferView::CopyContents                                                                                           ▒
           3.10%        [.] v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString                                                       ▒
           2.81%        [.] v8::Isolate::GetCurrentContext                                                                                              ▒
           2.65%        [.] node::StringBytes::Encode                                                                                                   ▒
           2.64%        [.] v8::internal::Factory::AllocateRaw                                                                                          ▒
           2.04%        [.] v8::Value::IsSharedArrayBuffer                                                                                              ▒
           2.04%        [.] v8::internal::Utf8DecoderBase<v8::internal::Utf8Decoder>::Decode<unsigned char>                                             ▒
           1.61%        [.] v8::internal::JSTypedArray::element_size                                                                                    ▒
           1.61%        [.] v8::Value::IsArrayBufferView                                                                                                ▒
           1.48%        [.] v8::Context::GetNumberOfEmbedderDataFields                                                                                  ▒
           1.41%        [.] v8::internal::Deserializer<v8::internal::Isolate>::ReadObject                                                               ▒
           1.32%        [.] v8::internal::CopyChars<unsigned char, unsigned char>                                                                       ▒
           1.31%        [.] v8::internal::Deserializer<v8::internal::Isolate>::ReadSingleBytecodeData<v8::internal::SlotAccessorForHeapObject>          ▒
           1.02%        [.] v8::String::NewFromUtf8

We can compare Bun with Node.

Node: default: 128.932ms

Bun: 58.88ms

So Bun is 2x faster, again.

So DecodeUTF8 is a bottleneck in both instances. Its code looks like this...

void BindingData::DecodeUTF8(const FunctionCallbackInfo<Value>& args) {
  Environment* env = Environment::GetCurrent(args);  // list, flags

  CHECK_GE(args.Length(), 1);

  if (!(args[0]->IsArrayBuffer() || args[0]->IsSharedArrayBuffer() ||
        args[0]->IsArrayBufferView())) {
    return node::THROW_ERR_INVALID_ARG_TYPE(
        env->isolate(),
        "The \"list\" argument must be an instance of SharedArrayBuffer, "
        "ArrayBuffer or ArrayBufferView.");
  }

  ArrayBufferViewContents<char> buffer(args[0]);

  bool ignore_bom = args[1]->IsTrue();
  bool has_fatal = args[2]->IsTrue();

  const char* data = buffer.data();
  size_t length = buffer.length();

  if (has_fatal) {
    auto result = simdutf::validate_utf8_with_errors(data, length);

    if (result.error) {
      return node::THROW_ERR_ENCODING_INVALID_ENCODED_DATA(
          env->isolate(), "The encoded data was not valid for encoding utf-8");
    }
  }

  if (!ignore_bom && length >= 3) {
    if (memcmp(data, "\xEF\xBB\xBF", 3) == 0) {
      data += 3;
      length -= 3;
    }
  }

  if (length == 0) return args.GetReturnValue().SetEmptyString();

  Local<Value> error;
  MaybeLocal<Value> maybe_ret =
      StringBytes::Encode(env->isolate(), data, length, UTF8, &error);
  Local<Value> ret;

  if (!maybe_ret.ToLocal(&ret)) {
    CHECK(!error.IsEmpty());
    env->isolate()->ThrowException(error);
    return;
  }

  args.GetReturnValue().Set(ret);
}

The relevant part is this...

  Local<Value> error;
  MaybeLocal<Value> maybe_ret =
      StringBytes::Encode(env->isolate(), data, length, UTF8, &error);

It calls...

      {
        val = String::NewFromUtf8(isolate,
                                  buf,
                                  v8::NewStringType::kNormal,
                                  buflen);
        Local<String> str;
        if (!val.ToLocal(&str)) {
          *error = node::ERR_STRING_TOO_LONG(isolate);
        }
        return str;
      }

... which calls....

    i::Isolate* i_isolate = reinterpret_cast<internal::Isolate*>(v8_isolate); \
    ENTER_V8_NO_SCRIPT_NO_EXCEPTION(i_isolate);                               \
    API_RCS_SCOPE(i_isolate, class_name, function_name);                      \
    if (length < 0) length = StringLength(data);                              \
    i::Handle<i::String> handle_result =                                      \
        NewString(i_isolate->factory(), type,                                 \
                  base::Vector<const Char>(data, length))                     \
            .ToHandleChecked();                                               \
    result = Utils::ToLocal(handle_result);

which calls...

inline i::MaybeHandle<i::String> NewString(i::Factory* factory,
                                           NewStringType type,
                                           base::Vector<const char> string) {
  if (type == NewStringType::kInternalized) {
    return factory->InternalizeUtf8String(string);
  }
  return factory->NewStringFromUtf8(string);
}

And down, down we go... turtles all the way down.

So yeah, it is likely a fairly expensive process.

And as long as you must create a v8 string, I don't see much in the way of possible optimization. It costs what it costs.

ronag · 2023-12-05T19:56:20Z

I think the point here is that we might not have to create the strings. We could e.g hash the buffer and re use strings. Or some other way to lookup common header names from buffer to string.

lemire · 2023-12-05T22:28:34Z

You can use the fact that UTF-8 validation can be done nearly for free in Node directly in a buffer thanks to @anonrig.

ronag · 2023-12-06T06:25:17Z

Do we even need to utf validate header strings? Shouldn't they be ASCII? I think llhttp might already validate that?

ronag · 2023-12-06T06:26:51Z

So I was thinking something like:

function headerToString(buf) {
  if (buf[0] === CHAR_C) {
    if (buf.equals(BUF_CONTENT_LENGTH) return 'content-length'
  } else {
    return buf.toString()
 }
}

function headerToString(buf) {
  const hash = hash(buf) // Sufficiently unique hash
  return lookup.get(hash) ?? buf.toString()
}

tsctx · 2023-12-06T06:57:37Z

I think I could use a tree structure.
It is more efficient because there is no need to lowercase.

Benchmark

// bench.mjs
import { bench, group, run } from "mitata";
const buffer = Buffer.from("content-type");

const decoder = new TextDecoder();

bench("noop", () => {});
bench("noop", () => {});
bench("noop", () => {});
bench("noop", () => {});
bench("noop", () => {});
bench("noop", () => {});

class Tree {
  #node;
  constructor() {
    this.#node = {};
  }
  append(value) {
    const target = Buffer.from((value = value.toLowerCase()));
    let node = this.#node;
    for (let i = 0; i < target.length; ++i) {
      const key = target[i];
      node[key] ??= {};
      if (0x61 <= key && key <= 0x7a) {
        node[key - 32] ??= node[key];
      }
      node = node[key];
    }
    node.value = value;
  }
  lookup(buffer) {
    const length = buffer.length;
    let node = this.#node;
    for (let i = 0; i < length; ++i) {
      const key = buffer[i];
      if (!(key in node) || node === undefined) return null;
      node = node[key];
    }
    return node === undefined ? null : node.value;
  }
}

const tree = new Tree();

tree.append("content-type");

const fromCharCode = String.fromCharCode;

group("lookup", () => {
  bench("String.fromCharCode.apply", () => {
    fromCharCode.apply(null, buffer);
  });
  bench("Buffer#toString", () => {
    buffer.toString("latin1");
  });
  bench("new TextDecoder", () => {
    new TextDecoder().decode(buffer);
  });
  bench("TextDecoder", () => {
    decoder.decode(buffer);
  });
  bench("tree.lookup", () => {
    tree.lookup(buffer);
  });
});

group("lookup with toLowerCase", () => {
  bench("String.fromCharCode.apply", () => {
    fromCharCode.apply(null, buffer).toLowerCase();
  });
  bench("Buffer#toString", () => {
    buffer.toString("latin1").toLowerCase();
  });
  bench("new TextDecoder", () => {
    new TextDecoder().decode(buffer).toLowerCase();
  });
  bench("TextDecoder", () => {
    decoder.decode(buffer).toLowerCase();
  });
  bench("tree.lookup", () => {
    tree.lookup(buffer);
  });
});

await new Promise((resolve) => setTimeout(resolve, 7000));

await run();

• lookup
----------------------------------------------------------------- -----------------------------
String.fromCharCode.apply  142.27 ns/iter (123.52 ns … 386.93 ns) 141.83 ns 304.22 ns 325.85 ns
Buffer#toString            137.04 ns/iter (120.27 ns … 412.27 ns) 137.81 ns 251.07 ns 277.63 ns
new TextDecoder            273.48 ns/iter (190.98 ns … 506.72 ns) 316.53 ns 504.65 ns 506.72 ns
TextDecoder                172.61 ns/iter (114.26 ns … 381.04 ns)  169.4 ns 357.04 ns 373.75 ns
tree.lookup                 45.02 ns/iter  (38.88 ns … 144.39 ns)  47.52 ns  74.91 ns   80.5 ns

summary for lookup
  tree.lookup
   3.04x faster than Buffer#toString
   3.16x faster than String.fromCharCode.apply
   3.83x faster than TextDecoder
   6.07x faster than new TextDecoder
• lookup with toLowerCase
----------------------------------------------------------------- -----------------------------
String.fromCharCode.apply  152.95 ns/iter (129.08 ns … 433.39 ns) 148.95 ns 376.16 ns 387.05 ns
Buffer#toString            141.65 ns/iter  (125.11 ns … 453.8 ns)  146.7 ns  251.4 ns 266.22 ns
new TextDecoder            285.69 ns/iter (219.42 ns … 629.02 ns) 282.76 ns 589.22 ns 629.02 ns
TextDecoder                149.78 ns/iter   (116.24 ns … 3.53 µs) 133.05 ns 400.96 ns 988.68 ns
tree.lookup                 53.91 ns/iter   (41.39 ns … 449.1 ns)  53.85 ns 139.21 ns 162.82 ns

summary for lookup with toLowerCase
  tree.lookup
   2.63x faster than Buffer#toString
   2.78x faster than TextDecoder
   2.84x faster than String.fromCharCode.apply
   5.3x faster than new TextDecoder

ronag · 2023-12-06T07:05:30Z

What about const key = buffer[i] & ~32; to lowercase?

tsctx · 2023-12-06T07:05:33Z

I can work on this, what do you want me to do?

ronag · 2023-12-06T07:06:46Z

I can work on this, what do you want me to do?

Create two methods: headerToString (don't assume ascii) and headerToStringUnsafe (assume ascii).
Update parseHeaders to use them with an option "encoding".

You tree lookup method seems like a good first try.

tsctx · 2023-12-06T07:10:08Z

Let's go with that.
I think allowUnsafe is better than encoding.

tsctx · 2023-12-06T07:19:56Z

What about const key = buffer[i] & ~32; to lowercase?

This is not necessary, because even if it is uppercase, it already has a lowercase reference.

ywave620 added the bug Something isn't working label Dec 5, 2023

ronag added enhancement New feature or request and removed bug Something isn't working labels Dec 5, 2023

ywave620 changed the title ~~Request to replace FastBuffer with Uint8Arrray in internal codes for performance~~ Request to improve performance of stringifying in internal codes Dec 5, 2023

tsctx mentioned this issue Dec 6, 2023

perf(util/parseHeaders): If the header name is buffer #2501

Merged

ronag closed this as completed in #2501 Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request to improve performance of stringifying in internal codes #2498

Request to improve performance of stringifying in internal codes #2498

ywave620 commented Dec 5, 2023

ronag commented Dec 5, 2023

ywave620 commented Dec 5, 2023

ywave620 commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ywave620 commented Dec 5, 2023 •

edited

Loading

ywave620 commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

lemire commented Dec 5, 2023

ronag commented Dec 5, 2023 •

edited

Loading

lemire commented Dec 5, 2023

ronag commented Dec 6, 2023

ronag commented Dec 6, 2023 •

edited

Loading

tsctx commented Dec 6, 2023

ronag commented Dec 6, 2023 •

edited

Loading

tsctx commented Dec 6, 2023

ronag commented Dec 6, 2023 •

edited

Loading

tsctx commented Dec 6, 2023

tsctx commented Dec 6, 2023

Request to improve performance of stringifying in internal codes #2498

Request to improve performance of stringifying in internal codes #2498

Comments

ywave620 commented Dec 5, 2023

Bug Description

Reproducible By

Expected Behavior

Logs & Screenshots

Environment

Additional context

ronag commented Dec 5, 2023

ywave620 commented Dec 5, 2023

ywave620 commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ywave620 commented Dec 5, 2023 • edited Loading

ywave620 commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

ronag commented Dec 5, 2023

lemire commented Dec 5, 2023

ronag commented Dec 5, 2023 • edited Loading

lemire commented Dec 5, 2023

ronag commented Dec 6, 2023

ronag commented Dec 6, 2023 • edited Loading

tsctx commented Dec 6, 2023

ronag commented Dec 6, 2023 • edited Loading

tsctx commented Dec 6, 2023

ronag commented Dec 6, 2023 • edited Loading

tsctx commented Dec 6, 2023

tsctx commented Dec 6, 2023

ywave620 commented Dec 5, 2023 •

edited

Loading

ronag commented Dec 5, 2023 •

edited

Loading

ronag commented Dec 6, 2023 •

edited

Loading

ronag commented Dec 6, 2023 •

edited

Loading

ronag commented Dec 6, 2023 •

edited

Loading