Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to improve performance of stringifying in internal codes #2498

Closed
ywave620 opened this issue Dec 5, 2023 · 24 comments · Fixed by #2501
Closed

Request to improve performance of stringifying in internal codes #2498

ywave620 opened this issue Dec 5, 2023 · 24 comments · Fixed by #2501
Labels
enhancement New feature or request

Comments

@ywave620
Copy link

ywave620 commented Dec 5, 2023

Bug Description

The cpp binding function, utf8slice called by Buffer.toString() is accounted for a big proportion in the CPU profile resulted from benchmarking a HTTP proxy using undici.

By using Uint8Arrray, we can eliminat the overhead incurred by the language transition

Reproducible By

const dispatcher = new Pool('http://127.0.0.1:8090' /* an 4worker nginx */, {
  pipelining: 30,
  connections: 100,
});

const proxy = http.createServer((req, res) => {
  dispatcher.request({ // the constructor of RequestHandler in undici/lib/api/api-request.js
    method: req.method,
    path: req.url,
    headers: req.headers,
    body: req,
  }, (err, { statusCode, headers, body /** a Readable */ }) => {
    if (err) {
      throw err
    }
    res.writeHead(statusCode, headers);
     pipe(body, res);
  });
});

function pipe(src, dst) {
  src.on('readable', () => {
    let chunk;
    while (null !== (chunk = src.read(65536))) {
      dst.write(chunk);
    }
  });
  src.on('end', () => {
    dst.end()
  });
}

Expected Behavior

Logs & Screenshots

image

Environment

node v22
linux

Additional context

I hope the members can provide some guide and more importantly caveats to applying this idea

@ywave620 ywave620 added the bug Something isn't working label Dec 5, 2023
@ronag ronag added enhancement New feature or request and removed bug Something isn't working labels Dec 5, 2023
@ronag
Copy link
Member

ronag commented Dec 5, 2023

FastBuffer is pretty much an Uint8Arrray with some helper methods. How exactly would it make it faster?

@ywave620
Copy link
Author

ywave620 commented Dec 5, 2023

At least, no language switch overhead from these
image

@ywave620 ywave620 changed the title Request to replace FastBuffer with Uint8Arrray in internal codes for performance Request to improve performance of stringifying in internal codes Dec 5, 2023
@ywave620
Copy link
Author

ywave620 commented Dec 5, 2023

One caveat I found nodejs/node#17431. %TypedArray%.prototype.subarray suffers from an upstream bug and plays slow. However, I think %TypedArray%.prototype.toString will be faster than Buffer.prototype.toString, thanks to the saved language switch

@ronag
Copy link
Member

ronag commented Dec 5, 2023

When is utf8Slice called? And what faster option does Uint8Array provide?

@ronag
Copy link
Member

ronag commented Dec 5, 2023

However, I think %TypedArray%.prototype.toString will be faster than Buffer.prototype.toString, thanks to the saved language switch

Do you have a benchmark?

@ronag
Copy link
Member

ronag commented Dec 5, 2023

I think this should be fixed in Node if it's faster.

@ronag
Copy link
Member

ronag commented Dec 5, 2023

Uint8Array.toString and Buffer.toString don't do the same thing...

@ywave620
Copy link
Author

ywave620 commented Dec 5, 2023

When is utf8Slice called? And what faster option does Uint8Array provide?

image

Sorry for the careless reply. The original motivation of this issue is that since the binding function, utf8Slice becomes the second largest contributor to CPU among all functions in HTTP module, I think we should works on a more efficent alternative, probably a pure js one.

Yes, Uint8Array.toString is not we want, TextDecoder is one thing I comes up with, but my benchmark shows it does not have advantage, which is out of my expectation

const buf = Buffer.from('Content-Type')

console.time()
for(let i = 0; i < 1e6; i ++) {
  buf.toString();
}
console.timeEnd()

const decoder = new TextDecoder();
const arr = new Uint8Array([0x43,0x6f,0x6e,0x74,0x65,0x6e,0x74,0x2d,0x54,0x79,0x70,0x65])
console.time()
for(let i = 0; i < 1e6; i ++) {
  decoder.decode(arr)
}
console.timeEnd()
default: 116.781ms
default: 113.501ms

@ywave620
Copy link
Author

ywave620 commented Dec 5, 2023

So if Buffer.toString is slow and has no room to improve, can we skip it as possible as we can.
Maybe in

const key = headers[i].toString()

If headers[i] represents a known header field, we can use the cached string rather than call toString on it to get one.

@ronag
Copy link
Member

ronag commented Dec 5, 2023

If headers[i] represents a known header field, we can use the cached string rather than call toString on it to get one.

The question is how to achieve that... you would need to somehow determine that a specific buffer corresponds to a specific string.

@ronag
Copy link
Member

ronag commented Dec 5, 2023

@anonrig

@ronag
Copy link
Member

ronag commented Dec 5, 2023

We have to constraints in our favor:

  • header keys are ascii (or are they?)
  • they are mostly from a limited set of possible values

@ronag
Copy link
Member

ronag commented Dec 5, 2023

@lemire

@lemire
Copy link
Member

lemire commented Dec 5, 2023

You are taking small buffers and converting them into v8 strings. It is highly likely that the bottleneck is in the speed of creating v8 strings from an array of bytes.

We can do some profiling to verify...

Regarding toString() (first half of the benchmark), the time spent is as follows (Node 20):

          10.17%        [.] node::encoding_binding::BindingData::DecodeUTF8                                                                             ▒
           8.51%        [.] Builtins_LoadIC                                                                                                             ▒
           6.46%        [.] v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString                                                       ▒
           5.01%        [.] v8::internal::Factory::AllocateRaw                                                                                          ▒
           4.12%        [.] v8::internal::Factory::NewStringFromUtf8                                                                                    ▒
           2.79%        [.] v8::internal::Utf8DecoderBase<v8::internal::Utf8Decoder>::Utf8DecoderBase                                                   ▒
           2.78%        [.] v8::ArrayBufferView::ByteLength                                                                                             ▒
           2.66%        [.] Builtins_CallApiCallback                                                                                                    ▒
           2.51%        [.] v8::ArrayBufferView::CopyContents                                                                                           ▒
           2.35%        [.] v8::Value::IsArrayBufferView                                                                                                ▒
           1.92%        [.] v8::String::NewFromUtf8                                                                                                     ▒
           1.73%        [.] v8::internal::Utf8DecoderBase<v8::internal::Utf8Decoder>::Decode<unsigned char>                                             ▒
           1.58%        [.] node::StringBytes::Encode                                                                                                   ▒
           1.31%        [.] v8::Value::IsSharedArrayBuffer                                                                                              ▒
           1.18%        [.] v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap                                                ▒
           1.10%        [.] v8::internal::Deserializer<v8::internal::Isolate>::ReadSingleBytecodeData<v8::internal::SlotAccessorForHeapObject>          ▒
           1.02%        [.] v8::Value::IsArrayBuffer                                                                                                    ▒
           1.01%        [.] v8::Context::GetNumberOfEmbedderDataFields                                                                                  ▒
           0.98%        [.] v8::internal::Deserializer<v8::internal::Isolate>::ReadObject

We can compare Bun with Node.

Node: default: 128.842ms

Bun: 58.57ms

So Bun is 2x faster.

Regarding the second half...

          10.22%        [.] node::encoding_binding::BindingData::DecodeUTF8                                                                             ▒
           6.60%        [.] Builtins_LoadIC                                                                                                             ▒
           5.15%        [.] Builtins_CallApiCallback                                                                                                    ▒
           5.01%        [.] v8::internal::Utf8DecoderBase<v8::internal::Utf8Decoder>::Utf8DecoderBase                                                   ▒
           3.85%        [.] v8::internal::Factory::NewStringFromUtf8                                                                                    ▒
           3.37%        [.] v8::ArrayBufferView::ByteLength                                                                                             ▒
           3.37%        [.] v8::ArrayBufferView::CopyContents                                                                                           ▒
           3.10%        [.] v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString                                                       ▒
           2.81%        [.] v8::Isolate::GetCurrentContext                                                                                              ▒
           2.65%        [.] node::StringBytes::Encode                                                                                                   ▒
           2.64%        [.] v8::internal::Factory::AllocateRaw                                                                                          ▒
           2.04%        [.] v8::Value::IsSharedArrayBuffer                                                                                              ▒
           2.04%        [.] v8::internal::Utf8DecoderBase<v8::internal::Utf8Decoder>::Decode<unsigned char>                                             ▒
           1.61%        [.] v8::internal::JSTypedArray::element_size                                                                                    ▒
           1.61%        [.] v8::Value::IsArrayBufferView                                                                                                ▒
           1.48%        [.] v8::Context::GetNumberOfEmbedderDataFields                                                                                  ▒
           1.41%        [.] v8::internal::Deserializer<v8::internal::Isolate>::ReadObject                                                               ▒
           1.32%        [.] v8::internal::CopyChars<unsigned char, unsigned char>                                                                       ▒
           1.31%        [.] v8::internal::Deserializer<v8::internal::Isolate>::ReadSingleBytecodeData<v8::internal::SlotAccessorForHeapObject>          ▒
           1.02%        [.] v8::String::NewFromUtf8

We can compare Bun with Node.

Node: default: 128.932ms

Bun: 58.88ms

So Bun is 2x faster, again.

So DecodeUTF8 is a bottleneck in both instances. Its code looks like this...

void BindingData::DecodeUTF8(const FunctionCallbackInfo<Value>& args) {
  Environment* env = Environment::GetCurrent(args);  // list, flags

  CHECK_GE(args.Length(), 1);

  if (!(args[0]->IsArrayBuffer() || args[0]->IsSharedArrayBuffer() ||
        args[0]->IsArrayBufferView())) {
    return node::THROW_ERR_INVALID_ARG_TYPE(
        env->isolate(),
        "The \"list\" argument must be an instance of SharedArrayBuffer, "
        "ArrayBuffer or ArrayBufferView.");
  }

  ArrayBufferViewContents<char> buffer(args[0]);

  bool ignore_bom = args[1]->IsTrue();
  bool has_fatal = args[2]->IsTrue();

  const char* data = buffer.data();
  size_t length = buffer.length();

  if (has_fatal) {
    auto result = simdutf::validate_utf8_with_errors(data, length);

    if (result.error) {
      return node::THROW_ERR_ENCODING_INVALID_ENCODED_DATA(
          env->isolate(), "The encoded data was not valid for encoding utf-8");
    }
  }

  if (!ignore_bom && length >= 3) {
    if (memcmp(data, "\xEF\xBB\xBF", 3) == 0) {
      data += 3;
      length -= 3;
    }
  }

  if (length == 0) return args.GetReturnValue().SetEmptyString();

  Local<Value> error;
  MaybeLocal<Value> maybe_ret =
      StringBytes::Encode(env->isolate(), data, length, UTF8, &error);
  Local<Value> ret;

  if (!maybe_ret.ToLocal(&ret)) {
    CHECK(!error.IsEmpty());
    env->isolate()->ThrowException(error);
    return;
  }

  args.GetReturnValue().Set(ret);
}

The relevant part is this...

  Local<Value> error;
  MaybeLocal<Value> maybe_ret =
      StringBytes::Encode(env->isolate(), data, length, UTF8, &error);

It calls...

      {
        val = String::NewFromUtf8(isolate,
                                  buf,
                                  v8::NewStringType::kNormal,
                                  buflen);
        Local<String> str;
        if (!val.ToLocal(&str)) {
          *error = node::ERR_STRING_TOO_LONG(isolate);
        }
        return str;
      }

... which calls....

    i::Isolate* i_isolate = reinterpret_cast<internal::Isolate*>(v8_isolate); \
    ENTER_V8_NO_SCRIPT_NO_EXCEPTION(i_isolate);                               \
    API_RCS_SCOPE(i_isolate, class_name, function_name);                      \
    if (length < 0) length = StringLength(data);                              \
    i::Handle<i::String> handle_result =                                      \
        NewString(i_isolate->factory(), type,                                 \
                  base::Vector<const Char>(data, length))                     \
            .ToHandleChecked();                                               \
    result = Utils::ToLocal(handle_result);  

which calls...

inline i::MaybeHandle<i::String> NewString(i::Factory* factory,
                                           NewStringType type,
                                           base::Vector<const char> string) {
  if (type == NewStringType::kInternalized) {
    return factory->InternalizeUtf8String(string);
  }
  return factory->NewStringFromUtf8(string);
}

And down, down we go... turtles all the way down.

So yeah, it is likely a fairly expensive process.

And as long as you must create a v8 string, I don't see much in the way of possible optimization. It costs what it costs.

@ronag
Copy link
Member

ronag commented Dec 5, 2023

I think the point here is that we might not have to create the strings. We could e.g hash the buffer and re use strings. Or some other way to lookup common header names from buffer to string.

@lemire
Copy link
Member

lemire commented Dec 5, 2023

You can use the fact that UTF-8 validation can be done nearly for free in Node directly in a buffer thanks to @anonrig.

@ronag
Copy link
Member

ronag commented Dec 6, 2023

Do we even need to utf validate header strings? Shouldn't they be ASCII? I think llhttp might already validate that?

@ronag
Copy link
Member

ronag commented Dec 6, 2023

So I was thinking something like:

function headerToString(buf) {
  if (buf[0] === CHAR_C) {
    if (buf.equals(BUF_CONTENT_LENGTH) return 'content-length'
  } else {
    return buf.toString()
 }
}

function headerToString(buf) {
  const hash = hash(buf) // Sufficiently unique hash
  return lookup.get(hash) ?? buf.toString()
}

@tsctx
Copy link
Member

tsctx commented Dec 6, 2023

I think I could use a tree structure.
It is more efficient because there is no need to lowercase.

Benchmark
// bench.mjs
import { bench, group, run } from "mitata";
const buffer = Buffer.from("content-type");

const decoder = new TextDecoder();

bench("noop", () => {});
bench("noop", () => {});
bench("noop", () => {});
bench("noop", () => {});
bench("noop", () => {});
bench("noop", () => {});

class Tree {
  #node;
  constructor() {
    this.#node = {};
  }
  append(value) {
    const target = Buffer.from((value = value.toLowerCase()));
    let node = this.#node;
    for (let i = 0; i < target.length; ++i) {
      const key = target[i];
      node[key] ??= {};
      if (0x61 <= key && key <= 0x7a) {
        node[key - 32] ??= node[key];
      }
      node = node[key];
    }
    node.value = value;
  }
  lookup(buffer) {
    const length = buffer.length;
    let node = this.#node;
    for (let i = 0; i < length; ++i) {
      const key = buffer[i];
      if (!(key in node) || node === undefined) return null;
      node = node[key];
    }
    return node === undefined ? null : node.value;
  }
}

const tree = new Tree();

tree.append("content-type");

const fromCharCode = String.fromCharCode;

group("lookup", () => {
  bench("String.fromCharCode.apply", () => {
    fromCharCode.apply(null, buffer);
  });
  bench("Buffer#toString", () => {
    buffer.toString("latin1");
  });
  bench("new TextDecoder", () => {
    new TextDecoder().decode(buffer);
  });
  bench("TextDecoder", () => {
    decoder.decode(buffer);
  });
  bench("tree.lookup", () => {
    tree.lookup(buffer);
  });
});

group("lookup with toLowerCase", () => {
  bench("String.fromCharCode.apply", () => {
    fromCharCode.apply(null, buffer).toLowerCase();
  });
  bench("Buffer#toString", () => {
    buffer.toString("latin1").toLowerCase();
  });
  bench("new TextDecoder", () => {
    new TextDecoder().decode(buffer).toLowerCase();
  });
  bench("TextDecoder", () => {
    decoder.decode(buffer).toLowerCase();
  });
  bench("tree.lookup", () => {
    tree.lookup(buffer);
  });
});

await new Promise((resolve) => setTimeout(resolve, 7000));

await run();
• lookup
----------------------------------------------------------------- -----------------------------
String.fromCharCode.apply  142.27 ns/iter (123.52 ns … 386.93 ns) 141.83 ns 304.22 ns 325.85 ns
Buffer#toString            137.04 ns/iter (120.27 ns … 412.27 ns) 137.81 ns 251.07 ns 277.63 ns
new TextDecoder            273.48 ns/iter (190.98 ns … 506.72 ns) 316.53 ns 504.65 ns 506.72 ns
TextDecoder                172.61 ns/iter (114.26 ns … 381.04 ns)  169.4 ns 357.04 ns 373.75 ns
tree.lookup                 45.02 ns/iter  (38.88 ns … 144.39 ns)  47.52 ns  74.91 ns   80.5 ns

summary for lookup
  tree.lookup
   3.04x faster than Buffer#toString
   3.16x faster than String.fromCharCode.apply
   3.83x faster than TextDecoder
   6.07x faster than new TextDecoder
• lookup with toLowerCase
----------------------------------------------------------------- -----------------------------
String.fromCharCode.apply  152.95 ns/iter (129.08 ns … 433.39 ns) 148.95 ns 376.16 ns 387.05 ns
Buffer#toString            141.65 ns/iter  (125.11 ns … 453.8 ns)  146.7 ns  251.4 ns 266.22 ns
new TextDecoder            285.69 ns/iter (219.42 ns … 629.02 ns) 282.76 ns 589.22 ns 629.02 ns
TextDecoder                149.78 ns/iter   (116.24 ns … 3.53 µs) 133.05 ns 400.96 ns 988.68 ns
tree.lookup                 53.91 ns/iter   (41.39 ns … 449.1 ns)  53.85 ns 139.21 ns 162.82 ns

summary for lookup with toLowerCase
  tree.lookup
   2.63x faster than Buffer#toString
   2.78x faster than TextDecoder
   2.84x faster than String.fromCharCode.apply
   5.3x faster than new TextDecoder

@ronag
Copy link
Member

ronag commented Dec 6, 2023

What about const key = buffer[i] & ~32; to lowercase?

@tsctx
Copy link
Member

tsctx commented Dec 6, 2023

I can work on this, what do you want me to do?

@ronag
Copy link
Member

ronag commented Dec 6, 2023

I can work on this, what do you want me to do?

  • Create two methods: headerToString (don't assume ascii) and headerToStringUnsafe (assume ascii).
  • Update parseHeaders to use them with an option "encoding".

You tree lookup method seems like a good first try.

@tsctx
Copy link
Member

tsctx commented Dec 6, 2023

Let's go with that.
I think allowUnsafe is better than encoding.

@tsctx
Copy link
Member

tsctx commented Dec 6, 2023

What about const key = buffer[i] & ~32; to lowercase?

This is not necessary, because even if it is uppercase, it already has a lowercase reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants