www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - HTTP frameworks benchmark focused on D libraries

reply tchaloupka <chalucha gmail.com> writes:
Hi,
as it pops up now and then (last one in 
https://forum.dlang.org/thread/qttjlgxjmrzzuflrjiio forum.dlang.org) I wanted
to see the various D libraries performance against each other too and ended up
with https://github.com/tchaloupka/httpbench

It's just a simple plaintext response testing (nothing fancy as 
in Techempower) but this interests me the most as it gives the 
idea about the potential of the library.

More details in the README.

Hope it helps to test some ideas or improve the current solutions.

Tom
Sep 20 2020
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
With my lib, the -version=embedded_httpd_threads build should 
give more consistent results in tests like this.

The process pool it uses by default in a dub build is more crash 
resilient, but does have a habit of dropping excessive concurrent 
connections. This forces them to retry which slaughters 
benchmarks like this. It will have like 5 ms 99th percentile (2x 
faster than the same test with the threads version btw), but then 
that final 1% of responses can take several seconds complete 
(indeed with 256 concurrent on my box it takes a whopping 30 
seconds!). Even with only like 40 concurrent, there's a final 1% 
spike there, but it is more like 10ms so it isn't so noticeable, 
but with hundreds it grows fast.

That's probably what you're seeing here. The thread build accepts 
more smoothly and thus evens it out giving a nicer benchmark 
number... but it actually performs worse on average in real world 
deployments in my experience and is not as resilient to buggy 
code segfaulting (with processes, the individual handler respawns 
and resets that individual connection with no other requests 
affected. with threads, the whole server must respawn which also 
often slips by unnoticed but is more likely to disrupt unrelated 
users).

There is a potential "fix" for the process handler to complete 
these benchmarks more smoothly too, but it comes at a cost: even 
in the long retry cases, at least the client has some feedback. 
It knows its connection is not accepted and can respond 
appropriately. At a minimum, they won't be shoveling data at you 
yet. The "fix" though breaks this - you accept ALL the 
connections, even if you are too busy to actually process them. 
This leads to more inbound data potentially worsening the 
existing congestion and leaving users more likely to just hang. 
At least the unaccepted connection is specified (by TCP) to retry 
later automatically, but if it is accepted, acknowledged, yet 
unprocessed, it is unclear what to do. Odds are the user will 
just be left hanging until the browser decides to timeout and 
display its error which can actually take longer than the TCP 
retry window.

My threads version does it this way anyway though. So it'd 
probably look better on the benchmark.


But BTW stuff like this is why I don't put too much stock in 
benchmarks. Even if you aren't "cheating" like checking length 
instead of path and other tricks like that (which btw I think are 
totally legitimate in some cases, I said recently I see it as a 
*strength* when you can do that), it still leaves some nuance on 
the ground. Is it crash resilient? Debuggable when it crashes? Is 
it compatible with third-party libraries or force you to choose 
from ones that share your particular event loop at risk of 
blocking the whole server when you disobey? Does it *actually* 
provide the scalability it claims to under real world conditions, 
or did it optimize to the controlled conditions of benchmarks at 
the expense of dynamic adaptation to reality?

Harder to measure those.
Sep 20 2020
prev sibling next sibling parent reply Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Sunday, 20 September 2020 at 20:03:27 UTC, tchaloupka wrote:
 Hi,
 as it pops up now and then (last one in 
 https://forum.dlang.org/thread/qttjlgxjmrzzuflrjiio forum.dlang.org) I wanted
to see the various D libraries performance against each other too and ended up
with https://github.com/tchaloupka/httpbench

 It's just a simple plaintext response testing (nothing fancy as 
 in Techempower) but this interests me the most as it gives the 
 idea about the potential of the library.

 More details in the README.

 Hope it helps to test some ideas or improve the current 
 solutions.

 Tom
Cool! Nice to see such good results for D. Did you try netcore 3.1 btw? 🤔
Sep 20 2020
parent tchaloupka <chalucha gmail.com> writes:
On Monday, 21 September 2020 at 05:48:54 UTC, Imperatorn wrote:
 On Sunday, 20 September 2020 at 20:03:27 UTC, tchaloupka wrote:
 Hi,
 as it pops up now and then (last one in 
 https://forum.dlang.org/thread/qttjlgxjmrzzuflrjiio forum.dlang.org) I wanted
to see the various D libraries performance against each other too and ended up
with https://github.com/tchaloupka/httpbench

 It's just a simple plaintext response testing (nothing fancy 
 as in Techempower) but this interests me the most as it gives 
 the idea about the potential of the library.

 More details in the README.

 Hope it helps to test some ideas or improve the current 
 solutions.

 Tom
Cool! Nice to see such good results for D. Did you try netcore 3.1 btw? 🤔
There's really no reason for D to by any slower than others. It's just about the whole library package and how efficiently it's written. Eventcore is probably closest to the system and all above just adds more overhead. I've tried to run .Net core out of docker (I'm using podman actually) and it seems to be more performant than .Net Core 5. But it was out of the container so maybe it's just that. I've added switches to CLI to set some load generator parameters so we can test scaling easier. Thanks to Adam I've also pushed tests for arsd:cgi package. It's in it's own category as others are using async I/O loops. But everything has it's pros and cons.
Sep 21 2020
prev sibling next sibling parent ikod <igor.khasilev gmail.com> writes:
On Sunday, 20 September 2020 at 20:03:27 UTC, tchaloupka wrote:
 Hi,
 as it pops up now and then (last one in 
 https://forum.dlang.org/thread/qttjlgxjmrzzuflrjiio forum.dlang.org) I wanted
to see the various D libraries performance against each other too and ended up
with https://github.com/tchaloupka/httpbench

 It's just a simple plaintext response testing (nothing fancy as 
 in Techempower) but this interests me the most as it gives the 
 idea about the potential of the library.

 More details in the README.

 Hope it helps to test some ideas or improve the current 
 solutions.

 Tom
thanks! Very good news.
Sep 21 2020
prev sibling next sibling parent James Blachly <james.blachly gmail.com> writes:
On 9/20/20 4:03 PM, tchaloupka wrote:
 Hi,
 as it pops up now and then (last one in 
 https://forum.dlang.org/thread/qttjlgxjmrzzuflrjiio forum.dlang.org) I 
 wanted to see the various D libraries performance against each other too 
 and ended up with https://github.com/tchaloupka/httpbench
 
 It's just a simple plaintext response testing (nothing fancy as in 
 Techempower) but this interests me the most as it gives the idea about 
 the potential of the library.
 
 More details in the README.
 
 Hope it helps to test some ideas or improve the current solutions.
 
 Tom
Thank you for doing this! One of the most fascinating things I think is how photon really shines when concurrency gets dialed up. With 8 workers, it performs about as well, but below, the rest of the micro, including below Rust and Go /platforms/. However, at 64 concurrent workers, photon rises to the top of the stack, performing about as well as eventcore and hunt. When going all the way up to 256, it was the only one that demonstrated **consistent performance** -- about the same as w/64, whereas ALL others dropped off, performing WORSE with 256 workers.
Sep 21 2020
prev sibling next sibling parent reply tchaloupka <chalucha gmail.com> writes:
Hi all, I've just pushed the updated results.

Test suite modifications:

* added runner command to list available tests
* possibility to switch off keepalive connections - causes `hey` 
to make a new connection for each request
* added parameter to run each test multiple times and choose the 
best result out of the runs

Tests additions:

* new RAW tests in C to utilize epoll and io_uring (using 
liburing) event loops - so we have some ground base we can 
compare against
* same RAW tests but in Dlang too - both in betterC, epoll is 
basically the same, io_uring differs in that it uses my during[1] 
library - so we can see if there are some performance problems 
(as it should perform basically the same as C variant)

Some insights:

I've found the test results from hey[2] pretty inconsistent (run 
locally or over the network). That's the reason I've added the 
`bestof` switch to the runner. And the current test results are 
the best of 10 runs for each of them.

Some results are a bit surprising, ie that even with 10 runs 
there are tests that are faster than C/dlang raw tests - as they 
should be at the top because they really don't do anything with 
HTTP handling.. And eventcore/fibers to beat raw C epoll loop 
with fibers overhead? It just seems odd..

I'll probably add wrk[3] load generator too to see a difference 
with a longer running tests.

[1] https://github.com/tchaloupka/during
[2] https://github.com/rakyll/hey
[3] https://github.com/wg/wrk
Sep 27 2020
next sibling parent ikod <igor.khasilev gmail.com> writes:
On Sunday, 27 September 2020 at 10:08:24 UTC, tchaloupka wrote:
 Hi all, I've just pushed the updated results.
 * new RAW tests in C to utilize epoll and io_uring (using 
 liburing) event loops - so we have some ground base we can
 I'll probably add wrk[3] load generator too to see a difference 
 with a longer running tests.

 [1] https://github.com/tchaloupka/during
 [2] https://github.com/rakyll/hey
 [3] https://github.com/wg/wrk
Thank for this job. It may be worth to add nginx as baseline for real C-based server. I'll add my framework as soon as it will be ready.
Sep 27 2020
prev sibling next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
I fixed my event loop last night so I'll prolly release that at 
some point after a lil more testing, it fixes my keep-alive 
numbers... but harms the others so I wanna see if I can maintain 
those too.
Sep 27 2020
prev sibling next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 27 September 2020 at 10:08:24 UTC, tchaloupka wrote:
 * new RAW tests in C to utilize epoll and io_uring (using 
 liburing) event loops - so we have some ground base we can 
 compare against
I fixed some buffering issues in cgi.d and, if you have the right concurrency level that happens to align with the number of worker processes... I'm getting incredible results. 65k rps. It *might* just beat the raw there. The kernel does a really good job. Of course, it still will make other connections wait forever... but my new event loop in threads mode is now also giving me a pretty solid 26k rps on random concurrency levels with the buffering fix. I just need to finish testing this to get some confidence before I push live but here it is on a github branch if you're curious to look: https://github.com/adamdruppe/arsd/blob/cgi_preview/cgi.d Compile with `-version=embedded_httpd_threads -version=cgi_use_fiber` to opt into the new event loop. But the buffering improvements should register in all usage modes.
Sep 27 2020
prev sibling next sibling parent James Blachly <james.blachly gmail.com> writes:
On 9/27/20 6:08 AM, tchaloupka wrote:
 Hi all, I've just pushed the updated results.
 
Thanks for continuing to work on this! vibe-core performs quite well -- scaling up with additional workers from 8 through 256, whereas vibe-d platform tops out around ~35,000-45,000 RPS irrespective of simultaneous workers (plateauing between 8-64 workers). Given the outstanding performance of vibe-core it looks like there is room to continue to improve the vibe-d platform. Cheers again for your work.
Sep 27 2020
prev sibling parent reply Daniel Kozak <kozzi11 gmail.com> writes:
On Sun, Sep 27, 2020 at 12:10 PM tchaloupka via Digitalmars-d-announce <
digitalmars-d-announce puremagic.com> wrote:

 ...
 Some results are a bit surprising, ie that even with 10 runs
 there are tests that are faster than C/dlang raw tests - as they
 should be at the top because they really don't do anything with
 HTTP handling.. And eventcore/fibers to beat raw C epoll loop
 with fibers overhead? It just seems odd..
  ...
I do not see TCP_NODELAY anywhere in your code for raw tests, so maybe you should try that
Sep 28 2020
parent tchaloupka <chalucha gmail.com> writes:
On Monday, 28 September 2020 at 09:44:14 UTC, Daniel Kozak wrote:
 I do not see TCP_NODELAY anywhere in your code for raw tests, 
 so maybe you should try that
I've added new results with these changes: * added NGINX test * edge and level triggered variants for epoll tests (level should be faster in this particular test) * new hybrid server variant of ARSD (thanks Adam) * added TCP_NODELAY to listen socket in some tests (client sockets should derive this) * make response sizes even for all tests * errors and response size columns are printed out only when there's some difference as they are pretty meaningless otherwise * switch to wrk load generator as a default (hey is still supported) - as it is less resource demanding and gives better control over client connections and it's worker threads Some tests insights: * arsd - I'm more inclined to switch it to multiCore category, at least the hybrid variant now (as it's not too fair against the others that should run just one worker with one thread event loop) - see https://github.com/tchaloupka/httpbench/pull/5 for discussion * ideal would be to add current variant to multiCore tests and limited variant to singleCore * photon - I've assumed that it's working in a single thread, but it doesn't seems to (see https://github.com/tchaloupka/httpbench/issues/7)
Sep 28 2020
prev sibling parent reply tchaloupka <chalucha gmail.com> writes:
Hi,
as there are two more HTTP server implementations:

* 
[Serverino](https://forum.dlang.org/thread/bqsatbwjtoobpbzxdpkf forum.dlang.org)
* 
[Archttp](https://forum.dlang.org/thread/jckjrgnmgsulewnrefpr forum.dlang.org)

It was time to update some numbers!

Last results can be seen 
[here](https://github.com/tchaloupka/httpbench#multi-core-results) - it's a lot
of numbers..

Some notes:

* I've updated all frameworks and compilers to latest versions
* tests has been run on the same host but separated using VMs 
(for workload generator and servers) with pinned CPUs (so they 
don't interfere each other)
* as I have "only" 16 available threads to be used and in 12 vs 4 
CPUs scenario wrk saturated all 12 CPUs, I had to switch it to 
14/2 CPUs to give wrk some space
* virtio bridged network between VMs
* `Archttp` have some problem with only 2 CPUs so it's included 
only in the first test (it was ok with 4 CPUs and was cca 2x 
faster than `hunt-web`)
* `Serverino` is set to use same number of processes as are CPUs 
(leaving it to default was slower so I kept it set like that)

One may notice some strange `adio-http` it the results. Well, 
it's a WIP framework (`adio` as an "async dlang I/O"), that is 
not public (yet). It has some design goals (due to it's  targeted 
usage), that some can prefer and some won't like at all:

* `betterC` - so no GC, no delegates, no classes (D ones), no 
exceptions, etc.
   * should be possible later to work with full D too, but it's 
easier to go from `betterC` to full D than other way around and 
is not in the focus now
* linux as an only target atm.
* `epoll` and `io_uring` async I/O api backends (can be extended 
with `IOCP` or `Kqueue`, but linux is main target now)
* performance, simplicity, safety in this order (and yes with 
`betterC` there are many pointers, function callbacks, manual 
memory management, etc. - thanks for `asan` ldc team ;-))
* middleware support - one can setup router with ie request 
logger, gzip, auth middlewares easily (REST API middleware would 
be one of them)
* can be used with just callbacks or combined with fibers (http 
server itself is fibers only as it would be a callback hell 
otherwise)
* each async operation can be set with timeout to simplify usage

It doesn't use any "hacks" in the benchmark. Just a real HTTP 
parser, simple path router, real headers writing, real `Date` 
header, etc. But has tuned parameters (no timeouts set - which 
others doesn't use too).
It'll be released when API settles a bit and real usage with 
sockets, websockets, http clients, REST API, etc. would be 
possible.
May 26 2022
next sibling parent reply zoujiaqing <zoujiaqing gmail.com> writes:
On Thursday, 26 May 2022 at 07:49:23 UTC, tchaloupka wrote:
 Some notes:

 * `Archttp` have some problem with only 2 CPUs so it's included 
 only in the first test (it was ok with 4 CPUs and was cca 2x 
 faster than `hunt-web`)
Hi tchaloupka: First Thank you for the benchmark project! I fixed the performance bug the first time. (The default HTTP 1.1 connection is keep-alive) Archttp version 1.0.2 has been released, and retesting has yielded significant performance improvements. -- zoujiaqing
May 27 2022
parent reply tchaloupka <chalucha gmail.com> writes:
On Friday, 27 May 2022 at 20:51:14 UTC, zoujiaqing wrote:
 On Thursday, 26 May 2022 at 07:49:23 UTC, tchaloupka wrote:

 I fixed the performance bug the first time. (The default HTTP 
 1.1 connection is keep-alive)

 Archttp version 1.0.2 has been released, and retesting has 
 yielded significant performance improvements.

 -- zoujiaqing
Hi, thanks for the PR. I've rerun the tests for `archttp` and it is indeed much better. Now on par with `vibe-d` Some more notes for a better performance (it's the same with `vibe-d` too). See what syscalls are called during the request processing: ``` [pid 1453] read(10, "GET / HTTP/1.1\r\nHost: 192.168.12"..., 1024) = 117 [pid 1453] write(10, "HTTP/1.1 200 OK\r\nDate: Sat, 28 M"..., 173) = 173 [pid 1453] write(10, "Hello, World!", 13) = 13 ``` It means two separate syscalls for header and body. This alone have huge impact on the performance and if it can be avoided, it would be much better. Also read/write while working with a socket too, are a bit slower than recv/send.
May 27 2022
parent ikod <igor.khasilev gmail.com> writes:
On Saturday, 28 May 2022 at 05:44:11 UTC, tchaloupka wrote:
 On Friday, 27 May 2022 at 20:51:14 UTC, zoujiaqing wrote:
 It means two separate syscalls for header and body. This alone 
 have huge impact on the performance and if it can be avoided, 
 it would be much better.
sendv/writev also can help to save syscals when you have to send data from non-contiguous buffers.
May 27 2022
prev sibling parent reply Andrea Fontana <nospam example.org> writes:
On Thursday, 26 May 2022 at 07:49:23 UTC, tchaloupka wrote:
 Hi,
 as there are two more HTTP server implementations:

 * 
 [Serverino](https://forum.dlang.org/thread/bqsatbwjtoobpbzxdpkf forum.dlang.org)
Thank you! Since it's just a young library that results sounds promising. I'm just working on the next version, focusing on performance enhancement and windows support :) I see there is a test where numbers are identical to arsd ones, is it a typo or a coincidence? Andrea
May 28 2022
parent reply tchaloupka <chalucha gmail.com> writes:
On Sunday, 29 May 2022 at 06:22:43 UTC, Andrea Fontana wrote:
 On Thursday, 26 May 2022 at 07:49:23 UTC, tchaloupka wrote:

 I see there is a test where numbers are identical to arsd ones, 
 is it a typo or a coincidence?
 Andrea
Hi Andrea, it was just a coincidence, straight out copy of the tool results. But as I've found some bugs calculating percentiles from `hey` properly, I've updated the results after the fix. I've also added results for `geario` (thanks #zoujiaqing). For `serverino`, I've added variant that uses 16 worker subprocesses in the pool, that should lead to less blocking and worse per request times in the test environment. Tom
May 30 2022
parent Andrea Fontana <nospam example.com> writes:
On Monday, 30 May 2022 at 20:57:02 UTC, tchaloupka wrote:
 On Sunday, 29 May 2022 at 06:22:43 UTC, Andrea Fontana wrote:
 On Thursday, 26 May 2022 at 07:49:23 UTC, tchaloupka wrote:

 I see there is a test where numbers are identical to arsd 
 ones, is it a typo or a coincidence?
 Andrea
Hi Andrea, it was just a coincidence, straight out copy of the tool results. But as I've found some bugs calculating percentiles from `hey` properly, I've updated the results after the fix. I've also added results for `geario` (thanks #zoujiaqing). For `serverino`, I've added variant that uses 16 worker subprocesses in the pool, that should lead to less blocking and worse per request times in the test environment. Tom
Thank's again! Benchmark are always welcome :)
May 31 2022