www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Asynchronicity in D

reply Max Klyga <max.klyga gmail.com> writes:
I've been thinking on things I can change in my GSoC proposal to make 
it stronger and noticed that currently Phobos does not address 
asynchronous I/O of any kind.

A number of threads on thid newsgroup mentioned about this problem or 
shown ways other languages address asynchronicity.

I want to ask D community about plans on asynchronicity in Phobos.
Did somenone in Phobos team thought about possible design?
How does asynchronicity stacks with ranges?
What model should D adapt?
etc.
Mar 31 2011
next sibling parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

Yes, asynchronous networking API would be more scalable. If you're collecting information about async IO, please take a look at libevent and libev, also NT's completion ports, FreeBSD's kqueue and Linux epoll. Protocols implemented using event-driven APIs should scale to thousands of connections using few working threads. Moreover async protocols could be wrapped to be synchronous (but not other way around) and used just like well known blocking API's. Basically, while using async IO, you request some data to be written and then wait for completion event (usually by callback function). Then you request some allocated buffer to be read and then you wait until network stack fills it up. You do not wait for blocking operation like with using send() or recv(), instead you may do some useful processing between events.
Mar 31 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Piotr Szturmaj (bncrbme jadamspam.pl)'s article
 Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

collecting information about async IO, please take a look at libevent and libev, also NT's completion ports, FreeBSD's kqueue and Linux epoll. Protocols implemented using event-driven APIs should scale to thousands of connections using few working threads. Moreover async protocols could be wrapped to be synchronous (but not other way around) and used just like well known blocking API's. Basically, while using async IO, you request some data to be written and then wait for completion event (usually by callback function). Then you request some allocated buffer to be read and then you wait until network stack fills it up. You do not wait for blocking operation like with using send() or recv(), instead you may do some useful processing between events.

Forgive any naiveness here, but isn't this just a special case of future promise parallelism? Using my proposed std.parallelism module: auto myTask = task(&someNetworkClass.recv); // Use a new thread, but this could also be executed on a task // queue to keep the number of threads down. myTask.executeInNewThread(); // Do other stuff. auto recvResults = myTask.yieldWait(); // Do stuff with recvResults If I understand correctly (though it's very likely I don't since I've never written any serious networking code before) such a thing can and should be implemented on top of more general parallelism primitives rather than being baked directly into the networking design.
Mar 31 2011
parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
dsimcha wrote:
 == Quote from Piotr Szturmaj (bncrbme jadamspam.pl)'s article
 Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

collecting information about async IO, please take a look at libevent and libev, also NT's completion ports, FreeBSD's kqueue and Linux epoll. Protocols implemented using event-driven APIs should scale to thousands of connections using few working threads. Moreover async protocols could be wrapped to be synchronous (but not other way around) and used just like well known blocking API's. Basically, while using async IO, you request some data to be written and then wait for completion event (usually by callback function). Then you request some allocated buffer to be read and then you wait until network stack fills it up. You do not wait for blocking operation like with using send() or recv(), instead you may do some useful processing between events.

Forgive any naiveness here, but isn't this just a special case of future promise parallelism? Using my proposed std.parallelism module: auto myTask = task(&someNetworkClass.recv); // Use a new thread, but this could also be executed on a task // queue to keep the number of threads down. myTask.executeInNewThread(); // Do other stuff. auto recvResults = myTask.yieldWait(); // Do stuff with recvResults If I understand correctly (though it's very likely I don't since I've never written any serious networking code before) such a thing can and should be implemented on top of more general parallelism primitives rather than being baked directly into the networking design.

Asynchronous tasks are great thing, but async networking IO aka overlapped IO is something different. Its efficency comes from direct interaction with operating system. In case of tasks you need one thread for each task, whereas in overlapped IO, you just request some well known IO operation, which is completed by the OS in the background. You don't need any threads, besides those which handle completion events. Here is a good explanation of how it works in WinNT: http://en.wikipedia.org/wiki/Overlapped_I/O
Mar 31 2011
prev sibling parent =?UTF-8?B?QWxla3NhbmRhciBSdcW+acSNacSH?= <ruzicic.aleksandar gmail.com> writes:
I really like design of node.js (http://nodejs.org) it's internally
based on libev and everything runs in a single-threaded event loop.
It's proven to be highly concurrent and memory efficient.

Maybe a wrapper around libev(ent) for D ala node.js would be good
solution for asynchronous API, other than thread approach (I always
like to have more than one option and choose one which suits better
for concrete task I'm dealing with).

Whatever solution to be chosen I'd like to have an API like this:

readTextAsync(filename, (string contents) {
   // do something with contents
});


On Thu, Mar 31, 2011 at 2:04 PM, Piotr Szturmaj <bncrbme jadamspam.pl> wrote:
 Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

Yes, asynchronous networking API would be more scalable. If you're collecting information about async IO, please take a look at libevent and libev, also NT's completion ports, FreeBSD's kqueue and Linux epoll. Protocols implemented using event-driven APIs should scale to thousands of connections using few working threads. Moreover async protocols could be wrapped to be synchronous (but not other way around) and used just like well known blocking API's. Basically, while using async IO, you request some data to be written and then wait for completion event (usually by callback function). Then you request some allocated buffer to be read and then you wait until network stack fills it up. You do not wait for blocking operation like with using send() or recv(), instead you may do some useful processing between events.

Mar 31 2011
prev sibling next sibling parent =?UTF-8?B?QWxla3NhbmRhciBSdcW+acSNacSH?= <ruzicic.aleksandar gmail.com> writes:
I really like design of node.js (http://nodejs.org) it's internally
based on libev and everything runs in a single-threaded event loop.
It's proven to be highly concurrent and memory efficient.

Maybe a wrapper around libev(ent) for D ala node.js would be good
solution for asynchronous API, other than thread approach (I always
like to have more than one option and choose one which suits better
for concrete task I'm dealing with).

Whatever solution to be chosen I'd like to have an API like this:

readTextAsync(filename, (string contents) {
  // do something with contents
});
Mar 31 2011
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/31/11 6:35 AM, Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

I think that would be a good contribution that would complement Jonas'. You'll need to discuss cooperation with him and at best Jonas would agree to become a mentor. I've posted a couple of weeks earlier how I think that could work with ranges: the range maintains the asynchronous state and has a queue of already-available buffers received. The network traffic occurs in a different thread; the range throws requests over the fence to libcurl and libcurl throws buffers over the fence back to the range. The range offers a seemingly synchronous interface: foreach (line; byLineAsync("http://d-programming-language.org")) { ... use line ... } except that the processing and the fetching of data occur in distinct threads. Server-side code such as network servers etc. would also be an interesting topic. Let me know if you're versed in the likes of libev(ent). Thanks, Andrei
Mar 31 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 3/31/11 6:35 AM, Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

You'll need to discuss cooperation with him and at best Jonas would agree to become a mentor. I've posted a couple of weeks earlier how I think that could work with ranges: the range maintains the asynchronous state and has a queue of already-available buffers received. The network traffic occurs in a different thread; the range throws requests over the fence to libcurl and libcurl throws buffers over the fence back to the range. The range offers a seemingly synchronous interface: foreach (line; byLineAsync("http://d-programming-language.org")) { ... use line ... } except that the processing and the fetching of data occur in distinct threads. Server-side code such as network servers etc. would also be an interesting topic. Let me know if you're versed in the likes of libev(ent). Thanks, Andrei

Is this basically std.parallelism.asyncBuf (http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html#asyncBuf) or something different?
Mar 31 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/31/11 11:43 AM, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 3/31/11 6:35 AM, Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

You'll need to discuss cooperation with him and at best Jonas would agree to become a mentor. I've posted a couple of weeks earlier how I think that could work with ranges: the range maintains the asynchronous state and has a queue of already-available buffers received. The network traffic occurs in a different thread; the range throws requests over the fence to libcurl and libcurl throws buffers over the fence back to the range. The range offers a seemingly synchronous interface: foreach (line; byLineAsync("http://d-programming-language.org")) { ... use line ... } except that the processing and the fetching of data occur in distinct threads. Server-side code such as network servers etc. would also be an interesting topic. Let me know if you're versed in the likes of libev(ent). Thanks, Andrei

Is this basically std.parallelism.asyncBuf (http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html#asyncBuf) or something different?

asyncBuf would be an excellent backend for that, but the entire thing needs encapsulation so as to not expose user code to the risks of undue sharing. Andrei
Mar 31 2011
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 3/31/11 11:43 AM, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 3/31/11 6:35 AM, Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

You'll need to discuss cooperation with him and at best Jonas would agree to become a mentor. I've posted a couple of weeks earlier how I think that could work with ranges: the range maintains the asynchronous state and has a queue of already-available buffers received. The network traffic occurs in a different thread; the range throws requests over the fence to libcurl and libcurl throws buffers over the fence back to the range. The range offers a seemingly synchronous interface: foreach (line; byLineAsync("http://d-programming-language.org")) { ... use line ... } except that the processing and the fetching of data occur in distinct threads. Server-side code such as network servers etc. would also be an interesting topic. Let me know if you're versed in the likes of libev(ent). Thanks, Andrei

Is this basically std.parallelism.asyncBuf (http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html#asyncBuf) or something different?

needs encapsulation so as to not expose user code to the risks of undue sharing. Andrei

Ok. If there are any enhancements that would make asyncBuf work better for this, let me know.
Mar 31 2011
prev sibling parent Jonas Drewsen <jdrewsen nospam.com> writes:
On 31/03/11 18.43, dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 3/31/11 6:35 AM, Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

You'll need to discuss cooperation with him and at best Jonas would agree to become a mentor. I've posted a couple of weeks earlier how I think that could work with ranges: the range maintains the asynchronous state and has a queue of already-available buffers received. The network traffic occurs in a different thread; the range throws requests over the fence to libcurl and libcurl throws buffers over the fence back to the range. The range offers a seemingly synchronous interface: foreach (line; byLineAsync("http://d-programming-language.org")) { ... use line ... } except that the processing and the fetching of data occur in distinct threads. Server-side code such as network servers etc. would also be an interesting topic. Let me know if you're versed in the likes of libev(ent). Thanks, Andrei

Is this basically std.parallelism.asyncBuf (http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html#asyncBuf) or something different?

Cool! I've been thinking about creating such a class myself. I definitely think that asyncBuf fits on with the 'foreach' support in the curl wrapper.
Apr 01 2011
prev sibling next sibling parent reply Robert Clipsham <robert octarineparrot.com> writes:
On 31/03/2011 17:26, Andrei Alexandrescu wrote:
 foreach (line; byLineAsync("http://d-programming-language.org"))
 {
 ... use line ...
 }

What would be awesome is if this was backed by fibers, then you have a really simple and easy wrapper for doing async io, handling lots of connections as the data comes in one thread. Of course a none-by-line version would also be excellent given that a lot of IO doesn't care about new lines. -- Robert http://octarineparrot.com/
Mar 31 2011
parent reply Robert Clipsham <robert octarineparrot.com> writes:
On 31/03/2011 17:53, Robert Clipsham wrote:
 On 31/03/2011 17:26, Andrei Alexandrescu wrote:
 foreach (line; byLineAsync("http://d-programming-language.org"))
 {
 ... use line ...
 }

What would be awesome is if this was backed by fibers, then you have a really simple and easy wrapper for doing async io, handling lots of connections as the data comes in one thread. Of course a none-by-line version would also be excellent given that a lot of IO doesn't care about new lines. -- Robert http://octarineparrot.com/

To clarify, this isn't much use for clients, but for servers it could be useful, or if you're wanting to act as multiple clients. -- Robert http://octarineparrot.com/
Mar 31 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s article
 Are fibers really better/faster than threads? I've heard rumors that
 they perform exactly the same, and that there's no benefit of using
 fibers over threads. Is that true?

Here are some key differences between fibers (as currently implemented in core.thread; I have no idea how this applies to the general case in other languages) and threads: 1. Fibers can't be used to implement parallelism. If you have N > 1 fibers running on one hardware thread, your code will only use a single core. 2. Fibers use cooperative concurrency, threads use preemptive concurrency. This means three things: a. It's the programmer's responsibility to determine how execution time is split between a group of fibers, not the OS's. b. If one fiber goes into an endless loop, all fibers executing on that thread will hang. c. Getting concurrency issues right is easier, since fibers can't be implicitly pre-empted by other fibers in the middle of some operation. All context switches are explicit, and as mentioned there is no true parallelism. 3. Fibers are implemented in userland, and context switches are a lot cheaper (IIRC an order of magnitude or more, on the order of 100 clock cycles for fibers vs. 1000 for OS threads).
Mar 31 2011
next sibling parent Jonas Drewsen <jdrewsen nospam.com> writes:
On 31/03/11 20.48, dsimcha wrote:
 == Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s article
 Are fibers really better/faster than threads? I've heard rumors that
 they perform exactly the same, and that there's no benefit of using
 fibers over threads. Is that true?

Here are some key differences between fibers (as currently implemented in core.thread; I have no idea how this applies to the general case in other languages) and threads: 1. Fibers can't be used to implement parallelism. If you have N> 1 fibers running on one hardware thread, your code will only use a single core.

The fastest webservers out there (e.g. zeus, nginx, lighttpd) also use some kind of fibers and they solve this problem by simply forking the process and sharing the listening socket between processes. That way you get the best of two worlds. /Jonas
 2.  Fibers use cooperative concurrency, threads use preemptive concurrency. 
This
 means three things:

      a.  It's the programmer's responsibility to determine how execution time
is
 split between a group of fibers, not the OS's.

      b.  If one fiber goes into an endless loop, all fibers executing on that
 thread will hang.

      c.  Getting concurrency issues right is easier, since fibers can't be
 implicitly pre-empted by other fibers in the middle of some operation.  All
 context switches are explicit, and as mentioned there is no true parallelism.

 3.  Fibers are implemented in userland, and context switches are a lot cheaper
 (IIRC an order of magnitude or more, on the order of 100 clock cycles for
fibers
 vs. 1000 for OS threads).

Mar 31 2011
prev sibling next sibling parent Jonas Drewsen <jdrewsen nospam.com> writes:
On 31/03/11 21.19, Torarin wrote:
 I'm currently working on an http and networking library that uses
 asynchronous sockets running in fibers and an event loop a la libev.
 These async sockets have the same interface as regular Berkeley
 sockets, so clients can choose whether to be synchronous, asynchronous
 or threaded with template arguments.

 For instance, it has HttpClient!AsyncSocket and HttpClient!Socket.

 Torarin

Very interesting! Do you have a github repos we can see? /Jonas
Mar 31 2011
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Sean Kelly (sean invisibleduck.org)'s article
 On Mar 31, 2011, at 11:48 AM, dsimcha wrote:
 == Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s

 Are fibers really better/faster than threads? I've heard rumors that
 they perform exactly the same, and that there's no benefit of using
 fibers over threads. Is that true?

Here are some key differences between fibers (as currently implemented

 core.thread; I have no idea how this applies to the general case in

 languages) and threads:

 1.  Fibers can't be used to implement parallelism.  If you have N > 1

 running on one hardware thread, your code will only use a single core.

default thread-local storage of statics. All fibers running on a thread will currently share the thread's static data. This could be worked around by doing TLS manually at the fiber level, but it's a non-trivial change.

Let's assume for the sake of argument that we are otherwise ready to make said change. What would the performance implications of this be for programs using TLS heavily but not using fibers? My gut feeling is that, if this has considerable performance implications for non-fiber-using programs, it should be left alone long-term, or fiber-local storage should be added as a separate entity.
Mar 31 2011
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Sean Kelly (sean invisibleduck.org)'s article
 On Mar 31, 2011, at 4:03 PM, dsimcha wrote:
 == Quote from Sean Kelly (sean invisibleduck.org)'s article
 On Mar 31, 2011, at 11:48 AM, dsimcha wrote:
 == Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s

 Are fibers really better/faster than threads? I've heard rumors




 they perform exactly the same, and that there's no benefit of using
 fibers over threads. Is that true?

Here are some key differences between fibers (as currently



 in
 core.thread; I have no idea how this applies to the general case in

 languages) and threads:

 1.  Fibers can't be used to implement parallelism.  If you have N >



 fibers
 running on one hardware thread, your code will only use a single



 It bears mentioning that this has interesting implications for the
 default thread-local storage of statics.  All fibers running on a


 will currently share the thread's static data.  This could be worked
 around by doing TLS manually at the fiber level, but it's a


 change.

Let's assume for the sake of argument that we are otherwise ready to

 change.  What would the performance implications of this be for

 heavily but not using fibers?  My gut feeling is that, if this has

 performance implications for non-fiber-using programs, it should be

 long-term, or fiber-local storage should be added as a separate

It's more an issue of creating an understandable programming model. If someone is using statics, the result should be the same regardless of whether the code gets a dedicated thread or is multiplexed with other code on one thread. ie. fibers are ideally an implementation detail.

Yes, but what would be the likely performance cost of doing so?
Apr 01 2011
prev sibling parent Robert Clipsham <robert octarineparrot.com> writes:
On 31/03/2011 19:34, Andrej Mitrovic wrote:
 Are fibers really better/faster than threads? I've heard rumors that
 they perform exactly the same, and that there's no benefit of using
 fibers over threads. Is that true?

I've written up a first draft of an article about this at: http://octarineparrot.com/article/view/getting-more-fiber-in-your-diet I'd be grateful if the people replying this thread could take a look over it. -- Robert http://octarineparrot.com/
Apr 05 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Are fibers really better/faster than threads? I've heard rumors that
they perform exactly the same, and that there's no benefit of using
fibers over threads. Is that true?
Mar 31 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 31 Mar 2011 14:48:13 -0400, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s article
 Are fibers really better/faster than threads? I've heard rumors that
 they perform exactly the same, and that there's no benefit of using
 fibers over threads. Is that true?

Here are some key differences between fibers (as currently implemented in core.thread; I have no idea how this applies to the general case in other languages) and threads: 1. Fibers can't be used to implement parallelism. If you have N > 1 fibers running on one hardware thread, your code will only use a single core. 2. Fibers use cooperative concurrency, threads use preemptive concurrency. This means three things: a. It's the programmer's responsibility to determine how execution time is split between a group of fibers, not the OS's. b. If one fiber goes into an endless loop, all fibers executing on that thread will hang. c. Getting concurrency issues right is easier, since fibers can't be implicitly pre-empted by other fibers in the middle of some operation. All context switches are explicit, and as mentioned there is no true parallelism. 3. Fibers are implemented in userland, and context switches are a lot cheaper (IIRC an order of magnitude or more, on the order of 100 clock cycles for fibers vs. 1000 for OS threads).

4. often there is an OS limit on how many threads a process can create. There is no such limit on fibers (only memory). Using fibers can increase the number of simultaneous tasks that can be run by quite a bit. -Steve
Mar 31 2011
prev sibling next sibling parent Torarin <torarind gmail.com> writes:
I'm currently working on an http and networking library that uses
asynchronous sockets running in fibers and an event loop a la libev.
These async sockets have the same interface as regular Berkeley
sockets, so clients can choose whether to be synchronous, asynchronous
or threaded with template arguments.

For instance, it has HttpClient!AsyncSocket and HttpClient!Socket.

Torarin
Mar 31 2011
prev sibling next sibling parent reply Jonas Drewsen <jdrewsen nospam.com> writes:
On 31/03/11 18.26, Andrei Alexandrescu wrote:
 On 3/31/11 6:35 AM, Max Klyga wrote:
 I've been thinking on things I can change in my GSoC proposal to make it
 stronger and noticed that currently Phobos does not address asynchronous
 I/O of any kind.

 A number of threads on thid newsgroup mentioned about this problem or
 shown ways other languages address asynchronicity.

 I want to ask D community about plans on asynchronicity in Phobos.
 Did somenone in Phobos team thought about possible design?
 How does asynchronicity stacks with ranges?
 What model should D adapt?
 etc.

I think that would be a good contribution that would complement Jonas'. You'll need to discuss cooperation with him and at best Jonas would agree to become a mentor. I've posted a couple of weeks earlier how I think that could work with ranges: the range maintains the asynchronous state and has a queue of already-available buffers received. The network traffic occurs in a different thread; the range throws requests over the fence to libcurl and libcurl throws buffers over the fence back to the range. The range offers a seemingly synchronous interface: foreach (line; byLineAsync("http://d-programming-language.org")) { ... use line ... } except that the processing and the fetching of data occur in distinct threads. Server-side code such as network servers etc. would also be an interesting topic. Let me know if you're versed in the likes of libev(ent). Thanks, Andrei

I believe that we would need both the threaded async IO that you describe but also a select based one. The thread based is important e.g. in order to keep buffering incoming data while processing elements in the range (the OS will only buffer the number of bytes allowed by sysadmin). The select based is important in order to handle _many_ connections at the same time (think D as the killer app for websockets). As Robert mentions fibers would be nice to take into consideration as well. What I also see as an unresolved issue is non-blocking handling in http://erdani.com/d/phobos/std_stream2.html which fits in naturally with this topic I think. I may very well agree mentoring if we get a solid proposal out of this. /Jonas
Mar 31 2011
parent reply Max Klyga <max.klyga gmail.com> writes:
On 2011-03-31 22:35:43 +0300, Jonas Drewsen said:

 On 31/03/11 18.26, Andrei Alexandrescu wrote:
 snip

I believe that we would need both the threaded async IO that you describe but also a select based one. The thread based is important e.g. in order to keep buffering incoming data while processing elements in the range (the OS will only buffer the number of bytes allowed by sysadmin). The select based is important in order to handle _many_ connections at the same time (think D as the killer app for websockets). As Robert mentions fibers would be nice to take into consideration as well. What I also see as an unresolved issue is non-blocking handling in http://erdani.com/d/phobos/std_stream2.html which fits in naturally with this topic I think. I may very well agree mentoring if we get a solid proposal out of this.

I'm very glad to hear this. Now my motivation doubled!
 
 /Jonas

Any comments, if this proposal be more focused on asyncronous networking or should it address asyncronisity in Phobos in general? I researched a little about libev and libevent. Both seem to have some limitations on Windows platform. libev can only be used to deal with sockets on Windows and uses select, which limits libev to 64 file handles per thread. libevent uses Windows overlaping I/O, but this thread[1] shows that current implementation has perfomance limitations. So one option may be to use either libev or libevent, and implement things on top of them. Another is to make a new implementation (from scratch, or reuse some code from Boost.ASIO[2]) using threads or fibers, or maybe both. 1. http://www.mail-archive.com/libevent-users monkey.org/msg01730.html 2. http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio.html
Mar 31 2011
parent reply Jonas Drewsen <jdrewsen nospam.com> writes:
On 31/03/11 23.20, Max Klyga wrote:
 On 2011-03-31 22:35:43 +0300, Jonas Drewsen said:

 On 31/03/11 18.26, Andrei Alexandrescu wrote:
 snip

I believe that we would need both the threaded async IO that you describe but also a select based one. The thread based is important e.g. in order to keep buffering incoming data while processing elements in the range (the OS will only buffer the number of bytes allowed by sysadmin). The select based is important in order to handle _many_ connections at the same time (think D as the killer app for websockets). As Robert mentions fibers would be nice to take into consideration as well. What I also see as an unresolved issue is non-blocking handling in http://erdani.com/d/phobos/std_stream2.html which fits in naturally with this topic I think. I may very well agree mentoring if we get a solid proposal out of this.

I'm very glad to hear this. Now my motivation doubled!
 /Jonas

Any comments, if this proposal be more focused on asyncronous networking or should it address asyncronisity in Phobos in general? I researched a little about libev and libevent. Both seem to have some limitations on Windows platform. libev can only be used to deal with sockets on Windows and uses select, which limits libev to 64 file handles per thread.

Actually it seems the limit is OS version dependent and for NT it is 32767 per process: http://support.microsoft.com/kb/111855
 libevent uses Windows overlaping I/O, but this thread[1] shows that
 current implementation has perfomance limitations.
 So one option may be to use either libev or libevent, and implement
 things on top of them.
 Another is to make a new implementation (from scratch, or reuse some
 code from Boost.ASIO[2]) using threads or fibers, or maybe both.

 1. http://www.mail-archive.com/libevent-users monkey.org/msg01730.html
 2. http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio.html

Mar 31 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Jonas Drewsen (jdrewsen nospam.com)'s article
 On 31/03/11 23.20, Max Klyga wrote:
 On 2011-03-31 22:35:43 +0300, Jonas Drewsen said:

 On 31/03/11 18.26, Andrei Alexandrescu wrote:
 snip

I believe that we would need both the threaded async IO that you describe but also a select based one. The thread based is important e.g. in order to keep buffering incoming data while processing elements in the range (the OS will only buffer the number of bytes allowed by sysadmin). The select based is important in order to handle _many_ connections at the same time (think D as the killer app for websockets). As Robert mentions fibers would be nice to take into consideration as well. What I also see as an unresolved issue is non-blocking handling in http://erdani.com/d/phobos/std_stream2.html which fits in naturally with this topic I think. I may very well agree mentoring if we get a solid proposal out of this.

I'm very glad to hear this. Now my motivation doubled!
 /Jonas

Any comments, if this proposal be more focused on asyncronous networking or should it address asyncronisity in Phobos in general? I researched a little about libev and libevent. Both seem to have some limitations on Windows platform. libev can only be used to deal with sockets on Windows and uses select, which limits libev to 64 file handles per thread.

32767 per process: http://support.microsoft.com/kb/111855

Again forgive my naiveness, as most of my experience with concurrency is concurrency to implement parallelism, not concurrency for its own sake. Shouldn't 32,000 threads be more than enough for anything? I can't imagine what kinds of programs would really need this level of concurrency, or how bad performance on any specific thread would be when you have this many. Right now in my Task Manager the program with the most threads is explorer.exe, with 28.
Mar 31 2011
parent reply Jonas Drewsen <jdrewsen nospam.com> writes:
On 01/04/11 01.07, dsimcha wrote:
 == Quote from Jonas Drewsen (jdrewsen nospam.com)'s article
 On 31/03/11 23.20, Max Klyga wrote:
 On 2011-03-31 22:35:43 +0300, Jonas Drewsen said:

 On 31/03/11 18.26, Andrei Alexandrescu wrote:
 snip

I believe that we would need both the threaded async IO that you describe but also a select based one. The thread based is important e.g. in order to keep buffering incoming data while processing elements in the range (the OS will only buffer the number of bytes allowed by sysadmin). The select based is important in order to handle _many_ connections at the same time (think D as the killer app for websockets). As Robert mentions fibers would be nice to take into consideration as well. What I also see as an unresolved issue is non-blocking handling in http://erdani.com/d/phobos/std_stream2.html which fits in naturally with this topic I think. I may very well agree mentoring if we get a solid proposal out of this.

I'm very glad to hear this. Now my motivation doubled!
 /Jonas

Any comments, if this proposal be more focused on asyncronous networking or should it address asyncronisity in Phobos in general? I researched a little about libev and libevent. Both seem to have some limitations on Windows platform. libev can only be used to deal with sockets on Windows and uses select, which limits libev to 64 file handles per thread.

32767 per process: http://support.microsoft.com/kb/111855

Again forgive my naiveness, as most of my experience with concurrency is concurrency to implement parallelism, not concurrency for its own sake. Shouldn't 32,000 threads be more than enough for anything? I can't imagine what kinds of programs would really need this level of concurrency, or how bad performance on any specific thread would be when you have this many. Right now in my Task Manager the program with the most threads is explorer.exe, with 28.

There doesn't have to be a thread for each socket. Actually many servers have very few threads with many sockets each. 32000 sockets is not unimaginable for certain server loads e.g. websockets or game servers. But I know it is not that common. /Jonas
Apr 01 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Sean Kelly (sean invisibleduck.org)'s article
 On Apr 1, 2011, at 7:49 AM, Jonas Drewsen wrote:
 On 01/04/11 01.07, dsimcha wrote:
 Again forgive my naiveness, as most of my experience with concurrency


 concurrency to implement parallelism, not concurrency for its own


 32,000 threads be more than enough for anything?  I can't imagine


 programs would really need this level of concurrency, or how bad


 any specific thread would be when you have this many.  Right now in


 Manager the program with the most threads is explorer.exe, with 28.

There doesn't have to be a thread for each socket. Actually many

not unimaginable for certain server loads e.g. websockets or game servers. But I know it is not that common. Hopefully not at all common. With that level of concurrency the process will spend more time context switching than executing code.

...or use such huge timeslices that the illusion of simultaneous execution breaks down.
Apr 01 2011
parent Jonas Drewsen <jdrewsen nospam.com> writes:
On 01/04/11 18.12, dsimcha wrote:
 == Quote from Sean Kelly (sean invisibleduck.org)'s article
 On Apr 1, 2011, at 7:49 AM, Jonas Drewsen wrote:
 On 01/04/11 01.07, dsimcha wrote:
 Again forgive my naiveness, as most of my experience with concurrency


 concurrency to implement parallelism, not concurrency for its own


 32,000 threads be more than enough for anything?  I can't imagine


 programs would really need this level of concurrency, or how bad


 any specific thread would be when you have this many.  Right now in


 Manager the program with the most threads is explorer.exe, with 28.

There doesn't have to be a thread for each socket. Actually many

not unimaginable for certain server loads e.g. websockets or game servers. But I know it is not that common. Hopefully not at all common. With that level of concurrency the process will spend more time context switching than executing code.

...or use such huge timeslices that the illusion of simultaneous execution breaks down.

I guess multiple cores will help out there.
Apr 01 2011
prev sibling next sibling parent reply Jonas Drewsen <jdrewsen nospam.com> writes:
On 01/04/11 17.21, Sean Kelly wrote:
 On Apr 1, 2011, at 7:49 AM, Jonas Drewsen wrote:

 On 01/04/11 01.07, dsimcha wrote:
 Again forgive my naiveness, as most of my experience with concurrency is
 concurrency to implement parallelism, not concurrency for its own sake. 
Shouldn't
 32,000 threads be more than enough for anything?  I can't imagine what kinds of
 programs would really need this level of concurrency, or how bad performance on
 any specific thread would be when you have this many.  Right now in my Task
 Manager the program with the most threads is explorer.exe, with 28.

There doesn't have to be a thread for each socket. Actually many servers have very few threads with many sockets each. 32000 sockets is not unimaginable for certain server loads e.g. websockets or game servers. But I know it is not that common.

Hopefully not at all common. With that level of concurrency the process will spend more time context switching than executing code.

For services where clients spend most time inactive this works. An example could be a server for messenger like clients. Most of the time the clients are just connected waiting for messages. As long as nothing is transmitted no context switching is done. Or maybe I've misunderstood the reason for the context switching? /Jonas
Apr 01 2011
next sibling parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Sean Kelly wrote:
 Fair enough.  Though I'd still say it's a terrible use of resources,
 given available asynchronous socket APIs.  And as an aside,
 I think 32K sockets per process is not at all surprising.
 I've seen apps that use orders of magnitude more than that, though
 breaking the 64K barrier does get a bit weird.

Breaking that barrier requires more than one IP address :)
Apr 01 2011
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Brad Roberts (braddr puremagic.com)'s article
 I've got an app that regularly runs with hundreds of thousands of
 connections (though most of them are mostly idle).  I haven't seen it
 break 1M yet, but the only thing stopping it is file descriptor limits and
 memory.  It runs a very standard 1 thread per cpu model.  Unfortunatly,
 not yet in D.
 Later,
 Brad

Why/how do you have all these connections open concurrently with only a few threads? Fibers? A huge asynchronous message queue to deal with new requests from connections that aren't idle?
Apr 01 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 4/1/2011 7:27 PM, Sean Kelly wrote:
 On Apr 1, 2011, at 2:24 PM, dsimcha wrote:

 == Quote from Brad Roberts (braddr puremagic.com)'s article
 I've got an app that regularly runs with hundreds of thousands of
 connections (though most of them are mostly idle).  I haven't seen it
 break 1M yet, but the only thing stopping it is file descriptor limits and
 memory.  It runs a very standard 1 thread per cpu model.  Unfortunatly,
 not yet in D.
 Later,
 Brad

Why/how do you have all these connections open concurrently with only a few threads? Fibers? A huge asynchronous message queue to deal with new requests from connections that aren't idle?

A huge asynchronous message queue. State is handled either explicitly or implicitly via fibers. After reading Brad's statement, I'd be interested in seeing a comparison of the memory and performance differences of a thread per socket vs. asynchronous model though (assuming that sockets don't need to interact, so no need for synchronization).

From the discussions lately I'm thoroughly surprised just how specialized a field massively concurrent server programming apparently is. Since it's so far from the type of programming I do my naive opinion was that it wouldn't take a Real Programmer from another specialty (though I emphasize Real Programmer, not code monkey) long to get up to speed.
Apr 01 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Apr 1, 2011, at 1:43 PM, Piotr Szturmaj wrote:

 Sean Kelly wrote:
 Fair enough.  Though I'd still say it's a terrible use of resources,
 given available asynchronous socket APIs.  And as an aside,
 I think 32K sockets per process is not at all surprising.
 I've seen apps that use orders of magnitude more than that, though
 breaking the 64K barrier does get a bit weird.

Breaking that barrier requires more than one IP address :)

That's why it gets weird :-)
Apr 01 2011
prev sibling parent reply Max Klyga <max.klyga gmail.com> writes:
Jonas, thanks for your valuable feedback.

You've expressed interest in mentoring a networking a networking 
project and since I couldn't find any other way to contact you 
directly, I'll post my message here.

As was discussed later, your work on curl supersedes my future effort 
on network clients. You stated that a foundation for implementing web 
servers would be a good project.

Web servers/clients would benefit from a framework similar to 
Boost.ASIO or libev.

So I would like to ask you to contact me directly or write a message 
here about what do I need to do to interest you in mentoring such a 
project.

I plan to post my updated proposal tomorrow and gather some more 
feedback while I still have time until the deadline.

 Comments are welcome.
Apr 04 2011
parent Jonas Drewsen <jdrewsen nospam.com> writes:
On 04/04/11 22.23, Max Klyga wrote:
 Jonas, thanks for your valuable feedback.

 You've expressed interest in mentoring a networking a networking project
 and since I couldn't find any other way to contact you directly, I'll
 post my message here.

 As was discussed later, your work on curl supersedes my future effort on
 network clients. You stated that a foundation for implementing web
 servers would be a good project.

 Web servers/clients would benefit from a framework similar to Boost.ASIO
 or libev.

 So I would like to ask you to contact me directly or write a message
 here about what do I need to do to interest you in mentoring such a
 project.

 I plan to post my updated proposal tomorrow and gather some more
 feedback while I still have time until the deadline.

 Comments are welcome.

Both are excellent frameworks to get inspired from and would definitely catch my interest. And as you can see from the news threads about networking and asynchronicity there are a lot of people who have experience on that topic and can provide help/feedback. I have signed up to be a mentor but I still need to be accepted. Looking forward to the updated proposal. /Jonas
Apr 05 2011
prev sibling parent Brad Roberts <braddr slice-2.puremagic.com> writes:
On Fri, 1 Apr 2011, dsimcha wrote:

 On 4/1/2011 7:27 PM, Sean Kelly wrote:
 On Apr 1, 2011, at 2:24 PM, dsimcha wrote:
 
 == Quote from Brad Roberts (braddr puremagic.com)'s article
 I've got an app that regularly runs with hundreds of thousands of
 connections (though most of them are mostly idle).  I haven't seen it
 break 1M yet, but the only thing stopping it is file descriptor limits
 and
 memory.  It runs a very standard 1 thread per cpu model.  Unfortunatly,
 not yet in D.
 Later,
 Brad

Why/how do you have all these connections open concurrently with only a few threads? Fibers? A huge asynchronous message queue to deal with new requests from connections that aren't idle?

A huge asynchronous message queue. State is handled either explicitly or implicitly via fibers. After reading Brad's statement, I'd be interested in seeing a comparison of the memory and performance differences of a thread per socket vs. asynchronous model though (assuming that sockets don't need to interact, so no need for synchronization).

From the discussions lately I'm thoroughly surprised just how specialized a field massively concurrent server programming apparently is. Since it's so far from the type of programming I do my naive opinion was that it wouldn't take a Real Programmer from another specialty (though I emphasize Real Programmer, not code monkey) long to get up to speed.

I won't go into the why part, it's not interesting here, and I probably can't talk about it anyway. The simplified view of how: No fibers, just a small number of kernel threads (via pthread). An epoll thread that queues tasks that are pulled by the 1 per cpu worker threads. The queue is only as big as the outstanding work to do. Assuming that the rate of socket events is less than the time it takes to deal with the data, the queue stays empty. It's actually quite a simple architecture at the 50k foot view. Having recently hired some new people, I've got recent evidence... it doesn't take a lot of time to fully 'get' the network layer of the system. There's other parts that are more complicated, but they're not part of this discussion. A thread per socket would never handle this load. Even with a 4k stack (which you'd have to be _super_ careful with since C/C++/D does nothing to help you track), you'd be spending 4G of ram on just the stacks. And that's before you get near the data structures for all the sockets, etc. Later, Brad
Apr 01 2011
prev sibling parent Max Klyga <max.klyga gmail.com> writes:
On 2011-04-01 01:45:54 +0300, Jonas Drewsen said:

 On 31/03/11 23.20, Max Klyga wrote:
 On 2011-03-31 22:35:43 +0300, Jonas Drewsen said:
 
 On 31/03/11 18.26, Andrei Alexandrescu wrote:
 snip

I believe that we would need both the threaded async IO that you describe but also a select based one. The thread based is important e.g. in order to keep buffering incoming data while processing elements in the range (the OS will only buffer the number of bytes allowed by sysadmin). The select based is important in order to handle _many_ connections at the same time (think D as the killer app for websockets). As Robert mentions fibers would be nice to take into consideration as well. What I also see as an unresolved issue is non-blocking handling in http://erdani.com/d/phobos/std_stream2.html which fits in naturally with this topic I think. I may very well agree mentoring if we get a solid proposal out of this.

I'm very glad to hear this. Now my motivation doubled!
 
 /Jonas

Any comments, if this proposal be more focused on asyncronous networking or should it address asyncronisity in Phobos in general? I researched a little about libev and libevent. Both seem to have some limitations on Windows platform. libev can only be used to deal with sockets on Windows and uses select, which limits libev to 64 file handles per thread.

Actually it seems the limit is OS version dependent and for NT it is 32767 per process: http://support.microsoft.com/kb/111855

That page also mentions that actual limit is 64 by default and is adjustable, but requires recompilation, because it is defined in a macro (FD_SETSIZE).
 
 libevent uses Windows overlaping I/O, but this thread[1] shows that
 current implementation has perfomance limitations.
 So one option may be to use either libev or libevent, and implement
 things on top of them.
 Another is to make a new implementation (from scratch, or reuse some
 code from Boost.ASIO[2]) using threads or fibers, or maybe both.
 
 1. http://www.mail-archive.com/libevent-users monkey.org/msg01730.html
 2. http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio.html


Mar 31 2011
prev sibling next sibling parent Max Klyga <max.klyga gmail.com> writes:
On 2011-03-31 19:26:45 +0300, Andrei Alexandrescu said:

 On 3/31/11 6:35 AM, Max Klyga wrote:
 snip

I think that would be a good contribution that would complement Jonas'. You'll need to discuss cooperation with him and at best Jonas would agree to become a mentor.

Jonas agreed to become a mentor if I make this proposal strong/interesting enough.
 
 I've posted a couple of weeks earlier how I think that could work with 
 ranges: the range maintains the asynchronous state and has a queue of 
 already-available buffers received. The network traffic occurs in a 
 different thread; the range throws requests over the fence to libcurl 
 and libcurl throws buffers over the fence back to the range. The range 
 offers a seemingly synchronous interface:
 
 foreach (line; byLineAsync("http://d-programming-language.org"))
 {
     ... use line ...
 }
 
 except that the processing and the fetching of data occur in distinct threads.

I thought about the same.
 
 Server-side code such as network servers etc. would also be an 
 interesting topic. Let me know if you're versed in the likes of 
 libev(ent).

I have no experience with libev/libevent, but I see no problem working with either, after reading some exapmles or/and documentation.
Mar 31 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Mar 31, 2011, at 11:48 AM, dsimcha wrote:

 =3D=3D Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s =

 Are fibers really better/faster than threads? I've heard rumors that
 they perform exactly the same, and that there's no benefit of using
 fibers over threads. Is that true?

Here are some key differences between fibers (as currently implemented =

 core.thread; I have no idea how this applies to the general case in =

 languages) and threads:
=20
 1.  Fibers can't be used to implement parallelism.  If you have N > 1 =

 running on one hardware thread, your code will only use a single core.

It bears mentioning that this has interesting implications for the = default thread-local storage of statics. All fibers running on a thread = will currently share the thread's static data. This could be worked = around by doing TLS manually at the fiber level, but it's a non-trivial = change.=
Mar 31 2011
prev sibling next sibling parent Torarin <torarind gmail.com> writes:
2011/3/31 Jonas Drewsen <jdrewsen nospam.com>:
 On 31/03/11 21.19, Torarin wrote:
 I'm currently working on an http and networking library that uses
 asynchronous sockets running in fibers and an event loop a la libev.
 These async sockets have the same interface as regular Berkeley
 sockets, so clients can choose whether to be synchronous, asynchronous
 or threaded with template arguments.

 For instance, it has HttpClient!AsyncSocket and HttpClient!Socket.

 Torarin

Very interesting! Do you have a github repos we can see? /Jonas

I just put one up: https://github.com/torarin/net Here's an example: https://github.com/torarin/net/blob/master/example.d Torarin
Apr 01 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Mar 31, 2011, at 4:03 PM, dsimcha wrote:

 =3D=3D Quote from Sean Kelly (sean invisibleduck.org)'s article
 On Mar 31, 2011, at 11:48 AM, dsimcha wrote:
 =3D=3D Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s

 Are fibers really better/faster than threads? I've heard rumors =




 they perform exactly the same, and that there's no benefit of using
 fibers over threads. Is that true?

Here are some key differences between fibers (as currently =



 in
 core.thread; I have no idea how this applies to the general case in

 languages) and threads:
=20
 1.  Fibers can't be used to implement parallelism.  If you have N > =



 fibers
 running on one hardware thread, your code will only use a single =



 It bears mentioning that this has interesting implications for the
 default thread-local storage of statics.  All fibers running on a =


 will currently share the thread's static data.  This could be worked
 around by doing TLS manually at the fiber level, but it's a =


 change.

Let's assume for the sake of argument that we are otherwise ready to =

 change.  What would the performance implications of this be for =

 heavily but not using fibers?  My gut feeling is that, if this has =

 performance implications for non-fiber-using programs, it should be =

 long-term, or fiber-local storage should be added as a separate =

It's more an issue of creating an understandable programming model. If = someone is using statics, the result should be the same regardless of = whether the code gets a dedicated thread or is multiplexed with other = code on one thread. ie. fibers are ideally an implementation detail.=
Apr 01 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Apr 1, 2011, at 7:49 AM, Jonas Drewsen wrote:

 On 01/04/11 01.07, dsimcha wrote:
=20
=20
 Again forgive my naiveness, as most of my experience with concurrency =


 concurrency to implement parallelism, not concurrency for its own =


 32,000 threads be more than enough for anything?  I can't imagine =


 programs would really need this level of concurrency, or how bad =


 any specific thread would be when you have this many.  Right now in =


 Manager the program with the most threads is explorer.exe, with 28.

There doesn't have to be a thread for each socket. Actually many =

not unimaginable for certain server loads e.g. websockets or game = servers. But I know it is not that common. Hopefully not at all common. With that level of concurrency the process = will spend more time context switching than executing code.=
Apr 01 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Apr 1, 2011, at 8:47 AM, dsimcha wrote:

 =3D=3D Quote from Sean Kelly (sean invisibleduck.org)'s article
=20
 It's more an issue of creating an understandable programming model.  =


 someone is using statics, the result should be the same regardless of
 whether the code gets a dedicated thread or is multiplexed with other
 code on one thread.  ie. fibers are ideally an implementation detail.

Yes, but what would be the likely performance cost of doing so?

The cost of context-switching with fibers is significantly smaller than = kernel threads--I think Mikola Lysenko's talk at the Tango conference a = few years back may have some numbers. The performance of using TLS in = general isn't great regardless of whether fibers are involved. Manually = implemented TLS maybe requires one additional lookup? TLS is done = manually on OSX right now, so the code for how it would work is already = in place.=
Apr 01 2011
prev sibling next sibling parent Brad Roberts <braddr puremagic.com> writes:
On Fri, 1 Apr 2011, Sean Kelly wrote:

 On Apr 1, 2011, at 12:59 PM, Jonas Drewsen wrote:
 
 On 01/04/11 17.21, Sean Kelly wrote:
 
 On Apr 1, 2011, at 7:49 AM, Jonas Drewsen wrote:
 
 There doesn't have to be a thread for each socket. Actually many servers have
very few threads with many sockets each. 32000 sockets is not unimaginable for
certain server loads e.g. websockets or game servers. But I know it is not that
common.

Hopefully not at all common. With that level of concurrency the process will spend more time context switching than executing code.

For services where clients spend most time inactive this works. An example could be a server for messenger like clients. Most of the time the clients are just connected waiting for messages. As long as nothing is transmitted no context switching is done.

Fair enough. Though I'd still say it's a terrible use of resources, given available asynchronous socket APIs. And as an aside, I think 32K sockets per process is not at all surprising. I've seen apps that use orders of magnitude more than that, though breaking the 64K barrier does get a bit weird.

I've got an app that regularly runs with hundreds of thousands of connections (though most of them are mostly idle). I haven't seen it break 1M yet, but the only thing stopping it is file descriptor limits and memory. It runs a very standard 1 thread per cpu model. Unfortunatly, not yet in D. Later, Brad
Apr 01 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Apr 1, 2011, at 2:24 PM, dsimcha wrote:

 =3D=3D Quote from Brad Roberts (braddr puremagic.com)'s article
 I've got an app that regularly runs with hundreds of thousands of
 connections (though most of them are mostly idle).  I haven't seen it
 break 1M yet, but the only thing stopping it is file descriptor =


 memory.  It runs a very standard 1 thread per cpu model.  =


 not yet in D.
 Later,
 Brad

Why/how do you have all these connections open concurrently with only =

 threads?  Fibers?  A huge asynchronous message queue to deal with new =

 from connections that aren't idle?

A huge asynchronous message queue. State is handled either explicitly = or implicitly via fibers. After reading Brad's statement, I'd be = interested in seeing a comparison of the memory and performance = differences of a thread per socket vs. asynchronous model though = (assuming that sockets don't need to interact, so no need for = synchronization).=
Apr 01 2011
prev sibling next sibling parent Fawzi Mohamed <fawzi gmx.ch> writes:
There are several difficult issues connected with asynchronicity, high  
performace networking and connected things.
I had to deal with them developing blip ( http://fawzi.github.com/ 
blip ).
My goal with it was to have a good basis for my program dchem, and as  
consequence is not so optimized in particular for non recursive tasks,  
and it is D1, but I think that the issues are generally relevant.

i/o and asynchronicity is a very important aspect and one that will  
tend to "pollute" many parts of the library, and introduce  
dependencies that are difficult to remove thus those choices have to  
be done carefully.

Overview:
========

Threads vs fibers:
-----------------------

* an issue not yet brought up is that thread wire some memory, and so  
have an extra cost that fibers don't.
* evaluation strategy of fibers can be chosen by the user, this is  
relevant for recursive tasks where each task
   spawns other tasks, different strategies (breadth first evaluation  
like threads uses a *lot* more resources
   than depth first, by having many more tasks concurrently in  
evaluation)

Otherwise the relevant points already  brought forth by others are:

- context switch of fibers (assuming that memory is active) is much  
faster
- context switch are chosen by the user in fibers (cooperative  
multitasking), this allows
   one to choose the optima point to switch, but a "bad" fibers can  
ruin the response time the others.
- d is not stackless (like Go for example), so each fiber needs to  
have enough space for the stack
   (something that often is not so easy to predict). This makes fiber  
still a bit costly if one really needs a lot of them.
   64 bit can help here, because hopefully the active part is small,  
and it can be kept in RAM, even using a rather
   large virtual space. Still as correctly said by Brad for heavily  
uniform handling of many tasks manual
   management (and using stateless functions as much as possible) can  
be much more efficient.

Closures
------------
When possible and for the low level (often used) operations delegates  
and functions calls are a better solution than , structs and manual  
memory handling for "closures" are a good choice for low level  
operations, because one can avoid the heap allocation connected with  
the automatic closure.
This approach cannot be avoided in D1, whereas D2 has the very useful  
closures, but at low level their cost should be avoided when possible.
About using structs there are subtle issues that I think are connected  
with optimization of the compiler (I never really investigated them, I  
always changed the code, or resorted to heap allocation.
The main issue is that one would like to optimize as much as possible,  
and to do it it normally assumes that the current thread is the only  
user of the stack. If you pass stack stored structures to other  
threads these assumptions aren't true anymore, so the memory of a  
stack allocated struct might be reused even before the function  
returns (unless I am mistaken and the ABI forbids it, in this case  
tell me).

Async i/o
----------

* almost always i/o is much slower than CPU, so an i/o operation is  
bound to make the cpu wait, so one wants to use the wait efficiently.
   - A very simple way is to just use blocking i/o, and just have  
other threads do other threads.
   - async i/o allows overlap of several operations in a single thread.
   - for files an even more efficient way to communicate sharing of  
the buffer with the kernel (aio_*)
   - an important issue is avoiding waste of cpu cycles while waiting,  
to achieve this one can collect several waiting operations and use a  
single thread to wait on several of them, select, poll and epoll allow  
this, and increase the efficiency of several kinds of programs
   - libev and libevent are cross platform libraries that can help  
having an event based approach, taking care to check a large number of  
events and call a user defined callback when they happen in a robust  
cross platform way

locks, semaphores
------------
to synchronize between threads locks and semaphores are a standard way  
to synchronize.
One has to be careful to mix them with fiber scheduling with locks, as  
one can easily deadlock.

Hardware informationy
-----------------------------
Efficient usage of computational resource depends also on being able  
to identify the available hardware.
Don did quite some hacking to get useful information out of cpuinfo,  
but if one is interested in more complex computers more info would be  
nice.
I use hwloc for this purpose, it is cross plattform, can be embedded.

Possible solutions
==============

Having async i/o can be presented as normal synchronous (blocking) i/ 
o, but this makes sense only if one has several objects waiting, or  
uses fibers, and executes other fiber while waiting.
How acceptable it is to rely (and thus introduce a dependency on)  
things like libev or hwloc?
For my purposes using them was ok, and they are cross platform and  
embeddable, but is it true also for phobos?

Asynchronicity means being able to have work to be executed  
concurrently and then resynchronize at a later point.
One can use processes (that also give memory protection), threads, or  
fibers to achieve this.
If one uses just threads, then asynchronous i/o makes sense only with  
a fully manual (explicit) handling of it, hiding it away will be  
equivalent to blocking i/o.
Fibers allow one to hide async io and make it look as blocking, but as  
Sean told there are issues with using fibers with D2 TLS.
I kind of dislike the use of TSL for non low level infrastructure  
stuff, but that is just me around here it seems.

In blip I choose to go with fiber based switching.
I wrapped libev both at low level and at a higher level, in such a way  
than one can use them directly (for maximum performance)
For the sockets I use non blocking calls, and a single "waiting" (io)  
thread, but hide them so that they are used just like blocking calls.

An important design decision if using fibers is if one should be able  
to have a "naked" thread, or hide the fiber scheduling in each thread.
In blip I went for yes, because it is entirely realized as a normal  
library, but that gives some ugly corner cases when one uses a method  
that wants to suspend a thread that doesn't  have scheduling place.
Building the scheduling into all threads is probably cleaner if one  
goes with fibers.
The problem of TSL and fibers remains though, especially if one allows  
the migration of fibers from one thread to the other (as I do in blip).

An important design choice in blip was being able to cope with  
recursive parallelism (typical of computation tasks), not just with  
the (server like) concurrent parallelism that is typical of servers.
I feel that it is important, but is something that might not be seen  
as such by others.

To do
====
Now about async io the first step is for sure to expose an  
asynchronous API. This doesn't influence or depends on other parts of  
the library much.
An important decision if/which external libraries one can rely on.

Making the async API nicer to use, or even use it "behind the scenes"  
as I do in blip needs more complex choices on the basic handling of  
suspension and synchronization.
Something like that is bound to be used in several parts of phobos so  
a careful choice is needed.

This parts are also partially connected with high performance  
networking (another GSoC project).

Fawzi

Fawzi
Apr 02 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Apr 1, 2011, at 6:08 PM, Brad Roberts wrote:

 On Fri, 1 Apr 2011, dsimcha wrote:
=20
 On 4/1/2011 7:27 PM, Sean Kelly wrote:
 On Apr 1, 2011, at 2:24 PM, dsimcha wrote:
=20
 =3D=3D Quote from Brad Roberts (braddr puremagic.com)'s article
 I've got an app that regularly runs with hundreds of thousands of
 connections (though most of them are mostly idle).  I haven't seen =





 break 1M yet, but the only thing stopping it is file descriptor =





 and
 memory.  It runs a very standard 1 thread per cpu model.  =





 not yet in D.
=20




I won't go into the why part, it's not interesting here, and I =

 can't talk about it anyway.
=20
 The simplified view of how: No fibers, just a small number of kernel=20=

 threads (via pthread).  An epoll thread that queues tasks that are=20
 pulled by the 1 per cpu worker threads.  The queue is only as big as =

 outstanding work to do.  Assuming that the rate of socket events is =

 than the time it takes to deal with the data, the queue stays empty.
=20
 It's actually quite a simple architecture at the 50k foot view.  =

 recently hired some new people, I've got recent evidence... it doesn't=20=

 take a lot of time to fully 'get' the network layer of the system. =20
 There's other parts that are more complicated, but they're not part of=20=

 this discussion.
=20
 A thread per socket would never handle this load.  Even with a 4k =

 (which you'd have to be _super_ careful with since C/C++/D does =

 help you track), you'd be spending 4G of ram on just the stacks.  And=20=

 that's before you get near the data structures for all the sockets, =

I misread your prior post as one thread per socket and was a bit = baffled. Makes a lot more sense now. Potentially one read event per = socket still means a pretty long queue though. Regarding the stack size... is that much of an issue with 64-bit = processes? Figure a few pages of committed memory per thread plus a = large reserved range that shouldn't impact things at all. Definitely = more than the event model, but maybe not tremendously so?=
Apr 04 2011
prev sibling parent Jose Armando Garcia <jsancio gmail.com> writes:
The problem with threads is the context switch not really the stack
size. Threads are not the solution to increase performance. In high
performance systems threads are used for fairness in the
resquest-response pipeline not for performance. Obviously, this fact
is not argued when talking about uni-processor. With the availability
of cheap multi-processor, multi-core and hyper-threading multiple
threads are needed to keep all logical processors busy. In other words
multiple threads are needed to get the most out of the hardware even
if you don't care about fairness.

Now the argument above doesn't take into account implementability.
Most people write sequential multithreaded because it is "easier" (I
personally think it is harder not to violate the invariant in the
presence of concurrency/sharing). Many people feel it is easier to
extend the programmer to understand a sequential shared model than to
do a paradigm switch to an event based model.

On Mon, Apr 4, 2011 at 6:49 PM, Sean Kelly <sean invisibleduck.org> wrote:
 On Apr 1, 2011, at 6:08 PM, Brad Roberts wrote:

 On Fri, 1 Apr 2011, dsimcha wrote:

 On 4/1/2011 7:27 PM, Sean Kelly wrote:
 On Apr 1, 2011, at 2:24 PM, dsimcha wrote:

 =3D=3D Quote from Brad Roberts (braddr puremagic.com)'s article
 I've got an app that regularly runs with hundreds of thousands of
 connections (though most of them are mostly idle). =A0I haven't seen=






 break 1M yet, but the only thing stopping it is file descriptor limi=






 and
 memory. =A0It runs a very standard 1 thread per cpu model. =A0Unfort=






 not yet in D.




I won't go into the why part, it's not interesting here, and I probably can't talk about it anyway. The simplified view of how: No fibers, just a small number of kernel threads (via pthread). =A0An epoll thread that queues tasks that are pulled by the 1 per cpu worker threads. =A0The queue is only as big as t=


 outstanding work to do. =A0Assuming that the rate of socket events is le=


 than the time it takes to deal with the data, the queue stays empty.

 It's actually quite a simple architecture at the 50k foot view. =A0Havin=


 recently hired some new people, I've got recent evidence... it doesn't
 take a lot of time to fully 'get' the network layer of the system.
 There's other parts that are more complicated, but they're not part of
 this discussion.

 A thread per socket would never handle this load. =A0Even with a 4k stac=


 (which you'd have to be _super_ careful with since C/C++/D does nothing =


 help you track), you'd be spending 4G of ram on just the stacks. =A0And
 that's before you get near the data structures for all the sockets, etc.

I misread your prior post as one thread per socket and was a bit baffled.=

ill means a pretty long queue though.
 Regarding the stack size... is that much of an issue with 64-bit processe=

ed range that shouldn't impact things at all. =A0Definitely more than the e= vent model, but maybe not tremendously so?
Apr 04 2011
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Apr 1, 2011, at 12:59 PM, Jonas Drewsen wrote:

 On 01/04/11 17.21, Sean Kelly wrote:
=20
 On Apr 1, 2011, at 7:49 AM, Jonas Drewsen wrote:
=20
 There doesn't have to be a thread for each socket. Actually many =



not unimaginable for certain server loads e.g. websockets or game = servers. But I know it is not that common.
=20
 Hopefully not at all common.  With that level of concurrency the =


=20
 For services where clients spend most time inactive this works. An =

the clients are just connected waiting for messages. As long as nothing = is transmitted no context switching is done. Fair enough. Though I'd still say it's a terrible use of resources, = given available asynchronous socket APIs. And as an aside, I think 32K = sockets per process is not at all surprising. I've seen apps that use = orders of magnitude more than that, though breaking the 64K barrier does = get a bit weird.=
Apr 01 2011