Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate libtuv thread pool to eliminate thread creation overhead #93

Merged
merged 2 commits into from
Feb 12, 2019

Conversation

marktwtn
Copy link
Collaborator

@marktwtn marktwtn commented Jan 30, 2019

To reduce the overhead of creating and eliminating the threads repeatedly,
we integrate the thread pool of libtuv with git submodule.
The pthread-related functions and data types are replaced with the corresonding
ones of libtuv.
The compilation of libtuv library is written in the file mk/submodule.mk.

Experiment:
Call clock_gettime() right before and after the functions for getting the thread.
The functions are pthread_create() (without thread pool)
and uv_queue_work() (with thread pool).
Use test-multi-pow.py as testcase since it initializes and destroys dcurl only once and
does the PoW multiple times, like what IRI does.
The experiment result shows the time of getting each thread
and the thread number of a PoW execution is 7.

Experiment result (unit: second):
Without thread pool
thread0: 0.000028384
thread1: 0.000025127
thread2: 0.000024748
thread3: 0.000023925
thread4: 0.000024126
thread5: 0.000025328
thread6: 0.000052900
thread0: 0.000049344
thread1: 0.000039575
thread2: 0.000036720
thread3: 0.000036249
thread4: 0.000034606
thread5: 0.000034676
thread6: 0.000033444

With thread pool
thread0: 0.000124327
thread1: 0.000002084
thread2: 0.000001052
thread3: 0.000000150
thread4: 0.000000121
thread5: 0.000000080
thread6: 0.000000090
thread0: 0.000000291
thread1: 0.000000080
thread2: 0.000000050
thread3: 0.000000050
thread4: 0.000000050
thread5: 0.000000060
thread6: 0.000000050

The first consumed time of getting the thread from thread pool is longer
since it is in charge of preallocating and initalizing the threads.

Close #58.

@jserv
Copy link
Member

jserv commented Jan 31, 2019

The commit messages shall contain relevant experiment results for reference purpose.

README.md Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
@jserv
Copy link
Member

jserv commented Jan 31, 2019

For performance measurement, you should take C++11 threads, affinity and hyperthreading into considerations. Thread affinity might differentiate dramatically.

src/pow_c.c Outdated Show resolved Hide resolved
@jserv jserv changed the title Integrate libtuv thread pool into dcurl Integrate libtuv thread pool to eliminate thread creation overhead Jan 31, 2019
@jserv
Copy link
Member

jserv commented Jan 31, 2019

How about C and OpenCL implementations?

@marktwtn
Copy link
Collaborator Author

marktwtn commented Feb 11, 2019

The commit messages shall contain relevant experiment results for reference purpose.

The experiment result has been added/updated to the git commit and the first comment of this issue.


For performance measurement, you should take C++11 threads, affinity and hyperthreading into considerations. Thread affinity might differentiate dramatically.

I think this can be opened as another issue?


How about C and OpenCL implementations?

C implementation does include thread pool of libtuv.
OpenCL implementation seems nothing to do with the thread?

@jserv
Copy link
Member

jserv commented Feb 11, 2019

For performance measurement, you should take C++11 threads, affinity and hyperthreading into considerations. Thread affinity might differentiate dramatically.

I think this can be opened as another issue?

Yes, please do. The skeleton implementation looks like the following:

static int num_processors;

int main(int argc, char *argv[]) {
    ...
    num_processors = sysconf(_SC_NPROCESSORS_CONF);
    ...
}

#include <sched.h>
static inline void drop_policy(void) {
    struct sched_param param = { .sched_priority = 0;  };
    sched_setscheduler(0, SCHED_OTHER, &param);
}

static inline void affine_to_cpu(int id, int cpu) {
    cpu_set_t set;
    CPU_ZERO(&set);
    CPU_SET(cpu, &set);
    sched_setaffinity(0, sizeof(set), &set);
}

static void *worker_thread(void *userdata) {
    int thread_id = ((thread_info *) userdata)->id;
    ...
    /* Set worker threads to nice 19 and then preferentially to SCHED_IDLE
     * and if that fails, then SCHED_BATCH. No need for this to be an
     * error if it fails.
     */
    if (!geteuid())
        setpriority(PRIO_PROCESS, 0, -14);
    drop_policy();

    /* Cpu affinity only makes sense if the number of threads is a multiple
     * of the number of CPUs.
     */
    affine_to_cpu(thread_id, thread_id % num_processors);
    ...
}

mk/submodule.mk Outdated Show resolved Hide resolved
@jserv
Copy link
Member

jserv commented Feb 11, 2019

OpenCL implementation seems nothing to do with the thread?

(off-topic) Is pthread_mutex_lock necessary in OpenCL backend? Can we simply synchronize at a barrier?

@jserv
Copy link
Member

jserv commented Feb 11, 2019

Experiment result shall come with the listing of hardware configurations for reference purpose.

@wusyong wusyong added this to the sprint-201902 milestone Feb 11, 2019
@jserv jserv requested a review from ajblane February 11, 2019 11:24
@jserv
Copy link
Member

jserv commented Feb 11, 2019

Since this pull request dramatically changes the flow of execution, there should be a dedicated note briefing fundamental designs in directory docs/. Something like docs/threading-model.md would be nice where we can discuss thread pool, SMP affinity, load balancing, etc.

@jserv
Copy link
Member

jserv commented Feb 11, 2019

Rebasing is required due to recent document re-organization.

mk/submodule.mk Outdated Show resolved Hide resolved
To reduce the overhead of creating and eliminating the threads repeatedly,
we integrate the thread pool of libtuv with git submodule.
The pthread-related functions and data types are replaced with the corresonding
ones of libtuv.
The compilation of libtuv library is written in the file mk/submodule.mk.

Experiment:
Call clock_gettime() right before and after the functions for getting the thread.
The functions are pthread_create() (without thread pool)
and uv_queue_work() (with thread pool).
Use test-multi-pow.py as testcase since it initializes and destroys dcurl only once and
does the PoW multiple times, like what IRI does.
The experiment result shows the time of getting each thread
and the thread number of a PoW execution is 7.

Hardware information:
architecure - x86_64
CPU         - AMD Ryzen 5 2400G (4 cores/8 threads)

Experiment result (unit: second):
Without thread pool
thread0: 0.000028384
thread1: 0.000025127
thread2: 0.000024748
thread3: 0.000023925
thread4: 0.000024126
thread5: 0.000025328
thread6: 0.000052900
thread0: 0.000049344
thread1: 0.000039575
thread2: 0.000036720
thread3: 0.000036249
thread4: 0.000034606
thread5: 0.000034676
thread6: 0.000033444

With thread pool
thread0: 0.000124327
thread1: 0.000002084
thread2: 0.000001052
thread3: 0.000000150
thread4: 0.000000121
thread5: 0.000000080
thread6: 0.000000090
thread0: 0.000000291
thread1: 0.000000080
thread2: 0.000000050
thread3: 0.000000050
thread4: 0.000000050
thread5: 0.000000060
thread6: 0.000000050

The first consumed time of getting the thread from thread pool is longer
since it is in charge of preallocating and initalizing the threads.

Close DLTcollab#58.
@marktwtn marktwtn force-pushed the libtuv-thread-pool-integration branch from 97613e3 to 728aa2a Compare February 12, 2019 06:19
@@ -0,0 +1,21 @@
# Copy from the Makefile of libtuv to support different platforms
UNAME_M := $(shell uname -m)
UNAME_S := $(shell uname -s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add FIXME to mention the limitation of supported operating system listing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm......
I'm not sure what you expect to see.
Like listing the operating system that dcurl supports but libtuv does not or vice versa?

@marktwtn
Copy link
Collaborator Author

Experiment result shall come with the listing of hardware configurations for reference purpose.

I have added the architecture and CPU information in the commit message.


Since this pull request dramatically changes the flow of execution, there should be a dedicated note briefing fundamental designs in directory docs/. Something like docs/threading-model.md would be nice where we can discuss thread pool, SMP affinity, load balancing, etc.

I will record it as a TO-DO list.


Rebasing is required due to recent document re-organization.

This has been done without any problem.

@marktwtn
Copy link
Collaborator Author

OpenCL implementation seems nothing to do with the thread?

(off-topic) Is pthread_mutex_lock necessary in OpenCL backend? Can we simply synchronize at a barrier?

It is necessary.
It makes sure that we can select the right GPU hardware if we have more than one GPU device.

@jserv jserv merged commit c5147ab into DLTcollab:dev Feb 12, 2019
@marktwtn marktwtn deleted the libtuv-thread-pool-integration branch February 14, 2019 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants