Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facilitate thread pool to eliminate overhead while creating threads frequently #58

Closed
furuame opened this issue Aug 12, 2018 · 7 comments
Assignees
Labels
enhancement New feature or request feature Outstanding features we should implement
Milestone

Comments

@furuame
Copy link
Member

furuame commented Aug 12, 2018

Since every CPU implementations, such as SSE and AVX, create many threads to find the nonce every time, the overhead cannot be ignored. libtuv offers a lightweight threadpool implementation for the embedded system.

This feature is expected to be implemented after the issue #57 is achieved. When the new interface is applied in ducrl, the threadpool can be hidden in the CPU PoW Algorithm context.

@furuame furuame added the enhancement New feature or request label Aug 12, 2018
@jserv jserv changed the title A threadpool for CPU implementations to avoid creating threads overhead Facilitate thread pool to eliminate overhead while creating threads frequently Aug 12, 2018
@marktwtn
Copy link
Collaborator

marktwtn commented Nov 13, 2018

IOTA IRI has added the new commit for the --pow-threads command option.

This affects the number of threads to use for proof-of-work calculation.
The thread pool would be affected, too.

@jserv
Copy link
Member

jserv commented Nov 30, 2018

IOTA IRI has added the new commit for the --pow-threads command option.
This affects the number of threads to use for proof-of-work calculation.
The thread pool would be affected, too.

Resolved in #87

@marktwtn
Copy link
Collaborator

marktwtn commented Jan 21, 2019

I reference the branch integrate-libtuv-threadpool create by @2henwei.

The current libtuv would be compiled to static library(.a) without flag fPIC, which can not be used to generate the libdcurl.so file.
We have to modify libtuv and add the flag to generate another type of the library.

@marktwtn
Copy link
Collaborator

The rough time comparison of creating/using threads from libtuv or not in dcurl:

without threadpool (unit: second)

time: 0.000371
time: 0.000216
time: 0.000256
time: 0.000223
time: 0.000246
time: 0.000232
time: 0.000228
time: 0.000229
time: 0.000222
time: 0.000221

with threadpool (unit: second)

time: 0.000411
time: 0.000015
time: 0.000033
time: 0.000016
time: 0.000028              
time: 0.000013
time: 0.000078
time: 0.000013
time: 0.000034
time: 0.000018

It is the first 10 result of running PoW 100 times.


However, it needs to do some modification in libtuv to integrate with dcurl.
I will open a pull request to libtuv.
After it is accepted, I will send the pull request to dcurl.

If it is not accepted, we might need to figure out other ways for integration.

@furuame
Copy link
Member Author

furuame commented Jan 23, 2019

The rough time comparison of creating/using threads from libtuv or not in dcurl:

without threadpool (unit: second)

time: 0.000371
time: 0.000216
time: 0.000256
time: 0.000223
time: 0.000246
time: 0.000232
time: 0.000228
time: 0.000229
time: 0.000222
time: 0.000221

with threadpool (unit: second)

time: 0.000411
time: 0.000015
time: 0.000033
time: 0.000016
time: 0.000028              
time: 0.000013
time: 0.000078
time: 0.000013
time: 0.000034
time: 0.000018

It is the first 10 result of running PoW 100 times.

Wow! It seems there is an obvious improvement with libtuv

However, it needs to do some modification in libtuv to integrate with dcurl.

If you refer to the compiler-related flag modification, I have sent a similar issue to them.
Samsung/libtuv#128

You can re-opened it.

I will open a pull request to libtuv.
After it is accepted, I will send the pull request to dcurl.

If it is not accepted, we might need to figure out other ways for integration.

@jserv
Copy link
Member

jserv commented Jan 23, 2019

However, it needs to do some modification in libtuv to integrate with dcurl.
I will open a pull request to libtuv.
After it is accepted, I will send the pull request to dcurl.

As @2henwei suggested, we can reopen existing libtuv issue. Meanwhile, we can defer to our fork which includes the necessary build fixes as another Git submodule.

@jserv
Copy link
Member

jserv commented Jan 23, 2019

The rough time comparison of creating/using threads from libtuv or not in dcurl:
without threadpool (unit: second)

time: 0.000371
time: 0.000216
time: 0.000256

with threadpool (unit: second)

time: 0.000411
time: 0.000015
time: 0.000033

We shall eliminate the impact of cache miss and page fault.

marktwtn added a commit to marktwtn/dcurl that referenced this issue Jan 30, 2019
To reduce the overhead of creating and eliminating the threads repeatedly,
we integrate the thread pool of libtuv with git submodule.
The pthread-related functions and data types are replaced with the corresonding
ones of libtuv.
The compilation of libtuv library is written in the file mk/submodule.mk.
The README.md asks the user to initialize and update the git submodule
right after downloading the repository.

Close DLTcollab#58.
marktwtn added a commit to marktwtn/dcurl that referenced this issue Feb 11, 2019
To reduce the overhead of creating and eliminating the threads repeatedly,
we integrate the thread pool of libtuv with git submodule.
The pthread-related functions and data types are replaced with the corresonding
ones of libtuv.
The compilation of libtuv library is written in the file mk/submodule.mk.

Experiment:
Call clock_gettime() right before and after the functions for getting the thread.
The functions are pthread_create() (without thread pool)
and uv_queue_work() (with thread pool).
Use test-multi-pow.py as testcase since it initializes and destroys dcurl only once and
does the PoW multiple times, like what IRI does.
The experiment result shows the time of getting each thread
and the thread number of a PoW execution is 7.

Experiment result (unit: second):
Without thread pool
thread0: 0.000028384
thread1: 0.000025127
thread2: 0.000024748
thread3: 0.000023925
thread4: 0.000024126
thread5: 0.000025328
thread6: 0.000052900
thread0: 0.000049344
thread1: 0.000039575
thread2: 0.000036720
thread3: 0.000036249
thread4: 0.000034606
thread5: 0.000034676
thread6: 0.000033444

With thread pool
thread0: 0.000124327
thread1: 0.000002084
thread2: 0.000001052
thread3: 0.000000150
thread4: 0.000000121
thread5: 0.000000080
thread6: 0.000000090
thread0: 0.000000291
thread1: 0.000000080
thread2: 0.000000050
thread3: 0.000000050
thread4: 0.000000050
thread5: 0.000000060
thread6: 0.000000050

The first consumed time of getting the thread from thread pool is longer
since it is in charge of preallocating and initalizing the threads.

Close DLTcollab#58.
@wusyong wusyong added this to the sprint-201902 milestone Feb 11, 2019
@jserv jserv added the feature Outstanding features we should implement label Feb 11, 2019
marktwtn added a commit to marktwtn/dcurl that referenced this issue Feb 12, 2019
To reduce the overhead of creating and eliminating the threads repeatedly,
we integrate the thread pool of libtuv with git submodule.
The pthread-related functions and data types are replaced with the corresonding
ones of libtuv.
The compilation of libtuv library is written in the file mk/submodule.mk.

Experiment:
Call clock_gettime() right before and after the functions for getting the thread.
The functions are pthread_create() (without thread pool)
and uv_queue_work() (with thread pool).
Use test-multi-pow.py as testcase since it initializes and destroys dcurl only once and
does the PoW multiple times, like what IRI does.
The experiment result shows the time of getting each thread
and the thread number of a PoW execution is 7.

Hardware information:
architecure - x86_64
CPU         - AMD Ryzen 5 2400G (4 cores/8 threads)

Experiment result (unit: second):
Without thread pool
thread0: 0.000028384
thread1: 0.000025127
thread2: 0.000024748
thread3: 0.000023925
thread4: 0.000024126
thread5: 0.000025328
thread6: 0.000052900
thread0: 0.000049344
thread1: 0.000039575
thread2: 0.000036720
thread3: 0.000036249
thread4: 0.000034606
thread5: 0.000034676
thread6: 0.000033444

With thread pool
thread0: 0.000124327
thread1: 0.000002084
thread2: 0.000001052
thread3: 0.000000150
thread4: 0.000000121
thread5: 0.000000080
thread6: 0.000000090
thread0: 0.000000291
thread1: 0.000000080
thread2: 0.000000050
thread3: 0.000000050
thread4: 0.000000050
thread5: 0.000000060
thread6: 0.000000050

The first consumed time of getting the thread from thread pool is longer
since it is in charge of preallocating and initalizing the threads.

Close DLTcollab#58.
@jserv jserv closed this as completed in #93 Feb 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature Outstanding features we should implement
Projects
None yet
Development

No branches or pull requests

4 participants