Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of state to improve performance (…failed?) #4

Open
DoubleHyphen opened this issue Mar 31, 2021 · 3 comments
Open

Get rid of state to improve performance (…failed?) #4

DoubleHyphen opened this issue Mar 31, 2021 · 3 comments

Comments

@DoubleHyphen
Copy link
Contributor

I thought I'd try to figure out a way to get rid of the state parametre in both the LUT's index and the results it outputs. I succeeded in achieving that, but to my astonishment it ended up consistently ~50% slower! I tried optimising things further here and there, but no consistent result came up. Here is the main.rs file that explains how I did all those things; in the lib.rs file you'll see the actual methods I used.

That said…

You said in your benchmarks that fast_hilbert is around 2-2.5 times faster than other comparable methods. My own benchmarks show the improvement to be on the order of 8-10 times faster! I strongly suspect that our benchmarks lead to different results. Thus, if this seems interesting to you, feel free to experiment as well and see; otherwise, just close the issue.

PS: The main reason I ultimately decided to show you the work I did is because I think that this will help illuminate the way to generalise to other dimensions.

@becheran
Copy link
Owner

Interesting.

Did you use the extra two bits to increase the LUT size before you benchmarked? The performance drop can be explained by the bit operation overhead (which is low, but not zero). A LUT lookup on a modern CPU with a hot cache is very faster. But that said, my implementation might be slower on a system without a cache such as a low cost microcontoller. On the other hand on such a system you might better also want to avoid a 256 Byte LUT anyways...

Actually I ran the benchmark again on a full 256 * 256 2D space and the performance gain was more in the range of 5 times faster. See the updated readme. But again, it also depends on the system you are benchmarking on. And the 2.5 times was before your pr which did also increase the performance because counting zeros is faster than the iteration over the LUT and saves time for smaller numbers.

@DoubleHyphen DoubleHyphen changed the title Get rid of state to improve performance (FAILED) Get rid of state to improve performance (…failed?) Apr 5, 2021
@DoubleHyphen
Copy link
Contributor Author

DoubleHyphen commented Apr 5, 2021

I figured out how to make it go faster. However, if I make it calculate the new state imperatively, it merely matches your performance; to make it go faster (by about a third) I had to also extract the state into a new LUT. Please find an example implementation for [u32; 2] -> u64 below:

const STATE_LUT: [u8; 256] = [0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 2, 3, 3, 0, 3, 2, 2, 1, 3, 2, 2, 1, 0, 1, 1, 2, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 2, 3, 3, 0, 3, 2, 2, 1, 3, 2, 2, 1, 0, 1, 1, 2, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0];


const LUT_4_BPC: [u8; 256] = [
    000, 001, 014, 015, 016, 019, 020, 021, 234, 235, 236, 239, 240, 241, 254, 255, 
    003, 002, 013, 012, 017, 018, 023, 022, 233, 232, 237, 238, 243, 242, 253, 252, 
    004, 007, 008, 011, 030, 029, 024, 025, 230, 231, 226, 225, 244, 247, 248, 251, 
    005, 006, 009, 010, 031, 028, 027, 026, 229, 228, 227, 224, 245, 246, 249, 250, 
    058, 057, 054, 053, 032, 035, 036, 037, 218, 219, 220, 223, 202, 201, 198, 197, 
    059, 056, 055, 052, 033, 034, 039, 038, 217, 216, 221, 222, 203, 200, 199, 196, 
    060, 061, 050, 051, 046, 045, 040, 041, 214, 215, 210, 209, 204, 205, 194, 195, 
    063, 062, 049, 048, 047, 044, 043, 042, 213, 212, 211, 208, 207, 206, 193, 192, 
    064, 067, 068, 069, 122, 123, 124, 127, 128, 131, 132, 133, 186, 187, 188, 191, 
    065, 066, 071, 070, 121, 120, 125, 126, 129, 130, 135, 134, 185, 184, 189, 190, 
    078, 077, 072, 073, 118, 119, 114, 113, 142, 141, 136, 137, 182, 183, 178, 177, 
    079, 076, 075, 074, 117, 116, 115, 112, 143, 140, 139, 138, 181, 180, 179, 176, 
    080, 081, 094, 095, 096, 097, 110, 111, 144, 145, 158, 159, 160, 161, 174, 175, 
    083, 082, 093, 092, 099, 098, 109, 108, 147, 146, 157, 156, 163, 162, 173, 172, 
    084, 087, 088, 091, 100, 103, 104, 107, 148, 151, 152, 155, 164, 167, 168, 171, 
    085, 086, 089, 090, 101, 102, 105, 106, 149, 150, 153, 154, 165, 166, 169, 170, 
];
    fn hilbert_index(&self) -> u64 {
        fn separate_nybbles(x: u32) -> u64 {
            let mut x = x as u64;
            x |= x<<16;
            x &= 0x0000_FFFF_0000_FFFF;
            x |= x<<8;
            x &= 0x00FF_00FF_00FF_00FF;
            x |= x<<4;
            x &= 0x0F0F_0F0F_0F0F_0F0F;
            x
        }
        let bloated_1 = separate_nybbles(self[0]);
        let bloated_2 = separate_nybbles(self[1]);
        let key_0 = bloated_2 << 4 | bloated_1;
// Now, each byte of key_0 has one nybble of x and one nybble of y.
        let key_1 = bloated_1 << 4 | bloated_2;
        let key_2 = !key_0;
        let key_3 = !key_1;
        let combineds = [key_0, key_1, key_2, key_3];
        let mut state = 0u8;
        let mut result = [0u8; 8];
        for i in 0..8 {
            let index = combineds[state as usize].to_be_bytes()[i];
            let new_thing = LUT_4_BPC[index as usize];
            result[i] = new_thing;
            state ^= STATE_LUT[new_thing as usize];
        }
        u64::from_be_bytes(result)
    }

Since we have the state LUT already, and since it's indexed by the Hilbert index's keys rather than X and Y, I think it would probably be possible to also use the same LUT to also hep accelerate the inverse transformation. Any ideas?

@becheran
Copy link
Owner

becheran commented Apr 6, 2021

I am not 100% sure how you crated the state LUT. But you still have 6 "free" bits in the state LUT since you only used 2 so far for the state. So one pragmatic approach could be to compute the inverser LUT and add the result in the existing LUT as well. Does this make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants