Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Main keystore #2

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft

Main keystore #2

wants to merge 18 commits into from

Conversation

peteralfonsi
Copy link
Owner

@peteralfonsi peteralfonsi commented Sep 22, 2023

Description

This change adds an RBM implementation of KeyLookupStore, designed to store integer hashcodes of keys for a future disk cache in a memory-efficient way. It stores values in the RBM with a modulo, as very sparse RBMs are not memory-efficient. This comes with a tradeoff of some rare collisions (~0.3% of values for the recommended modulo 2^29 in a store with 10^7 values). To handle this, we also maintain a hash set of values which have had collisions. Values with no collisions can be safely removed without risking any false negatives in contains().

The implementation is generic and can be used in other places as well.

Testing

The implementing class was unit tested. The tests include basic functionality for all the different underlying data structures that the classes use, ability to set a memory cap to prevent the store from growing too large, and making sure the implementations are thread-safe. See RBMIntKeyLookupStoreTests.java for these tests.

Related Issues

N/A, part of larger tiered caching feature.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • [N/A] Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

* int values. The internal representations may have collisions. Example transformations include a modulo
* or -abs(value), or some combination.
*/
public interface IntKeyLookupStore {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to define interface specifically for int.
You can use Generic instead and define KeyLookupStore, and accordingly parameterize the methods.
Then IntKeyLookupStore can be one of the implementation.

* Check if the implementing class supports safe removals. If it doesn't, remove() will always return false.
* @return true if the class supports safe removals, false if it doesn't.
*/
boolean supportsRemoval();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this in the first place? We ideally should not expose any implementation which can't support removals as it is a basic/mandatory requirement.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that because of collisions either in the hash function or the modulo, removing values might also remove the keys which collided with those values, and we don't want to have false negatives because it might mean we have to recompute the query. In the RemovableIntKeyLookupStore we have a set of values which have had collisions which lets us safely remove un-collided values, which should be most of them. I also added a forceRemove function so you can get rid of any value for either class, at the cost of no longer guaranteeing no false negatives.

* Check if the object currently guarantees having no false negatives when running contains().
* @return false if there will not be false negatives, true if there could be false negatives.
*/
boolean canHaveFalseNegatives();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think we should disallow false negatives at all cost? As it should be a basic requirement, and can be costly depending on the use case(like computing query again though it was present in cache)

* If it uses a RBM without a modulo or doesn't use an RBM, returns 0.
* @return The modulo.
*/
int getModulo();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm don't think we need to expose this. If we use modulo 2^22 for example, the concrete class can itself can be called RoaringBitMapModule22 or something like that. I don't see a use case for this function.

/**
* Returns whether the store is at memory capacity
*/
boolean getIsAtCapacity();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to isAtCapacity() ?

* Returns true if the transformation involves taking -abs(), simplifying int[] access and sorting
* @return Whether transformed values are always negative.
*/
boolean isUsingNegativeOnly();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep the interface simple for now and remove this if not a strong requirement.


/**
* Returns an estimate of the store's memory usage.
* @return The memory usage, in MB
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use bytes instead of MB. And long instead of double?
Also explicitly name function like getMemorySizeInBytes() to be clear.

* int values. The internal representations may have collisions. Example transformations include a modulo
* or -abs(value), or some combination.
*/
public interface IntKeyLookupStore {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also add a clear() function to clear all the keys.

Peter Alfonsi added 4 commits September 25, 2023 12:44
@peteralfonsi peteralfonsi marked this pull request as draft September 25, 2023 23:14
* Returns the current internal data structure.
* @return A string representing the currently used internal data structure.
*/
String getCurrentStructure() throws Exception;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about it's use case, but if we want to have it, create a enum for it instead.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was mostly only useful in testing, but you're right, we can just have it return the enum it uses internally instead of a string.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it from the interface, just kept it in the hybrid class and changed it to return an enum. The RBM only class doesn't need it

* Check if the implementing class supports safe removals. If it doesn't, remove() will always return false.
* @return true if the class supports safe removals, false if it doesn't.
*/
boolean supportsRemoval();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can go ahead with just Removable implementation, so I guess this will not be needed as such.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the additional overhead for the Removable implementation is very small so it's worth it

* Check if the object currently guarantees having no false negatives when running contains().
* @return false if there will not be false negatives, true if there could be false negatives.
*/
boolean canHaveFalseNegatives();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, lets not have a implementation which can lead to false negatives.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we get rid of forceRemove()? I think if we keep it, it'd still be useful to have a function keeping track of whether it's been called. But if we don't keep it no need for this function

* Returns the number of times add() has been run, including unsuccessful attempts.
* @return The number of adding attempts.
*/
int getNumAddAttempts();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it to getTotalAdds() ?

* Returns the number of times add() has returned false due to a collision.
* @return The number of collisions.
*/
int getNumCollisions();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it to getCollisions(), as it is a int anyways, so "Num" looks redundant.

return false;
}
int transformedValue = transform(value);
boolean alreadyContained = contains(transformedValue);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. We are passing transformedValue to contains, which internally again transforms it further. Shouldn't we be just passing contains(value) instead?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also considering this will mostly return false than true, as collisions would be less frequent, better to just call rbm.add() directly and avoid extra seek?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right it's redundant, although transform() should be idempotent. I'll change that. But I think we need to keep the call to contains(), otherwise we won't know if there is a collision, and then we can't maintain our set of values which have had collisions, which we need to safely remove entries.

Comment on lines 44 to 47
protected int size;
protected long memSizeCapInBytes;
protected int numAddAttempts;
protected int numCollisions;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned elsewhere, lets move this to separate KeyStoreStats class.
Also use CounterMetric or LongAdder to avoid concurrency related issues.

return true;
}
handleCollisions(transformedValue);
return false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can make this function add as void itself, as client calling add() might not find this value useful.


@Override
public String getCurrentStructure() throws Exception {
return "RBM";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, define an enum for this if needed.

import org.roaringbitmap.RoaringBitmap;
import java.util.HashSet;

public class RemovableRBMIntKeyLookupStore extends RBMIntKeyLookupStore implements KeyLookupStore<Integer> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, lets only keep this implementation and avoid false negatives at all cost.. And rename it to RBMIntKeyLookupStore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants