Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load and look up ProjectVIC photoDNA hashes #280

Closed
lfcnassif opened this issue Oct 9, 2020 · 3 comments · Fixed by #298
Closed

Load and look up ProjectVIC photoDNA hashes #280

lfcnassif opened this issue Oct 9, 2020 · 3 comments · Fixed by #298
Assignees

Comments

@lfcnassif
Copy link
Member

Basic support was implemented in #246 without photoDNA loading. Current hashset size would result in about 2GB of heap usage if we load photoDNA hashes on heap. I thought about refactoring photoDNA indexing and lookup to be disk based, but that will need some effort, and probably will be slower. Although loading on heap is not a long term solution, with current hashset size it is possible.

Any thoughts @tc-wleite ?

@wladimirleite
Copy link
Member

Well, it will definitely be slower (but hopefully still very fast) and will require extra implementation effort.
On the other hand, although 2 GB is not that much, thinking about other hash datasets that can be included in the future, using some disk-based solution seems unavoidable.

Just a couple of quick thoughts, not sure if they make sense at this point:
An option would be to implement a more general solution to deal with hash sets, outside the tasks and parsers(?) implementation, which would allow loading and querying datasets.
The second observation is that, having an option to load in memory could be interesting (when there is a lot of memory available).

@lfcnassif
Copy link
Member Author

lfcnassif commented Oct 9, 2020

1.1: Do you mean an external application/service to be queried? I thought about this in the past, I made it possible for tasks to accumulate items to do bulk requests to external services, so network latency will not hurt too much. There is some initial implementation in batchPythonTask branch (bad named);
1.2: The KFFTask needs to be refactored, so it could support different kinds of hashsets, not only NSRL. Not sure if vector based hashes with similarity distances would be easy to put in the same solution;
2: Yes, we currently have a lot of memory, I just thought in a straighforward implementation for now.

@wladimirleite
Copy link
Member

1.1. That can be another option, but I was thinking about an internal implementation. I meant more in terms of code organization.
1.2. I guess distance would require a specific/more sophisticated solution.

@lfcnassif lfcnassif self-assigned this Oct 20, 2020
@lfcnassif lfcnassif changed the title Load and check ProjectVIC photoDNA hashes Load and look up ProjectVIC photoDNA hashes Nov 18, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants