[REVIEW] Add tfidf bm25 #2353

jperez999 · 2024-06-05T16:21:07Z

This PR will add support for tfidf and BM25 preprocessing of sparse matrix. It does not require the user to work within the confines of the COO or CSR matrix. It only requires the triplets of data ( row, column, value). With this information, we are able to preprocess the values accordingly. Putting this up to get eyes on this, to make sure this is going in the correct direction or if not, to adjust.

Unit tests are still required for these features.

[skip ci] Update master references for main branch

REL Fix `21.06` Release Changelog

[HOTFIX] Remove `-g` from cython compile commands

[RELEASE] v22.04

Our `devel` Docker containers need to be switched to using `conda` compilers to resolve a linking error. `raft` is in those containers, but hasn't yet been built with `conda` compilers. This PR addresses that. These changes won't cleanly merge into `branch-22.08` unfortunately due to the changes in rapidsai#641, but we can address that another time. Authors: - AJ Schmidt (https://github.com/ajschmidt8) - Corey J. Nolet (https://github.com/cjnolet) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Corey J. Nolet (https://github.com/cjnolet)

[RELEASE] v22.06 raft

FIX update-version.sh

@shwina

@shwina I'm going to apologize ahead of time for this, but i was trying to forward merge your branch 22.10 locally to create a new PR from it and I accidentally pushed to your remote branch. I cherry-picked the commits over to a new branch for the hotfix. Authors: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) Approvers: - Ray Douglass (https://github.com/raydouglass)

[RELEASE] raft v22.10.01

[RELEASE] raft v22.12.01 [skip-gpuci]

REL Update changelog v23.04

cjnolet

Thanks for these changes Julio! They look great for the most part. Mostly minor things- 1) we need to use RAFT primitives where and whenever possible instead of thrust. 2) We should test at larger scales and write more reproducible tests by providing naive kernels to evalute the results.

cjnolet · 2024-08-14T19:40:57Z

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh

+ * limitations under the License.
+ */
+
+#include <raft/core/device_mdarray.hpp>


Ideally you just import what you need, so if you need all of these then go ahead and import them. Otherwise, try to remove things that are unneeded.

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh

cpp/test/sparse/preprocess_coo.cu

cjnolet · 2024-08-14T19:53:36Z

/ok to test

cpp/test/preprocess_utils.cu

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh

jperez999 · 2024-09-11T17:53:07Z

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh

+                             data.data_handle(),
+                             stream);
+
+  thrust::reduce_by_key(raft::resource::get_thrust_policy(handle),


Ended up using the thrust version because it could handle vectors, which allows me to use the same code for both the csr and coo matrix versions of the encoding logic. Also the raft version does not support sparse matrix versions.

Is this to compute the degree of each row in the sparse format? We have routines for this already. We have a coo_degree function here. Degree computation for CSR is actually really trivial- since you already have an array of offsets, you don't even need to count the columns because you can literally just diff the array (e.g. compute the difference between each value in the indptr array and the value that occurred before it). If you can't guarantee uniqueness, you can also use a simple mask as an efficient way to compute uniqueness. For COO, you can then just add the 1s in the mask for each row segment. For a sorted COO, the degree computation is actually trivial- you only need the row and columns arrays and do a segmented reduce.

When we were using this function for rows, coo_degree was absolutely the right play. I was just trying to follow code reuse, but that ended up causing problems with larger datasets (in the form of illegal memory access errors). I have made it so this function is only used when we are trying to get a column-wise sum of the values (not just checking if there is a value like with rows). And we cant just use l1 normalization because I need the avg column size across all columns and the individual column avg. The reduce by key functions available in raft are for dense matrices only. This is why I have opted to use the thrust reduce_by_key when we are doing the column based processing.

…to add-tfidf-bm25

cpp/include/raft/sparse/matrix/preprocessing.cuh

cpp/include/raft/sparse/neighbors/knn.cuh

cjnolet · 2024-09-24T20:37:22Z

cpp/test/preprocess_utils.cu

+  auto keys_out   = raft::make_device_vector<int, int64_t>(handle, num_rows);
+  auto counts_out = raft::make_device_vector<int, int64_t>(handle, num_rows);
+
+  thrust::reduce_by_key(raft::resource::get_thrust_policy(handle),


We have a great function already for removing duplicates from sparse formats- it uses a simple mask to figure out where the duplicates are. It's really efficient. Also, if the goal is to get the degree for each row of the matrix, we have functions for this.

I'm really less concerned about thrust in tests... however it does make it easier if we can reuse raft routines

So the two available functions for removing duplicates that I saw are compute_duplicates_mask and max_duplicates now I did not find one that takes a mask and removes based on it. And what I have here is a function that uses the mask to remove the dupes. Max duplicates works a little different than the mask. It will opt to leave the max value row, however that is not the behavior I want. I would like to take the last vertice value that we see in the COO vectors. This aligns more with the compute_duplicates_mask function which is what I used here. But all this other stuff is required. If you look further up in the code you will see that before we remove the dupes we use that function to calculate the mask. The mask is actually used in this function. If there is a function that exists already that takes this mask directly and will remove the 0 value indices, I would love to use it. I could not find it though.

ajschmidt8 and others added 30 commits July 14, 2020 17:05

update master references

a6677ca

REL DOC Updates for main branch switch

ad2d7d7

[skip ci] Update master references for main branch

Merge pull request rapidsai#272 from rapidsai/branch-21.06

e3c9344

REL Fix `21.06` Release Changelog

Merge pull request rapidsai#321 from rapidsai/branch-21.08

3b0a6d2

[HOTFIX] Remove `-g` from cython compile commands

REL v21.08.00 release

309ea1a

Merge pull request rapidsai#612 from rapidsai/branch-22.04

3740998

[RELEASE] v22.04

REL v22.04.00 release

e987ec8

update changelog

229b9f8

Merge pull request rapidsai#708 from rapidsai/branch-22.06

0eded98

[RELEASE] v22.06 raft

FIX update-version.sh

3e5a625

Merge pull request rapidsai#709 from rapidsai/branch-22.06

ad50a7f

FIX update-version.sh

REL v22.06.00 release

ed2c529

Merge pull request rapidsai#782 from rapidsai/branch-22.08

aae5e34

REL v22.08.00 release

87a7d16

Merge pull request rapidsai#908 from rapidsai/branch-22.10

1de93ba

REL v22.10.00 release

31ae597

Merge pull request rapidsai#988 from rapidsai/branch-22.10

c6e6ce8

[RELEASE] raft v22.10.01

REL v22.10.01 release

f7d2335

Merge pull request rapidsai#1063 from rapidsai/branch-22.12

c16fa56

REL v22.12.00 release

9a716b7

Merge pull request rapidsai#1101 from rapidsai/branch-22.12

60936ba

[RELEASE] raft v22.12.01 [skip-gpuci]

REL v22.12.01 release

a655c9a

Merge pull request rapidsai#1250 from rapidsai/branch-23.02

9a66f42

REL v23.02.00 release

69dce2d

Merge pull request rapidsai#1405 from rapidsai/branch-23.04

1467154

REL v23.04.00 release

7d1057e

REL v23.04.01 release

dc800d6

REL Merge pull request rapidsai#1486 from rapidsai/branch-23.04

520e12c

REL Update changelog v23.04

jperez999 added 3 commits July 31, 2024 12:02

Merge branch 'branch-24.10' into add-tfidf-bm25

1fc27f3

Merge branch 'branch-24.10' into add-tfidf-bm25

82cfb1f

Merge branch 'branch-24.10' into add-tfidf-bm25

1155609

cjnolet requested changes Aug 14, 2024

View reviewed changes

cjnolet and others added 5 commits August 29, 2024 12:26

Merge branch 'branch-24.10' into add-tfidf-bm25

6302957

fix preprocessing and make tests run on r random at generation

05f4af2

remove unnecessary imports

a1e3a48

remove log for tf

44f3e1c

added more template changes

e25e2de

jperez999 commented Sep 11, 2024

View reviewed changes

cpp/test/preprocess_utils.cu Show resolved Hide resolved

jperez999 commented Sep 11, 2024

View reviewed changes

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh Outdated Show resolved Hide resolved

jperez999 commented Sep 11, 2024

View reviewed changes

Merge branch 'branch-24.10' into add-tfidf-bm25

187e148

jperez999 requested a review from cjnolet September 11, 2024 17:54

jperez999 and others added 5 commits September 18, 2024 12:03

Merge branch 'branch-24.10' into add-tfidf-bm25

ec4e4a2

remove excess thrust calls

e6d2c1c

add better comment on inputs for tests

5120c97

Merge branch 'add-tfidf-bm25' of https://github.com/jperez999/raft in…

81e2a41

…to add-tfidf-bm25

Merge branch 'branch-24.10' into add-tfidf-bm25

90373ab

cjnolet reviewed Sep 24, 2024

View reviewed changes

cpp/include/raft/sparse/matrix/preprocessing.cuh Outdated Show resolved Hide resolved

cjnolet reviewed Sep 24, 2024

View reviewed changes

cpp/include/raft/sparse/neighbors/knn.cuh Outdated Show resolved Hide resolved

cjnolet reviewed Sep 24, 2024

View reviewed changes

cpp/include/raft/sparse/neighbors/knn.cuh Outdated Show resolved Hide resolved

cjnolet reviewed Sep 24, 2024

View reviewed changes

cjnolet changed the base branch from branch-24.10 to branch-24.12 September 26, 2024 14:47

jperez999 and others added 4 commits September 26, 2024 11:13

fixed scale errors

87a729c

remove vector based public apis

63576b0

add in bfknn tests for csr and coo sparse matrices

c123acb

Merge branch 'branch-24.12' into add-tfidf-bm25

29f14d9

jperez999 requested a review from cjnolet September 30, 2024 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add tfidf bm25 #2353

[REVIEW] Add tfidf bm25 #2353

jperez999 commented Jun 5, 2024

cjnolet left a comment

cjnolet Aug 14, 2024

cjnolet commented Aug 14, 2024

jperez999 Sep 11, 2024

cjnolet Sep 24, 2024 •

edited

Loading

jperez999 Sep 27, 2024 •

edited

Loading

cjnolet Sep 24, 2024

cjnolet Sep 24, 2024

jperez999 Sep 27, 2024

[REVIEW] Add tfidf bm25 #2353

Are you sure you want to change the base?

[REVIEW] Add tfidf bm25 #2353

Conversation

jperez999 commented Jun 5, 2024

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet Aug 14, 2024

Choose a reason for hiding this comment

cjnolet commented Aug 14, 2024

jperez999 Sep 11, 2024

Choose a reason for hiding this comment

cjnolet Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

jperez999 Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

cjnolet Sep 24, 2024

Choose a reason for hiding this comment

cjnolet Sep 24, 2024

Choose a reason for hiding this comment

jperez999 Sep 27, 2024

Choose a reason for hiding this comment

cjnolet Sep 24, 2024 •

edited

Loading

jperez999 Sep 27, 2024 •

edited

Loading