Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts about design philosophy of RankLLM #109

Open
lintool opened this issue Apr 19, 2024 · 6 comments
Open

Thoughts about design philosophy of RankLLM #109

lintool opened this issue Apr 19, 2024 · 6 comments

Comments

@lintool
Copy link
Member

lintool commented Apr 19, 2024

What is RankLLM? I can think of two obvious answers:

Approach 1. RankLLM is a fully-integrated layer on top of Anserini and Pyserini.

If this is the case, then we need "deep" integration with Pyserini, pulling it in as a dependency (perhaps parts of it optional, etc.). Iteration would need to be coupled with Pyserini, and likely slower.

Approach 2. RankLLM is a lightweight general-purpose reranking library.

Basically, we can rerank anything... just give us something in this JSON format, and we'll rerank it for you. By the way, you can get the candidates from Pyserini, here's the command you run.

In this case, RankLLM does not need to have Pyserini as a dependency. We just need shim code in Pyserini to get its output into the right format. And for that, Anserini directly also.

Integration is not as tight - but this simplifies dependencies quite a bit...


Thoughts about these two approaches @ronakice @sahel-sh ?

@sahel-sh
Copy link
Member

Copy pasting comments from our slack discussion:

Ronak: "yup these are two directions we can take it in. I am not sure what do you prefer
@sahel Sharifymoghaddam
? I think people probably want more in the Approach 2 direction. For me I always run baselines in Anserini/Pyserini so approach 1 is completely fine too;"

Sahel: "When I decided to keep it as a separate repo, I had option 2 in mind as well. That's why it has a pyserini retriever. I see pros and cons for both approaches. My main concern about the #1 is expanding rankllm. For example adding training and other types of ranking prompts like pointwise. I think having it separate might make it easier to expand. I think Pyserini on its own is large enough and expanding.
But I also see for us as a lab, a cohesive retrieval and rerank can be an umbrella for everything, for example repllama and rankllama.
I personally prefer #2 for easier maintenance and greater visibility. But I don't think we should decide based on that.
The main question is: moving forward what would be the main usage of rankllm. If it is some basic retrieval with study of llms for ranking, it is fine as is.(I.e.a pyserini wrapper inside rankllm repo for an optional retrieval, or bring your own retrieved data like Ronak does, or heavily caching/storing retrieved results)
If you think retrieval would be equally important to our users, then maybe keeping it in the same repo as pyserini guarantees better feature parity. Like some new retriever would be directly available for reranking too)"

@sahel-sh
Copy link
Member

Current state of the design is available in these examples:

Comment from @lintool on the current design:
I like calling approach 2 "bring your own data". The current design is worst of both worlds
in the sense that (1) it's difficult for us to maintain, (2) user doesn't know what to do

@ronakice
Copy link
Member

Just dropping thoughts here:

My main concern about the Option 1 is expanding rankllm. For example adding training and other types of ranking prompts like pointwise. I think having it separate might make it easier to expand. I think Pyserini on its own is large enough and expanding.

I'm not sure if one would hold it down generally (besides training). With training specifically, I think the dependency charts with pyserini will be affected (because I think training frameworks quickly go through torch/transformers versions while Pyserini is slower). One issue with training addition is that additionally, we'll have to benchmark our models whenever these changes are made, especially if 2CR pages.

I personally prefer #2 for easier maintenance and greater visibility. But I don't think we should decide based on that.
The main question is: moving forward what would be the main usage of rankllm. If it is some basic retrieval with study of llms for ranking, it is fine as is.(I.e.a pyserini wrapper inside rankllm repo for an optional retrieval, or bring your own retrieved data like Ronak does, or heavily caching/storing retrieved results)
If you think retrieval would be equally important to our users, then maybe keeping it in the same repo as pyserini guarantees better feature parity. Like some new retriever would be directly available for reranking too)"

I think retrieval is always going to be important to users (even in our pipelines), but at the end of the day, they might not use Pyserini for it. Academically, we do need these coupled well, and I'm sure the community will use it. Practically, I think people will just use some LangChain/LLaMAindex/Vespa most of the time which can interface with RankLLM and have a multi-stage system like that. At least so I think.

@ronakice
Copy link
Member

Yup, I am not sure if it is worse than 2, it is just with some of the prereq baggage of 1, making 2 a bit annoying. I would say there's a lot to be done to make it easier and accessible to use, simplifying some workflows/consistency etc, but those can be worked on.

@sahel-sh
Copy link
Member

I agree with @ronakice: as a retriever I think a hybrid search via LangChain or even simply BM25 via LangChain is as popular as Pyserini if not more.
decoupling rankLLM from Pyserini, might increase its usability.

@lintool
Copy link
Member Author

lintool commented Apr 21, 2024

Seems like we're leaning to Approach 2. I concur with this decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants