Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Separate thresholds for valid topics and invalid topics. #13

Open
JosephCatrambone opened this issue Aug 15, 2024 · 1 comment

Comments

@JosephCatrambone
Copy link
Contributor

JosephCatrambone commented Aug 15, 2024

As of writing, there's only one threshold for the zero-shot topics that's used as a cutoff for whether a topic is considered 'found' or not. Having separate thresholds for the positive and negative side of the equation would allow for us to perform more nuanced filtering, like: "It might not be about sports, but it's definitely not about travel."

Consider the case where our threshold is 0.5, the default. If we assume the false-positive rate here 4%[1] then adding ten negative topics means our odds of accidentally flagging something is 1-((1-0.04)...(1-0.04)), or 33%.

It would be nice to be able to tune that.

I imagine the change would be something akin to:

        candidate_topics = model_input["valid_topics"] + model_input["invalid_topics"]
        thresholds = [self._zero_shot_threshold_valid]*len(model_input["valid_topics"]) + [self._zero_shot_threshold_invalid]*len(model_input["invalid_topics"])

        result = self._classifier(text, candidate_topics)
        topics = result["labels"]
        scores = result["scores"]
        found_topics = []
        for topic, score, threshold in zip(topics, scores, thresholds):
            if score > threshold:
                found_topics.append(topic)

[1] Source: lost the original link so the new source is 'trust me, friendo'.

@JosephCatrambone
Copy link
Contributor Author

I'm not sure if this merits a separate discussion, but was the default threshold originally selected to optimize for fewer false negatives to more readily defer to GPT or was it picked as an overall optimal?

zsimjee pushed a commit that referenced this issue Oct 21, 2024
-adds dynamic metadata based filtering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant