You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I am working on fine-tuning the Kosmos-2 model for my own application. In short, the target may appear multiple times in the image (e.g., cars in a parking lot), and the cases can be there is only one target in the image as well.
Right now, I am preparing the dataset like following:
if len(bboxes) > 1:
text = "<grounding>" + "<phrase> several {target}s</phrase>"
else:
text = "<grounding>" + "<phrase> a {target}</phrase>"
data_list.append({'bbox': [bboxes], 'image': image, 'text': text})
In this code, the bboxes is the human annotated bounding box, the format is list of list of tuples. The [target] is the placeholder for my target (which is a noun word.)
When I train the model with such prompts, it still output one and only one bounding box for the target, even there are multiple targets in the image.
For example, let's say the target is "car", the model will only output a bounding box for one of multiple cars in the image.
May I ask how can I solve this issue?
Note
"Car" is an random example, the target is something we believe it's rare in the Kosmos-2 pre-training data.
The text was updated successfully, but these errors were encountered:
Hello, as you know, we haven't fine-tuned this model on any specific object detection dataset, so we cannot control how many bboxes the model will generate; it could be one or multiple.
Perhaps you can try some different prompts: Describe this image in detailed. a {target} / several {targets}
Describe
Model I am using: Kosmos-2
Hi! I am working on fine-tuning the Kosmos-2 model for my own application. In short, the target may appear multiple times in the image (e.g., cars in a parking lot), and the cases can be there is only one target in the image as well.
Right now, I am preparing the dataset like following:
In this code, the
bboxes
is the human annotated bounding box, the format is list of list of tuples. The[target]
is the placeholder for my target (which is a noun word.)When I train the model with such prompts, it still output one and only one bounding box for the target, even there are multiple targets in the image.
For example, let's say the target is "car", the model will only output a bounding box for one of multiple cars in the image.
May I ask how can I solve this issue?
Note
"Car" is an random example, the target is something we believe it's rare in the Kosmos-2 pre-training data.
The text was updated successfully, but these errors were encountered: