python ./utils/trans_img2depth.py \
--input_file image.jsonl \
--output_folder <depth-map-folder/> \
--image_folder <your-image-folder/> \
--start_line 0 \
--end_line 999
Prepare your dataset as follows in jsonl format:
{"image": "xxx.jpg"}
Then run:
python extract/extract_fr_img.py \
--test_task DenseCap \
--config_file ./extract/configs/GRiT_B_DenseCap_ObjectDet.yaml \
--confidence_threshold 0.55 \
--image_folder <your-image-path/> \
--input_file <path to your_image.jsonl> \
--output_file <obj_extr_from_img.jsonl> \
--start_line 0 \
--end_line 999 \
--visualize_output <visualize-output-path> \
--opts MODEL.WEIGHTS ./ckpt/grit_b_densecap_objectdet.pth
You will get <obj_extr_from_img.jsonl>:
{"image": "xxx.jpg", "extr_obj_from_img": ["obj1","obj2"], "bounding_boxes": [[206, 137, 426, 364], [418, 119, 639, 388]]}
Prepare your jsonl file that contains image path and corresponding description generated by MLLMs as follows:
{"image": "xxx.jpg", "description": "xxxxxxxx."}
We utilize LLMs to help extract, here provide gpt and llama based extraction way here:
- gpt version: remember change the
query_ChatGPT
function to your own query way.
python extract/extract_fr_desc-gpt.py \
--input_file_path <path to your.jsonl> \
--output_file_path <obj_extr_from_desc.jsonl> \
--start_line 0 \
--end_line 999
- llama version
CUDA_VISIBLE_DEVICES=4,5,6,7 python ./extract/extract_fr_desc-llama.py \
--input_file description.jsonl \
--output_file <obj_extr_from_desc.jsonl> \
--stop_tokens "<|eot_id|>" \
--prompt_structure "<|begin_of_text|><|start_header_id|>user<|end_header_id|>{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>" \
--start_line 0 \
--end_line 999
Then you will get <obj_extr_from_desc.jsonl>:
{"image": "xxx.jpg", "extr_obj_fr_desc": ["obj1","obj2"], "description": "xxxxxxxx."}
Run:
python filter/filter_fr_desc.py \
--model_config ./filter/GroundingDINO/groundingdino/config/GroundingDINO_SwinB_cfg.py \
--model_checkpoint ./ckpt/groundingdino_swinb_ogc.pth \
--box_threshold 0.20 \
--text_threshold 0.18 \
--input_file <obj_extr_from_desc.jsonl> \
--output_file <hal_from_desc.jsonl> \
--image_folder <your-image-path/> \
--start_line 0 \
--end_line 999
Then you will get <hal_from_desc.jsonl>:
{"image": "xxx.jpg", "del_obj_from_desc": ["hal2"], "description": "xxxxxxxx."}
Run:
python fg_annotation/mask_depth.py \
--input_path <obj_extr_from_img.jsonl> \
--output_path <fg_anno.jsonl> \
--image_folder <your-image-folder\> \
--image_depth_folder <depth-map-folder/> \
--start_line 0 \
--end_line 999
Then you will get <fg_anno.jsonl>:
{"image": "xxx.jpg", "extr_obj_from_img": ["obj2"], "bounding_boxes": [[418, 119, 639, 388]], "object_depth": [83], "size": [12428], "width": 640, "height": 480}
First concat your <fg_anno.jsonl>, <hal_from_desc.jsonl> together into <your.jsonl> as following:
{"image": "xxx.jpg", "del_obj_from_desc": ["hal2"], "extr_obj_from_img": ["obj2"], "bounding_boxes": [[418, 119, 639, 388]], "object_depth": [83], "size": [12428], "width": 640, "height": 480, "description": "xxxxxxxx."}
Then run:
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./refine/add_detail.py \
--input_file <your.jsonl> \
--output_file <refined_desc.jsonl> \
--stop_tokens "<|eot_id|>" \
--prompt_structure "<|begin_of_text|><|start_header_id|>user<|end_header_id|>{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>" \
--start_line 0 \
--end_line 999
You will get a more detailed description:
{"image": "xxx.jpg", "description": "xxxxxxxx."}