Skip to content

Latest commit

 

History

History
129 lines (113 loc) · 4.03 KB

use.md

File metadata and controls

129 lines (113 loc) · 4.03 KB

Usage of IT framework

1. Transfer Image to Depth Map

python ./utils/trans_img2depth.py \
    --input_file image.jsonl \
    --output_folder <depth-map-folder/> \
    --image_folder <your-image-folder/> \
    --start_line 0 \
    --end_line 999

2.1 Extract Objects from Images

Prepare your dataset as follows in jsonl format:

{"image": "xxx.jpg"} 

Then run:

python extract/extract_fr_img.py \
    --test_task DenseCap \
    --config_file ./extract/configs/GRiT_B_DenseCap_ObjectDet.yaml \
    --confidence_threshold 0.55 \
    --image_folder  <your-image-path/> \
    --input_file  <path to your_image.jsonl> \
    --output_file  <obj_extr_from_img.jsonl> \
    --start_line 0 \
    --end_line 999 \
    --visualize_output <visualize-output-path> \
    --opts MODEL.WEIGHTS ./ckpt/grit_b_densecap_objectdet.pth 

You will get <obj_extr_from_img.jsonl>:

{"image": "xxx.jpg", "extr_obj_from_img": ["obj1","obj2"], "bounding_boxes": [[206, 137, 426, 364], [418, 119, 639, 388]]}

2.2 Extract Objects from Descriptions

Prepare your jsonl file that contains image path and corresponding description generated by MLLMs as follows:

{"image": "xxx.jpg", "description": "xxxxxxxx."} 

We utilize LLMs to help extract, here provide gpt and llama based extraction way here:

  • gpt version: remember change the query_ChatGPT function to your own query way.
python extract/extract_fr_desc-gpt.py \
    --input_file_path <path to your.jsonl> \
    --output_file_path <obj_extr_from_desc.jsonl> \
    --start_line 0 \
    --end_line 999
  • llama version
CUDA_VISIBLE_DEVICES=4,5,6,7 python ./extract/extract_fr_desc-llama.py \
    --input_file description.jsonl \
    --output_file <obj_extr_from_desc.jsonl> \
    --stop_tokens "<|eot_id|>" \
    --prompt_structure "<|begin_of_text|><|start_header_id|>user<|end_header_id|>{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>" \
    --start_line 0 \
    --end_line 999

Then you will get <obj_extr_from_desc.jsonl>:

{"image": "xxx.jpg", "extr_obj_fr_desc": ["obj1","obj2"], "description": "xxxxxxxx."}

3.1 Filter Hallucinations within Description

Run:

python filter/filter_fr_desc.py \
    --model_config ./filter/GroundingDINO/groundingdino/config/GroundingDINO_SwinB_cfg.py \
    --model_checkpoint ./ckpt/groundingdino_swinb_ogc.pth \
    --box_threshold 0.20 \   
    --text_threshold 0.18 \    
    --input_file <obj_extr_from_desc.jsonl> \
    --output_file <hal_from_desc.jsonl> \
    --image_folder <your-image-path/> \
    --start_line 0 \
    --end_line 999

Then you will get <hal_from_desc.jsonl>:

{"image": "xxx.jpg", "del_obj_from_desc": ["hal2"], "description": "xxxxxxxx."}

3.2 Fine-grained Annotation for Objects in Images

Run:

python fg_annotation/mask_depth.py \
    --input_path <obj_extr_from_img.jsonl> \
    --output_path <fg_anno.jsonl> \
    --image_folder <your-image-folder\> \
    --image_depth_folder <depth-map-folder/> \
    --start_line 0 \
    --end_line 999

Then you will get <fg_anno.jsonl>:

{"image": "xxx.jpg", "extr_obj_from_img": ["obj2"], "bounding_boxes": [[418, 119, 639, 388]], "object_depth": [83], "size": [12428], "width": 640, "height": 480}

Textualization Recaptioning

First concat your <fg_anno.jsonl>, <hal_from_desc.jsonl> together into <your.jsonl> as following:

{"image": "xxx.jpg", "del_obj_from_desc": ["hal2"], "extr_obj_from_img": ["obj2"], "bounding_boxes": [[418, 119, 639, 388]], "object_depth": [83], "size": [12428], "width": 640, "height": 480, "description": "xxxxxxxx."}

Then run:

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./refine/add_detail.py \
--input_file <your.jsonl> \
--output_file <refined_desc.jsonl> \
--stop_tokens "<|eot_id|>" \
--prompt_structure "<|begin_of_text|><|start_header_id|>user<|end_header_id|>{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>" \
--start_line 0 \
--end_line 999

You will get a more detailed description:

{"image": "xxx.jpg", "description": "xxxxxxxx."}