Skip to content

c++实现的clip推理,模型有一点点改动,但是不大,改动和导出模型的代码可以在readme里找到,模型文件都在Releases里,包括AX650的模型。新增支持ChineseCLIP

Notifications You must be signed in to change notification settings

ZHEQIUSHUI/CLIP-ONNX-AX650-CPP

Repository files navigation

CLIP

zh_clip-2023-10-20_10.19.10.mp4

other interesting project SAM-ONNX-AX650-CPP

Build

mkdir build
cd build

if x86 onnxruntime

cmake -DONNXRUNTIME_DIR=${onnxruntime_dir} -DOpenCV_DIR=${opencv_cmake_file_dir} ..

else if ax650

cmake -DONNXRUNTIME_DIR=${onnxruntime_dir} -DOpenCV_DIR=${opencv_cmake_file_dir} -DBSP_MSP_DIR=${msp_out_dir} -DBUILD_WITH_AX650=ON ..
make -j4

aarch64-none-gnu library:
onnxruntime
opencv

Resource

Google Drive

ONNX

Export Onnx

ZHEQIUSHUI/CLIP
ZHEQIUSHUI/Chinese-CLIP

Get Original model

export onnx by yourself

# Original Clip
git clone https://github.com/ZHEQIUSHUI/CLIP.git
cd CLIP
python onnx_export.py

# Chinese Clip
git clone https://github.com/ZHEQIUSHUI/Chinese-CLIP.git
git checkout ax650

# download weights
cd weights
./downloads.sh

# get onnx model
cd ..
./convert.sh

# onnxsim model
cd ax650
./onnxsim.sh

or direct download model from release

# Chinese Clip model
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/cnclip/cnclip_vitb16.axmodel
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/cnclip/cnclip_vitb16.img.fp32.onnx
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/cnclip/cnclip_vitb16.txt.fp32.onnx

# feature matmul model
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/3models/feature_matmul.onnx

# Original Clip model
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/3models/image_encoder.onnx
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/3models/image_encoder.axmodel
wget https://github.com/ZHEQIUSHUI/CLIP-ONNX-AX650-CPP/releases/download/3models/text_encoder.onnx

run in x86 with onnxruntime

英文

./main --ienc image_encoder.onnx --tenc text_encoder.onnx --dec feature_matmul.onnx -v ../vocab.txt -i ../images/ -t ../text.txt 

inputs: 
              images: 1 x 3 x 224 x 224
output: 
      image_features: 1 x 512
decode Inference Cost time : 0.00040005s

per image:
                 image path\text|                            bird|                             cat|                             dog|
              ../images/bird.jpg|                            1.00|                            0.00|                            0.00|
               ../images/cat.jpg|                            0.00|                            0.99|                            0.01|
         ../images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|              ../images/bird.jpg|               ../images/cat.jpg|         ../images/dog-chai.jpeg|
                            bird|                            0.87|                            0.01|                            0.12|
                             cat|                            0.00|                            0.98|                            0.02|
                             dog|                            0.00|                            0.00|                            1.00|

中文

./main -l 1 -v ../cn_vocab.txt -t ../cn_text.txt -i ../images/ --ienc ../onnx_models/vitb16.img.fp32.onnx --tenc ../onnx_models/vitb16.txt.fp32.onnx -d ../onnx_models/feature_matmul.onnx 

inputs: 
               image: 1 x 3 x 224 x 224
output: 
unnorm_image_features: 1 x 512
[I][              load_image_encoder][  20]: image feature len 512
[I][               load_text_encoder][ 101]: text feature len 512
[I][                  load_tokenizer][  75]: text token len 52
encode text Inference Cost time : 0.0926369s
matmul Inference Cost time : 0.00045888s

per image:
                 image path\text|                          小鸟|                          猫咪|                          狗子|
              ../images/bird.jpg|                            1.00|                            0.00|                            0.00|
               ../images/cat.jpg|                            0.00|                            0.99|                            0.01|
         ../images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|              ../images/bird.jpg|               ../images/cat.jpg|         ../images/dog-chai.jpeg|
                          小鸟|                            0.77|                            0.22|                            0.01|
                          猫咪|                            0.00|                            1.00|                            0.00|
                          狗子|                            0.00|                            0.00|                            1.00|

中英混合

./main -l 1 -v ../cn_vocab.txt -t ../cn_text_mix.txt -i ../images/ --ienc ../onnx_models/vitb16.img.fp32.onnx --tenc ../onnx_models/vitb16.txt.fp32.onnx -d ../onnx_models/feature_matmul.onnx 

inputs: 
               image: 1 x 3 x 224 x 224
output: 
unnorm_image_features: 1 x 512
[I][              load_image_encoder][  20]: image feature len 512
[I][               load_text_encoder][ 101]: text feature len 512
[I][                  load_tokenizer][  75]: text token len 52
encode text Inference Cost time : 0.106218s
matmul Inference Cost time : 0.000361136s

per image:
                 image path\text|                        小 bird|                         cat 咪|                     小 dog 子|
              ../images/bird.jpg|                           1.00|                           0.00|                         0.00|
               ../images/cat.jpg|                           0.00|                           0.95|                         0.05|
         ../images/dog-chai.jpeg|                           0.00|                           0.01|                         0.99|


per text:
                 text\image path|              ../images/bird.jpg|               ../images/cat.jpg|         ../images/dog-chai.jpeg|
                         小 bird|                            0.96|                            0.03|                            0.00|
                          cat 咪|                            0.00|                            0.93|                            0.07|
                       小 dog 子|                            0.00|                            0.01|                            0.99|

AX650

run in AXERA Chip AX650

英文

./main --ienc image_encoder.axmodel --tenc text_encoder.onnx -d feature_matmul.onnx  -v vocab.txt -t text.txt -i images/
Engine creating handle is done.
Engine creating context is done.
Engine get io info is done.
Engine alloc io is done.
[I][                            init][ 275]: RGB MODEL
decode Inference Cost time : 0.000754583s

per image:
                 image path\text|                            bird|                             cat|                             dog|
                 images/bird.jpg|                            1.00|                            0.00|                            0.00|
                  images/cat.jpg|                            0.01|                            0.98|                            0.01|
            images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|                 images/bird.jpg|                  images/cat.jpg|            images/dog-chai.jpeg|
                            bird|                            1.00|                            0.00|                            0.00|
                             cat|                            0.00|                            0.99|                            0.01|
                             dog|                            0.00|                            0.00|                            1.00|

中文

./main -l 1 -v cn_vocab.txt -t cn_text.txt  -i images/ --ienc cn_clip_vitb16.axmodel --tenc vitb16.txt.fp32.onnx -d feature_matmul.onnx
Engine creating handle is done.
Engine creating context is done.
Engine get io info is done.
Engine alloc io is done.
[I][                            init][ 275]: RGB MODEL
[I][              load_image_encoder][  19]: image feature len 512
[I][               load_text_encoder][ 101]: text feature len 512
[I][                  load_tokenizer][  75]: text token len 52
encode text Inference Cost time : 0.762541s
matmul Inference Cost time : 0.0007695s

per image:
                 image path\text|                            小鸟|                             猫咪|                            狗子|
                 images/bird.jpg|                            0.99|                            0.00|                            0.01|
                  images/cat.jpg|                            0.00|                            0.98|                            0.02|
            images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|                 images/bird.jpg|                  images/cat.jpg|            images/dog-chai.jpeg|
                           小鸟|                             0.43|                            0.57|                            0.00|
                           猫咪|                             0.00|                            1.00|                            0.00|
                           狗子|                             0.00|                            0.14|                            0.86|

中英混和

./main -l 1 -v cn_vocab.txt -t cn_text_mix.txt  -i images/ --ienc cn_clip_vitb16.axmodel --tenc vitb16.txt.fp32.onnx -d feature_matmul.onnx
Engine creating handle is done.
Engine creating context is done.
Engine get io info is done.
Engine alloc io is done.
[I][                            init][ 275]: RGB MODEL
[I][              load_image_encoder][  19]: image feature len 512
[I][               load_text_encoder][ 101]: text feature len 512
[I][                  load_tokenizer][  75]: text token len 52
encode text Inference Cost time : 0.75124s
matmul Inference Cost time : 0.000727667s

per image:
                 image path\text|                         小 bird|                          cat 咪|                        小 dog 子|
                 images/bird.jpg|                            0.99|                            0.01|                            0.00|
                  images/cat.jpg|                            0.00|                            0.94|                            0.06|
            images/dog-chai.jpeg|                            0.00|                            0.00|                            1.00|


per text:
                 text\image path|                 images/bird.jpg|                  images/cat.jpg|            images/dog-chai.jpeg|
                        小 bird|                             0.92|                            0.08|                            0.00|
                         cat 咪|                             0.00|                            1.00|                            0.00|
                      小 dog 子|                             0.00|                            0.10|                            0.90|

Reference

CLIP
Chinese-CLIP
CLIP-ImageSearch-NCNN

About

c++实现的clip推理,模型有一点点改动,但是不大,改动和导出模型的代码可以在readme里找到,模型文件都在Releases里,包括AX650的模型。新增支持ChineseCLIP

Resources

Stars

Watchers

Forks

Packages

No packages published