Android TensorFlow Lite 模型训练流程

Author Avatar
智灵 6月 08, 2020

参考文档:

官方文档:

https://www.tensorflow.org/lite/examples

TensorFlow Object Detection API遇到的问题及解决

// 自动标注

https://github.com/yuxitong/AutoMarKingTensorFlowPython

// TensorFlow android demo 车道线 车辆 人脸 动作 骨架 识别 检测 抽烟 打电话 闭眼 睁眼 :

https://github.com/yuxitong/TensorFlowAndroidDemo/issues?q=is%3Aissue+is%3Aclosed

// Deploying a TensorFlow model to Android (SO形式 有点年代)

https://medium.com/joytunes/deploying-a-tensorflow-model-to-android-69d04d1b0cba

// Android TensorFlow Machine Learning Example (SO形式)

https://blog.mindorks.com/android-tensorflow-machine-learning-example-ff0e9b2654cc

// Training and serving a realtime mobile object detector in 30 minutes with Cloud TPUs (靠谱)

https://medium.com/tensorflow/training-and-serving-a-realtime-mobile-object-detector-in-30-minutes-with-cloud-tpus-b78971cf1193

使用TensorFlow Lite将ssd_mobilenet移植至安卓客户端:靠谱

https://blog.csdn.net/qq_26535271/article/details/83031412

官方或本地移动模型教程:

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_mobile_tensorflowlite.md

模型测试 + 模型转换,靠谱:

https://blog.csdn.net/weixin_43056275/article/details/105225089?depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-6&utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-6#31_config_32

https://blog.csdn.net/chenmaolin88/article/details/79357263

模型转换参考:量化建议

https://www.jianshu.com/p/71de8a49efd4

量化例子,参考

https://blog.csdn.net/angela_12/article/details/85000072?depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-1&utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-1

ERR:

OOM when allocating tensor of shape [3,3,432,1] and type float

​ [[node gradients/zeros_144 (defined at /tensorflow/models/research/object_detection/model_lib.py:413) ]]

代码中使用的batch_size是10,改小为5

https://github.com/tensorflow/models/issues/1993

ERROR:

not a valid absolute pattern (absolute target patterns must start with exactly two slashes

路径用绝对路径, output 和 input 路径用 单引号起来的绝对路径

0、删除 ~/train_ssd_mobilenet/roadsign_data/PascalVOC/ImageSets 里的Main文件夹

删除 ~/train_ssd_mobilenet/roadsign_data/ tfrecords 文件夹

方置 图片和xml标注文集到 ~/train_ssd_mobilenet/roadsign_data/PascalVOC 对应文件夹

修改 :~/train_ssd_mobilenet/roadsign_data 下的 roadsign_label_map.pbtxt文集为自己的label

修改配置文集,如果有必要,

配置文集原始位置:/home/ubuntu/tensorflow/models/research/object_detection/samples/configs/

参考帖子:https://blog.csdn.net/qq_26535271/article/details/84930868

1、准备数据集、编译

1
2
3
4
5
cd ~/train_ssd_mobilenet
//生成 ~/train_ssd_mobilenet/roadsign_data/PascalVOC/ImageSets trainval.txt
python3 build1_trainval.py
// 生成 :~/train_ssd_mobilenet/roadsign_data/tfrecords
python3 build2_tf_record.py

1.1 训练配置选择:

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

// TODO 制作数据集的时候忘了对数据打乱顺序。图片是按第0、1、顺序排放的,所以训练的时候是会出问题的,一定要shuffle数据集 shuffle=true

2、训练:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
python3 ~/tensorflow/models/research/object_detection/model_main.py --alsologtostderr --model_dir=resnet-042401 --pipeline_config_path=./ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_18/pipeline.config 

// PC版 fast rcnn
python3 ~/tensorflow/models/research/object_detection/model_main.py --alsologtostderr --model_dir=resnet-0530 --pipeline_config_path=./faster_rcnn_resnet101_coco.config

// 换成tensorflow samples里面的config,目前正常
python3 ~/tensorflow/models/research/object_detection/model_main.py --alsologtostderr --model_dir=resnet-0523 --pipeline_config_path=ssd_mobilenet_v2_quantized_300x300_coco.config \
--num_train_steps=300000 --num_eval_steps=30

// 可以不要
--num_train_steps=100000

// 可以不要
--num_eval_steps=20
config提示:

eval_config:

num_examples:想要验证的图片数量

max_evals:验证循环次数/home/ubuntu/GoogleAPI/models/research/object_detection

metrics_set:选择验证的方式,有以下几个选项(在research/object_detection/legacy/evaluator.py中)

配置文件调参 参考:

https://www.sohu.com/a/238172351_100177858

https://www.cnblogs.com/jwcz/p/11799507.html

https://github.com/tensorflow/models/blob/abd504235f3c2eed891571d62f0a424e54a2dabc/research/object_detection/protos/eval.proto#L8

查看进度 tensorboard --logdir resnet-042401

3、导出模型

1
2
3
4
5
6
7
8
python3 ~/tensorflow/models/research/object_detection/export_tflite_ssd_graph.py \
--pipeline_config_path=./ssd_mobilenet_v2_quantized_300x300_coco.config \
--trained_checkpoint_prefix=./resnet-0523/model.ckpt-500000 \
--output_directory=./output-0523/tflite \
--add_postprocessing_op=true \
--max_detections=1000

// max_detections 期望检测出来的物品最多多少个

PC 模型导出:

1
2
> python3 ~/tensorflow/models/research/object_detection/export_inference_graph.py --input_type image_tensor --pipeline_config_path=./faster_rcnn_resnet101_coco.config --trained_checkpoint_prefix ./resnet-0415/model.ckpt-104947  --output_directory .
>

4、转换成tflite模型

1
2
3
4
5
6
7
8
9
10
11
12
bazel run -c opt tensorflow/lite/toco:toco -- \
--input_file='/home/ubuntu/train_ssd_mobilenet/output-0523/tflite/tflite_graph.pb' \
--output_file='/home/ubuntu/train_ssd_mobilenet/output-0523/tflite/detect.tflite' \
--input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \
--input_shapes=1,900,900,3 \
--input_arrays=normalized_input_image_tensor \
--output_arrays='TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1','TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3' \
--inference_type=QUANTIZED_UINT8 \
--mean_values=128.0 \
--std_values=128.0 \
--change_concat_input_ranges=false \
--allow_custom_ops
1
2
3
> 这个参数不确定,未测试 :
> --default_ranges_min=0 --default_ranges_max=6
>

模型本地测试

仅适用于fast 训练的模型?, mobilenet训练的模型pb报400)

1
tensorflow_model_server --rest_api_port=9000  --model_name=aaa --model_base_path=/home/ubuntu/train_ssd_mobilenet/output-042401

rcnn_web中转服务,依赖包; pip3 python3.6

Keras (2.3.1)
Keras-Applications (1.0.8)
Keras-Preprocessing (1.1.0)

Flask (1.1.2)

否则无法运行

部署服务器
  • 安装 pip3 install gunicorn

  • gunicorn [OPTIONS] 模块名:变量名

gunicorn -b 0.0.0.0:5000 -w 9 wsgi:app

  • 阿里云ECS ubuntu无法找到gunicorn,卸载后,安装gunicorn3

    gunicorn3 -b xxxx

5、修改官方demo文件:

参考demo https://github.com/fairytale110/Camera2WithTFLite

1
TFLiteObjectDetectionAPIModel.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 // in label file and class labels start from 1 to number_of_classes+1,
// while outputClasses correspond to class index from 0 to number_of_classes
int labelOffset = 1;
- recognitions.add(
- new Recognition(
- "" + i,
- labels.get((int) outputClasses[0][i] + labelOffset),
- outputScores[0][i],
- detection));
+ final int classLabel = (int) outputClasses[0][i] + labelOffset;
+ if (inRange(classLabel, labels.size(), 0) && inRange(outputScores[0][i], 1, 0)) {
+ recognitions.add(
+ new Recognition(
+ "" + i,
+ labels.get(classLabel),
+ outputScores[0][i],
+ detection));
+ }
}
Trace.endSection(); // "recognizeImage"
return recognitions;
}

+ private boolean inRange(float number, float max, float min) {
+ return number < max && number >= min;
+ }
+

遇到过的问题

[Cannot copy between a TensorFlowLite tensor with shape][https://github.com/tensorflow/tensorflow/issues/29054]

解决方法:https://github.com/tensorflow/tensorflow/issues/22106

WARNING:tensorflow:Ignoring ground truth with image id 558212937 since it was previously added

解决办法:
把config文件eval_config部分的num_examples和训练代码的num_eval_steps都改为和测试集图片的数量一样或者更少就可以了(我的测试集图片有100张,我的两个地方都改成了100)

config文件的num_examples

1
2
3
4
5
6
7
8
config文件的num_examples

python object_detection/model_main.py \
--pipeline_config_path=object_detection/training/ssd_mobilenet_v1_coco.config \
--model_dir=object_detection/training \
--num_train_steps=50000 \
--num_eval_steps=100 \
--alsologtostderr

原文链接:https://blog.csdn.net/weixin_46127907/java/article/details/105421607

Invalid argument: ValueError: Category stats do not exist Traceback (most recent call last)

1
2
3
4
5
6
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
(0) Out of range: End of sequence
[[{{node IteratorGetNext}}]]
(1) Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Shape_8/_4689]]

原因:未知

解决方案:换为tensorflow带的同名config文件

AssertionError: Bad argument number for Name: 3, expecting 4

原因:网上说的和gats版本有关?没官暂时

OOM when allocating tensor with shape[20,32,300,300] and type float

1
2
 Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.41GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-25 11:56:55.277922: W tensorflow/core/common_runtime/bfc_allocator.cc:237]

解决方案:减小batch_size

Ignoring detection with image id 1471494072 since it was previously added

1
W0425 14:46:33.695297 139864167134976 coco_evaluation.py:139] Ignoring detection with image id 1471494072 since it was previously added

模型检测出来 出现重复的结果

解析之后,一个物体会得到了多个定位的框,如何确定哪一个是我们需要的最准确的框呢?我们就要用到非极大值抑制,来抑制那些冗余的框:抑制的过程是一个迭代-遍历-消除的过程。

1,将所有框的得分排序,选中最高分及其对应的框

2,遍历其余的框,如果和当前最高分框的重叠面积(IOU)大于一定阈值,我们就将框删除。

3,从未处理的框中继续选一个得分最高的,重复上述过程。

“/dev/kvm device: permission denied”

语法:chown [选项] 所有者 文件

1
sudo chown 用户名 -R /dev/kvm

延伸资料:

https://www.chainnews.com/articles/051424062222.htm