tensorflow c 实践及各种坑-ag真人游戏

在这篇文章中：

实现方案
实现步骤
- (1) 源码编译
- (2) 模型训练与输出
- (3) 模型固化
- 坑 batchnorm bug
- (4) 模型加载及运行
- (5) 运行问题

tensorflow当前ag真人试玩娱乐官网仅包含python、c、java、go的发布包，并无c release包，并且tensorflowag真人试玩娱乐官网也注明了并不保证除python以外库的稳定性，在功能方面python也是最完善的。众所周知，python在开发效率、易用性上有着巨大的优势，但作为一个解释性语言，在性能方面还是存在比较大的缺陷，在各类ai服务化过程中，采用python作为模型快速构建工具，使用高级语言(如c ，java)作为服务化程序实现是大势所趋。本文重点介绍tensorflow c 服务化过程中实现方式及遇到的各种问题。

实现方案

对于tensorflow c 库的使用，有两种方法：

(1) 最佳方式当然是直接用c 构建graph，但是当前c tensorflow库并不像python api那样full-featured。可参照builds a small graph in c here, c tensorflow api中还包含cpu和gpu的数字内核实现的类，可用以添加新的op。可参照https://www.tensorflow.org/extend/adding_an_op

(2) 常用的方式，c 调用python生成好的graph。本文主要介绍该方案。

实现步骤

(1) 编译tensorflow源码c so(2) 模型训练输出结果(3) 模型固化(4) 模型加载及运行(5) 运行问题

(1) 源码编译

环境要求：公司tlinux2.2版本， gcc版本 >= 4.8.5安装组件： protobuf 3.3.0 bazel 0.5.0 python 2.7 java8机器要求： 4gb内存

a. 安装java8

yum install java

b. 安装protobuf 3.3.0

下载https://github.com/google/protobuf/archive/v3.3.0.zip

./configure  &&  make  &&  make install

c. 安装bazel

download  https://github.com/bazelbuild/bazel/releases
sh bazel-0.5.0-installer-linux-x86_64.sh

d. 编译源码

最好采用最新release版本：https://github.com/tensorflow/tensorflow/releases

bazel build //tensorflow:libtensorflow_cc.so

编译过程中可能遇到的问题：问题一： fatal error: unsupported/eigen/cxx11/tensor: no such file or directory

安装eigen3.3或以上版本问题二： java.io.ioexception: cannot run program "patch"

  yum install patch

问题三：内存不够

(2) 模型训练与输出

模型训练输出可参照改用例去实践https://blog.metaflow.fr/tensorflow-saving-restoring-and-mixing-multiple-models-c4c94d5d7125， google上也很多，模型训练保存好得到下面文件：

(3) 模型固化

模型固化方式有三种：

a. freeze_graph 工具

bazel build tensorflow/python/tools:freeze_graph && bazel-bin/tensorflow/python/tools/freeze_graph 
        --input_graph=graph.pb 
        --input_checkpoint=checkpoint 
        --output_graph=./frozen_graph.pb 
        --output_node_names=output/output/scores

b. 利用freeze_graph.py工具

# we save out the graph to disk, and then call the const conversion
# routine.
checkpoint_state_name = "checkpoint"
input_graph_name = "graph.pb"
output_graph_name = "frozen_graph.pb"
input_graph_path = os.path.join(flags.model_dir, input_graph_name)
input_saver_def_path = ""
input_binary = false
input_checkpoint_path = os.path.join(flags.checkpoint_dir, 'saved_checkpoint')   "-0"
# note that we this normally should be only "output_node"!!!
output_node_names = "output/output/scores" 
restore_op_name = "save/restore_all"
filename_tensor_name = "save/const:0"
output_graph_path = os.path.join(flags.model_dir, output_graph_name)
clear_devices = false
freeze_graph.freeze_graph(input_graph_path, input_saver_def_path,
                          input_binary, input_checkpoint_path,
                          output_node_names, restore_op_name,
                          filename_tensor_name, output_graph_path,
                          clear_devices)

c. 利用tensorflow python

import os, argparse
import tensorflow as tf
from tensorflow.python.framework import graph_util
dir = os.path.dirname(os.path.realpath(__file__))
def freeze_graph(model_folder):
    # we retrieve our checkpoint fullpath
    checkpoint = tf.train.get_checkpoint_state(model_folder)
    input_checkpoint = checkpoint.model_checkpoint_path
    # we precise the file fullname of our freezed graph
    absolute_model_folder = "/".join(input_checkpoint.split('/')[:-1])
    output_graph = absolute_model_folder   "/frozen_model.pb"
    print output_graph
    # before exporting our graph, we need to precise what is our output node
    # this is how tf decides what part of the graph he has to keep and what part it can dump
    # note: this variable is plural, because you can have multiple output nodes
    output_node_names = "output/output/scores"
    # we clear devices to allow tensorflow to control on which device it will load operations
    clear_devices = true
    # we import the meta graph and retrieve a saver
    saver = tf.train.import_meta_graph(input_checkpoint   '.meta', clear_devices=clear_devices)
    # we retrieve the protobuf graph definition
    graph = tf.get_default_graph()
    input_graph_def = graph.as_graph_def()
    # fix batch norm nodes
    for node in input_graph_def.node:
        if node.op == 'refswitch':
            node.op = 'switch'
            for index in xrange(len(node.input)):
                if 'moving_' in node.input[index]:
                    node.input[index] = node.input[index]   '/read'
        elif node.op == 'assignsub':
            node.op = 'sub'
            if 'use_locking' in node.attr: del node.attr['use_locking']
    # we start a session and restore the graph weights
    with tf.session() as sess:
        saver.restore(sess, input_checkpoint)
        # we use a built-in tf helper to export variables to constants
        output_graph_def = graph_util.convert_variables_to_constants(
            sess, # the session is used to retrieve the weights
            input_graph_def, # the graph_def is used to retrieve the nodes 
            output_node_names.split(",") # the output node names are used to select the usefull nodes
        ) 
        # finally we serialize and dump the output graph to the filesystem
        with tf.gfile.gfile(output_graph, "wb") as f:
            f.write(output_graph_def.serializetostring())
        print("%d ops in the final graph." % len(output_graph_def.node))
if __name__ == '__main__':
    parser = argparse.argumentparser()
    parser.add_argument("--model_folder", type=str, help="model folder to export")
    args = parser.parse_args()
    freeze_graph(args.model_folder)

坑 batchnorm bug

在具体实际项目，用方式一与方式二将生成的模型利用tensorflow c api加载，报以上错误，采用tensorflow python加载模型报同样错：

原因是模型中用到了batchnorm，修复方式如上面c中给出的方案

(4) 模型加载及运行

构建输入输出

模型输入输出主要就是构造输入输出矩阵，相比python的numpy库，tensorflow提供的tensor和eigen::tensor还是非常难用的，特别是动态矩阵创建，如果你的编译器支持c 14，可以用xtensor库，和numpy一样强大，并且用法机器类似。如果是c 11版本就好好看看eigen库和tensorflow::tensor文档吧。例举集中简单的用法：

矩阵赋值：

tensorflow::tensor four_dim_plane(dt_float, tensorflow::tensorshape({
   1, model_x_axis_len, model_y_axis_len, fourth_dim_size}));
auto plane_tensor = four_dim_plane.tensor();
for (uint32_t k = 0; k < array_plane.size();   k)
{
   
    for (uint32_t j = 0; j < model_y_axis_len;   j)
    {
   
        for (uint32_t i = 0; i < model_x_axis_len;   i)
        {
   
            plane_tensor(0, i, j, k) = array_plane[k](i, j); 
        }
    }
}

softmax:

eigen::tensor modelapp::tensorsoftmax(const eigen::tensor& tensor)
{
   
    eigen::tensor max = tensor.maximum();
    auto e_x = (tensor - tensor.constant(max())).exp();
    eigen::tensor e_x_sum = e_x.sum();
    return e_x / e_x_sum();
}

模型加载及session初始化：

int32_t modelapp::init(const std::string& graph_file, logger *logger)
{
   
    auto status = newsession(sessionoptions(), &m_session); 
    if (!status.ok())
    {
   
        log_err(logger, "new session failed! %s", status.tostring().c_str());
        return error::err_failed_new_tensorflow_session;
    }
    graphdef graph_def;
    status = readbinaryproto(env::default(), graph_file, &graph_def);
    if (!status.ok()) 
    {
   
        log_err(logger, "read binary proto failed! %s", status.tostring().c_str());
        return error::err_failed_read_binary_proto;
    }
    status = m_session->create(graph_def);
    if (!status.ok()) 
    {
   
        log_err(logger, "session create failed! %s", status.tostring().c_str());
        return error::err_failed_create_tensorflow_session;
    }
    return error::success;
}

运行：

0.10以上的tensorflow库是线程安全的，因此可多线程调用predict

int32_t modelapp::predict(const action& action, std::vector* info, logger *logger)
{
   
    ...
    auto tensor_x = m_writer->generate(action, logger);
    tensor phase_train(dt_bool, tensorshape());
    phase_train.scalar()() = false;
    std::vector> inputs = {
   
        {
   "input_x", tensor_x},
        {
   "phase_train", phase_train}
    }; 
    std::vector result;
    auto status = m_session->run(inputs, {
   "output/output/scores"}, {
   }, &result);
    if (!status.ok())
    {
   
        log_err(logger, "session run failed! %s", status.tostring().c_str());
        return error::err_failed_tensorflow_execution;
    }
    ...
    auto scores = result[0].flat() ;
    ...
    return error::success;
}

(5) 运行问题

问题一：运行告警

2017-08-16 14:11:14.393295: w tensorflow/core/platform/cpu_feature_guard.cc:45] the tensorflow library wasn't compiled to use sse4.1 instructions, but these are available on your machine and could speed up cpu computations.
2017-08-16 14:11:14.393324: w tensorflow/core/platform/cpu_feature_guard.cc:45] the tensorflow library wasn't compiled to use sse4.2 instructions, but these are available on your machine and could speed up cpu computations.
2017-08-16 14:11:14.393331: w tensorflow/core/platform/cpu_feature_guard.cc:45] the tensorflow library wasn't compiled to use avx instructions, but these are available on your machine and could speed up cpu computations.
2017-08-16 14:11:14.393338: w tensorflow/core/platform/cpu_feature_guard.cc:45] the tensorflow library wasn't compiled to use fma instructions, but these are available on your machine and could speed up cpu computations.

是因为在编译tensorflow so库的时候没有把这些cpu加速指令编译进去，因此可以在编译的时候加入加速指令，在没有gpu条件下，加入这些库实测可以将cpu计算提高10%左右。

bazel build -c opt --copt=-mavx --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 -k //tensorflow:libtensorflow_cc.so

需要注意的是并不是所有cpu都支持这些指令，一定要实机测试，以免abort。

问题二: c libtensorflow和python tensorflow混用

为验证c 加载模型调用的准确性，利用swig将c api封装成了python库供python调用，在同时import tensorflow as tf和import封装好的python swig接口时，core dump

该问题tensorflow官方并不打算解决