赵工的个人空间


专业技术部分转网页计算转业余爱好部分


 编程语言

常用的编程语言
C#编程语言基础
C#面向对象与多线程
C#数据及文件操作
JavaScript基础
JavaScript的数据类型和变量
JavaScript的运算符和表达式
JavaScript的基本流程控制
JavaScript的函数
JavaScript对象编程
JavaScript内置对象和方法
JavaScript的浏览器对象和方法
JavaScript访问HTML DOM对象
JavaScript事件驱动编程
JavaScript与CSS样式表
Ajax与PHP
ECMAScript6的新特性
Vue.js前端开发
PHP的常量与变量
PHP的数据类型与转换
PHP的运算符和优先规则
PHP程序的流程控制语句
PHP的数组操作及函数
PHP的字符串处理与函数
PHP自定义函数
PHP的常用系统函数
PHP的图像处理函数
PHP类编程
PHP的DataTime类
PHP处理XML和JSON
PHP的正则表达式
PHP文件和目录处理
PHP表单处理
PHP处理Cookie和Session
PHP文件上传和下载
PHP加密技术
PHP的Socket编程
PHP国际化编码
MySQL数据库基础
MySQL数据库函数
MySQL数据库账户管理
MySQL数据库基本操作
MySQL数据查询
MySQL存储过程和存储函数
MySQL事务处理和触发器
PHP操作MySQL数据库
数据库抽象层PDO
Smarty模板
ThinkPHP框架
Python语言基础
Python语言结构与控制
Python的函数和模块
Python的复合数据类型
Python面向对象编程
Python的文件操作
Python的异常处理
Python的绘图模块
Python的NumPy模块
Python的SciPy模块
Python的SymPy模块
Python的数据处理
Python操作数据库
Python网络编程
Python图像处理
Python机器学习
TensorFlow深度学习
Tensorflow常用函数
TensorFlow用于卷积网络
生成对抗网络GAN


首页 > 专业技术 > 编程语言 > TensorFlow用于神经网络
TensorFlow用于神经网络
  1. TensorFlow神经网络算法:
  2. TensorFlow实现卷积神经网络:
  3. TensorFlow自然语言处理:
  4. TensorFlow实现循环神经网络:
  5. TensorFlow产品化与可视化:

1. TensorFlow神经网络算法:

神经网络算法的概念已经出现几十年了,但仅仅在近些年由于计算能力的提升能训练大规模网络才获得大发展,能在图像识别、手写识别、文本理解、图像分割、对话系统、自动驾驶等领域不断打破记录。
神经网络算法是对输入数据矩阵进行的一系列基本操作,通常包括非线性函数的加法和乘法,允许任意形式的基本操作和非线性函数的结合,如绝对值、最大值、最小值等。神经网络最核心的计算是反向传播,这是一种基于学习率和损失函数返回值来更新模型变量的过程;另一个重要特性是非线性激励函数,这使得神经网络可以解决大部分的非线性问题。
神经网络对选择的超参数是敏感的,不同的学习率、损失函数和优化过程对模型训练都有很大影响。
相关资料:
Yann LeCun等著Efficient BackProp的pdf文件:http://yann.lecun.com/exdb/publis/pdf /lecun-98b.pdf
斯坦福大学课程CS231:http://cs231n.stanford.edu/
斯坦福大学课程CS224d:http://cs224d.stanford.edu/
MIT出版的Deep Learning:http://www.deeplearningbook.org (Goodfellow等著)
在线书籍:http://neuralnetworksanddeeplearning.com/ (Michael Nielsen著)
Andrej Karpathy用JS程序介绍神经网络:http://karpathy.github.io/neuralnets/
Ian Goodfellow等总结的深度学习笔记:http://randomekek.github.io/deep/deeplearning.html

1)实现门函数:

神经网络算法的基本概念之一是门操作。
⑴ 先实现f(x)=ax:
声明a为变量,x为占位符,这样TensorFlow将改变a的值。创建损失函数,度量输出结果与目标值50之间的差值。
  import tensorflow as tf
  sess=tf.Session()

声明模型变量、输入数据集和占位符:
  a=tf.Variable(tf.constant(4.))
  x_val=5.
  x_data=tf.placeholder(dtype=tf.float32)

增加操作到计算图中:
  multiplication=tf.multiply(a, x_data)
声明损失函数,为输出结果与预期目标值50之间的L2距离:
  loss=tf.square(tf.subtract(multiplication, 50.))
初始化模型变量,声明标准梯度下降优化算法:
  init=tf.global_variables_initializer()
  sess.run(init)
  my_opt=tf.train.GradientDescentOptimizer(0.01)
  train_step=my_opt.minimize(loss)

优化模型输出结果,从而接近预期的输出值。连续输入值5,反向传播损失函数来更新模型变量以达到值10。代码为:
  for i in range(10):
     sess.run(train_step, feed_dict={x_data: x_val})
     a_val=sess.run(a)
     mult_output=sess.run(multiplication, feed_dict={x_data: x_val})
     print(str(a_val)+'*'+str(x_val)+'='+str(mult_output)

通过打印出的结果,可以看到变量a越来越接近10,而输出结果也越来越接近50。
⑵ 实现门函数嵌套f(x)=ax+b:
声明a和b为变量,x为占位符,优化输出目标值也是50,许多模型变量的组合都能使得输出结果为50。
  from tensorflow.python.framework import ops
  ops.reset_default_graph()
  sess=tf.Session()
  a=tf.Variable(tf.constant(1.))
  b=tf.Variable(tf.constant(1.))
  x_val=5.
  x_data=tf.placeholder(dtype=tf.float32)
  two_gate=tf.add(tf.multiply(a, x_data), b)
  loss=tf.square(tf.subtract(two_gate, 50.))
  my_opt=tf.train.GradientDescentOptimizer(0.01)
  train_step=my_opt.minimize(loss)
  init=tf.global_variables_initializer()
  sess.run(init)

优化模型变量,训练输出结果以达到预期目标值50:
  for i in range(10):
     sess.run(train_step, feed_dict={x_data: x_val})
     a_val, b_val=(sess.run(a), sess.run(b))
     two_gate_output=sess.run(two_gate, feed_dict={x_data: x_val})
     print(str(a_val)+'*'+str(x_val)+'+'+str(b_val)+'='+str(two_gate_output)

这里的结果不是唯一的,但这并不重要,因为所有的参数是根据减小损失函数来调整的,最终结果依赖于a和b的初始化。如果a和b使用随机变量初始化,每次迭代的模型变量的输出结果就各不相同。

2)使用门函数和激励函数:

这里将比较两种激励函数,sigmoid和ReLU。通过创建两个结构相同的单层神经网络,使用不同的激励函数,损失函数使用L2范数距离。从正态分布数据集中随机抽取批量数据,优化输出结果以达到预期值。
  import numpy as np
  import tensorflow as tf
  import matplotlib.pyplot as plt
  sess=tf.Session()
  tf.set_random_seed(5)
  np.random.seed(42)

这里使用了TensorFlow和numpy模块的随机数生成器,对于相同的随机种子集,应可以复现。然后声明批量大小、模型变量、数据集和占位符,在计算图中为两个相似的神经网络模型传入正态分布数据。代码为:
  batch_size=50
  a1=tf.Variable(tf.random_normal(shape=[1, 1]))
  b1=tf.Variable(tf.random_normal(shape=[1, 1]))
  a2=tf.Variable(tf.random_normal(shape=[1, 1]))
  b2=tf.Variable(tf.random_normal(shape=[1, 1]))
  x=np.random.normal(2, 0.1, 500)
  x_data=tf.placeholder(shape=[None, 1], dtype=tf.float32)

声明两个训练模型,即sigmoid激励模型和ReLU激励模型:
  sigmoid_activation=tf.sigmoid(tf.add(tf.matmul(x_data, a1), b1))
  relu_activation=tf.nn.relu(tf.add(tf.matmul(x_data, a2), b2))

损失函数都采用模型输出和预期值0.75之间的差值的L2范数平均:
  loss1=tf.reduce_mean(tf.square(tf.subtract(sigmoid_activation, 0.75)))
  loss2=tf.reduce_mean(tf.square(tf.subtract(relu_activation, 0.75)))

声明优化算法,初始化变量:
  my_opt=tf.train.GradientDescentOptimizer(0.01)
  train_step_sigmoid=my_opt.minimize(loss1)
  train_step_relu=my_opt.minimize(loss2)
  init=tf.global_variables_initializer()
  sess.run(init)

遍历迭代训练模型,每个模型迭代750次,保存损失函数输出和激励函数返回值:
  loss_vec_sigmoid=[]
  loss_vec_relu=[]
  activation_sigmoid=[]
  activation_relu=[]
  for i in range(750):
     rand_indices=np.random.choice(len(x), size=batch_size)
     x_vals=np.transpose([x[rand_indices]])
     sess.run(train_step_sigmoid, feed_dict={x_data: x_vals})
     sess.run(train_step_relu, feed_dict={x_data: x_vals})
     loss_vec_sigmoid.append(sess.run(loss1,feed_dict={x_data: x_vals}))
     loss_vec_relu.append(sess.run(loss2,feed_dict={x_data: x_vals}))
     activation_sigmoid.append(np.mean(sess.run(sigmoid_activation,feed_dict={x_data: x_vals}))
     activation_relu.append(np.mean(sess.run(relu_activation,feed_dict={x_data: x_vals}))

绘制损失函数和激励函数:
  plt.plot(activation_sigmoid, 'k-', label='Sigmoid Activation')
  plt.plot(activation_relu, 'r--', label='ReLU Activation')
  plt.ylim([0, 1.0])
  plt.title('Activation Outputs')
  plt.xlabel('Generation')
  plt.ylabel('Outputs')
  plt.legend(loc='upper right')
  plt.show()
  plt.plot(loss_vec_sigmoid, 'k-', label='Sigmoid Loss')
  plt.plot(loss_vec_relu, 'r--', label='ReLU Loss')
  plt.ylim([0, 1.0])
  plt.title('Loss per Generation')
  plt.xlabel('Generation')
  plt.ylabel('Loss')
  plt.legend(loc='upper right')
  plt.show()

两个神经网络具有相似的结构和预期值,只有激励函数不同,分别是sigmoid和ReLU。从绘制的图中可以看出,带有ReLU激励函数的神经网络比sigmoid激励函数的神经网络收敛更快。
基于ReLU激励函数的形式,将比sigmoid激励函数返回更多的0值,认为是一种稀疏性,导致收敛速度加快,但是损失了一部分梯度控制能力,返回结果中容易出现极值;sigmoid激励函数具有良好的梯度控制,出现的极值很少。
神经网络常用的激励函数还有很多,归为两类,第一类是形状类似sigmoid的激励函数,包括反正切激励函数、双曲正切激励函数、阶跃激励函数等;第二类是形状类似ReLU的激励函数,包括softplus激励函数、leak ReLU激励函数等。

3)实现单层神经网络:

实现一个单隐藏层的神经网络,目的是理解全连接神经网络算法主要基于矩阵乘法,数据集和矩阵维度对于算法模型正确有序运行非常关键。因为是一个回归算法,使用均方误差作为损失函数,在iris数据集上进行模型训练。
  import matplotlib.pyplot as plt
  import numpy as np
  import tensorflow as tf
  from sklearn import datasets

加载iris数据集,存储花萼长度作为目标值,开始一个计算图会话:
  iris=datasets.load_iris()
  x_vals=np.array([x[0:3] for x in iris.data])
  y_vals=np.array([x[3] for x in iris.data])
  sess=tf.Session()

因为数据集比较小,设置一个种子使得返回结果可复现:
  seed=2
  tf.set_random_seed(seed)
  np.random.seed(seed)

创建一个80/20分的训练集和测试集,通过min-max缩放法将x特征值归一化为0~1之间:
  train_indices=np.random.choice(len(x_vals), round(len(x_vals)*0.8, replace=False))
  test_indices=np.array(list(set(range(len(x_vals)))-set(train_indices)))
  x_vals_train=x_vals[train_indices]
  x_vals_test=x_vals[test_indices]
  y_vals_train=y_vals[train_indices]
  y_vals_test=y_vals[test_indices]
  def normalize_cols(m):
     col_max=m.max(axis=0)
     col_min=m.min(axis=0)
     return (m-col_min)/(col_max-col_min)
  x_vals_train=np.nan_to_num(normalize_cols(x_vals_train))
  x_vals_test=np.nan_to_num(normalize_cols(x_vals_test))

为数据集和目标值声明批量大小和占位符:
  batch_size=50
  x_data=tf.placeholder(shape=[None, 3], dtype=tf.float32)
  y_target=tf.placeholder(shape=[None, 1], dtype=tf.float32)

声明有合适形状的模型变量,其中能声明隐藏层为任意大小,这里设置为5个隐藏节点:
  hidden_layer_nodes=5
  A1=tf.Variable(tf.random_normal(shape=[3, hidden_layer_nodes]))
  b1=tf.Variable(tf.random_normal(shape=[hidden_layer_nodes]))
  A2=tf.Variable(tf.random_normal(shape=[hidden_layer_nodes, 1]))
  b3=tf.Variable(tf.random_normal(shape=[1]))

分两步训练模型,第一步创建一个隐藏层输出,第二步创建训练模型的最后输出:
  hidden_output=tf.nn.relu(tf.add(tf.matmul(x_data, A1), b1))
  final_output=tf.nn.relu(tf.add(tf.matmul(hidden_outputa, A2), b2))

模型有三个特征值、五个隐藏节点和一个输出结果。
定义均方误差作为损失函数:
  loss=tf.reduce_mean(tf.square(y_target-final_output))
声明优化算法,初始化模型变量:
  my_opt=tf.train.GradientDescentOptimizer(0.005)
  train_step=my_opt.minimize(loss)
  init=tf.global_variables_initializer()
  sess.run(init)

遍历迭代训练模型。也存储训练损失和测试损失,在每次迭代训练时随机选择批量训练数据来拟合模型:
  loss_vec=[]
  test_loss=[]
  for i in range(500):
     rand_index=np.random.choice(len(x_vals_train), size=batch_size)
     rand_x=x_vals_train[rand_index]
     rand_y=np.transpoce(y_vals_train[rand_index])
     sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})
     temp_loss=sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
     loss_vec.append(np.sqrt(temp_loss))
     test_temp_loss=sess.run(loss, feed_dict={x_data: x_vals_test, y_target: np.transpose ([y_vals_test])})
     test_loss.append(np.sqrt(test_temp_loss))
     if (i+1)%50==0:
       print('Generation: '+str(i+1)+'.Loss='+str(temp_loss))

使用matplotlib绘制损失函数:
  plt.plot(loss_vec, 'k-', label='Train loss')
  plt.plot(test_loss, 'r--', label='Test loss')
  plt.title('Loss (MSE) per Generation')
  plt.xlabel('Generation')
  plt.ylabel('Loss')
  plt.legend(loc='upper right')
  plt.show()

在进行200次迭代训练后,会出现轻微的过拟合,因为测试集MSE没有丢失特征,而训练集MSE会剔除特征。从图上看出,测试集损失函数更平滑,因为训练集数据批量比测试集小,模型训练是在训练集上进行的,所以测试集对模型变量没有影响。
可视化神经网络算法模型,可以看到隐藏层包含5个节点。传入3个值:花萼长度、花萼宽度、花瓣长度,目标值是花瓣宽度。总共有26个模型变量。

4)实现神经网络常见层:

上节实现了神经网络全连接层,TensorFlow有许多内建函数实现多种类型的层,其中最流行的是卷积层convolutional layer和池化层maxpool layer。神经网络的层能以任意形式组合,最常用的方法是用卷积层和全连接层来创建特征;如果有许多特征,常用的处理方法是采用池化层;在这些层之后常常引入激励函数。
⑴ 神经网络处理一维数据:
  import numpy as np
  import tensorflow as tf
  sess=tf.Session()

初始化数据,使用numpy数组,长度25,创建传入数据的占位符:
  data_size=25
  data_1d=np.random.normal(size=data_size)
  x_input_1d=tf.placeholder(dtype=tf.float32, shape=[data_size])

定义一个卷积层的函数,接着声明一个随机过滤层,创建一个卷积层:
  def conv_layer_1d(input_1d, my_filter):
     # Make 1d input into 4d
     input_2d=tf.expand_dims(input_1d, 0)
     input_3d=tf.expand_dims(input_2d, 0)
     input_4d=tf.expand_dims(input_3d, 3)
     # perform convolution
     convolution_output=tf.nn.conv2d(input_4d, filter=my_filter, strides=[1,1,1,1], padding ="VALID")
     conv_output_1d=tf.squeeze(convolution_output)
     return(conv_output_1d)
  my_filter=tf.Variable(tf.random_normal(shape=[1,5,1,1]))
  my_convolution_output=conv_layer_1d(x_input_1d, my_filter)

许多TensorFlow的层函数是为四维数据设计的([batch size, width, height, channels]),需要调整输入数据和输出数据,包括扩展维度和降维。这里的数据,批量1、宽度1、高度25,颜色通道1,为了扩展维度使用expand_dims()函数,降维使用squeeze()函数。卷积层输出结果的维度公式为:output_size=(W-F+2P)/S+1,其中W为输入数据维度,F为过滤层大小,P为Padding大小,S是步长大小。
TensorFlow的激励函数默认是逐个元素进行操作,在部分层需要激励函数。创建激励函数并初始化的代码:
  def activation(input_1d):
     return tf.nn.relu(input_1d)
  my_activation_output=activation(my_convolution_output)

声明一个池化层函数,该函数在一维向量的移动窗口上创建池化层,宽度为5。代码:
  def max_pool(input_1d, width):
     # make 1d input into 4d
     input_2d=tf.expand_dims(input_1d, 0)
     input_3d=tf.expand_dims(input_2d, 0)
     input_4d=tf.expand_dims(input_3d, 3)
     # perform max pool operation
     pool_output=tf.nn.max_pool(input_4d, ksize=[1, 1, width, 1], strides=[1,1,1,1], padding ='VALID')
     pool_output_1d=tf,squeeze(pool_output)
     return pool_output_1d
  my_maxpool_output=max_pool(my_activation_output, width=5)

TensorFlow的池化层函数与卷积层函数的参数非常相似,只是没有过滤层,只有形状、步长和padding选项。因为窗口宽度为5,并且使用非零valid padding,所以输出数组将有2·floor(5/2)=4项。
最后一层是全连接层。创建一个函数,输入一维数据,输出值的索引。一维数组做矩阵乘法需要提前扩展为二维,代码为:
  def fully_connected(input_layer, num_outputs):
     # Create weights
     weight_shape=tf.squeeze(tf.stack([tf.shape(input_layer), [num_outputs]))
     weight=tf.random_normal(weight_shape, stddev=0.1)
     bias=tf.random_normal(shape=[num_outputs])
     # Make input to 2d
     input_layer_2d=tf.expand_dims(input_layer, 0)
     # Perform fully connected operations
     full_output=tf.add(tf.matmul(input_layer_2d, weight), bias)
     # Drop extra dimensions
     full_output_1d=tf.squeeze(full_output)
     return full_output_1d
  my_full_output=fully_connected(my_maxpool_output, 5)

初始化所有变量,运行计算图打印出每层的输出结果:
  init=tf.global_variables_initializer()
  sess.run(init)
  feed_dict={x_input_1d: data_1d}
  # Convolution Output
  print('Input=array of length 25')
  print('Convolution w/filter, length=5, stride size=1, results in an array of length 21: ')
  print(sess.run(my_convolution_output, feed_dict=feed_dict))
  # Activation Output
  print('Input=the above array of length 21')
  print('ReLU element wise returns the array of length 21: ')
  print(sess.run(my_activation_output, feed_dict=feed_dict))
  # Maxpool Output
  print('Input=the above array of length 21')
  print('MaxPool, window length=5, stride size=1, result in the array of length 17: ')
  print(sess.run(my_maxpool_output, feed_dict=feed_dict))
  # Fully Connected Output
  print('Input=the above array of length 17')
  print('Fully Connected layer on all four rows with five outputs: ')
  print(sess.run(my_full_output, feed_dict=feed_dict))

神经网络对于一维数据非常重要,时序数据集、信号处理数据集和一些文本嵌入数据集都是一维数据,会频繁使用到神经网络算法。
⑵ 神经网络处理二维数据:
  from tensorflow.python.framework import ops
  ops.reset_default_graph()
  sess=tf.Session()

初始化输入数组为10×10的矩阵,然后初始化计算图的占位符:
  data_size=[10, 10]
  data_2d=np.random.normal(size=data_size)
  x_input_2d=tf.placeholder(dtype=tf.float32, shape=data_size)

声明一个卷积层函数。因为数据集已经具有高度和宽度,仅需扩展两维(批量大小为1,颜色通道为1)即可使用内建卷积函数conv2d()。这里使用一个随机的2×2过滤层,两个方向上的步长和valid padding。由于输入数据是10×10,因此卷积输出为5×5.代码为:
  def conv_layer_2d(input_2d, my_filter):
     # Make 2d input into 4d
     input_3d=tf.expand_dims(input_2d, 0)
     input_4d=tf.expand_dims(input_3d, 3)
     # perform convolution
     convolution_output=tf.nn.conv2d(input_4d, filter=my_filter, strides=[1,2,2,1], padding ="VALID")
     conv_output_2d=tf.squeeze(convolution_output)
     return(conv_output_2d)
  my_filter=tf.Variable(tf.random_normal(shape=[2,2,1,1]))
  my_convolution_output=conv_layer_2d(x_input_2d, my_filter)

创建激励函数并初始化:
  def activation(input_2d):
     return tf.nn.relu(input_2d)
  my_activation_output=activation(my_convolution_output)

池化层需要声明移动窗口的宽度和高度,扩展池化层为二维。代码为:
  def max_pool(input_2d, width, height):
     # make 2d input into 4d
     input_3d=tf.expand_dims(input_2d, 0)
     input_4d=tf.expand_dims(input_3d, 3)
     # perform max pool operation
     pool_output=tf.nn.max_pool(input_4d, ksize=[1, height, width, 1], strides=[1,1,1,1], padding ='VALID')
     pool_output_2d=tf,squeeze(pool_output)
     return pool_output_2d
  my_maxpool_output=max_pool(my_activation_output, width=2, height=2)

把全连接层的二维输入看作一个对象,为了实现每项连接到每个输出,展开二维矩阵,然后在做矩阵乘法时再扩展维度。代码为:
  def fully_connected(input_layer, num_outputs):
     # Flatten into 1d
     flat_input=tf.reshape(input_layer, [-1])
     # Create weights
     weight_shape=tf.squeeze(tf.stack([tf.shape(flat_input), [num_outputs]))
     weight=tf.random_normal(weight_shape, stddev=0.1)
     bias=tf.random_normal(shape=[num_outputs])
     # Change into 2d
     input_2d=tf.expand_dims(flat_input, 0)
     # Perform fully connected operations
     full_output=tf.add(tf.matmul(input_2d, weight), bias)
     # Drop extra dimensions
     full_output_2d=tf.squeeze(full_output)
     return full_output_2d
  my_full_output=fully_connected(my_maxpool_output, 5)

初始化变量,创建一个赋值字典:
  init=tf.global_variables_initializer()
  sess.run(init)
  feed_dict={x_input_2d: data_2d}

打印每层的输出结果:
  # Convolution Output
  print('Input=[10 x 10] array')
  print('2x2 Convolution, stride size=[2x2], results in [5x5] array: ')
  print(sess.run(my_convolution_output, feed_dict=feed_dict))
  # Activation Output
  print('Input=the above [5x5] array')
  print('ReLU element wise returns the [5x5] array: ')
  print(sess.run(my_activation_output, feed_dict=feed_dict))
  # Maxpool Output
  print('Input=the above [5x5] array')
  print('MaxPool, stride size=[1x1], result in the [4x4] array: ')
  print(sess.run(my_maxpool_output, feed_dict=feed_dict))
  # Fully Connected Output
  print('Input=the above [4x4] array')
  print('Fully Connected layer on all four rows with five outputs: ')
  print(sess.run(my_full_output, feed_dict=feed_dict))

无论在一维数据集还是在二维数据集上使用TensorFlow的卷积层和池化层,最后的输出结果都是相同的维度,实现了神经网络算法的灵活性。

5)实现多层神经网络:

创建一个包含3个隐藏层的神经网络,将这个多层神经网络应用到出生体重数据集上预测婴儿出生体重。
  import matplotlib.pyplot as plt
  import numpy as np
  import tensorflow as tf
  import os
  import csv
  import requests
  sess=tf.Session()

使用request模块从网站加载数据集,然后分离出需要的特征数据和目标值。代码为:
  birth_weight_file='birth_weight.csv'
  birthdata_url='https://github.com/nfmcclure/tensorflow_cookbook/raw/master'\
     '/01_Introduction/7_Working_with_Data_Sources/birthweight_data/birthweight.dat'
  if not os.path.exists(birth_weight_file):
     birth_file=requests.get(birthdata_url)
     birth_data=birth_file.text.split('\r\n')
     birth_header=birth_data[0].split('\t')
     birth_data=[[float(x) for x in y.split('\t') if len(x)>=1] for y in birth_data[1:] if len(y)>=1]
     with open(birth_weight_file, 'w') as f:
       writer=csv.writer(f)
       writer.writerow([birth_header])
       writer.writerows(birth_data)
  # Read birth weight data into memory
  birth_data=[]
  with open(birth_weight_file, newline='') as csvfile:
     csv_reader=csv.reader(csvfile)
     birth_header=next(csv_reader)
     for row in csv_reader:
       birth_data.append(row)
     birth_data=[[float(x) for x in row] for row in birth_data]
  # Pull out target variable
  y_vals=np.array([x[0] for x in birth_data])
  # Pull out predictor variables
  x_vals=np.array([x[1:8] for x in birth_data])

为了后面的复现,为NumPy和TensorFlow设置随机种子,然后声明批量大小:
  seed=4
  tf.set_random_seed(seed)
  np.random.seed(seed)
  batch_size=100

分割数据集为80/20的训练集和测试集,然后使用min-max方法归一化输入特征数据为0到1之间:
  train_indices=np.random.choice(len(x_vals), round(len(x_vals)*0.8), replace=False)
  test_indices=np.array(list(set(range(len(x_vals)))-set(train_indices)))
  x_vals_train=x_vals[train_indices]
  x_vals_test=x_vals[test_indices]
  y_vals_train=y_vals[train_indices]
  y_vals_test=y_vals[test_indices]
  # Normalize by column (min-max norm)
  def normalize_cols(m, col_min=np.array([None]), col_max=np.array([None])):
     if not col_min[0]:
       col_min=m.min(axis=0)
     if not col_max[0]:
       col_max=m.max(axis=0)
     return (m-col_min)/(col_max-col_min), col_min, col_max
  x_vals_train, train_min, train_max=np.nan_to_num(normalize_cols(x_vals_train))
  x_vals_test, _, _=np.nan_to_num(normalize_cols(x_vals_test, train_min, train_max))

归一化输入特征数据是常用的特征转化方法,对神经网络算法特别有帮助。如果样本数据集是在0到1之间的,将有利于激励函数操作的收敛。
因为多个层含有相似的变量初始化,因此将创建一个初始化函数,该函数可以初始化加权权重和偏置。代码为:
  def init_weight(shape, st_dev):
     weight=tf.Variable(tf.random_normal(shape, stddev=st_dev))
     return weight
  def init_bias(shape, st_dev):
     bias=tf.Variable(tf.random_normal(shape, stddev=st_dev))
     return bias

初始化占位符,这里有8个输入特征数据和1个输出结果(出生体重):
  x_data=tf.placeholder(shape=[None, 8], dtype=tf.float32)
  y_target=tf.placeholder(shape=[None, 1], dtype=tf.float32)

全连接层将在三个隐藏层中使用三次,为了代码上的重复,将创建一个层函数来初始化算法模型。代码为:
  def fully_connected(input_layer, weights, biases):
     layer=tf.add(tf.matmul(input_layer, weights), biases)
     return tf.nn.relu(layer)

创建算法模型。对于包括输出层的每一层,将初始化一个权重矩阵、偏置矩阵和全连接层,这里三个隐藏层的大小分别为25、10、3。代码为:
  # Create first layer (25 hidden nodes)
  weight_1=init_weight(shape=[8, 25], st_dev=10.0)
  bias_1=init_bias(shape=[25], st_dev=10.0)
  layer_1=fully_connected(x_data, weight_1, bias_1)
  # Create second layer (10 hidden nodes)
  weight_2=init_weight(shape=[25, 10], st_dev=10.0)
  bias_2=init_bias(shape=[10], st_dev=10.0)
  layer_2=fully_connected(layer_1, weight_2, bias_2)
  # Create third layer (3 hidden nodes)
  weight_3=init_weight(shape=[10, 3], st_dev=10.0)
  bias_3=init_bias(shape=[3], st_dev=10.0)
  layer_3=fully_connected(layer_2, weight_3, bias_3)
  # Create output layer (1 output value)
  weight_4=init_weight(shape=[3, 1], st_dev=10.0)
  bias_4=init_bias(shape=[1], st_dev=10.0)
  final_output=fully_connected(layer_3, weight_4, bias_4)

这里创建的算法模型需要拟合522个变量。输入数据集和第1隐藏层之间有225(8X25+25)个变量,使用这种方法计算隐藏层加在一起有522(225+260+33+4)个变量。
使用L1范数损失函数(绝对值),声明优化器(Adam优化器)和初始化变量:
  loss=tf.reduce_mean(tf.abs(y_target-final_output))
  my_opt=tf.train.AdamOptimizer(0.05)
  train_step=my_opt.minimize(loss)
  init=tf.global_variables_initializer()
  sess.run(init)

这里为Adam优化器选择的学习率为0.05,更低的学习率会产生更好的效果,而使用较大的学习率是为了数据集的一致性和快速收敛。
迭代训练模型200次,存储训练损失和测试损失,选择随机变量大小和每25次迭代打印状态。代码为:
  loss_vec=[]
  test_loss=[]
  for i in range(200):
     rand_index=np.random.choice(len(x_vals_train), size=batch_size)
     rand_x=x_vals_train[rand_index]
     rand_y=np.transpoce(y_vals_train[rand_index])
     sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})
     temp_loss=sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
     loss_vec.append(temp_loss)
     test_temp_loss=sess.run(loss, feed_dict={x_data: x_vals_test, y_target: np.transpose ([y_vals_test])})
     test_loss.append(test_temp_loss)
     if (i+1)%25==0:
       print('Generation: '+str(i+1)+'.Loss='+str(temp_loss))

使用matplotlib模块绘制训练损失和测试损失:
  plt.plot(loss_vec, 'k-', label='Train loss')
  plt.plot(test_loss, 'r--', label='Test loss')
  plt.title('Loss per Generation')
  plt.xlabel('Generation')
  plt.ylabel('Loss')
  plt.legend(loc='upper right')
  plt.show()

在此前的逻辑回归算法中,在迭代上千次后得到大约60%的准确度。为了进行比较,输出训练集/测试集的回归结果,然后传入一个指示函数(判断是否大于2500克),将回归结果转为分类结果。证明此示例模型准确度的代码:
  actuals=np.array([x[1] for x in birth_data])
  test_actuals=actuals[test_indices]
  train_actuals=actuals[train_indices]
  test_preds=[x[0] for x in sess.run(final_output, feed_dict={x_data: x_vals_test})]
  train_preds=[x[0] for x in sess.run(final_output, feed_dict={x_data: x_vals_train})]
  test_preds=np.array([1.0 if x<2500.0 else 0.0 for x in test_preds])
  train_preds=np.array([1.0 if x<2500.0 else 0.0 for x in train_preds])
  # print out accuracies
  test_acc=np.mean([x==y for x,y in zip(test_preds, test_actuals)])
  train_acc=np.mean([x==y for x,y in zip(train_preds, train_actuals)])
  print('On predicting the category of low birthweight from regression output (<2500g): ')
  print('Test Accuracy: {}'.format(test_acc))   # Test Accuracy: 0.631578947368421
  print('Train Accuracy: {}'.format(train_acc))   # Train Accuracy: 0.7019867549668874

从图中可以看出,经过30次迭代训练后会获得比较好的模型。

6)线性预测模型的优化:

加载出生体重样本数据集,使用一个带两个隐藏层的全连接层的神经网络,并常用sigmoid激励函数来拟合出生体重数据。
  import matplotlib.pyplot as plt
  import numpy as np
  import tensorflow as tf
  import requests
  sess=tf.Session()

加载出生体重数据集,并对其进行抽取和归一化。这里使用出生体重指示变量作为目标值,而不是实际出生体重。代码为:
  birth_weight_file='birth_weight.csv'
  birthdata_url='https://github.com/nfmcclure/tensorflow_cookbook/raw/master'\
     '/01_Introduction/7_Working_with_Data_Sources/birthweight_data/birthweight.dat'
  if not os.path.exists(birth_weight_file):
     birth_file=requests.get(birthdata_url)
     birth_data=birth_file.text.split('\r\n')
     birth_header=birth_data[0].split('\t')
     birth_data=[[float(x) for x in y.split('\t') if len(x)>=1] for y in birth_data[1:] if len(y)>=1]
     with open(birth_weight_file, 'w') as f:
       writer=csv.writer(f)
       writer.writerows([birth_header])
       writer.writerows(birth_data)
  # Read birth weight data into memory
  birth_data=[]
  with open(birth_weight_file, newline='') as csvfile:
     csv_reader=csv.reader(csvfile)
     birth_header=next(csv_reader)
     for row in csv_reader:
       birth_data.append(row)
     birth_data=[[float(x) for x in row] for row in birth_data]
  # Pull out target variable
  y_vals=np.array([x[0] for x in birth_data])
  # Pull out predictor variables
  x_vals=np.array([x[1:8] for x in birth_data])
  seed=4
  tf.set_random_seed(seed)
  np.random.seed(seed)
  batch_size=100
  train_indices=np.random.choice(len(x_vals), round(len(x_vals)*0.8), replace=False)
  test_indices=np.array(list(set(range(len(x_vals)))-set(train_indices)))
  x_vals_train=x_vals[train_indices]
  x_vals_test=x_vals[test_indices]
  y_vals_train=y_vals[train_indices]
  y_vals_test=y_vals[test_indices]
  # Normalize by column (min-max norm)
  def normalize_cols(m, col_min=np.array([None]), col_max=np.array([None])):
     if not col_min[0]:
       col_min=m.min(axis=0)
     if not col_max[0]:
       col_max=m.max(axis=0)
     return (m-col_min)/(col_max-col_min), col_min, col_max
  x_vals_train, train_min, train_max=np.nan_to_num(normalize_cols(x_vals_train))
  x_vals_test, _, _=np.nan_to_num(normalize_cols(x_vals_test, train_min, train_max))

声明批量大小和占位符:
  batch_size=90
  x_data=tf.placeholder(shape=[None, 7], dtype=tf.float32)
  y_target=tf.placeholder(shape=[None, 1], dtype=tf.float32)

声明函数来初始化算法模型中的变量和层。为了创建更好的逻辑层,需要创建一个返回输入层的逻辑层的函数,也就是使用全连接层,返回每层的sigmoid值。损失函数包括最终的sigmoid函数,所以最后一层不使用输出的sigmoid函数。代码为:
  def init_variable(shape):
     return tf.Variable(tf.random_normal(shape=shape))
  # Create a logistic layer definition
  def logistic(input_layer, multiplication_weight, bias_weight, activation=True):
     linear_layer=tf.add(tf.matmul(input_layer, multiplication_weight), bias_weight)
     if activation:
       return tf.nn.sigmoid(linear_layer)
     else
       return linear_layer

声明神经网络的两个隐藏层和一个输出层,为每一层初始化一个权重矩阵和偏置矩阵,并定义每层的操作:
  # First logistic layer (7 inputs to 14 hidden nodes)
  A1=init_variable(shape=[7, 14])
  b1=init_variable(shape=[14])
  logistic_layer1=logistic(x_data, A1, b1)
  # Second logistic layer (14 hidden inputs to 5 hidden nodes)
  A2=init_variable(shape=[14, 5])
  b2=init_variable(shape=[5])
  logistic_layer2=logistic(logistic_layer1, A2, b2)
  # Final output layer (5 hidden nodes to 1 output)
  A3=init_variable(shape=[5, 1])
  b3=init_variable(shape=[1])
  final_output=logistic(logistic_layer2, A3, b3, activation=False)

声明损失函数和优化算法,并初始化变量。这里使用交叉熵损失函数,代码:
  loss=tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=final_output, labels=y_target))
  my_opt=tf.train.AdamOptimizer(learning_rate=0.002)
  train_step=my_opt.minimize(loss)
  init=tf.global_variables_initializer()
  sess.run(init)

交叉熵是度量概率之间的距离,这里度量确定值(0或1)和模型概率值(0<x<1)之间的差值。TensorFlow中实现交叉熵是用sigmoid函数内建的。采用超参数调优对于寻找最好的损失函数、学习率和优化算法是相当重要的,但为了简洁,未使用超参数调优。
为了评估和比较算法模型,创建计算图预测操作和准确度操作,这样可以传入测试集并计算准确度。代码为:
  prediction=tf.round(tf.nn.sigmoid(final_output))
  predictions_correct=tf.cast(tf.equal(prediction, y_target), tf.float32)
  accuracy=tf.reduce_mean(predictions_correct)

开始遍历迭代训练模型,训练1500次,并为后续绘图保存模型的损失函数和训练集/测试集准确度:
  loss_vec=[]
  train_acc=[]
  test_acc=[]
  for i in range(1500):
     rand_index=np.random.choice(len(x_vals_train), size=batch_size)
     rand_x=x_vals_train[rand_index]
     rand_y=np.transpoce(y_vals_train[rand_index])
     sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})
     temp_loss=sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
     loss_vec.append(temp_loss)
     temp_acc_train=sess.run(accuracy, feed_dict={x_data: x_vals_train, y_target: np.transpose ([y_vals_train])})
     train_acc.append(temp_acc_train)
     temp_acc_test=sess.run(accuracy, feed_dict={x_data: x_vals_test, y_target: np.transpose ([y_vals_test])})
     test_acc.append(temp_acc_test)
     if (i+1)%50==0:
       print('Generation: '+str(i+1)+'.Loss='+str(temp_loss))

用matplotlib模块绘制交叉熵损失函数和测试集/训练集准确度:
  plt.plot(loss_vec, 'k-', label='Train loss')
  plt.title('Cross Entropy Loss per Generation')
  plt.xlabel('Generation')
  plt.ylabel('Cross Entropy Loss')
  plt.show()
  plt.plot(train_acc, 'k-', label='Train Set Accuracy')
  plt.plot(test_acc, 'r--', label='Test Set Accuracy')
  plt.title('Train and Test Accuracy')
  plt.xlabel('Generation')
  plt.ylabel('Accuracy')
  plt.legend(loc='lower right')
  plt.show()

从绘出的图可以看出,大约迭代50次就得到较好的训练模型,而随着继续迭代训练,后续的迭代并没有获得较大的提升效果。
神经网络算法比之前的算法模型收敛得更快,并在某些场景下更准确,但需要训练更多的模型变量,并且极有可能过拟合,训练集的准确度在持续缓慢增加,然而测试集的准确度有时轻微增加,有时还会减小。为了解决过拟合问题,可以增加训练模型的深度或者迭代训练的次数,可以增加更多的数据或者使用正则化方法。
神经网络算法模型变量并不像线性模型那样具有可解释性,其模型的系数比线性模型更难解释其在算法模型中的特征意义。

2. TensorFlow实现卷积神经网络:

在近些年中,卷积神经网络CNN(Convolutional Neural Networks)在图像识别方面有了重大突破。
从数学上讲,卷积是一个函数在另外一个函数中的操作。为了方便,将图像认为是数字矩阵,这些数字代表像素或图像属性,卷积操作将固定窗口大小的过滤器在图像上滑动,并进行对应位置元素相乘之后取和,便得到卷积结果。
卷积神经网络也有一些必要的操作,比如引入非线性(ReLU函数),或者聚合参数(maxpool函数)等。卷积操作可训练的变量是过滤器权重,最大池化是在指定区域上取最大值。虽然可以创建自定义的CNN,但一般是采用现有的架构方案,常常使用预训练网络,并在最后用全连接在新的数据集上再训练预训练的神经网络。

1)实现简单的CNN:

这里开发一个4层卷积神经网络,提升预测MNIST数字的准确度,前两层由Convolution-ReLU-maxpool操作组成,后2层是全连接层。
为了访问MNIST数据集,导入TensorFlow的examples.tutorials包的数据加载功能,数据集加载后,设置算法模型变量,创建模型,批量训练模型,并且可视化损失函数、准确度,抽样一些数字显示。
  import matplotlib.pyplot as plt
  import numpy as np
  import tensorflow as tf
  from tensorflow.examples.tutorials.mnist import input_data
  from tensorflow.python.framework import ops
  ops.reset_default_graph()
  sess=tf.Session()

加载数据集,转化图像为28×28的数组:
  data_dir='temp'
  mnist=input_data.read_data_sets("MNIST_data")
  train_xdata=np.array([np.reshape(x, (28,28)) for x in mnist.train.images])
  test_xdata=np.array([np.reshape(x, (28,28)) for x in mnist.train.images])
  train_labels=mnist.train.labels
  test_labels=mnist.test.labels

下载的MNIST数据集中包括验证数据集,验证数据集的大小与测试数据集相同。如果进行超参数调优或者模型选择,也需要加载测试数据集和验证数据集。
设置模型参数。由于图像是灰度图,所以该图像的深度为1,即颜色通道数为1。代码为:
  batch_size=100
  learning_rate=0.005
  evaluation_size=500
  image_width=train_xdata[0].shape[0]
  image_height=train_xdata[0].shape[1]
  target_size=max(train_labels)+1
  num_channels=1
  generations=500
  eval_every=5
  conv1_features=25
  conv2_features=50
  max_pool_size1=2
  max_pool_size2=2
  fully_connected_size1=100

为数据集声明占位符,同时声明训练数据集变量和测试数据集变量:
  x_input_shape=(batch_size, image_width, image_height, num_channels)
  x_input=tf.placeholder(tf.float32, shape=x_input_shape)
  y_target=tf.placeholder(tf.int32, shape=(batch_size))
  eval_input_shape=(evaluation_size, image_width, image_height, num_channels)
  eval_input=tf.placeholder(tf.float32, shape=eval_input_shape)
  eval_target=tf.placeholder(tf.int32, shape=(evaluation_size))

声明卷积层的权重和配置:
  conv1_weight=tf.Variable(tf.truncated_normal([4, 4, num_channels, conv1_features], stddev=0.1, dtype=tf.float32))
  conv1_bias=tf.Variable(tf.zeros([conv1_features], dtype=tf.float32))
  conv2_weight=tf.Variable(tf.truncated_normal([4, 4, conv1_features, conv2_features], stddev=0.1, dtype=tf.float32))
  conv2_bias=tf.Variable(tf.zeros([conv2_features], dtype=tf.float32))

声明全连接层的权重和偏置:
  resulting_width=image_width//(max_pool_size1*max_pool_size2)
  resulting_height=image_height//(max_pool_size1*max_pool_size2)
  full1_input_size=resulting_width*resulting_height*conv2_features
  full1_weight=tf.Variable(tf.truncated_normal([full1_input_size, fully_connected_sizes], stddev=0.1, dtype=tf.float32))
  full1_bias=tf.Variable(tf.truncated_normal([fully_connected_sizes], stddev=0.1, dtype=tf.float32))
  full2_weight=tf.Variable(tf.truncated_normal([fully_connected_sizes, target_size], stddev=0.1, dtype=tf.float32))
  full2_bias=tf.Variable(tf.truncated_normal([target_size], stddev=0.1, dtype=tf.float32))

声明算法模型。首先创建一个模型函数my_conv_net(),该函数可以访问所有层的权重和偏置。为了最后两层全连接层能有效工作,将第2卷积层的结果摊平。代码为:
  def my_conv_net(input_data):
     # First Conv_ReLU_MaxPool Layer
     conv1=tf.nn.conv2d(input_data, conv1_weight, strides=[1, 1, 1, 1], padding='SAME')
     relu1=tf.nn.relu(tf.nn.bias_add(conv1, conv1_bias))
     max_pool1=tf.nn.max_pool(relu1, ksize=[1, max_pool_size1, max_pool_size1, 1], strides=[1, max_pool_size1, max_pool_size1, 1], padding='SAME')
     # Second Conv_ReLU_MaxPool Layer
     conv2=tf.nn.conv2d(max_pool1, conv2_weight, strides=[1, 1, 1, 1], padding='SAME')
     relu2=tf.nn.relu(tf.nn.bias_add(conv2, conv2_bias))
     max_pool2=tf.nn.max_pool(relu2, ksize=[1, max_pool_size2, max_pool_size2, 1], strides=[1, max_pool_size2, max_pool_size2, 1], padding='SAME')
     # Transform Output into a 1xN layer for next fully connected layer
     final_conv_shape=max_poo2.get_shape().as_list()
     final_shape=final_conv_shape[1]*final_conv_shape[2]*final_conv_shape[3]
     flat_output=tf.reshape(max_pool2, [final_conv_shape[0], final_shape])
     # First Fully Connected Layer
     fully_connected1=tf.nn.relu(tf.add(tf.matmul(flat_output, full1_weight), full1_bias))
     # Second Fully Connected Layer
     funal_model_output=tf.add(tf.matmul(fully_connected1, full2_weight), full2_bias)
     return final_model_output

声明训练模型和测试模型:
  model_output=my_conv_net(x_input)
  test_model_output=my_conv_net(eval_input)

因为预测结果不是多分类,而仅仅是一类,所以使用sparse_softmax函数作为损失函数:
  loss=tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=model_output, labels=y_target))
创建训练集和测试集的预测函数,同时创建对应的准确度函数,评估模型在每批量上的准确度。代码为:
  prediction=tf.nn.softmax(model_output)
  test_prediction=tf.nn.softmax(test_model_output)
  def get_accuracy(logits, targets):
     batch_predictions=np.argmax(logits, axis=1)
     num_correct=np.sum(np.equal(batch_predictions, targets))
     return 100.*num_correct/batch_predictions.shape[0]

创建优化器函数,声明训练步长,初始化所有的模型变量:
  my_optimizer=tf.train.MomentumOptimizer(learning_rate, 0.9)
  train_step=my_optimizer.minimize(loss)
  # Initialize Variables
  init=tf.global_variables_initializer()
  sess.run(init)

开始训练模型,遍历迭代随机选择批量数据进行训练,在训练集批量数据和测试集批量数据上评估模型,保存损失函数和准确度:
  train_loss=[]
  train_acc=[]
  test_acc=[]
  for i in range(generations):
     rand_index=np.random.choice(len(train_xdata), size=batch_size)
     rand_x=train_xdata[rand_index]
     rand_x=np.expand_dims(rand_x, 3)
     rand_y=train_labels[rand_index]
     train_dict={x_input: rand_x, y_target: rand_y}
     sess.run(train_step, feed_dict=train_dict)
     temp_train_loss, temp_train_preds=sess.run([loss, prediction], feed_dict=train_dict)
     temp_train_acc=get_accuracy(temp_train_preds, rand_y)
     if (i+1)%eval_every==0:
       eval_index=np.random.choice(len(test_xdata), size=evaluation_size)
       eval_x=rest_xdata[eval_index]
       eval_x=np.expand_dims(eval_x, 3)
       eval_y=test_labels[eval_index]
       test_dict={eval_input: eval_x, eval_target: eval_y}
       test_preds=sess.run(test_prediction, feed_dict=test_dict)
       temp_test_acc=get_accuracy(test_preds, eval_y)
       # Record and print results
       train_loss.append(temp_train_loss)
       train_acc.append(temp_train_acc)
       test_acc.append(temp_test_acc)
       acc_and_loss=[(i+1), temp_train_loss, temp_train_acc, temp_test_acc]
       acc_and_loss=[np.round(x, 2) for x in acc_and_loss]
       print('Generation # {}. Train Loss: {:.2f}. Train Acc (Test Acc): {:.2f})'.format(*acc_and_loss))

部分输出结果:
  Generation # 490. Train Loss: 0.06. Train Acc (Test Acc): 98.00 (97.40)
  Generation # 500. Train Loss: 0.14. Train Acc (Test Acc): 98.00 (96.00)

可见,经500次迭代,测试数据集上准确度达到96~97%。
使用matplotlib模块绘制损失函数和准确度:
  eval_indices=range(0, generations, eval_every)
  # Plot loss over time
  plt.plot(eval_indices, train_loss, 'k-')
  plt.title('Softmax Loss per Generation')
  plt.xlabel('Generation')
  plt.ylabel('Softmax Loss')
  plt.show()
  # Plot train and test accuracy
  plt.plot(eval_indices, train_acc, 'k-', label='Train Set Accuracy')
  plt.plot(eval_indices, test_acc, 'r--', label='Test Set Accuracy')
  plt.title('Train and Test Accuracy')
  plt.xlabel('Generation')
  plt.ylabel('Accuracy')
  plt.legend(loc='lower right')
  plt.show()

打印最新结果中的6幅抽样图:
  actuals=rand_y[0:6]
  predictions=np.argmax(temp_train_preds, axis=1)[0:6]
  images=np.squeeze(rand_x[0:6])
  Nrows=2
  Ncols=3
  for i in range(6):
     plt.subplot(Nrows, Ncols, i+1)
     plt.imshow(np.reshape(images[i], [28,28]), cmap='Greys_r')
     plt.title('Actual: '+str(actuals[i]+' Pred: '+str(predictions[i]), fontsize=10)
     frame=plt.gca()
     frame.axes.get_xaxis().set_visible(False)
     frame.axes.get_yaxis().set_visible(False)

卷积神经网络在图像识别方向效果很好,从原始数据集训练模型迅速获得了约97%的准确度,部分原因是卷积层操作将图片中重要的部分特征转化成低维特征,通过创建其特征并用该特征预测。
CNN模型在图像识别领域发展迅速,新观点和架构方案频出,Arxiv.org网站https://arxiv.org/收录了该领域的很多论文,这是由康奈尔大学创建并维护的。

2)实现进阶的CNN:

将CNN扩展到图像识别上,增加网络深度,重复卷积操作、maxpool操作和ReLU操作是增加神经网络深度的标准方式,如果有足够大的数据集就能提高预测准确度,许多准确度高的图像识别网络都使用该方法。
这里使用CIFAR10数据集进行图像识别,网址https://www.cs.toronto.edu/~kriz/cifar.html。该图片数据集有60000张32×32像素图片,分10个类别,airplane、automobile、bird、cat、deer、dog、frog、horse、ship、truck。图片数据集大都比较大,不能全部放入内存,TensorFlow是建立一个图像管道从文件中一次批量读取,为此要建立一个图像数据读取器,然后创建一个批量数据的队列。
一般地,对于图像识别,都会在模型训练前给图片随机干扰,这里将进行随机剪裁、翻转和调节亮度。这里是使用TensorFlow官方CIFAR-10例子改写的。
  import os
  import sys
  import tarfile
  import matplotlib.pyplot as plt
  import numpy as np
  import tensorflow as tf
  from six.moves import urllib
  sess=tf.Session()

声明一些模型参数。将训练集和测试集的批量大小设为128,总共迭代20000次,并且每迭代50次打印出状态值,每迭代500次将在测试集的批量数据上进行评估。设置图片长度和宽度,还有随机剪裁图片的大小,颜色通道设为3,目标分类设为10类,最后声明存储数据和批量图片的位置。代码为:
  batch_size=128
  generations=20000
  output_every=50
  eval_every=500
  image_height=32
  image_width=32
  crop_height=24
  crop_width=24
  num_channels=3
  num_targets=10
  data_dir='temp'
  extract_folder='cifar-10-batches-bin'

为了降低学习率来训练更好的模型,采用指数减小学习率:学习率初始值设为0.1,每迭代250次指数减小学习率10%,公式0.1·0.9(x/250),其中x是当前迭代次数。TensorFlow默认是连续减小学习率,但是也接受阶梯式更新学习率。设置参数:
  learning_rate=0.1
  lr_decay=0.9
  num_gens_to_wait=250

设置读取二进制CIFAR-10图片的参数:
  image_vec_length=image_height*image_width*num_channels
  record_length=1+image_vec_length

设置下载CIFAR图像数据集的URL和数据目录:
  data_dir='temp'
  if not os.path.exists(data_dir):
     os.makedirs(data_dir)
  cifar10_url='http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz'
  data_file=os.path.join(data_dir, 'cifar-10-binary.tar.gz')
  if not os.path.isfile(data_file):
     # Download file
     filepath, _=urllib.request.urlretrieve(cifar10_url, data_file)
     # Extract file
     tarfile.open(filepath, 'r:gz').extractall(data_dir)

自定义read_cifar_files()函数建立图片阅读器,返回一个随机打乱的图片:
  def read_cifar_files(filename_queue, distort_images=True):
     reader=tf.FixedLengthRecordReader(record_bytes=record_length)
     key, record_string=reader.read(filename_queue)
     record_bytes=tf.decode_raw(record_string, tf.uint8)
     # Extract label
     image_label=tf.cast(tf.slice(record_bytes, [0], [1], tf.int32)
     # Extract image
     image_extracted=tf.reshape(tf.slice(record_bytes, [1], [image_vec_length]), [num_channels, image_height, image_width])
     # Reshape image
     image_uint8image=tf.transpose(image_extracted, [1, 2, 0])
     reshaped_image=tf.cast(image_uint8image, tf.float32)
     # Randomly Crop image
     final_image=tf.image.resize_image_with_crop_or_pad(reshaped_image, crop_width, crop_height)
     if distort_images:
       # Randomly flip the image horizontally, change the brightness and contrast
       final_image=tf.image.random_flip_left_right(final_image)
       final_image=tf.image.random_brightness(final_image, max_delta=63)
       final_image=tf.image.random_contrast(final_image, lower=0.2, upper=1.8)
     # Normalize whitening
     final_image=tf.image.per_image_standardization(final_image)
     return final_image, image_label

函数中首先声明一个读取固定字节长度的阅读器,然后从图像队列中读取图片,抽取图片并标记,最后使用TensorFlow内建的图像修改函数随机打乱图片。
声明批量处理使用的图像管道填充函数:
  def input_pipeline(batch_size, train_logical=True):
     if train_logical:
       files=[os.path.join(data_dir, extract_folder, 'data_batch_{}.bin'.format(i)) for i in range(1,6)]
     else:
       files=[os.path.join(data_dir, extract_folder, 'test_batch.bin']
     filename_queue=tf.train.string_input_producer(files)
     image, label=read_cifar_files(filename_queue)
     min_after_dequeue=1000
     capacity=min_after_dequeue+3*batch_size
     example_batch, label_batch=tf.train.shuffle)batch([image, label], batch_size, capacity, min_after_dequeue)
     return example_batch, label_batch

函数中首先建立读取图片的列表,定义用TensorFlow内建函数创建的input producer对象读取这些图片列表,把input producer传入上一步创建的图片读取函数read_cifar_files()中,继而创建图像队列的批量读取器shuffle_batch()。
设置合适的min_after_dequeue值是非常重要的,该参数是设置抽样图片缓存最小值,TensorFlow官方文档推荐设置为(#threads+error margin)*batch_size。该参数设置过大会导致更多的shuffle,从图片队列中shuffle大的图像数据集需要更多的内存。
声明模型函数:
  def cifar_cnn_model(input_images, batch_size, train_logical=True):
     def truncated_normal_var(name, shape, dtype):
       return tf.get_variable(name=name, shape=shape, dtype=dtype, initializer=tf.truncated_normal_initializer(stddev=0.05))
     def zero_var(name, shape, dtype):
       return tf.get_variable(name=name, shape=shape, dtype=dtype, initializer=tf.constant_initializer(0.0))
     # First Convolutional Layer
     with tf.variable_scope('conv1') as scope:
       # Conv_kernel is 5x5 for all 3 colors and create 64 features
       conv1_kernel=truncated_normal_var(name='conv_kernel1', shape=[5, 5, 3, 64], dtype=tf.float32)
       # convolve across the image with a stride size of 1
       conv1=tf.nn.conv2d(input_images, conv1_kernel, [1, 1, 1, 1], padding='SAME')
       # Initialize and add the vias term
       conv1_bias=zero_var(name='conv_bias1', shape=[64], dtype=tf.float32)
       conv1_add_bias=tf.nn.bias_add(conv1, conv1_bias)
       # ReLU element wise
       relu_conv1=tf.nn.relu(conv1_add_bias)
     # Max Pooling
     pool1=tf.nn.max_pool(relu_conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='SAME', name='pool_layer1')
     # Local Response Normalization
     norm1=tf.nn.lrn(pool1, depth_radius=5, bias=2.0, alpha=1e-3, beta=0.75, name='norm1')
     # Second Convolutional Layer
     with tf.variable_scope('conv2') as scope:
       # Conv_kernel is 5x5, across all prior 64 features and create 64 more features
       conv2_kernel=truncated_normal_var(name='conv_kernel2', shape=[5, 5, 64, 64], dtype=tf.float32)
       # convolve filter across prior output with stride size of 1
       conv2=tf.nn.conv2d(norm1, conv2_kernel, [1, 1, 1, 1], padding='SAME')
       # Initialize and add the vias term
       conv2_bias=zero_var(name='conv_bias2', shape=[64], dtype=tf.float32)
       conv2_add_bias=tf.nn.bias_add(conv2, conv2_bias)
       # ReLU element wise
       relu_conv2=tf.nn.relu(conv2_add_bias)
     # Max Pooling
     pool2=tf.nn.max_pool(relu_conv2, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='SAME', name='pool_layer2')
     # Local Response Normalization
     norm2=tf.nn.lrn(pool2, depth_radius=5, bias=2.0, alpha=1e-3, beta=0.75, name='norm2')
     # Reshape output into a single matrix for multiplication for the fully connected layers
     reshaped_output=tf.reshape(norm2, [batch_size, -1])
     reshaped_dim=reshaped_output.get_shape()[1].value
     # First Fully Connected Layer
     with tf.variable_scope('full1') as scope:
       # Fully connected layer will have 384 outputs
       full_weight1=truncated_normal_var(name='full_mult1', shape=[reshaped_dim, 384], dtype=tf.float32)
       full_bias1=zeros_var(name='full_bias1', shape=[384], dtype=tf.float32)
       full_layer1=tf.nn.relu(tf.add(tf.matmul(reshaped_output, full_weight1), full_bias1))
     # Second Fully Connected Layer
     with tf.variable_scope('full2') as scope:
       # Second fully connected layer will have 192 outputs
       full_weight2=truncated_normal_var(name='full_mult2', shape=[384, 192], dtype=tf.float32)
       full_bias2=zeros_var(name='full_bias2', shape=[192], dtype=tf.float32)
       full_layer2=tf.nn.relu(tf.add(tf.matmul(full_layer1, full_weight2), full_bias2))
     # Final Fully Connected Layer -> 10 categories for output (num_targets)
     with tf.variable_scope('full3') as scope:
       # Final fully connected layer will have 10 (num_targets) outputs
       full_weight3=truncated_normal_var(name='full_mult3', shape=[102, num_targets], dtype=tf.float32)
       full_bias3=zeros_var(name='full_bias3', shape=[num_targets], dtype=tf.float32)
       final_output=tf.add(tf.matmul(full_layer2, full_weight3), full_bias3)
     return final_output

模型使用了2个卷积层,接着是3个全连接层,还定义了两个变量函数。2层卷积操作各创建64个特征,第1个全连接层连接第2个卷积层,有384个隐藏节点;第2个全连接层连接这384个隐藏节点到192个隐藏节点;最后的隐藏层连接192个隐藏节点到预测的10个输出分类。
创建损失函数,这里使用softmax损失函数,输出结果为10类分类的概率分布:
  def cifar_loss(logits, targets):
     # Get rid of extra dimensions and cast targets into integers
     targets=tf.squeeze(tf.cast(targets, tf.int32))
     # Calculate cross entropy from logits and targets
     cross_entropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=targets)
     # Take the average loss across batch size
     cross_entropy_mean=tf.reduce_mean(cross_entropy)
     return cross_entropy_mean

定义训练步骤函数,在训练过程中学习率指数减小:
  def train_step(loss_value, generation_num):
     # Out learning rate is an exponential decay (stepped down)
     model_learning_rate=tf.train.exponential_decay(learning_rate, generation_num, num_gens_to_wait, lr_decay, staircase=True)
     # Create optimizer
     my_optimizer=tf.train.GradientDescentOptimizer(model_learning_rate)
     # Initialize train step
     train_step=my_optimizer.minimize(loss_value)
     return train_step

创建批量图片的准确度函数,该函数输入logits和目标向量,输出平均准确度:
  def accuracy_of_batch(logits, targets):
     # Make sure targets are integers and drop extra dimensions
     targets=tf.squeeze(tf,cast(targets, tf.int32))
     # Get predicted values by finding which logit is the greatest
     batch_predictions=tf.cast(tf.argmax(logits, 1), tf.int32)
     # Check if they are equal across the batch
     predicted_correctly=tf.equal(batch_predictions, targets)
     # Average the 1's and 0's (True's and False's) across the batch size
     accuracy=tf.reduce_mean(tf.cast(predicted_correctly, tf.float32))
     return accuracy

有了图像管道函数input_pipeline()后,开始初始化训练图像管道和测试图像管道:
  images, targets=input_pipeline(batch_size, train_logical=True)
  test_images, test_targets=input_pipeline(batch_size, train_logical=False)

初始化模型的训练输出和测试输出。需要创建训练模型后声明scope.reuse_variables(),这样可以在创建测试模型时重用训练模型相同的模型参数。代码为:
  with tf.variable_scope('model_definition') as scope:
     # Declare the training network model
     model_output=cifar_cnn_model(images, batch_size)
     # Use same variables within scope
     scope.reuse_variables()
     # Declare test model output
     test_output=cifar_cnn_model(test_images, batch_size)

初始化损失函数和测试准确度函数,然后声明迭代变量,该迭代变量需要声明为非训练型变量,并传入训练函数,用于计算学习率的指数衰减值:
  loss=cifar_loss(model_output, targets)
  accuracy=accuracy_of_batch(test_output, test_targets)
  generation_num=tf.Variable(0, trainable=False)
  train_op=train_step(loss, generation_num)

初始化所有模型变量,然后运行TensorFlow的start_queue_runners()函数启动图像管道。图像管道通过赋值字典传入批量图片,开始训练模型和测试模型输出。代码为:
  init=tf.global_variables_initializer()
  sess.run(init)
  tf.train.start_queue_runners(sess=sess)

现在遍历迭代训练,保存训练集损失函数和测试集准确度:
  train_loss=[]
  test_accuracy=[]
  for i in range(generations):
     _, loss_value=sess.run([train_op, loss])
     if (i+1)%output_every==0:
       train_loss.append(loss_value)
       output='Generation {}: Loss={:.5f}'.format((i+1), loss_value)
       print(output)
     if (i+1)%eval_every==0:
       [temp_accuracy]=sess.run([accuracy])
       test_accuracy.append(temp_accuracy)
       acc_output='Test Accuracy= {:.2f}%'.format(100.*temp_accuracy)
       print(acc_output)

部分输出结果:
  Generation 19900: Loss=0.11342
  Generation 20000: Loss=0.02228
  Test Accuracy=83.59%

在训练中使用matplotlib模块绘制损失函数和准确度:
  eval_indices=range(0, generations, eval_every)
  output_indices=range(0, generations, output_every)
  # Plot loss over time
  plt.plot(output_indices, train_loss, 'k-')
  plt.title('Softmax Loss per Generation')
  plt.xlabel('Generation')
  plt.ylabel('Softmax Loss')
  plt.show()
  # Plot accuracy over time
  plt.plot(eval_indices, test_accuracy, 'k-')
  plt.title('Test Accuracy')
  plt.xlabel('Generation')
  plt.ylabel('Accuracy')
  plt.show()

模型训练后,在测试集上达到75%的准确度。

3)再训练已有的CNN模型:

从原始数据集开始训练一个全新的图像识别模型需耗费大量时间和算力,如果重用预训练好的网络将会缩短计算时间。TensorFlow官方提供了利用已有的CNN模型进行再训练的例子,这里采用流行的Inception架构的CNN网络来识别CIFAR-10数据集图片。
  import os
  import tarfile
  import _pickle as cPickle
  import numpy as np
  import tensorflow as tf
  import urllib.request
  import scipy.misc
  from imageio import imwrite

声明CIFAR-10图片数据链接,创建存储数据的临时文件夹,并声明图片的10个分类:
  cifar_link='https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
  data_dir='temp'
  if not os.path.isdir(data_dir):
     os.makedirs(data_dir)
  objects=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

下载CIFAR-10.tar数据文件,并解压压缩文件:
  target_file=os.path.join(data_dir, 'cifar-10-python.tar.gz')
  if not os.path.isfile(target_file):
     print{'downloading CIFAR data (Size=163MB)'}
     filename, headers=urllib.request.urlretrieve(cifar_link, target_file)
  # Extract into memory
  tar=tarfile.open(target_file)
  tar.extractall(path=data_dir)
  tar.close()

创建训练所需的文件夹结构,临时目录下有train_dir和validation_dir两个文件夹,每个文件夹下有10个子文件夹,分别存储10个目标分类:
  # Create train image folders
  train_folder='train_dir'
  if not os.path.isdir(os.path.join(data_dir, train_folder)):
     for i in range(10):
       folder=os.path.join(data_dir, train_folder, objects[i])
       os.makedirs(folder)
  # Create test image folders
  test_folder='validation_dir'
     if not os.path.isdir(os.path.join(data_dir, test_folder)):
     for i in range(10):
       folder=os.path.join(data_dir, test_folder, objects[i])
       os.makedirs(folder)

为了保存图片,创建函数从内存中加载图片并存入图像字典中:
  def load_batch_from_file(file):
     file_conn=open(file, 'rb')
     image_dictionary=cPickle.load(file_conn, encoding='latin1')
     file_conn.close()
     return(image_dictionary)

通过字典将每个文件保存到正确位置:
  def save_images_from_dict(image_dict, folder='data_dir'):
     for ix,label in enumerate(image_dict['labels']):
       folder_path=os.path.join(data_dir, folder, objects[label])
       filename=image_dict['filename'][ix]
       # Transform image data
       image_array=image_dict['data'][ix]
       image_array.resize([3, 32, 32])
       # Save image
       output_location=os.path.join(folder_path, filename)
       imwrite(output_location, image_array.transpose())

遍历下载数据文件,并把每张图片保存在正确位置:
  data_location=os.path.join(data_dir, 'cifar-10-batches-py')
  train_names=['data_batch_'+str(x) for x in range(1,6)]
  test_names=['test_batch']
  # Sort train images
  for file in train_names:
     print('Saving images from file: {}'.format(file))
     file_location=os.path.join(data_dir, 'cifar-10-batches-py', file)
     image_dict=load_batch_from_file(file_location)
     save_images_from_dict(image_dict, folder=train_folder)
  # Sort test images
  for file in test_names:
     print('Saving images from file: {}'.format(file))
     file_location=os.path.join(data_dir, 'cifar-10-batches-py', file)
     image_dict=load_batch_from_file(file_location)
     save_images_from_dict(image_dict, folder=test_folder)

创建标注文件,使用标注而不是数值索引解释输出结果:
  cifar_labels_file=os.path.join(data_dir, 'cifar10_labels.txt')
  print('Writing labels file, {}'.format(cifar_labels_file))
  with open(cifar_labels_file, 'w') as labels_file:
     for item in objects:
       labels_file.write("{}n".format(item))

上面脚本运行后,下载图片数据集并排序归类。
接着按TensorFlow官方示例复制模型源码:
  git clone https://github.com/tensorflow/models/tree/master/research/inception
为了重用预训练模型,下载神经网络权重并应用于新的神经网络模型,这需要访问网址:https://github.com/tensorflow/models/tree/master/research/slim,按照指导下载安装cifar10模型架构和权重,其中会下载一个数据目录,包括build、train、test程序脚本。
转为TFRecords对象:
  python3 data/build_image_data.py --train_directory="temp/train_dir/" --validation_directory="temp/validation_dir/" --output_directory="temp/" --labels_file="temp/cifar10_labels.txt"
使用bazel模块训练算法模型,设置fine_tune参数为true。该脚本每迭代10次输出损失函数,可以随时终止进程。模型结果都保存在temp/training_results文件夹,可以从该文件加载模型数据进行评估。
  bazel-bin/inception/flowers_train --train_directory="temp/training_results/" --data_dir="temp/data_dir" --pretrained_model_checkpoint_path="model.ckpt-157585" --fine_ture=True --initial_learning_rate=0.001 --input_queue_memory_factor=1
TensorFlow官方示例训练模型是基于已训练好的CNN模型,其要求从CIFAR-10图像数据创建文件夹,这里将图像数据转化成TFRecords文件格式进行训练,微调已有的网络模型,重新训练全连接层来拟合10个分类目标。

4)实现图像风格迁移:

Stylenet程序试图学习一幅图的风格,并将该图像风格应用于另外一幅图,但保持后者的图片结构和内容。如果可以找到与图像风格相关的CNN模型中间层节点,且节点与图像内容无关,那么就可以进行风格迁移。
Stylenet程序需输入两幅图片,将一幅图片的图像风格应用于另外一幅图的内容上,从风格图片中训练图片风格层,从原始图片中训练图片内容层,并且反向传播这些计算损失函数,从而让原始图片更像风格图片。需下载推荐的网络imagenet-vgg-19,其实imagenet-vgg-16效果也不错。
下载.mat格式的预训练网络,存为.mat文件格式。这是一种matlab对象,利用Python的scipy模块读取该文件,下载网址:http:/www.vlfeat.org/matconvnet/models/beta16/imagenet-vgg-verydeep-19.mat。
  import os
  import scipy.io
  import scipy.misc
  import imageio
  from skimage.transform import resize
  from operator import mul
  from functools import reduce
  import numpy as np
  import tensorflow as tf
  from tensorflow.python.framework import ops
  ops.reset_default_graph()

声明两幅图片的位置,使用梵高的Starry Night作为风格图片:
  original_image_file='temp/book_cover.jpg'
  style_image_file='temp/starry_night.jpg'

设置模型参数,包括.mat模型文件位置、网络权重、学习率、迭代次数和输出中间图片的频率,其中权重可以增加应用于原始图片中风格的权重。这些参数可以根据实际需要进行调整。代码为:
  vgg_path='imagenet-vgg-verydeep-19.mat'
  original_image_weight=5.0
  style_image_weight=500.0
  regularization_weight=100
  learning_rate=10
  generations=100
  output_generations=25
  beta1=0.9
  beta2=0.999

使用scipy模块加载两幅图片,并将风格图片的维度调整到和原始图片一致:
  original_image=imageio.imread(original_image_file)
  style_image=imageio.imread(style_image_file)
  # Get shape of target and make the style image the same
  target_shape=original_image.shape
  style_image=resize(style_image, target_shape)

可以定义各层出现的顺序:
  vgg_layers=[
  'conv1_1', 'relu1_1',
  'conv1_2', 'relu1_2', 'pool1',
  'conv2_1', 'relu2_1',
  'conv2_2', 'relu2_2', 'pool2',
  'conv3_1', 'relu3_1',
  'conv3_2', 'relu3_2',
  'conv3_3', 'relu3_3',
  'conv3_4', 'relu3_4', 'pool3',
  'conv4_1', 'relu4_1',
  'conv4_2', 'relu4_2',
  'conv4_3', 'relu4_3',
  'conv4_4', 'relu4_4', 'pool4',
  'conv5_1', 'relu5_1',
  'conv5_2', 'relu5_2',
  'conv5_3', 'relu5_3',
  'conv5_4', 'relu4_4']

定义函数抽取mat文件中的参数:
  def extract_net_info(path_to_params):
     vgg_data=scipy.io.loadmat(path_to_params)
     normalization_matrix=vgg_data['normalization'][0][0][0]
     mat_mean=np.mean(normalization_matrix, axis=(0, 1))
     network_weights=vgg_data['layers'][0]
     return mat_mean, network_weights

根据上面加载的权重和网络层定义,通过下面的函数来创建网络。迭代训练每层,并分配合适的权重和偏置。代码为:
  def vgg_network(network_weights, init_image):
     network={}
     image=init_image
     for i,layer in enumerate(vgg_layers):
       if layer[1]=='c':
         weights, bias=network_weights[i][0][0][0][0]
         weights=np.transpose(weights, (1, 0, 2, 3))
         bias=bias.reshape(-1)
         conv_layer=tf.nn.conv2d(image, tf.constant(weights), (1, 1, 1, 1), 'SAME')
         image=tf.nn.bias_add(conv_layer, bias)
       elif layer[1]=='r':
         image=tf.nn.relu(image)
       else:
         image=tf.nn.max_pool(image, (1, 2, 2, 1), (1, 2, 2, 1), 'SAME')
         network[layer]=image
     return(network)

这里,原始图片采用relu4_2层,风格图片采用reluX_1层组合:
  original_layer=['relu4_2']
  style_layers=['relu1_1', 'relu2_1', 'relu3_1', 'relu4_1', 'relu5_1']

下一步运行extract_net_info()函数获取网络权重和平均值,也需要设置VGG19风格层权重,可根据自己的意愿改变权重,这里都设为0.5。代码为:
  # Get network parameters
  normalization_mean, network_weights=extract_net_info(vgg_path)
  shape=(1,)+original_image.shape
  style_shape=(1,)+style_image.shape
  original_features={}
  style_features={}
  # Set style weights
  style_weights={l: 1./(len(style_layers)) for l in style_layers}

为了遵从原图片的内容,需要构建内容损失函数,为此先要声明image占位符,并创建该占位符的网络,归一化图片矩阵。代码为:
  g_original=tf.Graph()
  with g_original.as_default(),tf.Session() as sess1:
     image=tf.placeholder('float', shape=shape)
     vgg_net=vgg_network(network_weights, image)
     original_minus_mean=original_image-normalization_mean
     original_norm=np.array([original_minus_mean])
     for layer in original_layers:
       original_features[layer]=vgg_net[layer].eval(feed_dict={image: original_norm})

如果想让原始图片拥有风格图片的风格,也需要构建风格损失函数:
  g_style=tf.Graph()
  with g_style.as_default(),tf.Session() as sess2:
     image=tf.placeholder('float', shape=style_shape)
     vgg_net=vgg_network(network_weights, image)
     style_minus_mean=style_image-normalization_mean
     style_norm=np.array([style_minus_mean])
     for layer in style_layers:
       features=vgg_net[layer].eval(feed_dict={image: style_norm})
       features=np.reshape(features, (-1, features.shape[3]))
       gram=np.matmul(features.T, features)/features.size
       style_features[layer]=gram

开始计算损失函数和展开训练,为此先初始化一张图像,带有随机噪声作为目标图像:
  with tf.Graph().as_default():
     # Get network parameters
     imitial=tf.random_normal(shape)*0.256
     init_image=tf.Variable(initial)
     vgg_net=vgg_network(network_weights, init_image)

计算内容损失,应使得原始图像内容尽可能多的保留:
  original_layers_w={'relu4_2': 0.5, 'relu5_2': 0.5}
  original_loss=0
  for o_layer in original_layers:
     temp_original_loss =original_layers_w[o_layer] *original_image_weight *(2*tf.nn.l2_loss (vgg_net[o_layer] -original_features[o_layer]))
     original_loss+=(temp_original_loss/original_features[o_layer].size)

计算风格损失,是将风格图片的风格特征与目标图像进行对比:
  style_loss=0
  styles_losses=[]
  for style_layer in style_layers:
     layer=vgg_net[style_layer]
     feats, height, width, channels=[x.value for x in layer.get_shape()]
     size=height*width*channels
     features=tf.reshape(layer, (-1, channels))
     style_gram_matrix=tf.matmul(tf.transpose(features), features)/size
     style_expected=style_features[style_layer]
     style_losses.append (style_weights[style_layer] *2*tf.nn.l2_loss (style_gram_matrix -style_expected) /style_expected.size)
     style_loss+=style_image_weight*tf.reduce_sum(style_losses)

最后一个损失函数是总变分损失,对变化显著的领域像素进行抑制:
  total_var_x=reduce(mul, init_image[:, 1:, :, :].get_shape().as_list(), 1)
  total_var_y=reduce(mul, init_image[:, :, 1:, :].get)shape().as_list(), 1)
  first_term=regularization_weight*2
  second_term_numerator=tf.nn.l2_loss(init_image[:, 1:, :, :]-init_image[:, :shape[1]-1, :, :])
  second_term=second_term_numerator/total_var_y
  trird_term=(tf.nn.l2_loss(init_image[:, :, 1:, :]-init_image[:, :shape[2]-1, :, :])/total_var_x)
  total_variation_loss=first_term*(second_term+third_term)

经将3个损失函数合并为总损失函数,声明优化器和训练函数:
  loss=original_loss+style_loss+total_variation_loss
  optimizer=tf.train.AdamOptimizer(learning_rate, beta1 ,beta2)
  train_step=optimizer.minimize(loss)

开始训练,保存中间临时图片,并保存最终输出图像:
  with tf.Session() as sess:
     tf.global_variables_initilizer().run()
     for i in range(generations):
       train_step.run()
       # Print update and save temporary output
       if (i+1) % output_generations==0:
         print('Generation {} out of {}, loss: {}'.format(i+1, generations, sess.run(loss)))
         image_eval=init_image.eval()
         best_image_add_mean=image_eval.reshape(shape[1:])+normalization_mean
         output_file='temp_output_{}.jpg'.format(i)
         imageio.imwrite(output_file, best_image_add_mean.astype(np.uint8))
         # Save final image
         best_image_add_mean=image_eval.reshape)shape[1:])+normalization_mean
         output_fiel='final_output.jpg'
         scipy.misc.imsave(output_file, best_image_add_mean)

上述代码中,加载两幅图片,然后加载预训练好的网络权重,为原始图片和风格图片分配网络层;计算3种损失函数:内容损失、风格损失和总变分损失;然后使用原始图像的内容和风格图像的风格训练随机噪声图片。

5)实现图像风格迁移:

重用已训练好的CNN模型的另外一个应用是得到一些中间节点检测标签特征的实质,利用这个实质可以找到转换任何图像的方法,以反映选择的任何节点的特征。TensorFlow官方示例通过一个脚本展示如何实现DeepDream,网址https://www.tensorflow.org/tutorials/images/image_recognition,这里要使用GoogleNet,网址:https://arxiv.org/abs/1512.00567。
下载GoogleNet,其为CIFAR-1000图片数据集上已训练好的CNN模型:
  wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip
  unzip inception5h.zip

导入必要的库,并创建一个计算图会话:
  import os
  import matplotlib.pyplot as plt
  import numpy as np
  import PIL.Image
  import tensorflow as tf
  from io import BytesIO
  graph=tf.Graph()
  sess=tf.InteractiveSession(graph=graph)

声明解压的模型参数位置,并且将这些参数加载进TensorFlow的计算图:
  model_fn='tensorflow_inception_graph.pb'
  with tf.gfile.FastGFile(model_fn, 'rb') as f:
     graph_def=tf.GraphDef()
     graph_def.ParseFromString(f.read())

创建输入数据的占位符,设置imagenet_mean为117.0,然后导入计算图定义,并传入归一化的占位符:
  t_input=tf.placeholder(np.float32, name='input')
  imagenet_mean=117.0
  t_preprocessed=tf.expand_dims(t_input-imagenet_mean, 0)
  tf.import_graph_def(graph_def, {'input':t_preprocessed})

导入卷积层进行可视化,并在后续处理DeepDream时使用:
  layers=[op.name for op in graph.get_operations() if op.type=='Conv2D' and 'import/' in op.name]
  feature_nums=[int(graph.get_tensor_by_name(name+':0').get_shape()[-1]) for name in layers]

选择某一层进行可视化,可以通过层的名字进行选择,这里选择特征数字139来查看,对图片进行噪声处理:
  layer='mixed4d_3x3_bottleneck_pre_relu'
  channel=139
  img_noise=np.random.uniform(size=(224, 224, 3))+100.0

声明函数来绘制图片数组:
  def showarray(a, fmt='jpeg'):
     a=np.uint8(np.clip(a, 0, 1)*255)
     f=BytesIO()
     PIL.Image.fromarray(a).save(f, fmt)
     plt.imshow(a)

在计算图中创建层迭代函数来简化重复的代码,其以层的名字迭代:
  def T(layer):
     graph.get_tensor_by_name("import/%s:0"%layer)

构建一个能创建占位符的封装函数,其可以指定参数返回占位符:
  def tffunc(*argtypes):
     placeholders=list(map(tf.placeholder, argtypes))
     def wrap(f):
       out=f(*placeholders)
       def wrapper(*args, **kw):
         return out.eval(dict(zip(placeholders, args)), session=kw.get('session'))
       return wrapper
     return wrap

创建调整图片大小的函数,其可以指定图片大小,使用内建图片线性差值函数tf.image.resize_bilinear():
  def resize(img, size):
     img=tf.expand_dims(img, 0)
     return tf.image.resize_bilinear(img, size)[0,:,:,:]

通过指定图片的梯度计算方法更新来源图片,让其更像选择的特征。这里定义函数计算图片上子区域的梯度,使得计算更快,并将在图片的z轴和y轴方向上随机移动或者滚动,这将平滑方格的影响。代码为:
  def calc_grad_tiled(img, t_grad, tile_size=512):
     # Pick a subregion square size
     sz=tile_size
     # Get the image height and width
     h, w=img.shape[:2]
     # Get a random shift amount in the x and y direction
     sx, sy=np.random.randint(sz, size=2)
     # Randomly shift the image (roll image) in the x and y directions
     img_shift=np.roll(np.roll(img, sx, 1), sy, 0)
     # Initialize the while image gradient as zeros
     grad=np.zeros_like(img)
     # Now we loop through all the sub-titles in the image
     for y in range(0, max(h-sz//2, sz), sz):
       for x in range(0, max(w-sz//2, sz), sz):
         # Select the sub image tile
         sub=img_shift[y:y+sz, x:x+sz]
         # Calculate the gradient for the tile
         g=sess.run(t_grad, {t_input:sub})
         # Apply the gradient of the tile to the whole image gradient
         grad[y:y+sz, x:x+sz]=g
     # Return the gradient, undoing the roll operation
     return np.roll(np.roll(grad, -sx, 1), -sy, 0)

声明DeepDream函数,算法对象是选择特征的平均值,损失函数是基于梯度的,其依赖于输入图片和选择特征之间的距离,策略是分割图像为高频部分和低频部分,在低频部分上计算梯度,而将高频部分的结果再分割为高频部分和低频部分,重复前面的过程。原始图片和低频图片称为octaves,对传入的每个对象计算其梯度应应用到图片中。代码为:
  def render_deepdream(t_obj, img0=img_noise, iter_n=10, step=1.5, octave_n=4, octave_scale=1.4):
     # Defining the optimization objective
     t_score=tf.reduce_mean(t_obj)
     # Gradients will be defined as changing the t_input to closer t_score
     t_grad=tf.gradients(t_score, t_input)[0]
     # Store the image
     img=img0
     # Initialize the image octave list
     octaves=[]
     # Calculate n-1 octaves
     for i in range(octave_n-1):
       # Extract the image shape
       hw=img.shape[:2]
       # Resize the image
       lo=resize(img, np.int32(np.float32(hw)/octave_scale))
       # Residual is hi.
       hi=img-resize(lo, hw)
       # Save the lo image for re-iterating
       img=lo
       # Save the extracted hi-image
       octaves.append(hi)
     # Generate details octave by octave
     for octave in range(oatave_n):
       if octave>0:
         # Start with the last octave
         hi=octaves[-octave]
         image=resize(img, hi.shape[:2])+hi
       for i in range(iter_n):
         # Calculate gradient of the image
         g=calc_grad_tiled(img, t_grad)
         img+=g*(step/(np.abs(g).mean()+1e-7))
         print('.', end=' ')
     showarray(img/255.0)

所有函数准备好,开始运行DeepDream算法:
  if __name__=="__main__":
     # Create resize function
     resize=tffunc(np.float32, np.int32)(resize)
     # Open image
     img0=PIL.Image.open('book_cover.jpg')
     img0=np.float32(img0)
     # Show Original Image
     showarray(img0/255.0)
     # Create deep dream
     render_deepdream(T(layer)[:,:,:,139], img0, iter_n=15)
     sess.close()

相关内容可访问DeepDream官方示例,网址https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html,示例代码见https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/tutorials/deepdream。

6)将Keras作为TensorFlow API使用:

TensorFlow为编程提供了灵活性和强大功能,但模型原型设计和迭代各种测试可能很麻烦。Keras是深度学习库的封装器,可以更轻松地处理模型的各方面,使编程更容易。这里使用TensorFlow内部的Keras函数,在MNIST数据集上搭建全连接神经网络和简单的CNN图像网络。
  import tensorflow as tf
  from sklearn.preprocessung import MultiLabelBinarizer
  from keras.utils import to_categorical
  from tensorflow import keras
  from tensorflow.python.framework import ops
  ops.reset_default_graph()
  # Load MNIST data
  from tensorflow.examples.tutorials.mnist import input_data

在TensorFlow中加载带有MNIST数据导入函数的库。MNIST图像像素为28×28,其中每个观测量是0~1的784个灰度点,y标签导入了0~9的整数:
  mnist=input_data.read_data_sets("MNIST_data")
  x_train=mnist.train.images
  x_test=mnist.test.images
  y_train=mnist.train.labels
  y_test=mnist.test.labels
  y_train=[[i] for i in y_train]
  y_test=[[i] for i in y_test]

将使用scikit-learn的MultiLabelBinarizer()函数将目标整数转换为独热编码向量:
  one_hot=MultiLabelBinarizer()
  y_train=one_hot.fit_transform(y_train)
  y_test=one_hot.fit_transform(y_test)

创建一个三层全连接神经网络,隐藏节点数为32、16和10,输出节点数为10:
  # Start with 'sequential' model type
  model=keras.Sequential()
  # Adds a densely-connected layer with 32 unit
  model.add(keras.layers.Dense(32, activation='relu'))
  # Adds a densely-connected layer with 16 unit
  model.add(keras.layers.Dense(16, activation='relu'))
  # Adds a softmax layer with 10 output unit
  model.add(keras.layers.Dense(10, activation='softmax'))

要训练模型,需要调用compile()方法,并设置参数。所需参数是优化器函数和损失函数类型,也要记录准确度,因此metrics列表包括accuracy参数。代码:
  model.compile(optimizer=tf.train.AdamOptimizer(0.001), loss='categorical_crossentropy', metrics =['accuracy'])
如果要以均方误差损失配置回归模型,使用代码:
  model.compile(optimizer=tf.train.AdamOptimizer(0.01), loss='mse', metrics=['mae'])
使用两个卷积层实现CNN模型,并且最大池化层后跟着全连接层。首先需要将平面图像重塑为2D图像,并将y目标转换为numpy数组。代码为:
  x_train=x_train.reshape(x_train.shape[0], 28, 28, 1)
  x_test=x_train.reshape(x_test.shape[0], 28, 28, 1)
  input_shape=(28, 28, 1)
  num_classes=10
  # Categorize y targets
  y_test=to_categorical(mnist.test.labels)
  y_train=to_categorical(mnist.train.labels)

使用Keras的Conv2d()、MaxPooling2D()和Dense()函数来创建CNN模型:
  cnn_model=keras.Sequential()
  # First convolution layer
  cnn_model.add(keras.layers.Conv2D(25, kernel_size=(4,4), strides=(1,1), activation='relu', input_shape=input_shape))
  # Max pooling
  cnn_model.add(keras.layers.MaxPooling2D(pool_size=(2,2), strides=(2,2)))
  # Second convolution layer
  cnn_model.add(keras.layers.Conv2D(50, kernel_size=(5,5), strides=(1,1), activation='relu'))
  # Max pooling
  cnn_model.add(keras.layers.MaxPooling2D(pool_size=(2,2), strides=(2,2)))
  # Flatten for dense (fully connected) layer
  cnn_model.add(keras.layers.Flatten())
  # Add dense (fully connected) layer
  cnn_model.add(keras.layers.Dense(num_classes, activation='softmax'))

编译模型,设置优化器和损失函数:
  cnn_model.compile(optimizer=tf.train.AdamOptimizer(0.001), loss='categorical_crossentropy', metrics=['accuracy'])
Keras还允许将函数插入到名为Callbacks的训练代码中。Callbacks函数在代码中的某些时间执行,可用于执行各种函数。有各种预定义的callback,可以保存模型、在指定条件下停止训练、记录值等,可参考https://keras.io/callbacks/。这里创建一个名为RecordAccuracy()的callback类,这是一个Keras Callback类,在每个epoch的末尾存储准确度。代码为:
  class RecordAccuracy(keras.callbacks,Callback):
     def on_train_begin(self, logs={}):
       self.acc=[]
     def on_epoch_end(self, batch, ogs={}):
       self.acc.append(logs.get('acc'))
  accuracy=RecordAccuracy()

使用fit()方法训练CNN模型,提供validation_data和callbacks。代码:
  cnn.model.fit(x_train, y_train, batch_size=64, epochs=3, validation_data=(x_test, y_test), callbacks=[accuracy])
  print(accuracy.acc)

使用Keras可以轻松创建和训练模型,能自动处理变量类型、维度和数据输入的许多复杂的细节,但掩盖了太多模型细节,也有可能无意地实现了错误的模型。

3. TensorFlow自然语言处理:

如果要用机器学习进行文本处理,就必须使用一种方法将文本转化成数字,方法有很多。比如把一个语句中的单词按顺序转换成数字,然后当看到一个新的句子,把其中一样的单词也转换为对应数字,而用0表示不能识别的单词。对于一个长文本,可以选择频率高的词汇保留,而用0标注其他任意单词。其中的数字仅仅代表分类,而不具有数值关系。
而句子的算法模型希望输入相同长度的语句,因此要把句子转换为稀疏向量。如果句子中包含某个词汇。在词汇对应的索引位置上的值为1,为0则表示句子中不包含此单词,这种转换损失了单词顺序特征,但得到了长度相等的向量。一般情况下,会选择很大的词汇量,因此得到的句子向量非常稀疏,这种词嵌入方法称为词袋。

1)词袋的使用:

这里使用UCI机器学习数据库的垃圾邮件文本数据集,网址https://archive.ics.uci.edu/ml/SWS+Spam+Collection。该垃圾邮件数据库中有正常邮件和垃圾邮件,下载存储,使用词袋方法预测一条文本是否为垃圾邮件。使用不含隐藏层的逻辑模型来训练词袋,采用批量大小为1的随机训练,并计算留存的测试集准确度。
  import tensorflow as tf
  import matplotlib.pyplot as plt
  import os
  import numpy as np
  import csv
  import string
  import requests
  import io
  from zipfile import ZipFile
  from tensorflow.contrib import learn
  sess=tf.Session()

将UCI机器学习数据库中的zip文件下载并存储,然后抽取输入数据和目标数据,调整目标值,垃圾邮件spam置为1,正常邮件ham置为0。代码为:
  save_file_name=os.path.join('temp', 'temp_spam_data.csv')
  if os.path.isfile(save_file_name):
     text_data=[]
     with open(save_file_name, 'r') as temp_output_file:
       reader=csv.reader(save_file_name)
       for row in reader:
         text_data.append(row)
  else:
     zip_url='http://archive.ics.uci.edu/ml/machine-learning-databases/00228/sms/spamcollection.zip'
     r=requests.get(zip_url)
     z=ZipFile(io.BytesIO(r.content))
     file=z.read('SMSSpamCollection')
     # Format Data
     text_data=file.decode()
     text_data=text_data.encode('ascii', errors='ignore')
     text_data=text_data.decode().split('\n')
     text_data=[[x.split('\t') for x in text_data if len(x)>=1]
     # And write to csv
     with open(save_file_name, 'w') as temp_output_file:
       writer=csv.writer(temp_output_file)
       writer.writerows(text_data)
  texts=[x[1] for x in text_data]
  target=[x[0] for x in text_data]
  # Relabel 'spam' as 1, 'ham' as 0
  target=[1 if x=='spam' else 0 for x in target]

为了减小词汇量,对文本进行归一化处理,移除文本中大小写、标点符号和数字的影响:
  # Convert to lower case
  texts=[x.lower() for x in texts]
  # Remove punctuation
  texts=[''.join(c for c in x if c not in string.punctuation) for x in texts]
  # Remove numbers
  texts=[''.join(c for c in x if c not in '0123456789') for x in texts]
  # Trim extra whitespace
  texts=[' '.join(x.split()) for x in texts]

计算最长句子大小,使用文本长度直方图,并取最佳截止点:
  # Plot histogram of text lengths
  text_lengths=[len(x.split()) for x in texts]
  text_lengths=[x for x in text_lengths if x<50]
  plt.hist(text_lengths, bins=25)
  plt.title('Histogram of # of Words in Texts')
  sentence_size=25
  min_word_freq=3

这里使用直方图选出最大单词长度,设为25单词,也可以设为更大的值。
TensorFlow内置转换函数VocabularyProcessor(),该函数位于learning.preprocessing库。此函数有可能会出现弃用警告。
  vocab_processor=learn.preprocessing.VocabularyProcessor(sentence_size, min_frequency=min_word_freq)
  vocab_processor.fit_transform(texts)
  transformed_texts=np.array([x for x in vocab_processor.transform(texts)])
  embedding_size=len(np.unique(transformed_texts))

分割数据集为训练集和测试集:
  train_indices=np.random.choice(len(texts), round(len(texts)*0.8), replace=False)
  test_indices=np.array(list(set(range(len(texts)))-set(train_indices)))
  texts_train=[x for ix,x in enumerate(texts) if ix in train_indices]
  texts_test=[x for ix,x in enumerate(texts) if ix in test_indices]
  target_train=[x for ix,x in enumerate(target) if ix in train_indices]
  target_test=[x for ix,x in enumerate(target) if ix in test_indices]

声明词嵌入矩阵。将句子单词转成索引,再将索引转成one-hot向量,该向量为单位矩阵所创造。使用该矩阵为每个单词查找稀疏向量,并加入到词稀疏向量。代码为:
  identity_mat=tf.diag(tf.ones(shape=[embedding_size]))
因为最后要用逻辑回归预测垃圾邮件的概率,所以需要声明逻辑回归变量:
  A=tf.Variable(tf.random_normal(shape=[embedding_size, 1]))
  b=tf.Variable(tf.random_normal(shape=[1, 1]))

声明占位符,x_data输入占位符是整型,因为被用来查找单位矩阵的行索引:
  x_data=tf.placeholder(shape=[sentence_size], dtype=tf.int32)
  y_target=tf.placeholder(shape=[1, 1], dtype=tf.float32)

使用TensorFlow的嵌入查找函数来映射句子中的单词索引为单位矩阵的one-hot向量,然后把前面的词向量求和,得到句子向量:
  x_embed=tf.nn.embedding_lookup(identity_mat, x_data)
  x_col_sums=tf.reduce_sum(x_embed, 0)

有了句子的固定长度的句子向量后,进行逻辑回归训练。为此声明逻辑回归算法模型,因为需要做一个数据点的随机训练,要扩展输入数据的维度,并进行线性回归操作:
  x_col_sums_2D=tf.expand_dims(x_col_sums, 0)
  model_output=tf.add(tf.matmul(x_col_sums_2D, A), b)

声明训练模型的损失函数、预测函数和优化器:
  loss=tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=y_target))
  # Prediction operation
  prediction=tf.sigmoid(model_output)
  # Declare optimizer
  my_opt=tf.train.GradientDescentOptimizer(0.001)
  train_step=my_opt.minimize(loss)

初始化计算图中的变量:
  init=tf.global_variables_initializer()
  sess.run(init)

开始迭代训练。TensorFlow的内建函数vocab_processor.fit()是一个符合需要的生成器,使用该函数来进行随机训练逻辑回归模型。为了得到准确度的趋势,保留最近50次迭代的平均值,如果只绘制当前值,会依赖预测训练数据点是否正确而得到1或者0的值。代码为:
  loss_vec=[]
  train_acc_all=[]
  train_acc_avg=[]
  for ix,t in enumerate(vocab_processor.fit_transform(texts_train)):
     y_data=[[target_train[ix]]
     sess.run(train_step, feed_dict={x_data: t, y_target: y_data})
     temp_loss=sess.run(loss, feed_dict={x_data: t, y_target: y_data})
     loss_vec.append(temp_loss)
     if (ix+1)%10==0:
       print('Training Observation #{}: Loss={}'.format(ix+1, temp_loss))
     # Keep trailing average of past 50 observations accuracy
     # Get prediction of single observation
     [[temp_pred]]=sess.run(prediction, feed_dict={x_data: t, y_target: y_data})
     # Get True/False if prediction is accurate
     train_acc_temp=target_train[ix]==np.round(temp_pred)
     train_acc_all.append(train_acc_temp)
     if len(train_acc_all)>=50:
       train_acc_avg.append(np.mean(train_acc_all[-50:]))

部分显示结果:
  Training Observation #4430: Loss=1.84636
  Training Observation #4450: Loss=0.045941

为了得到测试集的准确度,重复处理过程,对文本只进行预测操作,而不进行训练操作:
  test_acc_all=[]
  for ix,t in enumerate(vocab_processor.fit_transform(texts_train)):
     y_data=[[target_train[ix]]
     if (ix+1)%50==0:
       print('Test Observation #{}'.format(ix+1))
     # Keep trailing average of past 50 observations accuracy
     # Get prediction of single observation
     [[temp_pred]]=sess.run(prediction, feed_dict={x_data: t, y_target: y_data})
     # Get True/False if prediction is accurate
     test_acc_temp=target_test[ix]==np.round(temp_pred)
     test_acc_all.append(test_acc_temp)
  print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))

部分显示结果:
  Test Observation #1100
  Overall Test Accuracy: 0.8035874439461883

这里限制文本大小为25个单词,因为文本长度影响预测结果。如果一个单词出现一次预测为正常邮件,而其后出现多次,就可能变成垃圾邮件,这在数据不均衡情况下最普遍。

2)实现TF-IDF算法:

TF-IDF(Text Frequency-Inverse Document Frequency)算法表示为词频和逆文档频率的乘积。
词袋的方法是每个单词出现一次就分配一个值1,但句子中the、and等单词相对重要性较低,因此不同单词应使用不同的权重。
词频TF是某个单词在文档中出现的频率,目的是找到每个单词的权重。句子中the、and等单词在每个文档中出现的频率都很高,需要降低这些单词的权重,所以用词频TF乘以逆文档频率(其为所有包含该单词的文档频率的倒数)即可得到重要的单词。但是因为语料库往往数量巨大,普遍做法是对文档频率求log,这样就得到每个单词在文档中的TF-IDF公式:
TF-IDF算法
其中,Wtf是文档的词频,Wdf是包含该单词的所有文档的总频率。TF-IDF值高代表着单词在文档中的重要性高。
创建TF-IDF向量要求向内存加载所有的文本数据,并在开始训练模型之前计算每个单词出现的次数,将使用scikit-learn创建TF-IDF向量,采用TensorFlow拟合逻辑回归模型。
  import tensorflow as tf
  import matplotlib.pyplot as plt
  import os
  import numpy as np
  import csv
  import string
  import requests
  import io
  import nltk
  from zipfile import ZipFile
  from sklearn.feature_extraction.text import TfidVectorizer

创建一个计算图会话,什么批量大小和词汇的最大长度:
  sess=tf.Session()
  batch_size=200
  max_features=1000

加载文本数据集:
  save_file_name=os.path.join('temp', 'temp_spam_data.csv')
  if os.path.isfile(save_file_name):
     text_data=[]
     with open(save_file_name, 'r') as temp_output_file:
       reader=csv.reader(temp_output_file)
       for row in reader:
         text_data.append(row)
  else:
     zip_url='http://archive.ics.uci.edu/ml/machine-learning-databases/00228/sms/spamcollection.zip'
     r=requests.get(zip_url)
     z=ZipFile(io.BytesIO(r.content))
     file=z.read('SMSSpamCollection')
     # Format Data
     text_data=file.decode()
     text_data=text_data.encode('ascii', errors='ignore')
     text_data=text_data.decode().split('\n')
     text_data=[[x.split('\t') for x in text_data if len(x)>=1]
     # And write to csv
     with open(save_file_name, 'w') as temp_output_file:
       writer=csv.writer(temp_output_file)
       writer.writerows(text_data)
  texts=[x[1] for x in text_data]
  target=[x[0] for x in text_data]
  # Relabel 'spam' as 1, 'ham' as 0
  target=[1. if x=='spam' else 0. for x in target]

通过所有字符转成小写、剔除标点符号和数字,以减小词汇长度:
  # Convert to lower case
  texts=[x.lower() for x in texts]
  # Remove punctuation
  texts=[''.join(c for c in x if c not in string.punctuation) for x in texts]
  # Remove numbers
  texts=[''.join(c for c in x if c not in '0123456789') for x in texts]
  # Trim extra whitespace
  texts=[' '.join(x.split()) for x in texts]

为了使用scikit-learn的TF-IDF处理函数,需要将句子进行分割,也就是将句子切分为相应的单词,nitk包可以提供分词器来实现分词功能:
  def tokenizer(text):
     words=nltk.word_tokenize(text)
     return words
  # Create TF-IDF of texts
  tfidf=TfidfVectorizer(tokenizer=tokenizer, stop_words='english', max_features=max_features)
  sparse_tfidf_texts=tfidf.fit_transform(texts)

分割数据集为训练集和测试集:
  train_indices=np.random.choice(sparse_tfidf_texts.shape[0], round(0.8*sparse_tfidf_texts.shape[0]), replace=False)
  test_indices=np.array(list(set(range(sparse_tfidf_texts.shape[0]))-set(train_indices)))
  texts_train=sparse_tfidf_texts[train_indices]
  texts_test=sparse_tfidf_texts[test_indices]
  target_train=np.array([x for ix,x in enumerate(target) if ix in train_indices])
  target_test=np.array([x for ix,x in enumerate(target) if ix in test_indices])

声明逻辑回归模型的变量和数据集的占位符:
  A=tf.Variable(tf.random_normal(shape=[max_features, 1]))
  b=tf.Variable(tf.random_normal(shape=[1, 1]))
  x_data=tf.placeholder(shape=[None, max_features], dtype=tf.float32)
  y_target=tf.placeholder(shape=[None, 1], dtype=tf.float32)

声明算法模型操作和损失函数:
  model_output=tf.add(tf.matmul(x_data, A), b)
  loss=reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=y_target))

为计算图增加预测函数和准确度函数,以便在模型训练过程中获得训练集和测试集的准确度:
  prediction=tf.round(tf.sigmoid(model_output))
  prediction_correct=tf.cast(tf.equal(prediction, y_target), tf,float32)
  accuracy=tf.reduce_mean(prediction_correct)

声明优化器,初始化计算图中的变量:
  my_opt=tf.train.GradientDescentOptimizer(0.0025)
  train_step=my_opt.minimize(loss)

初始化计算图中的变量:
  init=tf.global_variables_initializer()
  sess.run(init)

遍历迭代训练模型10000次,记录测试集和训练集损失,还有每迭代100次的准确度,每迭代500次打印状态信息:
  train_loss=[]
  test_loss=[]
  train_acc=[]
  test_acc=[]
  i_data=[]
  for i in range(10000):
     rand_index=np.random.choice(texts_train.shape[0], size=batch_size)
     rand_x=texts_train[rand_index].todense()
     rand_y=np.transpose([target_train[rand_index]])
     sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})
     # Record loss and accuracy every 100 generations
     if (i+1)%100==0:
       i_data.append(i+1)
       train_loss_temp=sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
       train_loss.append(train_loss_temp)
       test_loss_temp=sess.run(loss, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])})
       test_loss.append(test_loss_temp)
       train_acc_temp=sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y})
       train_acc.append(train_acc_temp)
       test_acc_temp=sess.run(accuracy, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])})
       test_acc.append(test_loss_temp)
     if (i+1)%500==0:
       acc_and_loss=[i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]
       acc_and_loss=[np.round(x,2) for x in acc_and_loss]
       print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))

部分输出结果:
  Generation # 9500. Train Loss (Test Loss): 0.39 (0.45). Train Acc (Test Acc): 0.89 (0.85)
  Generation # 10000. Train Loss (Test Loss): 0.48 (0.45). Train Acc (Test Acc): 0.84 (0.85)

使用TF-IDF值来进行模型训练,其预测函数的准确度由词袋方法的80%提高到90%,这种方法解决了单词重要性的问题,但没有解决单词顺序的问题。

3)实现skip-gram模型:

前面创建词向量时未考虑相关单词顺序特征,就职Google的Tomas Mikolov与其他研究者发表了论文并创建了一种词向量方法来解决此问题,命名为Word2Vec,网址:https://arxiv.org/abs/1301.3781。其基本思路是创建词向量来表现单词上下文关系。
可以使用神经网络算法预测输入单词的上下文单词,也能根据上下文单词预测目标单词,这两种方法都是Word2Vec的变体。根据目标单词进行上下文相关单词的预测,称为skip-gram模型;而从上下文相关单词集合中预测目标单词,称为连续词袋CBOW(continue bag of words)方法。这里使用康奈尔大学的电影影评数据集来实现skip-gram模型。
  import tensorflow as tf
  import matplotlib.pyplot as plt
  import os
  import random
  import numpy as np
  import collections
  import string
  import requests
  import io
  import tarfile
  import urllib.request
  from nltk.corpus import stopwords
  sess=tf.Session()

声明一些模型参数:
  batch_size=50
  embedding_size=20
  vocabulary_size=10000
  generations=50000
  print_loss_every=500
  num_sampled=int(batch_size/2)
  window_size=2
  stops=stopwords.words('english')
  print_valid_every=2000
  valid_words=['cliche', 'love', 'hate', 'silly', 'sad']

一次查找50对词嵌入(批量大小),每个单词的嵌套大小是一个长度为200的向量,并且仅仅考虑10000个频率最高的单词(其他单词分类为unknown)。将迭代训练50000词,每迭代500次打印损失函数。然后声明一个num_sampled变量,在损失函数中使用;也需要声明skip-gram模型上下文窗口大小,这里设为2,即查找目标单词两边各2个上下文单词。使用Python的nltk包做“停词”步骤,为了检测词向量的性能,选择一些常用的电影影评单词,每迭代2000次训练打印最近邻域单词。
声明数据加载函数,该函数会在下载数据前,先检测是否已下载过该数据集,如果已经下载过就直接从磁盘加载数据。代码为:
  def load_movie_data():
     save_folder_name='temp'
     pos_file=os.path.join(save_folder_name, 'rt-polarity.pos')
     neg_file=os.path.join(save_folder_name, 'rt-polarity.neg')
     # Check if files are already downloaded
     if os.path.exists(save_folder_name):
       pos_data=[]
       with open(pos_file, 'r') as temp_pos_file:
         for row in temp_pos_file:
           pos_data.append(row)
       neg_data=[]
       with open(neg_file, 'r') as temp_neg_file:
         for row in temp_neg_file:
           neg_data.append(row)
     else:   # If not download and save
       movie_data_url='http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polarit/ydata.tar.gz'
       stream_data=urllib.request.urlopen(movie_data_url)
       tmp=io.BytesIO()
       while True:
         s=stream_data.read(16384)
         if not s:
           break
         tmp.write(s)
         stream_data.close()
         tmp.seek(0)
       tar_file=tarfile.open(fileobj=tmp, mode='r:gz')
       pos=tar_file.extractfile('rt-polaritydata/rt-polarity.pos')
       neg=tar_file.extractfile('rt-polaritydata/rt-polarity.neg')
       # Save pos/neg reviews
       pos_data=[]
       for line in pos:
         pos_data.append(line.decode('ISO-8859-1').encode('ascii', errors='ignore').decode())
       neg_data=[]
       for line in neg:
         neg_data.append(line.decode('ISO-8859-1').encode('ascii', errors='ignore').decode())
       tar_file.close()
       # Write to file
       if not os.path.exists(save_folder_name):
         os.makedirs(save_folder_name)
       # Save files
       with open(pos_file, 'w') as pos_file_handler:
         pos_file_handler.write(''.join(pos_date))
       with open(neg_file, 'w') as neg_file_handler:
         neg_file_handler.write(''.join(neg_date))
     texts=pos_data+neg_data
     target=[1]*len(pos_data)+[0]*neg_data
     return(texts, target)
  texts, target=load_movie_data()

创建归一化文本函数。函数输入一列字符串,转换成小写字母,移除标点符号和数字,去除多余的空白字符,并移除“停词”。代码为:
  def normalize_text(texts, stops):
     texts=[x.lower() for x in texts]
     texts=[''.join(c for c in x if c not in string.punctuation) for x in texts]
     texts=[''.join(c for c in x if c not in '0123456789') for x in texts]
     texts=[' '.join([word for word in x.split()) if word not in (stops)]) for x in texts]
     texts=[' '.join(x.split()) for x in texts]
     return(texts)
  texts=normalize_text(texts, stops)

为了确保所有电影影评的有效性,将检查其中的影评长度,可以强制影评长度为3个单词或者更长长度的单词:
  target=[target[ix] for ix,x in enumerate(texts) if len(x.split())>2]
  texts=[x for x in texts if len(x.split())>2]

构建词汇表,创建函数来建立一个单词字典,该单词字典是单词和单词数对,而词频不够的单词(即标记为unknown的单词)标记为RARE。代码为:
  def build_dictionary(sentences, vocabulary_size):
     # Turn sentences (list of strings) into lists of words
     split_sentences=[s.split() for s in sentences]
     words=[x for sublist in split_sentences for x in sublist]
     # Initialize list of [word, word_count] for each word, starting with unknown
     count=[['RARE', -1]]
     # Now add most frequent words, limited to the N-most frequent
     count.extend(collections.Counter(words).most_common(vocabulary_size-1))
     # Now create the dictionary
     word_dict={}
     for word,word_count in count:
       word_dict[word]=len(word_dict)
     return(word_dict)

创建一个函数,将一系列句子转化成单词索引列表,并将单词索引列表传入嵌入寻找函数:
  def text_to_numbers(sentences, word_dict):
     # Initialize the returned data
     data=[]
     for sentence in sentences:
       sentence_data=[]
       # For each word, either use selected index or rare word index
       for word in sentence:
         if word in word_dict:
           word_ix=word_dict[word]
         else:
           word_ix=0
         sentence_data.append(word_ix)
       data.append(senrence_data)
     return data

创建单词字典,转换句子列表为单词索引列表:
  word_dictionary=build_dictionary(texts, vocabulary_size)
  word_dictionary_rev=dict(zip(word_dictionary.values(), word_dictionary.keys()))
  text_data=text_to_numbers(texts, word_dictionary)

从预处理的单词字典中,查找第2步中选择的验证单词的索引:
  valid_examples=[word_dictionary[x] for x in valid_words]
创建函数返回skip-gram模型的批量数据。希望在单词对上训练模型,该单词对中第1个单词为训练输入,也即窗口中央的目标单词;另一个单词为窗口中所选的单词。
  def generate_batch_data(sentences, batch_size, window_size, method='skip-gram'):
     # Fill up data batch
     batch_data=[]
     label_data=[]
     while len(batch_data)<batch_size:
       # Select random sentence to start
       rand_sentence=np.random,choice(sentences)
       # Generate cousecutive wondows to look at
       window_sequences=[rand_sentence[max((ix-windows_size), 0):(ix+windows_size+1)] for ix,x in enumerate(rand_sentence)]
       # Denote which element of each window is the center word of interest
       label_indices=[ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]
       # Pull out center word of interest for each window and create a tuple for each window
       if method=='skip-gram':
         batch_and_labels=[(x[y], x[:y]+x[(y+1):]) for x,y in zip(window_sequences, label_indices))
         # Make it in to a big list of tuple (target word, surrounding word)
         tuple_data=[(x, y_) for x,y in batch_and_labels for y_ in y]
       else:
         raise CalueError('Method {} not implemented yer.'.format(method))
       # Extract batch and labels
       batch, labels=[list(x) for x in zip(*tuple_data)]
       batch_data.extend(batch[:batch_size])
     # Trim batch and label at the end
     batch_data=batch_data[:batch_size]
     label_data=label_data[:batch_size]
     # Convert to numpy array
     batch_data=np.array(batch_data)
     label_data=np.array(label_data)
     return batch_data, label_data

初始化嵌入矩阵,声明占位符和嵌入查找函数:
  embeddings=tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  # Create data/target placeholders
  x_inputs=tf.placeholders(tf.int32, shape=[batch_size])
  y_target=tf.placeholders(tf.int32, shape=[batch_size, 1])
  valid_dataset=tf.constant(valid_examples, dtype=tf.int32)
  # Lookup the word embedding
  embed=tf.nn.embedding_lookup(embeddings, x_inputs)

Softmax损失函数是用来实现多分类问题最常见的损失函数,用于计算预测错误单词分类的损失。因为这里的目标是10000个不同的分类,所以会导致稀疏性非常高,而稀疏性会导致算法模型拟合或者收敛问题。为了解决这个问题,使用噪声对比损失函数NCE(noise-contrastive estimation)。NCE损失函数将问题转换成一个二值预测,预测单词分类和随机噪声,num_sampled参数控制转换为随机噪声的批量大小。代码为:
  nce_weights=tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size]. stddev=1.0/np.sqrt(embedding_size)))
  nce_biases=tf.Variable(tf.zeros([vocabulary_size]))
  loss=tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, inputs=embed, labels=y_target, num_sampled=num_sampled, num_classes=vocabulary_size))

创建函数寻找验证单词周围的单词,将计算验证单词集和所有词向量之间的余弦相似度,打印出每个验证单词最接近的单词:
  norm=tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
  normalized_embeddings=embeddings/norm
  valid_embeddings=tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
  similarity=tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

声明优化器,初始化模型变量:
  optimizer=tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)
  init=tf.global_variables_initializer()
  sess.run(init)

训练词嵌入,打印出损失函数和验证单词集单词的最接近的单词:
  loss_vec=[]
  loss_x_vec=[]
  for i in range(generations):
     batch_inputs, batch_labels=generate_batch_data(text_data, batch_size, window_size)
     feed_dict={x_inputs: batch_inputs, y_target: batch_labels}
     # Run the train step
     sess.run(optimizer, feed_dict=feed_dict)
     # Return the loss
     if (i+1)%print_loss_every==0:
       loss_val=sess.run(loss, feed_dict=feed_dict)
       loss_vex.append(loss_val)
       loss_x_vec.append(i+1)
       print("Loss at step {} : {}".format(i+1, loss_val))
     # Validation: Print some random words and top 5 related words
     if (i+1)%print_valid_every==0:
       sim=sess.run(similarity, feed_dict=feed_dict)
       for j in range(len(valid_words)):
         valid_word=word_dictionary_rev[valid_examples[j]]
         top_k=5 # number of nearst neighbors
         nearest=(-sim[j, :]).argsort()[1:top_k+1]
         log_str="Nearest to {}:".format(valid_word)
         for k in range(top_k):
           close_word=word_dictionary_rev[nearest[k]]
           log_str="%s %s, " % (log_str, close_word)
       print(log_str)

上面代码中,使用argsort方法排序,确保找索引时是按照相似度从高到低,而不是相反。
部分输出结果:
  Loss at step 49500 : 0.9395825862884521
  Loss at step 50000 : 0.30323168635368347
  Nearest to hate: style, throws, players, fearlessness, astringent,
  Nearest to love: plight, fiction, complete, lady, bartledy,

4)实现CBOW词嵌入模型:

CBOW模型中,把上下文窗口的单词嵌入放在一起组成单词集合,预测目标单词。与上面例子不同的是如何创建单词嵌入和如何从句子中生成嵌入数据。
  import tensorflow as tf
  import matplotlib.pyplot as plt
  import os
  import random
  import pickle
  import numpy as np
  import collections
  import string
  import requests
  import io
  import tarfile
  import urllib.request
  import text_helpers
  from nltk.corpus import stopwords
  sess=tf.Session()

把前面示例的相关函数移入一个单独的text_helpers.py文件中,包含处理数据加载函数、文本归一化函数、单词字典创建函数和批量数据生成函数。
确保临时数据和参考存储在文件夹中:
  data_folder_name='temp'
  if not os.path.exists(data_folder_name):
     os.makedirs(data_folder_name)

声明算法模型的参数,这些参数和skip-gram模型类似:
  batch_size=500
  embedding_size=200
  vocabulary_size=2000
  generations=50000
  model_learning_rate=0.001
  num_sampled=int(batch_size/2)
  window_size=3
  save_embeddings_every=5000
  print_valid_every=5000
  print_loss_every=100
  stops=stopwords.words('english')
  valid_words=['love', 'hate', 'happy', 'sad', 'man', 'woman']

设置电影影评不小于3个单词:
  texts, target=text_helpers.load_movie_data(data_folder_name)
  texts=text_helpers.normalize_text(texts, stops)
  target=[target[ix] for ix,x in enumerate(texts) if len(x.split())>2]
  texts=[x for x in texts if len(x.split())>2]

创建单词字典,以便查找单词。也需要一个逆序单词字典,可以通过索引查找单词,当需要打印出离验证单词集中的单词最近的单词时,就可使用逆序字典。代码为:
  word_dictionary=text_helpers.build_dictionary(texts, vocabulary_size)
  word_dictionary_rev=dict(zip(word_dictionary.values(), word_dictionary.keys()))
  text_data=text_helpers.text_to_numbers(texts, word_dictionary)
  valid_examples=[word_dictionary[x] for x in valid_words]

初始化待拟合的词嵌入,并声明算法模型的数据占位符:
  embeddings=tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  # Create data/target placeholders
  x_inputs=tf.placeholders(tf.int32, shape=[batch_size, 2*window_size])
  y_target=tf.placeholders(tf.int32, shape=[batch_size, 1])
  valid_dataset=tf.constant(valid_examples, dtype=tf.int32)

处理词嵌入。因为CBOW模型将上下文窗口内的词嵌入叠加在一起,所以创建一个循环,将窗口内的所有词嵌入加在一起:
  # Lookup the word embeddings and add together window embeddings
  embed=tf.zeros([batch_size, embedding_size])
  for element in range(2*window_size):
     embed+=tf.nn.embedding_lookup(embeddings, x_inputs[:, element])

使用TensorFlow内建的NCE损失函数:
  nce_weights=tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size]. stddev=1.0/np.sqrt(embedding_size)))
  nce_biases=tf.Variable(tf.zeros([vocabulary_size]))
  loss=tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, inputs=embed, labels=y_target, num_sampled=num_sampled, num_classes=vocabulary_size))

使用余弦相似度给出距离验证集单词最近的单词,以验证词嵌入的质量:
  norm=tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings=embeddings/norm
  valid_embeddings=tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
  similarity=tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

为了保存词向量,需要加载TensorFlow的train.Saver()方法,该方法默认会保存整个计算图会话,但这里指定参数只保存嵌入变量,并设置名字:
  saver=tf.train.Saver({"embeddings": embeddings})
声明优化器,初始化模型变量:
  optimizer=tf.train.GradientDescentOptimizer(learning_rate=model_learning_rate).minimize(loss)
  init=tf.global_variables_initializer()
  sess.run(init)

遍历迭代训练,打印损失函数,保存词嵌入和字典到指定文件夹:
  loss_vec=[]
  loss_x_vec=[]
  for i in range(generations):
     batch_inputs, batch_labels=text_helpers.generate_batch_data(text_data, batch_size, window_size, method='cbow')
     feed_dict={x_inputs: batch_inputs, y_target: batch_labels}
     # Run the train step
     sess.run(optimizer, feed_dict=feed_dict)
     # Return the loss
     if (i+1)%print_loss_every==0:
       loss_val=sess.run(loss, feed_dict=feed_dict)
       loss_vec.append(loss_val)
       loss_x_vec.append(i+1)
       print("Loss at step {} : {}".format(i+1, loss_val))
     # Validation: Print some random words and top 5 related words
     if (i+1)%print_valid_every==0:
       sim=sess.run(similarity, feed_dict=feed_dict)
       for j in range(len(valid_words)):
         valid_word=word_dictionary_rev[valid_examples[j]]
         top_k=5   # number of nearest neighbors
         nearest=(-sim[j, :]).argsort()[1:top_k+1]
         log_str="Nearest to {}:".format(valid_word)
         for k in range(top_k):
           close_word=word_dictionary_rev[nearest[k]]
           print_str='{} {}, '.format(log_str, close_word)
       print(print_str)
     # Save dictionary + embeddings
     if (i+1)%save_embeddings_every==0:
       # Save vocabulary dictionary
       with open(os.path.join(data_folder_name, 'movie_vocab.pkl'), 'wb') as f:
         pickle.dump(word_dictionary, f)
       # Save embeddings
       model_checkpoint_path=os.path.join(os.getcwd(), data_folder_name, 'cbow_movie_embeddings.ckpt')
       save_path=saver.save(sess, model_checkpoint_path)
       print('Model saved in file: {}'.format(save_path))

部分输出结果:
  Loss at step 49900 : 1.6794960498809814
  Loss at step 50000 : 1.5071022510528564
  Nearest to hate: bringing, gifted, almost, next, wish,
  Nearest to love: clarity, cult, cliched, literary, memory,

其中,text_helper.py中的generate_batch_data()函数需要修改,加入一个cbow方法:
  elif method=='cbow':
     batch_and_labels=[(x[:y]+x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
     # Only keep windows with consistent 2*window_size
     batch_and_labels=[(x,y) for x,y in batch_and_labels if len(x)==2*window_size]
     batch, labels=[list(x) for x in zip(*batch_and_labels)]

CBOW方法是在上下文窗口内叠加词嵌入上进行训练,并预测目标单词,这种方法比skip-gram更平滑,更适用于小文本数据集。

5)使用Word2Vec预测:

利用已经保存的CBOW词嵌入,在电影影评数据集上做情感分析。创建一个线性逻辑回归模型,训练电影影评数据集,以得到CBOW词嵌入之外的信息。
  import tensorflow as tf
  import matplotlib.pyplot as plt
  import numpy as np
  import os
  import random
  import pickle
  import string
  import collections
  import requests
  import io
  import tarfile
  import urllib.request
  import text_helpers
  from nltk.corpus import stopwords
  sess=tf.Session()

声明算法模型参数:
  batch_size=100
  embedding_size=200
  vocabulary_size=2000
  max_words=100
  stops=stopwords.words('english')

使用text_helpers.py脚本加载和转换文本数据集:
  texts, target=text_helpers.load_movie_data(data_folder_name)
  texts=text_helpers.normalize_text(texts, stops)
  target=[target[ix] for ix,x in enumerate(texts) if len(x.split())>2]
  texts=[x for x in texts if len(x.split())>2]
  train_indices=np.random.choice(len(target), round(0.8*len(target)), replace=False)
  test_indices=np.array(list(set(range(len(target)))-set(train_indices)))
  texts_train=[x for ix,x in enumerate(texts) if ix in train_indices]
  texts_test=[x for ix,x in enumerate(texts) if ix in test_indices]
  target_train=np.array([x for ix,x in enumerate(target) if ix in train_indices])
  target_test=np.array([x for ix,x in enumerate(target) if ix in test_indices])

加载CBOW嵌入中保存的单词词典:
  dict_file=os.path.join(data_folder_name, 'movie_vocab.pkl')
  word_dictionary=pickle.load(open(dict_file, 'rb'))

通过单词字典将加载的句子转化为数值型numpy数组:
  text_data_train=np.array(text_helpers.text_to_numbers(text_train, word_dictionary))
  text_data_test=np.array(text_helpers.text_to_numbers(texts_test, word_dictionary))

影评长度不一,用同一的100个单词长度将其标准化,如果长度少于100则用0填充:
  text_data_train=np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_train]])
  text_data_test=np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_test]])

声明逻辑回归的模型变量和占位符:
  A=tf.Variable(tf.random_normal(shape=[embedding_size, 1]))
  b=tf.Variable(tf.random_normal(shape=[1, 1]))
  x_data=tf.placeholder(shape=[None, max_words], dtype=tf.int32)
  y_target=tf.placeholder(shape=[None, 1], dtype=tf.float32)

为了使TensorFlow可以重用训练过的词向量,需要先给存储方法设置一个变量。这里创建一个嵌入变量,其形状与要加载的嵌入相同。代码为:
  embeddings=tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
在计算图中加入embedding_lookup操作,计算句子中所有单词的平均嵌入:
  embed=tf.nn.embedding_lookup(embeddings, x_data)
  embed_avg=tf.reduce_mean(embed, 1)

声明模型操作和损失函数,损失函数中已经内建了sigmoid操作:
  model_output=tf.add(tf.matmul(embed_avg, A), b)
  loss=tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=y_target))

在计算图中增加预测函数和准确度函数,评估训练模型的准确度:
  prediction=tf.round(tf.sigmoid(model_output))
  predicttions_correct=tf.cast(tf.equal(prediction, y_target), tf.float32)
  accuracy=tf.reduce_mean(predictions_correct)

声明优化器函数,并初始化模型变量:
  my_opt=tf.train.AdagradOptimizer(0.005)
  train_step=my_opt.minimize(loss)
  init=tf.global_variables_initializer()
  sess.run(init)

随机初始化嵌入后,调用Saver方法来加载保存好的CBOW嵌入到嵌入变量:
  model_checkpoint_path=os.path.join(data_folder_name, 'cbow_movie_embeddings.ckpt')
  saver=tf.train.Saver({"embeddings": embeddings})
  saver.restore(sess, model_checkpoint_path)

开始迭代训练,每迭代100次就保存训练集和测试集的损失和准确度,每迭代500次就打印一次模型状态:
  train_loss=[]
  test_loss=[]
  train_acc=[]
  test_acc=[]
  i_data=[]
  for i in range(10000):
     rand_index=np.random.choice(text_data_train.shape[0], size=batch_size)
     rand_x=text_data_train[rand_index]
     rand_y=np.transpose([target_train[rand_index]])
     sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})
     if (i+1)%100==0:
       i_data.append(i+1)
       train_loss_temp=sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
       train_loss.append(train_loss_temp)
       test_loss_temp=sess.run(loss, feed_dict={x_data: text_data_test, y_target: np.transpose([target_test])})
       test_loss.append(test_loss_temp)
       train_acc_temp=sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y})
       train_acc.append(train_acc_temp)
       test_acc_temp=sess.run(accuracy, feed_dict={x_data: text_data_test, y_target: np.transpose([target_test])})
       test_acc.append(test_acc_temp)
     if (i+1)%500==0:
       acc_and_loss=[i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]
       acc_and_loss=[np.round(x, 2) for x in acc_and_loss]
       print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f}'.format(*acc_and_loss))

部分打印结果:
  Generation # 9500. Train Loss (Test Loss): 0.69 (0.70). Train Acc (Test Acc): 0.57 (0.55)
  Generation # 10000. Train Loss (Test Loss): 0.70 (0.70). Train Acc (Test Acc): 0.59 (0.55)

每迭代100次就绘制训练集和测试集损失函数和准确度:
  plt.plot(i_data, train_loss, 'k-', label='Train Loss')
  plt.plot(i_data, test_loss, 'r--', label='Test Loss', linewidth=4)
  plt.title('Cross Entropy Loss per Generation')
  plt.xlabel('Generation')
  plt.ylabel('Cross Entropy Loss')
  plt.legend(loc='upper right')
  plt.show()
  plt.plot(i_data, train_acc, 'k-', label='Train Set Accuracy')
  plt.plot(i_data, test_acc, 'r--', label='Test Set Accuracy', linewidth=4)
  plt.title('Train and Test Accuracy')
  plt.xlabel('Generation')
  plt.ylabel('Accuracy')
  plt.legend(loc='lower right')
  plt.show()

该模型性能较差,获得了60%的情感预测准确度,仅比随机预测好一点。

6)实现基于Doc2Vec的情感分析:

上面使用的Word2Vec方法,只是处理了单词之间的位置关系,但没有考虑单词和单词所在文档或影评之间的关系。其扩展之一就是Doc2Vec方法,考虑文档的影响。
Doc2Vec方法的基本思想是引入文档嵌入,同时连同单词嵌入一起帮助判断文档的感情色彩。如果电影影评有足够长度,并且包含多个负面词汇,那么文档整体的感情色彩可以帮助预测最近的单词。
Doc2Vec只是简单增加了一个文档嵌入矩阵,并使用单词的窗口叠加文档索引进行单词预测。在一个文档中的所有单词窗口有相同的文档索引。如何结合文档嵌入和单词嵌入是方法的重点,这里将单词窗口内的单词嵌入求和,主要有两种方法,将文档嵌入和单词嵌入相加,或者将文档嵌入直接追加在单词嵌入后面。如果将两种嵌入相加,那么要求文档嵌入的大小和单词嵌入大小相同,如果将两种嵌入追加,那么需要增加逻辑回归模型的变量数量。对于小数据集,两种嵌入相加的方法是更好的选择。
这里先进行电影影评的文档嵌入和单词嵌入,然后分割为训练集和测试集,训练逻辑回归模型,并评估该方法是否可以提高预测影评情感的准确度。
  import tensorflow as tf
  import matplotlib.pyplot as plt
  import numpy as np
  import os
  import random
  import pickle
  import string
  import collections
  import requests
  import io
  import tarfile
  import urllib.request
  import text_helpers
  from nltk.corpus import stopwords
  sess=tf.Session()

加载影评数据集:
  texts, target=text_helpers.load_movie_data(data_folder_name)
声明算法模型参数:
  batch_size=100
  vocabulary_size=7500
  generations=100000
  model_learning_rate=0.001
  embedding_size=200
  doc_embedding_size=100
  concatenated_size=embedding_size+doc_embedding_size
  num_sampled=int(batch_size/2)
  window_size=3
  # Add chechpoints to training
  save_embeddings_every=5000
  print_valid_every=5000
  print_loss_every=100
  # Declare stop words
  stops=stopwords.words('english')
  # Pick a few test words
  valid_words=['love', 'hate', 'happy', 'sad', 'man', 'woman']

归一化电影影评,确保每条影评都大于指定的窗口大小:
  texts=text_helpers.normalize_text(texts, stops)
  target=[target[ix] for ix,x in enumerate(texts) if len(x.split())>window_size]
  texts=[x for x in texts if len(x.split())>window_size]
  assert(len(target)==len(texts))

创建单词字典,而无需创建文档字典,文档索引仅仅是文档的索引值,每个文档有唯一的索引值。代码为:
  word_dictionary=text_helpers.build_dictionary(texts, vocabulary_size)
  word_dictionary_rev=dict(zip(word_dictionary.values(), word_dictionary.keys()))
  text_data=text_helpers.text_to_numbers(texts, word_dictionary)
  valid_examples=[word_dictionary[x] for x in valid_words]

定义单词嵌套和文档嵌套,然后声明NCE损失函数参数:
  embeddings=tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  doc_embeddings=tf.Variable(tf.random_uniform([len(texts), doc_embeddings_size], -1.0, 1.0))
  # NCE loss parameters
  nce_weights=tf.Variable(tf.truncated_normal([vocabulary_size, concatenated_size]. stddev=1.0/np.sqrt(concatenated_size)))
  nce_biases=tf.Variable(tf.zeros([vocabulary_size]))

声明Doc2Vec索引和目标单词索引的占位符,输入索引的大小是窗口大小加1,因为每个生成的数据窗口将有一个额外的文档索引。代码为:
  x_inputs=tf.placeholders(tf.int32, shape=[None, window_size+1])
  y_target=tf.placeholders(tf.int32, shape=[None, 1])
  valid_dataset=tf.constant(valid_examples, dtype=tf.int32)

创建嵌入函数,将单词嵌入求和,然后连接文档嵌入:
  embed=tf.zeros([batch_size, embedding_size])
  for element in range(window_size):
     embed+=tf.nn.embedding_lookup(embeddings, x_inputs[:, element])
  doc_indices=tf.slice(x_inputs, [0, window_size], [batch_size, 1])
  doc_embed=tf.nn.embedding_lookup(doc_embeddings, doc_indices)
  final_embed=tf.concat(axis=1, values=)

声明损失函数并创建优化器:
  loss=tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, inputs=final_embed, labels=y_target, num_sampled=num_sampled, num_classes=vocabulary_size))
  optimizer=tf.train.GradientDescentOptimizer(learning_rate=model_learning_rate)
  train_step=optimizer.minimize(loss)

声明验证单词集的余弦距离:
  norm=tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings=embeddings/norm
  valid_embeddings=tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
  similarity=tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

为了保存嵌入,创建模型的saver函数,然后初始化模型变量:
  saver=tf.train.Saver({"embeddings": embeddings, "doc_embeddings": doc_embeddings})
  init=tf.global_variables_initializer()
  sess.run(init)

迭代训练:
  loss_vec=[]
  loss_x_vec=[]
  for i in range(generations):
     batch_inputs, batch_labels=text_helpers.generate_batch_data(text_data, batch_size, wondow_size, method='doc2vec')
     feed_dict={x_inputs: batch_inputs, y_target: batch_labels}
     sess.run(train_step, feed_dict=feed_dict)
     # Return the loss
     if (i+1)%print_loss_every==0:
       loss_val=sess.run(loss, feed_dict=feed_dict)
       loss_vec.append(loss_val)
       loss_x_vec.append(i+1)
       print('Loss at step {} : {}'.format(i+1, loss_val))
     # Validation : Print some random words and top5 related words
     if (i+1)%print_valid_every==0:
       sim=sess.run(similarity, feed_dict=feed_dict)
       for j in range(len(valid_words)):
         valid_words=word_dictionary_rev[valid_examples[j]
         top_k=5   # number of nearest neighbors
         nearest=(-sim[j, :]).argsort()[1:top_k+1]
         log_str="Nearest to {}:".format(valid_word)
         for k in range(top_k):
           close_word=word_dictionary_rev[nearest[k]]
           print_str='{} {}, '.format(log_str, close_word)
       print(print_str)
     # Save dictionary + embeddings
     if (i+1)%save_embeddings_every==0:
       # Save vocabulary dictionary
       with open(os.path.join(data_folder_name, 'movie_vocab.pkl'), 'wb') as f:
         pickle.dump(word_dictionary, f)
       # Save embeddings
       model_checkpoint_path=os.path.join(os.getcwd(), data_folder_name, 'doc2vec_movie_embeddings.ckpt')
       save_path=saver.save(sess, model_checkpoint_path)
       print('Model saved in file: {}'.format(save_path))

部分输出结果:
  Loss at step 99900 : 17.733346939086914
  Loss at step 100000 : 17.384489059448242
  Nearest to hate: redunant, snapshot, from, performances, extravagant,
  Nearest to love: ride, with, by, its, start,

使用这些嵌入训练逻辑回归模型,预测影评情感色彩。首先设置逻辑回归模型的一些参数:
  max_words=20   # maximum review word length
  logistic_batch_size=500   # training batch size

分割数据集为训练集和测试集:
  train_indices=np.sort(np.random.choice(len(target), round(0.8*len(target)), replace=False))
  test_indices=np.sort(np.array(list(set(range(len(target)))-set(train_indices))))
  texts_train=[x for ix,x in enumerate(texts) if ix in train_indices]
  texts_test=[x for ix,x in enumerate(texts) if ix in test_indices]
  target_train=np.array([x for ix,x in enumerate(target) if ix in train_indices])
  target_test=np.array([x for ix,x in enumerate(target) if ix in test_indices])

将电影影评转换成数值型的单词索引,填充或者剪裁每条影评为20个单词:
  text_data_train=np.array(text_helpers.text_to_numbers(texts_train, word_dictionary))
  text_data_test=np.array(text_helpers.text_to_numbers(texts_test, word_dictionary))
  # Pad/crop movie reviews to specific length
  text_data_train=np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_train]])
  text_data_text=np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_test]])

声明逻辑回归模型的数据占位符、模型变量、模型操作和损失函数:
  # Define Logistic placeholders
  log_x_inputs=tf.placeholder(tf.int32, shape=[None, max_words+1])
  log_y_inputs=tf.placeholder(tf.int32, shape=[None, 1])
  A=tf.Variable(tf.random_normal(shape=[concatenated_size, 1]))
  b=tf.Variable(tf.random_normal(shape=[1, 1]))
  # Declare logistic model (sigmoid in loss function)
  model_output=tf.add(tf.matmul(log_final_embed, A), b)
  # Delcare loss function (Cross Entropy loss)
  logistic_loss=tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=tf.cast(log_y_target, tf.float32)))

创建另一个嵌入函数,前面部分的嵌入函数是训练3个单词的窗口和文档索引以预测下一个单词,这里也是类似,不同的是训练20个单词的影评。代码为:
  # Add together element embeddings in window
  log_embed=tf.zeros([logistic_batch_size, embedding_size])
  for element in range(max_words):
     log_embed+=tf.nn.embedding_lookup(embeddings, log_x_inputs[:, element])
  log_doc_indices=tf.slice(log_x_inputs, [0,max_words], [logistic_batch_size, 1])
  log_doc_embed=tf.nn.embedding_lookup(doc_embeddings, log_doc_indices)
  # concatenate embeddings
  log_final_embed=tf.concat(1, [log_embed, tf.squeeze(log_doc_embed)])

创建预测函数和准确度函数,评估迭代训练模型,然后声明优化器函数,初始化模型变量:
  prediction=tf.round(tf.sigmoid(model_output))
  predicttions_correct=tf.cast(tf.equal(prediction, tf.cast(log_y_target, tf.float32), tf.float32)
  accuracy=tf.reduce_mean(predictions_correct)
  logistic_opt=tf.train.GradientDescentOptimizer(learning_rate=0.01)
  logistic_train_step=logistic_opt.minimize(logistic_loss, var_list=[A, b])
  init=tf.global_variables_initializer()
  sess.run(init)

开始训练逻辑回归模型:
  train_loss=[]
  test_loss=[]
  train_acc=[]
  test_acc=[]
  i_data=[]
  for i in range(10000):
     rand_index=np.random.choice(text_data_train.shape[0], size=logistic_batch_size)
     rand_x=text_data_train[rand_index]
     # Append review index at the end of text data
     rand_x_doc_indices=train_indices[rand_index]
     rand_x=np.hstack((rand_x, np.transpose([rand_x_doc_indices])))
     rand_y=np.transpose([target_train[rand_index]])
     feed_dict={log_x_inputs: rand_x, log_y_target: rand_y}
     sess.run(logistic_train_step, feed_dict=feed_dict)
     # Only record loss and accuracy every 100 generations
     if (i+1)%100==0:
       rand_index_test=np.random.choice(text_data_test.shape[0], size=logistic_batch_size)
       rand_x_test=text_data_test[rand_index_test]
       # Append review index at the end of text data
       rand_x_doc_indices_test=test_indices[rand_index_test]
       rand_x_test=np.hstack((rand_x_test, np.transpose([rand_x_doc_indices_test])))
       rand_y_test=np.transpose([target_test[rand_index_test]])
       test_feed_dict={log_x_inputs: rand_x_test, log_y_target: rand_y_test}
       i_data.append(i+1)
       train_loss_temp=sess.run(logistic_loss, feed_dict=feed_dict)
       train_loss.append(train_loss_temp)
       test_loss_temp=sess.run(logistic_loss, feed_dict=test_feed_dict)
       test_loss.append(test_loss_temp)
       train_acc_temp=sess.run(accuracy, feed_dict=feed_dict)
       train_acc.append(train_acc_temp)
       test_acc_temp=sess.run(accuracy, feed_dict=test_feed_dict)
       test_acc.append(test_acc_temp)
     if (i+1)%500==0:
       acc_and_loss=[i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]
       acc_and_loss=[np.round(x, 2) for x in acc_and_loss]
       print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f}'.format(*acc_and_loss))

部分输出结果:
  Generation # 500. Train Loss (Test Loss): 5.62 (7.45). Train Acc (Test Acc): 0.52 (0.48}
  Generation # 10000. Train Loss (Test Loss): 2.35 (2.51). Train Acc (Test Acc): 0.59 (0.58}

在已经创建的独立的批量数据生成方法text_helpers.generate_batch_data()函数中,加入部分代码。最终代码为:
  def generate_batch_data(sentences, batch_size, window_size, method='skip-gram'):
     # Fill up data batch
     batch_data=[]
     label_data=[]
     while len(batch_data)<batch_size:
       # Select random sentence to start
       rand_sentence_ix=int(np.random,choice(sentences), size=1))
       rand_sentence=sentenses[rand_sentence_ix]
       # Generate cousecutive wondows to look at
       window_sequences=[rand_sentence[max((ix-windows_size), 0):(ix+windows_size+1)] for ix,x in enumerate(rand_sentence)]
       # Denote which element of each window is the center word of interest
       label_indices=[ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]
       # Pull out center word of interest for each window and create a tuple for each window
       if method=='skip-gram':
         batch_and_labels=[(x[y], x[:y]+x[(y+1):]) for x,y in zip(window_sequences, label_indices))
         # Make it in to a big list of tuple (target word, surrounding word)
         tuple_data=[(x, y_) for x,y in batch_and_labels for y_ in y]
       elif method=='cbow':
         batch_and_labels=[(x[:y]+x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
         # Only keep windows with consistent 2*window_size
         batch_and_labels=[(x,y) for x,y in batch_and_labels if len(x)==2*window_size]
         batch, labels=[list(x) for x in zip(*batch_and_labels)]
       elif method=='doc2vec':
         # For doc2vec keep LHS window only to predict target word
         batch_and_labels=[(rand_sentence[i:i+window_size], rand_sentence[i+window_size] for i in range(0, len(rand_sentence)-window_size)]
         batch, labels=[list(x) for x in zip(*batch_and_labels)]
         # Add document index to batch
         batch=[x+[rand_sentence_ix] for x in batch]
       else:
         raise CalueError('Method {} not implemented yer.'.format(method))
       # Extract batch and labels
       batch_data.extend(batch[:batch_size])
       label_data.extend(labels[:batch_size])
     # Trim batch and label at the end
     batch_data=batch_data[:batch_size]
     label_data=label_data[:batch_size]
     # Convert to numpy array
     batch_data=np.array(batch_data)
     label_data=np.transpose(np.array([label_data]))
     return batch_data, label_data

上面代码中进行了两个迭代训练,第1个是训练Doc2Vec嵌入,第2个是在电影影评上训练逻辑回归模型预测影评情感色彩。不过情感分析准确度并没有增加多少,仍然约为60%,但其中成功实现了电影影评数据集的Doc2Vec连接版本。因为逻辑回归模型不能捕捉到自然语言处理中的非线性行为特征,因此为了提高预测准确度,可以尝试用不同的参数训练Doc2Vec嵌套,也可以使用复杂一些的算法模型。

4. TensorFlow实现循环神经网络:

为了处理序列数据,扩展神经网络以存储前一次迭代的输出,这类神经网络算法称为循环神经网络RNN(recurrent neural network)。
全连接网络公式为:
全连接网络公式
其中,A为加权权重,x为输入层,σ为激励函数,y是输出层。
如果有序列输入数据,x1, x2, x3, ...,修改全连接层,从而把之前的输入都考虑在内,表达式:
循环神经网络RNN公式
在循环迭代的基础上获取下一个输入。
通过softmax函数得到概率分布输出,表达式:
softmax函数输出公式
一旦有了所有序列的输出{s1, s2, s3, ...},可以只考虑最后一个输出作为目标数据或目标分类。也可以把序列输出结果作为序列,即Seq2Seg模型。
对于任意长度的序列,利用反向传播算法训练创建长时间依赖的梯度,会出现梯度消失或梯度爆炸的问题,使用RNN单元的扩展--长短期记忆LSTM(Long Short Term Memory)单元通过引入门操作来解决。
当RNN模型处理自然语言时,编码将数据转换为数值型RNN特征;解码则将数值型RNN特征转换成输出的单词或者字符。

1)实现RNN模型进行垃圾邮件预测:

使用UCI大学的机器学习仓库中的SMS垃圾邮件数据集。
  import os
  import re
  import io
  import requests
  import numpy as np
  import matplotlib,pyplot as plt
  import tensorflow as tf
  from zipfile import ZipFile

开始计算图会话,并设置RNN模型参数:
  sess=tf.Session()
  epochs=20
  batch_size=250
  max_sequence_length=25
  rnn_size=10
  embedding_size=50
  min_word_frequency=10
  learning_rate=0.0005
  dropout_keep_prob=tf.placeholder(tf.float32)

其中,训练数据20个epoch,批量大小250,邮件最大长度25单词,超过部分会截取掉,不够的部分用0填充。RNN模型由10个单元组成,仅仅处理词频超过10的单词,每个单词会嵌入在长度为50的词向量中。dropout概率为占位符,训练模型时设为0.5,评估模型时设为1.0。
获取SMS文本数据集:
  data_dir='temp'
  data_file='text_data.txt'
  if not os.path.exists(data_dir):
     os.makedirs(data_dir)
  if not os.path.isfile(os.path.join(data_dir, data_file)):
     zip_url='http://archive.ics.uci.edu/ml/machine-learning-databases/00228/sms/spamcollection.zip'
     r=requests.get(zip_url)
     z=ZipFile(io.BytesIO(r.content))
     file=z.read('SMSSpamCollection')
     # Format Data
     text_data=file.decode()
     text_data=text_data.encode('ascii', errors='ignore')
     text_data=text_data.decode().split('\n')
     # Save data to text file
     with open(os.path.join(data_dir, data_file), 'w') as file_conn:
       for text in text_data:
         file_conn.write("{}\n".format(text))
  else:
     # Open data from text file
     text_data=[]
     with open(os.path.join(data_dir, data_file), 'r') as file_conn:
       for row in file_conn:
         text_data.append(row)
       text_data=text_data[:-1]
  text_data=[x.split('\t') for x in text_data if len(x)>=1]
  [text_data_target, text_data_train]=[list(x) for x in zip(*text_data)]

先检查数据集是否已经下载过,如果下载过则直接从文件中读取,否则下载数据集。
为降低词汇量,清洗文本数据集,移除特殊字符和空格,将所有文本转为小写:
  def clean_text(text_string):
     text_string=re.sub(r'([^sw]|_|[0-9])+', ' ', text_string)
     text_string=" ".join(text_string.split())
     text_string=text_string.lower()
     return text_string
  text_data_train=[clean_text(x) for x in text_data_train]

从文本数据中移除特殊字符时,可以使用空格替换该特殊字符,也可以根据数据集的格式选择合适的处理方法。
使用TensorFlow内建的词汇处理器处理文本,将文本转换为索引列表:
  vocab_processor=tf.contrib.learn.preprocessing.VocabularyProcessor(max_sequence_length, min_frequency=min_word_frequency)
  text_processed=np.array(list(vocab_processor.fit_transform(text_data_train)))

TensorFlow 1.10并不支持contrib.learn.preprocessing中的函数。可以升级使用。
随机shuffle文本数据集:
  text_processed=np.array(text_processed)
  text_data_target=np.array([1 if x=='ham' else 0 for x in text_data_target)
  shuffled_ix=np.random.permutation(np.arange(len(text_data_target)))
  x_shuffled=text_processed[shuffled_ix]
  y_shuffled=text_data_target[shuffled_ix]

分割数据集为80-20的训练-测试数据集:
  in_cutoff=int(len(y_shuffled)*0.8)
  x_train, x_test=x_shuffled[:ix_cutoff], x_shuffled[ix_cutoff:]
  y_train, y_test=y_shuffled[:ix_cutoff], y_shuffled[ix_cutoff:]
  vocab_size=len(vocab_processor.vocabulary_)
  print("vocabulary Size: {:d}".format(vocab_size))
  print("80-20 Train Test split: {:d} -- {:d}".format(len(y_train), len(y_test)))

如果做超参数调优,需要将数据集分割为训练集-测试集-验证集。Scikit-learn的model_selection.train_test_split()函数可以随机分割训练集和测试集。
声明计算图的占位符,输入数据x_data是形状为[None, max_sequence_length]的占位符,其以文本最大允许的长度为批量大小;输出结果y_output的占位符为整数0或1,代表正常邮件或者垃圾邮件:
  x_data=tf.placeholder(tf.int32, [None, max_sequence_length])
  y_output=tf.placeholder(tf.int32, [None])

创建输入数据x_data的嵌入矩阵和嵌入查找操作:
  embedding_mat=tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0))
  embedding_output=tf.nn.embedding_lookup(embedding_mat, x_data)

声明算法模型,先初始化RNN单元的类型,大小为10,然后通过动态RNN函数tf.nn.dynamic_rnn()创建rnn序列,接着增加dropout操作:
  cell=tf.nn.rnn_cell.BasicRNNCell(num_units=rnn_size)
  output, state=tf.nn.dynamic_rnn(cell, embedding_output, dtype=tf.float32)
  output=tf.nn.dropout(output, dropout_keep_prob)

动态RNN允许变长序列,即使这里使用的是固定长度的序列,也推荐使用tf.nn.dynamic_rnn()函数,实践证明动态RNN计算更快,并且允许RNN中运行不同长度的序列。
为了进行预测,转置并重新排列RNN的输出结果,剪切最后的输出结果:
  output=tf.transpose(output, [1, 0, 2])
  last=tf.gather(output, int(output.get_shape()[0])-1)

为了进行RNN预测,通过全连接层将rnn_size大小的输出转换为二分类输出:
  weight=tf.Variable(tf.truncated_normal([rnn_size, 2], stddev=0.1))
  bias=tf.Variable(tf.constant(0.1, shape=[2]))
  logits_out=tf.nn.softmax(tf.matmul(last, weight)+bias)

声明损失函数,这里使用sparse_softmax函数,目标值是int型索引,logits是float型:
  losses=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits_out, labels=y_output)
  loss=tf.reduce_mean(losses)

创建准确度函数,比较训练集和测试集的结果:
  accuracy=tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits_out, 1), tf.cast(y_output, tf.int64)), tf.float32))
创建优化器函数,初始化模型变量:
  optimizer=tf.train.RMSPropOptimizer(learning_rate)
  train_step=optimizer.minimize(loss)
  init=tf.global_variables_initializer()
  sess.run(init)

开始遍历迭代训练模型,遍历数据集多次。为了避免过拟合,每个epoch都需随机shuffle数据。代码为:
  train_loss=[]
  test_loss=[]
  train_accuracy=[]
  test_accuracy=[]
  for i in range(epochs):
     # Shuffle training data
     shuffled_ix=np.random.permutation(np.arrage(len(x_train)))
     x_train=x_train[shuffled_ix]
     y_train=y_train[shuffled_ix]
     num_batches=int(len(x_train)/batch_size)+1
     for i in range(num_batches):
       # Select train data
       min_ix=i*batch_size
       max_ix=np.min([len(x_train), ((i+1)*batch_size)])
       x_train_batch=x_train[min_ix:max_ix]
       y_train_batch=y_train[min_ix:max_ix]
       # Run train step
       train_dict={x_data: x_train_batch, y_output: y_train_batch, dropout_keep_prob:0.5}
       sess.run(train_step, feed_dict=train_dict)
     # Run loss and accuracy for training
     temp_train_loss, temp_train_acc=sess.run([loss, accuracy]), feed_dict=train_dict)
     train_loss.append(temp_train_loss)
     train_accuracy.append(temp_train_acc)
     # Run Eval Step
     test_dict={x_data: x_test, y_output: y_test, dropout_keep_prob:1.0}
     temp_test_loss, temp_test_acc=sess.run([loss, accuracy]), feed_dict=test_dict)
     test_loss.append(temp_test_loss)
     test_accuracy.append(temp_test_acc)
     print('Epoch: {}, Test Loss: {:.2}, Test Acc: {:.2}'.format(epoch+1, temp_test_loss, temp_test_acc))

部分输出结果:
  Vocabulary Size: 933
  80-20 Train Test split: 4459 -- 1115
  Epoch: 19: Test Loss: 0.46, Test Acc: 0.86
  Epoch: 20: Test Loss: 0.46, Test Acc: 0.86

绘制训练集、测试集损失和准确度:
  epoch_seq=np.arange(1, epochs+1)
  plt.plot(epoch_seq, train_loss, 'k--', label='Train Set')
  plt.plot(epoch_seq, test_loss, 'r-', label='Test Set')
  plt.title('Softmax Loss')
  plt.xlabel('Epochs')
  plt.ylabel('Softmax Loss')
  plt.legend(loc='upper left')
  plt.show()
  plt.plot(epoch_seq, train_accuracy, 'k--', label='Train Set')
  plt.plot(epoch_seq, test_accuracy, 'r-', label='Test Set')
  plt.title('Test Accuracy')
  plt.xlabel('Epochs')
  plt.ylabel('Accuracy')
  plt.legend(loc='upper left')
  plt.show()

上面创建的RNN分类模型预测SMS邮件文本是否为垃圾邮件,测试集上的训练准确度为86%。
对于序列数据,推荐对训练集进行多次训练。全部数据进行一次训练称为epoch,并在每次epoch之前进行随机shuffle数据。

2)实现LSTM模型:

LSTM是传统循环神经网络的变种,可以用来解决变长RNN模型的梯度消失或者梯度爆炸问题。LSTM单元引入内部忘记门(foget gate),可以修改一个单元到下一个单元的信息流转。
类似常规RNN模型:
常规RNN模型输出公式
通过公式评估候选哪些信息可以遗忘或通过:
LSTM模型评估公式
通过忘记矩阵(foget matrix)修改可选的记忆单元,计算公式:
LSTM模型忘记矩阵
将前面的记忆信息与遗忘记忆结合,然后与可选的记忆单元相加,得到新的记忆信息:
LSTM模型记忆信息
结合前面所有项,获得记忆单元的输出:
LSTM模型记忆单元的输出
之后,通过迭代更新h,计算公式:
LSTM模型计算公式
LSTM是通过记忆单元忘记或修改输入信息。
使用TensorFlow可以维护相关的状态信息,并根据模型的损失函数、优化器函数和学习率计算梯度来自动更新模型的变量值。这里将使用带有LSTM结构的序列RNN模型,在莎士比亚文本数据集上训练,预测下一个单词。为该模型传入短语,看看训练的模型是否可以预测出短语接下来的单词。
  import os
  import re
  import string
  import requests
  import numpy as np
  import collections
  import random
  import pickle
  import matplotlib,pyplot as plt
  import tensorflow as tf

开始计算图会话,并设置RNN参数:
  sess=tf.Session()
  min_word_frequency=5
  rnn_size=128
  epochs=10
  batch_size=100
  learning_rate=0.001
  training_seq_len=50
  embedding_size=rnn_size
  save_every=500
  eval_every=50
  prime_texts=['thou art more', 'to be or not to', 'wherefore art thou']

定义数据和模型的文件夹和文件名,将保留连字符和省略符,因为莎士比亚频繁使用这些字符来组合单词和音节:
  data_dir='temp'
  data_file='shakespeare.txt'
  model_path='shakespeare_model'
  full_model_dir=os.path.join(data_dir, model_path)
  # Declare punctuation to remove, except hyphens and apostrophe's
  punctuation=string.punctuation
  punctuation=''.join([x for x in punctuation if x not in ['-', "'"]])

如果数据集存在直接加载数据,不存在将下载文本数据集并保存:
  if not os.path.exists(full_model_dir):
     os.makedirs(full_model_dir)
  if not os.path.exists(data_dir):
     os.makedirs(data_dir)
  if not os.path.isfile(os.path.join(data_dir, data_file)):
     print('Downloading Shakespeare texts from www.gutenberg.org')
     shakespeare_url='http://www.gutenberg.org/cache/epub/100/pg100.txt'
     # Get Shakespeare text
     r=requests.get(shakespeare_url)
     shakespeare_file=r.content
     # Decode binary into string
     s_text=shakespeare_file.decode('utf-8')
     # Drop first few descriptive paragraphs
     s_text=s_text[7576:]
     # Remove newlines
     s_text=s_text.replace('\r\n', '')
     s_text=s_text.replace('\n', '')
     # Write to file
     with open(os.path.join(data_dir, data_file), 'w') as out_conn:
       out_conn.write(s_text)
  else:
     # If file has been saved, load from that file
     with open(os.path.join(data_dir, data_file), 'r') as file_conn:
       s_text=file_conn.read().replace('\n', '')

清洗莎士比亚文本,移除标点符号和多余空格:
  s_text=re.sun(r'[{}]'.format(punctuation), ' ', s_text)
  s_text=re.sun('s+', ' ', s_text).strip().lower()

创建莎士比亚词汇表,创建build_vocab()返回2个单词字典,包括单词到索引的映射和索引到单词的映射,其中出现的单词要符合频次要求:
  def build_vocab(text, min_word_freq):
     word_counts=collections.Counter(text.split(' '))
     # Limit word counts to those more frequent than cutoff
     word_counts={key:val for key, val in word_counts.items() if val>min_word_freq}
     # Create vocab --> index mapping
     words=word_counts.keys()
     vocab_to_ix_dict={key:(ix+1) for ix,key in enumerate(words)}
     # Add unknown key --> 0 index
     vocab_to_ix_dict['unkown']=0
     # Create index --> vocab mapping
     ix_to_vocab_dict={val:key for key,val in vocab_to_ix_dict.items()}
     return ix_to_vocab_dict,vocab_to_ix_dict
  ix2vocab, vocab2ix=build_vocab(s_text, min_word_freq)
  vocab_size=len(ix2vocab)+1

处理文本时,需要注意单词索引为0的值,将其保存并填充;对于未知单词也采取相同的方法处理。
有了单词词汇表,将莎士比亚文本转换成索引数组:
  s_text_words=s_text.split(' ')
  s_text_ix=[]
  for ix,x in enumerate(s_text_words):
     try:
       s_text_ix.append(vocab2ix[x])
     except:
       s_text_ix.append(0)
  s_text_ix=np.array(s_text_ix)

使用class对象创建算法模型:
  class LSTM_Model():
     def __init__(self, rnn_size, batch_size, learning_rate, training_seq_len, vocab_size, infer=False):
       self.rnn_size=rnn_size
       self.vocab_size=vocab_size
       self.infer=infer
       self.learning_rate=learning_rate
       if infer:
         self.batch_size=1
         self.training_seq_len=1
       else:
         self.batch_size=batch_size
         self.training_seq_len=training_seq_len
       self.lstm_cell=tf.nn.rnn_cell.BasicLSTMCell(rnn_size)
       self.initial_state=self.lstm_cell.zeros_state(self.batch_size, tf.float32)
       self.x_data=tf.placeholder(tf.int32, [self.batch_size, self.training_seq_len])
       self.y_output=tf.placeholder(tf.int32, [self.batch_size, self.training_seq_len])
       with tf.Variable_scope('lstm_vars'):
         # Softmax Output Weight
         W=tf.get_variable('W", [self.rnn_size, self.vocab_size], tf.float32, tf.random_normal_initializer())
         b=tf.get_variable('b", [self.vocab_size], tf.float32, tf.constant_initializer(0.0))
         # Define Embedding
         embedding_mat=tf.get_variable('embedding_mat", [self.vocab_size, self.rnn_size], tf.float32, tf.random_normal_initializer())
         embedding_output=tf.nn.embedding_lookup(embedding_mat, self.x_data)
         rnn_inputs=tf.split(embedding_output, num_or_size_splits=self.training_seq_len, axis=1)
         rnn_inputs_trimmed=[tf.squeeze(x, [1]) for x in rnn_inputs]
       # If inferring (generating text), add a 'loop' function to get i+1 input from i output
       def inferred_loop(prev, count):
         prev_transformed=tf.matmul(prev, W)+b
         prev_symbol=tf.stop_gradient(tf.argmax(prev_transformed, 1))
         output=tf.nn.embedding_lookup(embedding_mat, prev_symbol)
         return output
       decoder=tf.nn.seq2seq.rnn_decoder
       outputs, last_state=decoder(rnn_inputs_trimmed, self.initial_state, self.lstm_cell, loop_function=inferred_loop if infer else None)
       # Non inferred outputs
       output=tf.reshape(tf.concat(1, outputs), [-1, self.rnn_size])
       # Logits and output
       self.logit_output=tf.matmul(output, W)+b
       self.model_output=tf.nn.softmax(self.logit_output)
       loss_fun=tf.contrib.legacy_seq2seq.sequence_loss_by_example
       loss=loss_fun([self.logit_output], [tf.reshape(self.y_output, [-1])], [tf.ones([self.batch_size*self.training_seq_len])], self.vocab_size)
       self.cost=tf.reduce_sum(loss)/(self.batch_size*self.training_seq_len)
       self.final_state=last_state
       gradients, _=tf.clip_by_global_norm(tf.gradients(self.cost, tf.trainable_variables()), 4.5)
       optimizer=tf.train.AdamOptimizer(self.learning_rate)
       self.train_op=optimizer.apply_gradients(zip(gradients, tf.trainable_variables())
     def sample(self, sess, words=ix2vocab, vocab=vocab2ix, mu,=10, prime_text='thou art'):
       state=sess.run(self.lstm_cell.zero_state(1, tf.float32))
       word_list=prime_text.split()
       for word in word_list[:-1]:
         x=np.zeros((1, 1))
         x[0, 0]=vocab[word]
         feed_dict={self.x_data: x, self.initial_state: state}
         [state]=sess.run([self.final_state], feed_dict=feed_dict)
       out_sentence=prime_text
       word=word_list[-1]
       for n in range(num):
         x=np.zeros((1, 1))
         x[0, 0]=vocab[word]
         feed_dict={self.x_data: x, self.initial_state: state}
         [model_output, state]=sess.run([self.model_output, self.final_state], feed_dict=feed_dict)
         sample=np.argmax(model_output[0])
         if sample==0:
           break
         word=words[sample]
         out_sentence=out_sentence+' '+word
       return out_sentence

将该class代码单独保存在一个Python文件中,可以在脚本起始位置导入。
声明LSTM模型和测试模型,使用tf.variable_scope管理模型变量,使得测试LSTM模型可以复用训练LSTM模型相同的参数:
  with tf.variable_scope('lstm_model', reuse=tf.AUTO_REUSE) as scope:
     # Define LSTM Model
     lstm_model=LSTM_Model(rnn_size, batch_size, learning_rate, training_seq_len, vocab_size)
     scope.reuse_variables()
     test_lstm_model=LSTM_Model(rnn_size, batch_size, learning_rate, training_seq_len, vocab_size, infer=True)

创建Saver操作,并分割输入文本为相同批量大小的块,然后初始化模型变量:
  saver=tf.train_Saver()
  # Create batches for each epoch
  num_batches=int(len(s_text_ix)/(batch_size*training_seq_len))+1
  # Split up text indices into subarrays of equal size
  batches=np.array_split(s_text_ix, num_batches)
  # Reshape each split into [batch_size, training_seq_len]
  batches=[np.resize(x, [batch_size, training_seq_len]) for x in batches]
  # Initialize all variables
  init=tf.global_variables_initializer()
  sess.run(init)

通过epoch迭代训练,并在每个epoch之前将数据shuffle:
  train_loss=[]
  iteration_count=[]
  for epoch in range(epochs):
     # Shuffle word indices
     random.shuffle(batches)
     # Create targets from shuffled batches
     targets=[np.roll(x, -1, axis=1) for x in batches]
     # Run a through one epoch
     print('Starting Epoch #{} of {}.'.format(epoch+1, epochs))
     # Reset initial LSTM state every epoch
     state=sess.run(lstm_model.initial_state)
     for ix,batch in enumerate(batches):
       training_dict={lstm_model.x_data: batch, lstm_model.y_output: targets[ix]}
       c, h=lstm_model.initial_state
       training_dict[c]=state.c
       training_dict[h]=state.h
       temp_loss, state, _=sess.run([lstm_model.cost, lstm_model.final_state, lstm_model.train_op], feed_dict=train_dict)
       train_loss.append(temp_loss)
       # Print status every 10 gens
       if iteration_count%10==0:
         summary_nums=(iteration_count, epoch+1, ix+1, num_batches+1, temp_loss)
         print('Iteration: {}, Epoch: {}, Batch: {} out of {}, Loss: {:.2f}'.format(*summary_nums))
       # Save model and vocab
       if iteration_count%save_every==0:
         model_file_name=os.path.join(full_model_dir, 'model')
         saver.save(sess, model_file_name, global_step=iteration_count)
         print('Model Saved To: {}'.format(model_file_name))
         dictionary_file=os.path.join(full_model_dir, 'vocab.pkl')
         with open(dictionary_file, 'wb') as dict_file_conn:
           pickle.dump([vocab2ix, ix2vocab], dict_file_conn)
       if iteration_count%eval_every==0:
         for sample in prime_texts:
           print(test_lstm_model.sample(sess, ix2vocab, vocab2ix, num=10, prime_text=sample))
       iteration_count+=1

部分输出结果:
  Vocabulary Length=8009
  Iteration: 1790, Epoch: 10, Batch: 161 out of 182, Loss: 5.68
  Iteration: 1800, Epoch: 10, Batch: 171 out of 182, Loss: 6.05
  thou art more than i am a
  to be or not to the man i have
  wherefore art thou art of the long
  Iteration: 1810, Epoch: 10, Batch: 181 out of 182, Loss: 5.99

绘制训练损失随epoch的趋势图:
  plt.plot(train_loss, 'k-')
  plt.title('Sequence to Sequence Loss')
  plt.xlabel('Iteration')
  plt.ylabel('Loss')
  plt.show()

上述代码基于莎士比亚词汇构建带有LSTM结构的RNN模型预测下一个单词,通过增加序列大小、降低学习率。或者增加模型的epoch,可能会提高模型的预测效果。这里也实现了随机抽样单词的方法,通过基于输出结果的logits或概率分布进行加权抽样。

3)堆叠多层LSTM:

可以通过堆叠多个LSTM层以增加RNN的深度,必要时也可以将目标输出作为输入赋值给另外一个网络。TensorFlow使用MultiRNNCell()函数轻松实现多层组合,该函数输入参数为RNN单元列表,用Python调用MultiRNNCell([rnn_cell]*num_layer)很容易创建一个多层RNN。
这里还是使用莎士比亚文本数据集,采用堆叠3层LSTM替代前面的1层网络,以字符级别的预测替代单词级别的预测,字符级别的预测将极大减小词汇表,表中仅有40个字符,包括26个字母、10个数字、1个空格和3个特殊字符。下面代码主要展示与前面示例不同的部分。
设置RNN模型的层数等参数:
  num_layers=3
  min_word_freq=5
  rnn_size=128
  epochs=10

以字符来加载、处理和传入文本,因此在清洗文本数据后通过list()函数分割整个文本:
  s_text=re.sub(r'[{}]'.format(punctuation), ' ', s_text)
  s_text=re.sub('s+', ' ', s_text).strip().lower()
  char_list=list(s_text)

改变LSTM模型为多层,接收num_layers变量:
  class LSTM_MODEL():
     def __init__(self, rnn_size, num_layers, batch_size, learning_rate, training_seq_len, vocab_size, infer=False):
       self.rnn_size=rnn_size
       self.num_layers=num_layers
       self.vocab_size=vocab_size
       self.infer=infer
       self.learning_rate=learning_rate
       ......
       self.lstm_cell=tf.contrib.rnn.BasicLSTMCell(rnn_size)
       self.lstm_cell=tf.contrib.rnn.MultiRNNCell([self.lstm_cell for _ in range(self.num_layers)])
       self.initial_state=self.lstm_cell.zeros_state(self.batch_size, tf.float32)
       self.x_data=tf.placeholder(tf.int32, [self.batch_size, self.training_seq_len])
       self.y_output=tf.placeholder(tf.int32, [self.batch_size, self.training_seq_len])

TensorFlow的MultiRNNCell()函数的输入参数为RNN的单元列表,这里使用相同的RNN,也可以采用任何RNN层堆叠组合。
训练模型部分输出:
  Vocabulary Length=40
  Iteration: 9440, Epoch: 10, Batch: 899 out of 950, Loss: 1.46
  Iteration: 9450, Epoch: 10, Batch: 909 out of 950, Loss: 1.49
  thou art more than the
  to be or not to the serva
  wherefore art thou dost thou
  Iteration: 9460, Epoch: 10, Batch: 919 out of 950, Loss: 1.41

最后的输出文本抽样:
  thou art more fancy with to be or not to be for be wherefore art thou art thou
训练模型已经可以生成古英语的单词。
绘制迭代训练损失函数:
  plt.plot(train_loss, 'k-')
  plt.title('Sequence to Sequence Loss')
  plt.xlabel('Iteration')
  plt.ylabel('Loss')
  plt.show()

4)实现Seq2Seq翻译模型:

因为每个RNN单元都有输出,因此可以训练RNN序列预测变长的序列,这里用来创建英语到德语的翻译模型。
TensorFlow自带相关模型类来进行Seq2Seq翻译模型训练,使用了官方提供的机器翻译模型,网址https://github.com/tensorflow/nmt。而使用的数据来自http://www.manythings.org/的zip文件,该文件汇编了Tatoeba库的数据,这些语句数据是tab键分割的英语-德语语句对,并且该语句数据包含了数千种不同长度的句子。
  import os
  import re
  import sys
  import json
  import math
  import time
  import string
  import requests
  import io
  import numpy as np
  import collections
  import random
  import pickle
  import matplotlib,pyplot as plt
  import tensorflow as tf
  from zipfile import ZipFile
  from collections import Counter
  from tensorflow.python.ops import lookup_ops
  from tensorflow.python.framework import ops
  ops.reset_default_graph()
  local_respository='temp/seq2seq'

将整个NMT模型导入temp文件夹:
  if not os.path.exists(local_respository):
     from git import Repo
     tf_model_repository='https://github.com/tensorflow/nmt/'
     Repo.clone_from(tf_model_repository, local_respository)
     sys.path.insert(0, 'temp/seq2seq/nmt/')
  # May also try to use 'attention model'
  from temp.seq2seq.nmt import model as model
  from temp.seq2seq.nmt.utils import vocab_utils as vocab_utils
  import temp.seq2seq.nmt.model_helper as model_helper
  import temp.seq2seq.nmt.utils.iterator_utils as iterator_utils
  import temp.seq2seq.nmt.utils.misc_utils as utils
  import temp.seq2seq.nmt.train as train

设置词汇量大小的参数,删除标点符号,设数据存储位置:
  vocab_size=10000
  punct=string.punctuation
  data_dir='temp'
  data_file='eng_ger.txt'
  model_path='seq2seq_model'
  full_model_dir=os.path.join(data_dir, model_path)

将使用TensorFlow提供的超参数格式,这种类型参数存储为json或xml格式文件。允许在不同类型架构中重复使用。这里使用wmt16.json并进行一些变化:
  # Load hyper-parameters for translation model
  hparams=tf.contrib.training.HParams()
  param_file='temp/seq2seq/nmt/standard_hparams/wmt16.json'
  # Can also try for different architectures
  # 'temp/seq2seq/nmt/standard_hparams/iwsit15.json'
  # 'temp/seq2seq/nmt/standard_hparams/wmt16_gnmt_4_layer.json'
  # 'temp/seq2seq/nmt/standard_hparams/wmt16_gnmt_8_layer.json'
  with open(param_file, 'r') as f:
     params_json=json.loads(f.read())
  for key,value in params_json.items():
     hparams.add_hparam(key, value)
  hparams.add_hparam('num_gpus', 0)
  hparams.add_hparam('num_encoder_layers', hparams.num_layers)
  hparams.add_hparam('num_decoder_layers', hparams.num_layers)
  hparams.add_hparam('num_encoder_residual_layers', 0)
  hparams.add_hparam('num_decoder_residual_layers', 0)
  hparams.add_hparam('init_op', 'uniform')
  hparams.add_hparam('random_seed', None)
  hparams.add_hparam('num_embeddings_partitions', 0)
  hparams.add_hparam('warmup_steps', 0)
  hparams.add_hparam('length_penalty_weight', 0)
  hparams.add_hparam('sampling_temperature', 0.0)
  hparams.add_hparam('num_translations_per_input', 1)
  hparams.add_hparam('warmup_scheme', 't2t')
  hparams.add_hparam('epoch_step', 0)
  hparams.num_train_steps=5000
  # Not use any pretrained embeddings
  hparams.add_hparam('src_embed_file', '')
  hparams.add_hparam('tgt_embed_file', '')
  hparams.add_hparam('num_keep_ckpts', 5)
  hparams.add_hparam('avg_ckpts', False)
  # Remove attention
  hparams.attention=None

如果模型和数据目录不存在,则创建:
  if not os.path.exists(full_model_dir):
     os.makedirs(full_model_dir)
  if not os.path.exists(data_dir):
     os.makedirs(data_dir)

下载数据集,并将数据拆分为英语和德语句子的单词列表:
  print('Loading English-German Data')
  # Check for data
  if not os.path.isfile(os.path.join(data_dir, data_file)):
     print('Data not found, downloading Eng-Ger sentences')
     sentence_url='http://www.manythings.org/anki/deu-eng.zip'
     r=requests.get(sentence_url)
     z=ZipFile(io.BytesIO(r.content))
     file=z.read('deu.txt')
     # Format data
     eng_ger_data=file.decode('utf-8')
     eng_ger_data=eng_ger_data.encode('ascii', errors='ignore')
     eng_ger_data=eng_ger_data.decode.split('\n')
     # Write to file
     with open(os.path.join(data_dir, data_file), 'w') as out_conn:
       for sentence in eng_ger_data:
         out_conn.write(sentence+'\n')
  else:
     eng_ger_data=[]
     with open(os.path.join(data_dir, data_file), 'r') as in_conn:
       for row in in_conn:
         eng_ger_data.append(row[:-1])
  print('Done')

删除句子中的标点符号:
  eng_ger_data=[''.join(char for char in sent if char not in punct) for sent in eng_ger_data]
  # Split each sentence by tabs
  eng_ger_data=[x.plit('\t') for x in eng_ger_data if len(x)>=1]
  [english_sentence, german_sentence]=[list(x) for x in zip(*eng_ger_data)]
  english_sentence=[x.lower().split() for x in english_sentence]
  german_sentence=[x.lower().split() for x in german_sentence]

为了使用较快的TensorFlow数据管道功能,需要以适当的格式将数据写入磁盘。翻译模型期望的格式为:
  train_prefix.source_suffix=train.en
  train_prefix.target_suffix=train.de

后缀确定语言,而前缀确定数据集类型(测试集或训练集):
  train_prefix='train'
  src_suffix='en'   # English
  tgt_suffix='de'   # Deutsch (German)
  source_txt_file=train_prefix+'.'+src_suffix
  hparams.add_hparam('src_file', source_txt_file)
  target_txt_file=train_prefix+'.'+tgt_suffix
  hparams.add_hparam('tgt_file', target_txt_file)
  with open(source_txt_file, 'w') as f:
     for sent in english_sentence:
       f.write(' '.join(sent)+'\n')
  with open(target_txt_file, 'w') as f:
     for sent in german_sentence:
       f.write(' '.join(sent)+'\n')

解析一些测试句子,这里任意选择大约100句,然后也写入文件中:
  # Partition some sentences off for testing files
  test_prefix='test_sent'
  hparams.add_hparam('dev_prefix', test_prefix)
  hparams.add_hparam('train_prefix', train_prefix)
  hparams.add_hparam('test_prefix', test_prefix)
  hparams.add_hparam('src_prefix', src_prefix)
  hparams.add_hparam('tgt_prefix', tgt_prefix)
  num_sample=100
  total_samples=len(english_sentence)
  # Get around 'num_sample's every so often in the src/tgt sentences
  ix_sample=[x for x in range(total_samples) if x%(total_samples//num_samples)==0]
  test_src=[' '.join(english_sentence[x]) for x in ix_sample]
  test_tgt=[' '.join(german_sentence[x]) for x in ix_sample]
  # Write test sentences to file
  with open(test_prefix+','+src_suffix, 'w') as f:
     for eng_test in test_src:
       f.write(eng_test+'\n')
  with open(test_prefix+','+tgt_suffix, 'w') as f:
     for ger_test in test_src:
       f.write(ger_test+'\n')

处理英语和德语句子的词汇表,然后将词汇列表保存到文件中:
  # Process the English Vocabulary
  all_english_words=[word for sentence in english_sentence for word in sentence]
  all_english_counts=Counter(all_english_words)
  eng_word_keys=[x[0] for x in all_english_counts.most_common(vocab_size-3)]   # because UNK, S, /S
  eng_vocab2ix=dict(zip(eng_word_keys, range(1, vocab_size)))
  eng_ix2vocab={val: key for key, val in eng_vocab2ix.items()}
  english_processed=[]
  for sent in english_sentence:
     temp_sentence=[]
     for word in sent:
       try:
         temp_sentence.append(eng_vocab2ix[word])
       except KeyError:
         temp_sentence.append(0)
     english_processed.append(temp_sentence)
  # Process the German Vocabulary
  all_german_words=[word for sentence in german_sentence for word in sentence]
  all_german_counts=Counter(all_german_words)
  ger_word_keys=[x[0] for x in all_german_counts.most_common(vocab_size-3)]   # because UNK, S, /S
  ger_vocab2ix=dict(zip(ger_word_keys, range(1, vocab_size)))
  ger_ix2vocab={val: key for key, val in ger_vocab2ix.items()}
  german_processed=[]
  for sent in german_sentence:
     temp_sentence=[]
     for word in sent:
       try:
         temp_sentence.append(ger_vocab2ix[word])
       except KeyError:
         temp_sentence.append(0)
     german_processed.append(temp_sentence)
  # Save vocab files for data processing
  source_vocab_file='vocab'+'.'+src_suffix
  hparams.add_hparam('src_vocab_file', source_vocab_file)
  eng_word_keys=['<unk>', '<s>', '</s>']+eng_word_keys
  target_vocab_file='vocab'+'.'+tgt_suffix
  hparams.add_hparam('tgt_vocab_file', target_vocab_file)
  ger_word_keys=['<unk>', '<s>', '</s>']+ger_word_keys
  # Write out all unique english words
  with open(source_vocab_file, 'w') as f:
     for eng_word in eng_word_keys:
       f.write(eng_word+'\n')
  # Write out all unique german words
  with open(target_vocab_file, 'w') as f:
     for ger_word in ger_word_keys:
       f.write(ger_word+'\n')
  # Add vocab size to hyper parameters
  hparams.add_hparam('src_vocab_size', vocab_size)
  hparams.add_hparam('tgt_vocab_size', vocab_size)
  # Add out-directory
  out_dir='temp/seq2seq/nmt_out'
  hparams.add_hparam('out_dir', out_dir)
  if not tf.gfile.Exists(out_dir):
     tf.gfile.MakeDirs(out_dir)

下面分别创建训练、评价和推理计算图。先创建训练计算图,用类实现,并将参数集成到namedtuple()中。代码来自TensorFlow的NMT库,参考其中的model_helper.py文件。
  class TrainGraph(collections.namedtuple("TrainGraph", ("graph", "model", "iterator", "skip_count_placeholder"))):
     pass
  def create_train_graph(scope=None):
     graph=tf.Graph()
     with graph.as_default():
       src_vocab_table, tgt_vocab_table=vocab_utils.create_vocab_tables(hparams.src_vocab_file, hparams.tgt_vocab_file, share_vocab=False)
       src_dataset=tf.data.TextLineDataset(hparams.src_file)
       tgt_dataset=tf.data.TextLineDataset(hparams.tgt_file)
       skip_count_placeholder=tf.placeholder(shape=(), dtype=tf.int64)
       iterator=iterator_utils.get_iterator(src_dataset, tgt_dataset, src_vocab_table, tgt_vocab_table, batch_size=hparams.batch_size, sos=hparams.sos, eos=hparams.eos, random_seed=None, num_buckets=hparams.num_buckets, src_max_len=hparams.src_max_len, tgt_max_len=hparams.tgt_max_len, skip_count=skip_count_placeholder)
     final_model=model.Model(hparams, iterator=iterator, mode=tf.contrib.learn.ModeKeys.TRAIN, source_vocab_table=src_vocab_table, target_vocab_table=tgt_vocab_table, scope=scope)
     return TrainGraph(graph=graph, model=final_model, iterator=iterator, skip_count_placeholder=skip_count_placeholder)
  train_graph=create_train_graph()

创建评价图Evaluation graph:
  class EvalGraph(collections.namedtuple("EvalGraph", ("graph", "model", "src_file_placeholder", "tgt_file_placeholder", "iterator"))):
     pass
  def create_eval_graph(scope=None):
     graph=tf.Graph()
     with graph.as_default():
       src_vocab_table, tgt_vocab_table=vocab_utils.create_vocab_tables(hparams.src_vocab_file, hparams.tgt_vocab_file, hparams.share_vocab)
       src_file_placeholder=tf.placeholder(shape=(), dtype=tf.string)
       tgt_file_placeholder=tf.placeholder(shape=(), dtype=tf.string)
       src_dataset=tf.data.TextLineDataset(src_file_placeholder)
       tgt_dataset=tf.data.TextLineDataset(tgt_file_placeholder)
       skip_count_placeholder=tf.placeholder(shape=(), dtype=tf.int64)
       iterator=iterator_utils.get_iterator(src_dataset, tgt_dataset, src_vocab_table, tgt_vocab_table, hparams.batch_size, sos=hparams.sos, eos=hparams.eos, random_seed=hparams.random_seed, num_buckets=hparams.num_buckets, src_max_len=hparams.src_max_len_infer, tgt_max_len=hparams.tgt_max_len_infer)
       final_model=model.Model(hparams, iterator=iterator, mode=tf.contrib.learn.ModeKeys.EVAL, source_vocab_table=src_vocab_table, target_vocab_table=tgt_vocab_table, scope=scope)
     return EvalGraph(graph=graph, model=final_model, src_file_placeholder=src_file_placeholder, tgt_file_placeholder=tgt_file_placeholder, iterator=iterator)
  eval_graph=create_eval_graph()

创建推理图Inference graph:
  class InferGraph(collections.namedtuple("InferGraph", ("graph", "model", "src_placeholder", "batch_size_placeholder", "iterator"))):
     pass
  def create_infer_graph(scope=None):
     graph=tf.Graph()
     with graph.as_default():
       src_vocab_table, tgt_vocab_table=vocab_utils.create_vocab_tables(hparams.src_vocab_file, hparams.tgt_vocab_file, hparams.share_vocab)
       reverse_tgt_vocab_table, tgt_vocab_table=lookup_ops.index_to_string_table_from_file(hparams.tgt_vocab_file, default_value=vocab_util.UNK)
       src_placeholder=tf.placeholder(shape=[None], dtype=tf.string)
       batch_size_placeholder=tf.placeholder(shape=[], dtype=tf.int64)
       src_dataset=tf.data.Dataset.from_tensor_slices(src_placeholder)
       iterator=iterator_utils.get_infer_iterator(src_dataset, src_vocab_table, batch_size=batch_size_placeholder, eos=hparams.eos, src_max_len=hparams.src_max_len_infer)
       final_model=model.Model(hparams, iterator=iterator, mode=tf.contrib.learn.ModeKeys.INFER, source_vocab_table=src_vocab_table, target_vocab_table=tgt_vocab_table, reverse_target_vocab_table=reverse_tgt_vocab_table, scope=scope)
     return InferGraph(graph=graph, model=final_model, src_placeholder=src_placeholder, batch_size_placeholder=batch_size_placeholder, iterator=iterator)
  infer_graph=create_infer_graph()

为了在训练期间提供更有解释性的输出,提供一些在训练期间能作为输出的源语言/目标语言对:
  sample_ix=[25, 125, 240, 450]
  sample_src_data=[' '.join(english_sentence[x]) for x in sample_ix]
  sample_tgt_data=[' '.join(german_sentence[x]) for x in sample_ix]
  print([x for x in zip(sample_src_data, sample_tgt_data)])

加载训练图:
  config_proto=utils.get_config_proto()
  train_sess=tf.Session(config=config_proto, graph=train_graph.graph)
  eval_sess=tf.Session(config=config_proto, graph=eval_graph.graph)
  infer_sess=tf.Session(config=config_proto, graph=infer_graph.graph)
  with train_graph.graph.as_default():
     loaded_train_model, global_step=model_helper.create_or_load_model(train_graph.model, hparams.out_dir, train_sess, "train")
  summary_writer=tf.summary.FileWriter(os.path.join(hparams.out_dir, 'Training'), train_graph.graph)

将评价操作添加到计算图中:
  for metric in hparams.metrics:
     hparams.add_hparam("best_"+metric, 0)
     best_metric_dir=os.path.join(hparams.out_dir, "best_"+metric)
     hparams.ass_hparam("best_"+metric+"_dir", best_metric_dir)
     tf.gfile.MakeDirs(best_metric_dir)
  eval_output=train.run_full_eval(hparams.out_dir, infer_graph, infer_sess, eval_graph, eval_sess, hparams, summary_writer, sample_src_data, sample_tgt_data)
  eval_results, _, acc_blue_scores=eval_output

初始化计算图,初始化每次迭代都会更新的参数,包括时间、全局步长和epoch数:
  last_stats_step=global_step
  last_eval_step=global_step
  last_external_eval_step=global_step
  steps_per_eval=10*hparams.steps_per_stats
  steps_per_external_eval=5*steps_per_eval
  avg_step_time=0.0
  step_time, checkpoint_loss, checkpoint_predict_count=0.0, 0.0, 0.0
  checkpoint_total_count=0.0
  speed, train_ppl=0.0, 0.0
  utils.print_out("# Start step %d, lr %g, %s" %(global_step, loaded_train_model.learning_rate.eval(session=train_sess), time.ctime()))
  skip_count=hparams.batch_size*hparams.epoch_step
  utils.print_out("# Init train iterator, skipping %d elements" %skip_count)
  train_sess.run(train_graph.iterator.initializer, feed_dict={train_graph.skip_count_placeholder: skip_count})

默认每1000次迭代训练将保存一次模型,如果需要也可以改变这个超参数,训练此模型并保存最新的5个模型占用大约2GB硬盘空间。
进行模型训练和评价。训练最重要的部分是在循环的最开始,其余代码用于评价、样本推理和保存模型。代码为:
  while global_step<hparams.num_train_steps:
     start_time=time.time()
     try:
       step_result=loaded_train_model.train(train_sess)
       (_, step_loss, step_predict_count, step_summary, global_step, step_word_count, batch_size, _, _)=step_result
       hparams.epoch_step+=1
     except tf.error.OutOfRangeError:
       # Next Epoch
       hparams.epoch_step=0
       utils.print_out("# Finished an epoch, step %d. Perform external evaluation" %global_step)
       train.run_sample_decode(infer_graph, infer_sess, hparams.out_dir, hparams, summary_writer, sample_src_data, sample_tgt_data)
       dev_scores, test_scores, _=train.run_external_eval(infer_graph, infer_sess, hparams.out_dir, hparams, summary_writer)
       train_sess.run(train_graph.iterator.initializer, feed_dict={train_graph.skip_count_placeholder: 0})
       continue
     summary_writer.add_summary(step_summary, global_step)
     # Statistics
     step_time+=(time.time()-start_time)
     checkpoint_loss+=(step_loss*batch_size)
     checkpoint_predict_count+=step_predict_count
     checkpoint_total_count+=float(step_word_count)
     # print statistics
     if global_step-last_stats_step>=hparams.steps_per_stats:
       last_stats_step=global_step
       avg_step_time=step_time/hparams.steps_per_stats
       train_ppl=utils.safe_exp(checkpoint_loss/checkpoint_predict_count)
       speed=checkpoint_total_count/(1000*step_time)
       utils.print_out(" global step %d lr %g " "step-time %.2fs wps %.2fK ppl %.2f %s" %(global_step, loaded_train_model.learning_rate.eval(session=train_sess), avg_step_time, speed, train_ppl, train._get_best_results(hparams)))
       if math.isnan(train_ppl):
         break
       # Reset timer and loss
       step_time, checkpoint_loss, checkpoint_predict_count=0.0, 0.0, 0.0
       chechpoint_total_count=0.0
     if global_step-last_eval_step>=hparams.steps_per_eval:
       last_eval_step=global_step
       utils.print_out("# Save eval, global_step %d"%global_step)
       utils.add_summary(summary_writer, global_step, "train_ppl", train_ppl)
       # Save checkpoint
       loaded_train_model.saver.save(train_sess, os.path.join(hparams.out_dir, "translate.ckpt"), global_step=global_step)
       # Evaluate on dev/test
       train.run_sample_decode(infer_graph, infer_sess, out_dir, hparams, summary_writer, sample_src_data, sample_tgt_data)
       dev_ppl, test_ppl=train.run_internal_eval(eval_graph, eval_sess, out_dir, hparams, summary_writer)
     if global_step-last_external_eval_step>=steps_per_external_eval:
       last_external_eval_step=global_step
       # Save checkpoint
       loaded_train_model.saver.save(train_sess, os.path.join(hparams.out_dir, "translate.ckpt"), global_step=global_step)
       train.run_sample_decode(infer_graph, infer_sess, out_dir, hparams, summary_writer, sample_src_data, sample_tgt_data)
       dev_scores, test_scores, _=train.run_external_eval(infer_graph, infer_sess, out_dir, hparams, summary_writer)

上面代码使用TensorFlow内建的Seq2Seq模型将英语翻译为德语。但目前测试句子并没有得到很好的翻译效果。仍有提升空间。如果模型训练时间加长,合并一些放更多训练数据的桶,就可以提高翻译水平。
在manything网站上,也有其他语言的双语语料库,可以下载使用。

5)训练孪生RNN度量相似度:

RNN模型的一大优点就是可以处理变长的序列数据,利用该特性可以生成全新的序列数据,可以创建方法度量输入序列和其他序列之间的相似度,这里用来度量地址之间的相似度。
将构建一个双向RNN模型,输入地址,加入一个全连接层,将输出传入全连接层,输出固定长度为100的数值型向量;然后比较两个向量输出的余弦相似度,值在-1和1之间,利用该网络进行记录匹配。
  import os
  import random
  import string
  import numpy as np
  import matplotlib,pyplot as plt
  import tensorflow as tf
  sess=tf.Session()

设置模型参数:
  batch_size=200
  n_batches=300
  max_address_len=20
  margin=0.25
  num_features=50
  dropout_keep_prob=0.8

创建一个孪生RNN相似度模型:
  def snn(address1, address2, dropout_keep_prob, vocab_size, num_features, input_length):
     # Define the Siamese double RNN with a fully connected layer at the end
     def Siamese_nn(input_vector, num_hidden):
       cell_unit=tf.nn.rnn_cell.BasicLSTMCell
       # Forward direction cell
       lstm_forward_cell=cell_unit(num_hidden, forget_bias=1.0)
       lstm_forward_cell=tf.nn.rnn_cell.DropoutWrapper(lstm_forward_cell, output_keep_prob=dropout_keep_prob)
       # Backward direction cell
       lstm_backward_cell=cell_unit(num_hidden, forget_bias=1.0)
       lstm_backward_cell=tf.nn.rnn_cell.DropoutWrapper(lstm_backward_cell, output_keep_prob=dropout_keep_prob)
       # Split title into a character sequence
       input_embed_split=tf.split(1, input_length, input_vector)
       input_embed_split=[tf.squeeze(x, squeeze_dims=[1]) for x in input_embed_split]
       # Create bidirectional layer
       output, _, _=tf.nn.bidirectional_rnn(lstm_forward_cell, lstm_backward_cell, input_embed_split, dtype=tf.float32)
       # Average the output over the sequence
       temporal_mean=tf.add_n(outputs)/input_length
       # Fully connected layer
       output_size=10
       A=tf.get_variable(name="A", shape=[2*num_hidden, output_size], dtype=tf.float32, initializer=tf.random_normal_initializer(stddev=0.1))
       b=tf.get_variable(name="b", shape=[output_size], dtype=tf.float32, initializer=tf.random_normal_initializer(stddev=0.1))
       final_output=tf.matmul(temporal_mean, A)+b
       final_output=tf.nn.dropout(final_output, dropout_keep_prob)
       return(final_output)
     with tf.variable_scope("Siamese") as scope:
       output1=Siamese_nn(address1, num_features)
       # Use the same variables on the second string
       scope.reuse_variables()
       output2=Siamese_nn(address2, num_features)
     # Unit normalize the output
     output1=tf.nn.l2_normalize(output1, 1)
     output2=tf.nn.l2_normalize(output2, 1)
     # Return cosine distance
     dot_prod=tf.reduce_sum(tf.mul(output1, output2), 1)
     return dot_prod

使用tf.variable_scope可在Siamese网络的两个部分共享变量参数。余弦距离是归一化向量的点积。
声明预测函数,该函数是余弦距离的符号值:
  def get_predictions(scores):
     predictions=tf.sign(scores, name="predictions")
     return predictions

声明损失函数,包括正损失和负损失,并为error预留一个margin(类似于SVM模型):
  def loss(scores, y_target, margin):
     # Calculate positive losses
     pos_loss_term=0.25*tf.square(tf.sub(1., scores))
     pos_mult=tf.cast(y_target, tf.float32)
     # Make sure positive losses are on similar strings
     positive_loss=tf.mul(pos_mult, pos_loss_term)
     # Calculate negative losses, then make sure on dissimilar strings
     neg_mult=tf.sub(1., tf.cast(y_target, tf.float32))
     # Make sure positive losses are on similar strings
     negative_loss=neg_mult*tf.square(scores)
     # Combine similar and disssimilar losses
     loss=tf.add(positive_loss, negative_loss)
     # Create margin term, when the targets are 0 and scores less than m
     # Check if target is zero (dissimilar strings)
     target_zero=tf.equal(tf.cast(y_target, tf.float32), 0)
     # Check if cosine outputs is smaller than margin
     less_than_margin=tf.less(scores, margin)
     # Check if both are true
     both_logical=tf.logical_and(target_zero, less_than_margin)
     both_logical=tf.cast(both_logical, tf,float32)
     # If both are true, then multiply by (1-1)=0
     multiplycative_factor=tf.cast(1.-both_logical, tf.float32)
     total_loss=tf.mul(loss, multiplycative_factor)
     # Average loss over batch
     avg_loss=tf.reduce_mean(total_loss)
     return avg_loss

声明准确度函数:
  def accuracy(scores, y_target):
     predictions=get_predictions(scores)
     correct_predictions=tf.equal(predictions, y_target)
     accuracy=tf.reduce_mean(tf.cast(correct_predictions, tf.float32)
     return accuracy

使用基准地址创建有“打印错误”的相似地址,标注基准地址和错误地址:
  def create_typo(s):
     rand_ind=random.choice(range(len(s)))
     s_list=list(s)
     s_list[rand_ind]=random.choice(string.ascii_lowercase+'0123456789')
     s=''.join(s_list)
     return s

将街道号、街道名和街道后缀随机组合生成数据。街道名和街道后缀列表为:
  street_names=['abbey', 'baker', 'canal', 'donner', 'elm', 'fifth', 'grandvia', 'hollywood', 'interstate', 'jay', 'kings']
  street_types=['rd', 'st', 'ln', 'pass', 'ave', 'hwy', 'cir', 'dr', 'jct']

生成的测试查询地址和基准地址:
  test_queries=['111 abbey ln', '271 doner cicle', '314 king avenue', 'tensorflow is fun']
  test_references=['123 abbey ln', '217 donner cir', '314 kings ave', '404 hollywood st', 'tensorflow is so fun']

定义如何生成批量数据,这里的批量数据是一半相似的地址和一半不相似的地址。不相似的地址是通过读取地址列表的后半部分,并使用numpy.roll()函数将其向右循环移动1位获取的。代码为:
  def get_batch(n):
     # Generate a list of reference addresses with similar addresses
     numbers=[random.randint(1, 9999) for i in range(n)]
     streets=[random.choice(street_names) for i in range(n)]
     street_suffs=[random.choice(street_types) for i in range(n)]
     full_streets=[str(w)+' '+x+' '+y for w,x,y in zip(numbers, streets, street_suffs)]
     typo_streets=[create_typo(x) for x in full_streets]
     reference=[list(x) for x in zip(full_streets, typo_streets)]
     # Shuffle last half of them for training on dissimilar addresses
     half_ix=int(n/2)
     bottom_half=reference[half_ix:]
     true_address=[x[0] for x in bottom_half]
     typo_address=[x[1] for x in bottom_half]
     typo_address=list(np.roll(typo_address, 1)
     bottom_half=[[x,y] for x,y in zip(true_address, typo_address)]
     reference[half_ix:]=bottom_half
     # Get target similarities (1's for similar, -1's for non-similar)
     target=[1]*(n-half_ix)+[-1]*half_ix
     reference=[[x,y] for x,y in zip{reference, target)]
     return reference

定义词汇表,以及如何将地址one-hot编码为索引:
  vocab_chars=string.ascii_lowercase+'0123456789 '
  vocab2ix_dict={char:(ix+1) for ix,char in enumerate(vocab_chars)}
  vocab_length=len(vocab_chars)+
  # Define vocab one-hot encoding
  def address2onehot(address, vocab2ix_dict=vocab2ix_dict, max_address_len=max_address_len):
     # translate address string into indices
     address_ix=[vocab2ix_dict[x] for x in list(address)]
     # Pad or crop to max_address_len
     address_ix=(address_ix+[0]*max_address_len)[0:max_address_len]
     return address_ix

处理好词汇表,声明模型占位符和嵌入查找函数,使用one-hot编码嵌入,使用单位矩阵作为查找矩阵。代码为:
  address1_ph=tf.placeholder(tf.int32, [None, max_address_len], name="address1_ph")
  address2_ph=tf.placeholder(tf.int32, [None, max_address_len], name="address2_ph")
  y_target_ph=tf.placeholder(tf.int32, [None], name="y_target_ph")
  dropout_keep_prob_ph=tf.placeholder(tf.float32, [None], name="dropout_keep_prob")
  # Create embedding lookup
  identity_mat=tf.diag(tf.ones(shape=[vocab_length]))
  address1_embed=tf.nn.embedding_lookup(identity_mat, address1_ph)
  address2_embed=tf.nn.embedding_lookup(identity_mat, address2_ph)

声明算法模型、准确度、损失函数和预测操作:
  text_snn=mode.snn(address1_embed, address2_embed, dropout_keep_prob_ph, vocab_length, num_features, max_address_len)
  batch_accuracy=model.accuracy(text_snn, y_target_ph)
  batch_loss=model.loss(text_snn, y_target_ph, margin)
  predictions=model.get_predictions(text_snn)

增加优化器函数,初始化变量:
  optimizer=tf.train.AdamOptimizer(0.01)
  train_op=optimizer.minimize(batch_loss)
  init=tf.global_variables_initializer()
  sess.run(init)

遍历迭代训练,记录损失函数和准确度:
  train_loss_vec=[]
  train_acc_vec=[]
  for b in range(n_batches):
     # Get a batch of data
     batch_data=get_batch(batch_size)
     # Shuffle data
     np.random.shuffle(batch_data)
     # Parse addresses and targets
     input_addresses=[x[0] for x in batch_data]
     target_similarity=np.array([x[1] for x in batch_data])
     address1=np.array([address2onehot(x[0]) for x in input_addresses])
     address2=np.array([address2onehot(x[1]) for x in input_addresses])
     train_feed_dict={address1_ph: address1, address2_ph: address2, y_target_ph: target_similarity, dropout_keep_prob_ph: dropout_keep_prob}
     _, train_loss, train_acc=sess.run([train_op, batch_loss, batch_accuracy], feed_dict=train_feed_dict)
     # Save train loss and accuracy
     train_loss_vec.append(train_loss)
     train_acc_vec.append(train_acc)

处理测试查询和基准地址来查看模型效果:
  test_queries_ix=np.array([address2onehot(x) for x in test_queries])
  test_references_ix=np.array([address2onehot(x) for x in test_references])
  num_refs=test_references_ix.shape[0]
  best_fit_refs=[]
  for query in test_queries_ix:
     test_query=np.repeat(np.array([query]), num_refs, axis=0)
     test_feed_dict={address1_ph: test_query, address2_ph: test_references_ix, y_target_ph: target_similarity, dropout_keep_prob_ph: 1.0}
     test_out=sess.run(test_snn, feed_dict=test_feed_dict)
     best_fit=test_reference[np.argmax(test_out)]
     best_fit_refs.append(best_fit)
  print('Query Addresses: {}'.format(test_queries))
  print('Model Found Matches: {}'.format(best_fit_refs))

输出结果:
  Query Addresses: ['111 abbey ln', '271 doner cicle', '314 king avenue', 'tensorflow is fun']
  Model Found Matches: ['123 abbey ln', '217 donner cicle', '314 king ave', 'tensorflow is so fun']

从测试查询和基准地址可以看出,训练的模型不仅能判断正确的基准地址,也可以生成非地址词语。
这里没有指定测试集来练习,因为是生成数据。创建批量函数每次创建新的批量数据,因此模型输入的都是新数据,用批量损失和准确度替代测试损失和准确度,但对于实际的有限数据集来说并不一定都正确,所以需要训练集和测试集来评估模型效果。

5. TensorFlow产品化与可视化:

产品级代码的定义有很多,包括有单元测试的代码、分开训练和评估代码、有效存储和加载数据管道中各种所需的部分以及创建计算图会话。

1)单元测试:

测试代码使得原型设计更快、调试更有效、调整更快、代码分享更容易。
当编写一个TensorFlow模型时,单元测试有助于测试代码功能,想调整代码单元时,测试将确保这些改变不会破坏模型。这里将在MNIST数据集上创建一个简单的CNN网络,并实现3种不同的单元测试,以介绍如何在TensorFlow中编写单元测试。
Python有一个Nose的测试库,TensorFlow也有内建的测试函数,可以用来测试张量对象的值。
  import sys
  import numpy as np
  import tensorflow as tf
  from tensorflow.python.framework import ops
  ops.reset_default_graph()
  sess=tf.Session()

加载并格式化数据集:
  data_dir='temp'
  mnist=tf.keras.datasets.mnist
  (train_xdata, train_labels), (test_xdata, test_labels)=mnist.load_data()
  train_xdata=train_xdata/255.0
  test_xdata=test_xdata/255.0

设置模型参数:
  batch_size=100
  learning_rate=0.005
  evaluation_size=100
  image_width=train_xdata[0].shape[0]
  image_height=train_xdata[0].shape[1]
  target_size=max(train_labels)+1
  num_channels=1   # greyscale = 1 channel
  generations=100
  eval_every=5
  conv1_features=25
  conv2_features=50
  max_pool_size1=2   # NxN window for 1st max pool layer
  max_pool_size2=2   # NxN window for 2nd max pool layer
  fully_connected_size1=100
  dropout_prob=0.75

声明模型的占位符:
  x_input_shape=(batch_size, image_width, image_height, num_channels)
  x_input=tf.placeholder(tf.float32, shape=x_input_shape)
  y_target=tf.placeholder(tf.int32, shape=(batch_size))
  eval_input_shape=(evaluation_size, image_width, image_height, num_channels)
  eval_input=tf.placeholder(tf.float32, shape=eval_input_shape)
  eval_target=tf.placeholder(tf.int32, shape=(evaluation_size))
  dropout=tf.placeholder(tf.float32, shape=())

声明模型变量:
  conv1_weight=tf.Variable(tf.truncated_normal([4, 4, num_channels, conv1_features], stddev=0.1, dtype=tf.float32))
  conv1_bias=tf.Variable(tf.zeros([conv1_features], dtype=tf.float32))
  conv2_weight=tf.Variable(tf.truncated_normal([4, 4, conv1_features, conv2_features], stddev=0.1, dtype=tf.float32))
  conv2_bias=tf.Variable(tf.zeros([conv2_features], dtype=tf.float32))
  # Fully connected variables
  resulting_width=image_width//(max_pool_size1*max_pool_size2)
  resulting_height=image_height//(max_pool_size1*max_pool_size2)
  full1_input_size=resulting_width*resulting_height*conv2_features
  full1_weight=tf.Variable(tf.truncated_normal([full1_input_size, fully_connected_size1], stddev=0.1, dtype=tf.float32))
  full1_bias=tf.Variable(tf.truncated_normal([fully_connected_size1], stddev=0.1, dtype=tf.float32))
  full2_weight=tf.Variable(tf.truncated_normal([fully_connected_size1, target_size], stddev=0.1, dtype=tf.float32))
  full1_bias=tf.Variable(tf.truncated_normal([target_size], stddev=0.1, dtype=tf.float32))

声明模型表达式:
  def my_conv_net(input_data):
     # First Conv-ReLU-MaxPool Layer
     conv1=tf.nn.conv2d(input_data, conv1_weight, strides=[1,1,1,1], padding='SAME')
     relu1=tf.nn.relu(tf.nn.bias_add(conv1, conv1_bias))
     max_pool1=tf.nn.max_pool(relu1, ksize=[1, max_pool_size1, max_pool_size1, 1], strides=[1, max_pool_size1, max_pool_size1, 1], padding='SAME')
     # Second Conv-ReLU-MaxPool Layer
     conv2=tf.nn.conv2d(max_pool1, conv2_weight, strides=[1,1,1,1], padding='SAME')
     relu2=tf.nn.relu(tf.nn.bias_add(conv2, conv2_bias))
     max_pool2=tf.nn.max_pool(relu2, ksize=[1, max_pool_size2, max_pool_size2, 1], strides=[1, max_pool_size2, max_pool_size2, 1], padding='SAME')
     # Transform Output into a 1xN layer for next fully connected layer
     final_conv_shape=max_pool2.get_shape().as_list()
     final_shape=final_conv_shape[1]*final_conv_shape[2]*final_conv_shape[3]
     flat_output=tf.reshape(max_pool2, [final_conv_shape[0], final_shape])
     # First Fully Connected Layer
     fully_connected1=tf.nn.relu(tf.add(tf.matmul(flat_output, full1_weight), full1_bias))
     # Second Fully Connected Layer
     final_model_output=tf.add(tf.matmul(full1_connected1, full2_weight), full2_bias))
     # Add Dropout
     final_model_output=tf.nn.dropout(final_model_output, dropout)
     return final_model_output
  model_output=my_conv_net(x_input)
  test_model_output=my_conv_net(eval_input)

创建损失函数:
  loss=tf.reduce_mean(tf.nn.spares_softmax_cross_entropy_with_logits(model_out, y_target))
创建预测函数和准确度函数:
  prediction=tf.nn.softmax(model_output)
  test_prediction=tf.nn.softmax(test_model_output)
  def get_accuracy(logits, targets):
     batch_predictions=np.argmax(logits, axis=1)
     num_correct=np.sum(np.equal(batch_predictions, targets))
     return 100.*num_correct/batch_predictions.shape[0]

创建优化器,并初始化变量:
  my_optimizer=tf.train.MomentumOptimizer(learning_rate, 0.9)
  train_step=my_optimizer.minimize(loss)
  init=tf.global_variables_initializer()
  sess.run(init)

使用tf.test.TestCase()类来测试占位符或者变量的值,要确保dropout概率大于0.25,因此模型训练时不会试图让dropout大于75%。代码为:
  class DropOutTest(tf.test.TestCase):
     def dropout_greeterthan(self):
       with self.test_session():
         self.assertGreater(dropout.eval(), 0.25)

测试准确度函数的功能正确。按预期创建一个样本数组,确保测试返回100%的准确度。代码为:
  class AccuracyTest(tf.test.TestCase):
     def accuracy_exact_test(self):
       with self.test_session():
         test_preds=[[0.9, 0.1], [0.01, 0.99]]
         test_targets=[0, 1]
         test_acc=get_accuracy(test_preds, test_targets)
         self.assertEqual(test_acc.eval(), 100.)

测试张量的形状符合预期,测试模型输出结果是预期的[batch_size, target_size]形状:
  class ShapeTest(tf.test.TestCase):
     def output_shape_test(self):
       with self.test_session():
         numpy_array=np.ones([batch_size, target_size])
         self.assertShapeEqual(numpy_array, model_output)

创建main()函数来测试正在运行的TensorFlow:
  def main(argv):
     train_loss=[]
     train_acc=[]
     test_acc=[]
     for i in range(generations):
       rand_index=np.random.choice(len(train_xdata), size=batch_size)
       rand_x=train_xdata[rand_index]
       rand_x=train_xdata[rand_index]
       rand_y=train_labels[rand_index]
       train_dict={x_input: rand_x, y_target: rand_y, dropout: dropout_prob}
       sess.run(train_step, feed_dict=train_dict)
       temp_train_loss, temp_train_preds=sess.run([loss, prediction], feed_dict=train_dict)
       temp_train_acc=get_accuracy(temp_train_preds, rand_y)
       if (i+1)%eval_every==0:
         eval_index=np.random.choice(len(test_xdata), size=evaluation_size)
         eval_x=test_xdata[eval_index]
         eval_x=np.expand_dims(eval_x, 3)
         eval_y=test_labels[eval_index]
         test_dict={eval_input: eval_x, eval_target: eval_y, dropout: 1.0}
         test_preds=sess.run(test_prediction, feed_dict=test_dict)
         temp_test_acc=get_accuracy(test_preds, eval_y)
         # Record and print result
         train_loss.append(temp_train_loss)
         train_acc.append(temp_train_acc)
         test_loss.append(temp_test_loss)
         acc_and_loss=[(i+1), temp_train_loss, temp_train_acc, temp_test_acc]
         acc_and_loss=[np.round(x, 2) for x in acc_and_loss]
         print('Generation # {}. Train Loss: {:.2f}. Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))

为了让程序实现训练或测试,需要使用命令行调用。下面为主程序代码,如果输入参数test就执行测试,否则执行训练:
  if __name__=='__main__':
     cmd_args=sys.argv
     if len(cmd_args)>1 and cmd_args[1]=='test':
       tf.test.main(argv=cmd_args[1:])
     else:
       tf.app.run(main=None, argv=cmd_args)

上面代码实现了3种单元测试:张量值、操作输出结果和张量形状。其他TensorFlow单元测试函数可浏览网址https://www.tensorflow.org/versions/master/api_docs/python/test.html。
单元测试有助于确保代码的功能符合预期,提供分享代码的信心,并使代码可重用。

2)多设备使用:

TensorFlow的计算图非常适合并行计算,就像处理不同批量一样,计算图可以由不同的处理器来处理。
单CPU经常配备一个或多个可分享计算能力的GPU,如果TensorFlow可以访问这些设备,通过贪婪策略将自动在多个设备上分布计算,也允许程序通过命名空间方式指定哪个操作在哪个设备上执行。
为了访问GPU设备,需要安装TensorFlow的GPU版,并要求安装CUDA来使用GPU,并根据使用的操作系统选择安装步骤。
为了能够确定TensorFlow的什么操作正在使用什么设备,在计算图会话中传入config参数,将log_device_placement设为True,这样在命令行运行脚本时就会看到指定设备布局。
  import tensorflow as tf
  sess=tf.Session(config=tf.ConfigProto(log_device_placement=True)
  a=tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b=tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c=tf.matmul(a, b)
  print(sess.run(c))

从控制台运行下面的命令:
  python3 using_multiple_devices.py
然后会显示相关设备等信息。
TensorFlow会自动在计算设备上分配计算,但有时希望搞清楚正在使用的设备。当加载先前保存过的模型,并且该模型在计算图中已分配固定设备时,本地机器可提供不同的设备给计算图使用,实现该功能只需在config设置设备。代码为:
  config=tf.ConfigProto()
  config.allow_soft_placement=True
  sess_soft=tf.Session(config=config)

当使用GPU时,TensorFlow默认占据大部分GPU内存,但也需要谨慎分配GPU内存。当TensorFlow一直不释放GPU内存时,如有必要,可以设置GPU内存增长选项,让GPU内存分配缓慢增大到最大限制。代码为:
  config.gpu_options.allow_growth=True
  sess_grow=tf.Session(config=config)

如果希望限制死TensorFlow使用GPU的百分比,可以使用config设置per_process_gpu_memory_fraction。代码为:
  config.gpu_options.per_process_gpu_memory_fraction=True
  sess_limited=tf.Session(config=config)

有时希望代码健壮到可以确定哪些GPU可用,TensorFlow的内建函数可用实现此功能。如果期望代码在GPU内存合适时利用GPU计算能力,并分配指定操作给GPU,使用此功能有好处。代码为:
  if tf.test.is_built_with_cuda():
     <Run GPU specific code here>

做一些简单计算,并将这些计算分配给CPU和两个副GPU:
  with tf.device('/cpu:0'):
     a=tf.constant([1.0, 3.0, 5.0], shape=[1, 3])
     b=tf.constant([2.0, 4.0, 6.0], shape=[3, 1])
     with tf.device('/gpu:0'):
       c=tf.matmul(a, b)
       c=tf.reshape(c, [-1])
     with tf.device('/gpu:1'):
       d=tf.matmul(b, a)
       flat_d=tf.reshape(d, [-1])
     combined=tf.multiply(c, flat_d)
  print(sess.run(combined))

当希望为TensorFlow操作指定特定设备时,约定成俗的命名为:

设备 主CPU 副CPU 主GPU 副GPU 第三GPU
设备名 /cpu:0 /cpu:1 /gpu:0 /gpu:1 /gpu:2
TensorFlow在云服务上运行越来越容易,许多云服务提供商提供带有主CPU和GPU的实例,比如亚马逊AWS(Amazon Web Services)的G实例和P2实例提供GPU计算能力加速TensorFlow的处理,还有AWS机器学习镜像AMI(AWS Machine Image)可免费启动已安装的TensorFlow的GPU实例。

3)分布式TensorFlow实践:

为了扩展TensorFlow的并行能力,可将分开的计算图操作以分布式方式运行在不同的机器上。分布式TensorFlow允许构建TensorFlow集群,共享相同的训练和评估模型的计算任务,只需要简单地设置worker节点参数,然后为不同的worker节点分配不同的作业。
这里建立两个本地worker并分配不同的作业。
  import tensorflow as tf
  # Cluster for 2 local workers (task 0 and task1)
  cluster=tf.train.ClusterSpec({'local': ['localhost:2222', 'localhost:2223']})

将两个worker加入到集群中,并标记task数字:
  server=tf.train.Server(cluster, job_name="local", task_index=0)
  server=tf.train.Server(cluster, job_name="local", task_index=1)

为每个worker计算分配一个task,第1个worker将初始化两个矩阵,第2个worker计算每个矩阵所有元素的和,然后自动分配将两个和求和的任务,并打印结果:
  mat_dim=25
  matrix_list={}
  with tf.device('/job:local/task:0):
     for i in range(0, 2):
       m_label='m_{}'.format(i)
       matrix_list[m_label]=tf.random_normal([mat_dim, mat_dim])
  # Have each worker calculate the sum
  sum_outs={}
  with tf.device('/job:local/task:1):
     for i in range(0, 2):
       A=matrix_list['m_{}'.format(i)]
       sum_outs['m_{}'.format(i)]=tf.reduce_sum(A)
     # Sum all the sums
     summed_out=tf.add_n(list(sum_outs.values()))
  with tf.Session(server, target) as sess:
     result=sess.run(summed_out)
     print('Summed Values: {}'.format(result))

运行下面的命令:
  python3 parallelizing_tensorflow.py
就会出现相关显示。
使用分布式TensorFlow相当容易,只需在集群服务器中为worker节点分配带名字的IP,然后就可以手动或者自动为worker节点分配操作任务。

4)产品化开发提示:

如果想在产品中使用机器学习的脚本,有一些实践的注意点,包括如何最有效地保存和加载词汇表、计算图、模型变量和检查点,还有使用TensorFlow命令行参数解析器和日志级别的方法。
当运行TensorFlow程序时,希望确保内存中没有其他计算图会话,或者每次调试程序时重置计算图会话:
  from tensorflow.python.framework import ops
  ops.reset_default_graph()

当处理文本或者任意数据管道时,需要确保保存处理过的数据,以便随后用相同的方式处理评估数据。如果处理文本数据,需要确定能够保存和加载词汇字典。下面是保存JSON格式的词汇字典的代码:
  import json
  word_list=['to', 'be', 'or', 'not', 'to', 'be']
  vocab_list=list(set(word_list))
  vocab2ix_dict=dict(zip(vocab_list, range(len(vocab_list))))
  ix2vocab_dict={val: key for key,val in vocab2ix_dict.items()}
  # Save vocabulary
  import json
  with open('vocab2ix_dict.json', 'w') as file_conn:
     json.dump(vocab2ix_dict, file_conn)
  # Load vocabulary
  with open('vocab2ix_dict.json', 'r') as file_conn:
     vocab2ix_dict=json.load(file_conn)

上面使用JSON格式存储词汇字典,也可以将其存储为txt、csv或者二进制文件。如果词汇字典太大,建议使用二进制文件,可以使用pickle库创建pkl二进制文件,但是这种格式在不同的Python版本间兼容性不好。
为了保存算法模型的计算图和变量,在计算图中创建Saver()操作,建议在模型训练过程中按一定规则保存模型。代码为:
  saver=tf.train.Saver()
  for i in range(generations):
     ......
     if i%save_every==0:
       saver.save(sess, 'my_model', global_step=step)
  # Can also save only specific variables:
  saver=tf.train.Saver({"my_var": my_variable})

Saver()操作也可以传参数,能接受变量和张量字典来保存指定元素,也可以接受checkpoint_every_n_hours参数来按时间规则保存操作。默认保存操作只保存最近的5个模型,但可以通过max_to_keep选项改变。
在保存算法模型之前,确保为模型重要的操作命名。如果不提前命名,TensorFlow很难加载指定的占位符、操作或者变量。TensorFlow的大部分操作和函数都接受name参数:
  conv_weights=tf.Variable(tf.random_normal(), name='conv_weights')
  loss=tf.reduce_mean(..., name='loss')

TensorFlow的tf.apps.flags库使得执行命令行参数解析比较容易,可以定义string、float、integer或boolean型的命令行参数,运行tf.app.run()函数即可运行带有flag参数的main()函数:
  tf.flags.DEFINE_string("worker_locations", "", "List of worker addresses.")
  tf.flags.DEFINE_float("learning_rate", 0.01, "Initial learning rate.")
  tf.flags.DEFINE_integer("generations", 10001, "Number of training generations.")
  tf.flags.DEFINE_boolean("run_unit_tests", False, "If true, run tests.")
  FLAGS=tf.flags.FLAGS
  # Need to define a 'main' function for the app to run
  def main():
     worker_ips=FLAGS.worker_locations.split(",")
     learning_rate=FLAGS.learning_rate
     generations=FLAGS.generations
     run_unit_tests=FLAGS.run_unit_tests
  # Run the Tensorflow app
  if __name__=="__main__":
     # The following is looking for "main()" function to run
     tf.app.run()
     # More custom
     tf.app.run(main=my_main_function(), argv=my_arguments)

TensorFlow有内建的logging设置日志级别,日志级别可设置为DEBUG、INFO、WARN、ERROR、FATAL,默认级别是WARN。代码为:
  tf.logging.set_verbosity(tf.logging.WARN)
  # INFO or DEBUG
  tf.logging.set_verbosity(tf.logging.DEBUG)

使用者要能习惯用这些工具编写代码。

5)产品化实例:

产品级机器学习模型的最佳方式是分开训练和评估代码。下面示例只包括评估脚本,包括单元测试、模型保存和加载,以及模型评估。这里还是使用RNN模型预测邮件文本信息是否为垃圾信息,假设RNN模型已经训练好,同时带有词汇表。
  import os
  import re
  import numpy as np
  import tensorflow as tf
  from tensorflow,python.framework import ops
  ops.reset_default_graph()
  # Define App Flags
  tf.flags.DEFINE_string("storage_folder", "temp", "Whrer to store model and data.")
  tf.flags.DEFINE_float("learning_rate", 0.0005, "Initial learning rate.")
  tf.flags.DEFINE_float("dropout_prob", 0.5, "Per to keep probability for dropout.")
  tf.flags.DEFINE_integer("epochs", 20, "Number of epochs for training.")
  tf.flags.DEFINE_integer("batch_size", 250, "Batch size for training.")
  tf.flags.DEFINE_integer("rnn_size", 15, "RNN feature size.")
  tf.flags.DEFINE_integer("embedding_size", 25, "Word embedding size.")
  tf.flags.DEFINE_integer("min_word_frequency", 20, "Word frequency cutoff.")
  tf.flags.DEFINE_boolean("run_unit_tests", False, "If true, run tests.")
  FLAGS=tf.flags.FLAGS

声明文本清洗函数,在训练脚本中也有相同的清洗函数:
  def clean_text(text_string):
     text_string=re.sub(r'([^sw]|_|[0-9])+', '', text_string)
     text_string=" ".join(text_string.split())
     text_string=text_string.lower()
     return text_string

加载词汇处理函数:
  def load_vocab():
     vocab_path=os.path.join(FLAGS.storage_folder, 'vocab')
     vocab_processor=tf.contrib.learn.preprocessing.VocabularyProcessor.restore(vocab_path)
     return vocab_processor

创建数据处理管道:
  def process_data(inpus_data, vocab_processor):
     input_data=clean_text(input_data)
     input_data=input_data.split()
     processed_input=np.array(list(vocab_processor.transform(input_data)))
     return processed_input

要求用户输入文本,然后处理输入文本和返回处理文本,用于评估模型:
  def get_input_data():
     input_text=input("Please enter a text message to evaluate.")
     vocab_processor=load_vocab()
     return process_data(input_text, vocab_processor)

也有很多应用是通过提供文件或者API来获取数据,可以根据需要调整输入函数。
对于单元测试,应确保文本处理函数的表现符合预期:
  class clean_test(tf.test.TestCase):
     def clean_string_test(self):
       with self.test_session():
         test_input='--Tensorflow's so Great! Dont you think so? '
         test_expected='tensorflows so great don you think so'
         test_out=clean_text(test_input)
         self.assertEqual(test_expected, test_out)

运行主函数,在主函数中获取数据集,建立计算图,加载模型变量,传入处理过的数据,打印输出结果:
  def main(args):
     # Get flags
     storage_folder=FLAGS.storage_folder
     # Get user input text
     x_data=get_input_data()
     # Load model
     graph=tf.Graph()
     with graph.as_default():
       sess=tf.Session()
       with sess.as_default():
         # Load the saved meta graph and restore variables
         saver=tf.train.import_meta_graph("{}.meta".format(os.path.join(storage_folder, "model.ckpt")))
         # Get the placeholders from the graph by name
         x_data_ph=graph.get_operation_by_name("x_data_ph").outputs[0]
         dropout_keep_prob=graph.get_operation_by_name("dropout_keep_prob").outputs[0]
         probability_outputs=graph.get_operation_by_name("probability_outputs").outputs[0]
         # Make the prediction
         eval_feed_dict={x_data_ph: x_data, dropout_keep_prob: 1.0}
         probability_prediction=sess.run(tf.reduce_mean(probability_outputs, 0), eval_feed_dict)
         # Print output (Or save to file or DB connection?)
         print('Probability of Spam: {:.4}'.format(probability_prediction[1]))

单元测试或main()函数如何运行:
  if __name__=="__main__":
     if FLAGS.run_unit_tests:
       # Perform unit tests
       tf.test.main()
     else:
       # Run evaluation
       tf.app.run()

对于模型评估,能加载带TensorFlow app-flag的命令行参数,加载模型和词汇处理函数,然后运行处理过的数据训练模型并进行预测。
应使用命令行运行上面的脚本,在创建算法模型和词汇字典前应检测训练脚本。

6)TensorFlow服务部署:

这里将预测垃圾邮件和正常邮件的使用TensorFlow的RNN模型加载到本地服务器,读取9000作为输入。可以通过教程https://www.tensorflow.org/serving/serving_basic来学习。这里要修改前面的RNN模型,以protobuf格式保存,而相关脚本要用bash命令行提示符执行。
在Linux下安装服务使用命令行:
  sudo apt install tensorflow-model-server
先要导入编程库,设置TensorFlow flag:
  import os
  import re
  import io
  import sys
  import requests
  import numpy as np
  import tensorflow as tf
  from zipfile import ZipFile
  from tensorflow,python.framework import ops
  ops.reset_default_graph()
  # Define App Flags
  tf.flags.DEFINE_string("storage_folder", "temp", "Whrer to store model and data.")
  tf.flags.DEFINE_float("learning_rate", 0.0005, "Initial learning rate.")
  tf.flags.DEFINE_float("dropout_prob", 0.5, "Per to keep probability for dropout.")
  tf.flags.DEFINE_integer("epochs", 20, "Number of epochs for training.")
  tf.flags.DEFINE_integer("batch_size", 250, "Batch size for training.")
  tf.flags.DEFINE_integer("rnn_size", 15, "RNN feature size.")
  tf.flags.DEFINE_integer("embedding_size", 25, "Word embedding size.")
  tf.flags.DEFINE_integer("min_word_frequency", 20, "Word frequency cutoff.")
  tf.flags.DEFINE_boolean("run_unit_tests", False, "If true, run tests.")
  FLAGS=tf.flags.FLAGS

下面只介绍脚本不同的部分代码:
  out_path=os.path.join(tf.compt.as_bytes(os.path.join(storage_folder, '1')))
  print('Exporting finished model to : {}'.format(out_path))
  builder=tf.saved_model.builder.SavedModelBuilder(out_path)
  # Build the signature_def_map
  classification_inputs=tf.saved_model.utils.build_tensor_info(x_data_ph)
  classification_output_classes=tf.saved_model.utils.build_tensor_info(rnn_model_outputs)
  classification_signature=(tf.saved_model.signature_def_utils.build_signature_def(inputs={tf.saved_model.signature_constants.CLASSIFY_INPUTS: classification_inputs}, outputs={tf.saved_model.signature_constants.CLASSIFY_OUTPUT_CLASSES: classification_outputs_classes}, method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME))
  tensor_info_x=tf.saved_model.utils.build_tensor_info(x_data_ph)
  tensor_info_y=tf.saved_model.utils.build_tensor_info(y_output_ph)
  prediction_signature=(tf.saved_model.signature_def_utils.build_signature_def(inputs={'texts': tensor_info_x}, outputs={'scores': tensor_info_y}, method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
  legacy_init_op=tf.group(tf.tables_initializer(), name='legacy_init_op')
  builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.SERVING], signature_def_map={'predict_spam': prediction_signature, tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: classification_signature}, legacy_init_op=legacy_init_op)
  builder.save()
  print('Done exporting')

TensorFlow服务需要特定的文件或文件夹结构来加载模型。其中创建了数据目录temp,并建立模型版本号1;在版本号子目录中,保存protobuf模型和变量文件夹variables。
在数据目录中,TensorFlow服务将查找整数文件夹,将自动启动并在最大整数文件夹下获取模型,也就是如果要部署新模型需要将其标记为版本2,并将其粘贴到标记为2的文件夹下,TensorFlow服务将自动获取该模型。
要启动服务器,需要调用命令tensorflow_model_server,带有port、model_name和model_base_path参数;然后TensorFlow服务查找版本号文件夹并获取最大版本编号的模型;再通过给定参数的端口,将其部署到命令运行的机器上。在本地机器上运行,并接受默认的端口9000的命令行:
  tensorflow_model_server --port=9000 --model_name=spam_ham --model_base_path=<directory of our code>/tensorflow_cookbook/10_Taking_TensorFlow_to_Production/06_Using_TensorFlow_Serving/temp/
界面中会显示相关信息输出。
可以将二进制数据提交到<host>: 9000,返回显示结果的JSON响应。
上面代码在主机上部署了可以响应传入请求的模型服务器,要求输入数据为numpy数组的二进制格式。还可以使用其他一些架构来实现,比如Docker、Kubernetes、Luigi、Django/Flask、Celery、AWS、Azure等。

7)TensorFlow可视化:

机器学习算法的监控和故障解决是一个比较艰难的事情,尤其是在训练模型时,必须等待很长时间,有时还不知道结果。使用Tensorboard可以图形化计算图,绘制模型训练中的损失、准确度和批量训练时间等重要的值。Tensorboard是TensorFlow的可视化模块,允许可视化模型训练过程中的统计指标、计算图和图像。这里展示在Tensorboard中监控数值、数据集直方图和创建图像。
  import os
  import io
  import time
  import numpy as np
  import matplotlib.pyplot as plt
  import tensorflow as tf

初始化计算图,创建summary_writer将Tensorboard summary写入Tensorboard文件夹:
  sess=tf.Session()
  summary_writer=tf.summary.FileWriter('tensorboard', sess.graph)

确保summary_writer写入的Tensorboard文件夹存在,以便写入Tensorboard日志:
  if not os.path.exists('tensorboard'):
     os.makedirs('tensorboard')

设置模型参数,为模型生成数据集:
  batch_size=50
  generations=100
  # Create sample input data
  x_data=np.arange(1000)/10.
  true_slope=2.
  y_data=x_data*trye_slope+np.random.normal(loc=0.0, scale=25, size=1000)

代码中设置真实斜率true_slope为2,在迭代训练时,可视化斜率随时间的变化,观察逼近真实值的过程。
分割数据集为测试集和训练集:
  train_ix=np.random.choice(len(x_data), size=int(len(x_data)*0.9), replace=False)
  test_ix=np.setdiff1d(np.arange(1000), train_ix)
  x_data_train, y_data_train=x_data[train_ix], y_data[train_ix]
  x_data_test, y_data_test=x_data[test_ix], y_data[test_ix]

创建占位符、变量、模型操作、损失和优化器操作:
  x_graph_input=tf.placeholder(tf.float32, [None])
  y_graph_input=tf.placeholder(tf.float32, [None])
  # Declare model variable
  m=tf.Variable(tf.random_normal([1], dtype=tf.float32), name='Slope')
  # Declare model
  output=tf.multiply(m, x_graph_input, name='Batch_Multiplication')
  # Declare loss function (L1)
  residuals=output-y_graph_input
  l2_loss=tf.reduce_mean(tf.abs(residuals), name="L2_Loss")
  # Declare optimization function
  my_optim=tf.train.GradientDescentOptimizer(0.01)
  train_step=my_optim.minimize(l2_loss)

创建Tensorboard操作,汇总一个标量值--模型的斜率估计:
  with tf.name_scope('Slope_Estimate'):
     tf.summary.scalar('Slope_Estimate', tf.squeeze(m))

添加到Tensorboard的另一个汇总数据是直方图,此直方图输入张量,输出计算图和直方图:
  with tf.name_scope('Loss_and_Residuals'):
     tf.summary.histogram('Histogram_Errors', tf.squeeze(l1_loss))
     tf.summary.histogram('Histogram_Residuals', tf.squeeze(residuals))

创建完这些汇总操作,创建汇总合并操作,综合所有的汇总数据,然后初始化模型变量:
  summary_op=tf.summary.merge_all()
  # Initialize Variables
  init=tf.global_variables_initializer()
  sess.run(init)

训练线性回归模型,将每次迭代训练写入汇总数据:
  for i in range(generations):
     batch_indices=np.random.choice(len(x_data_train), size=batch_size)
     x_batch=x_data_train[batch_indices]
     y_batch=y_data_train[batch_indices]
     _, train_loss, summary=sess.run([train_step, l2_loss, summary_op], feed_dict={x_graph_input: x_batch, y_graph_input: y_batch})
     test_loss, test_resids=sess.run([l2_loss, residuals], feed_dict={x_graph_input: x_data_test, y_graph_input: y_data_test})
     if (i+1)%10==0:
       print('Generation {} of {}. Train Loss: {:.3}, Test Loss: {:.3}.'.format(i+1, generations, train_loss, test_loss))
     log_writer=tf.train.SummaryWriter('tensorboard')
     log_writer.add_summary(summary, i)

为了可视化数据点拟合的线性回归模型,创建protobuf格式的计算图图形:
  def gen_linear_plot(slope):
     linear_prediction=x_data*slope
     plt.plot(x_data, y_data, 'b.', label='data')
     plt.plot(x_data, linear_prediction, 'r-', linewidth=3, label='perdicted line')
     plt.legend(loc='upper left')
     buf=io.BytesIO()
     plt.savefig(buf, format='png')
     buf.seek(0)
     return(buf)

创建并且将protobuf格式的图形增加到Tensorboard:
  slope=sess.run(m)
  # Generate the linear plot in buffer
  plot_buf=gen_linear_plot(slope[0])
  # Convert PNG buffer to TF image
  image=tf.image.decode_png(plot_buf.getvalue(), channels=4)
  # Add the batch dimension
  image=tf.expand_dims(image, 0)
  # Add image summary
  image_summary_op=tf.summary.image("Linear_Plot", image)
  image_summary=sess.run(image_summary_op)
  log_writer.add_summary(image_summary, i)
  log_writer.close()

向Tensorboard写入图像汇总不要太频繁,如果迭代训练次数很多,每次都写入汇总数据,就会迅速占用大量磁盘空间。
从命令行运行上面的脚本程序:
  python3 using_tensorboard.py
运行指定的命令启动Tensorboard:
  tensorboard --logdir="tensorboard"
会出现提示:
  Starting tensorboard b'29' on port 6006 (You can navigate to http://127.0.0.1:6006)
使用浏览器打开网址:http://127.0.0.1:6006,就能看到样例图形,包括迭代100次的斜率估计趋势图、汇总直方图、最终拟合直线和数据点图形,保存为protobuf格式插入到Tensorboard图形汇总中。

Copyright@dwenzhao.cn All Rights Reserved   备案号:粤ICP备15026949号
联系邮箱:dwenzhao@163.com  QQ:1608288659