I was recently tasked with teaching a lecture on TensorFlow to ~200 MIT students in 6.864 (Deep Learning for NLP). This post contains everything I taught them – and more.
“Literally the best presentation, ever.”
- random student during my presentation to 6.864 (before they fainted)
Everything in TensorFlow comes down to building a computation graph. What is a computation graph? Its just a series of math operations that occur in some order. Here is an example of a simple computation graph:
This graph takes 2 inputs, (a, b) and computes an output (e). Each node in the graph is an operation that takes some input, does some computation, and passes its output to another node.
We could make this computation graph in TensorFlow in the following way:
tf.placeholder to handle inputs to the model. This is like making a reservation at a restaurant. The restaurant reserves a spot for 5 people, but you are free to fill those seats with any set of friends you want to.
tf.placeholder lets you specify that some input will be coming in, of some shape and some type. Only when you run the computation graph do you actually provide the values of this input data. You would run this simple computation graph like this:
feed_dict to pass in the actual input data into the graph. We use
session.run to get the output from the c operation in the graph. Since e is at the end of the graph, this ends up running the entire graph and returning the number 45 - cool!
Note that the computation graph we have defined is fixed + static. It can’t really learn anything. We’re now going to make a computation graph that can do some learning (hint hint: a neural network).
Our task: Sentiment Analysis
For this tutorial, we’re going to be learning a model to perform sentiment analysis on tweets. Given some tweet, we want our network to determine if the tweet is positive, negative, or neutral. This data can be downloaded here.
Building the Model
In this model, we’ll be representing tweets as bag-of-words (BOW) representations. BOW representations are vectors where each element index represents a different word and its value represents the number of times this word appears in our sentence. This means that each sentence will be represented by a vector that is
vocab_size long. Our output labels will be represented as a vector of size
n_classes (3). We get this data with some utility functions:
So we have our data loaded as numpy arrays. But remember, TensorFlow graphs begin with generic placeholder inputs, not actual data. We feed the actual data in later once the full graph has been defined. We define our placeholders like this:
A note about ‘None’ and fluid-sized dimensions:
You may notice that the first dimension of shape of
data_placeholder is ‘None’.
data_placeholder should have shape (num_tweets, vocab_size). However, we don’t know how many tweets we are going to be passing in at a time. Its possible that we only want to pass in 1 tweet at a time, or 30, or 1,000. Thankfully, TensorFlow allows us to specify placeholders with fluid-sized dimensions. We can use None to specify some fluid dimension of our shape. When our data eventually gets passed in as a numpy array, TensorFlow can figure out what the value of the fluid-size dimension should be.
Let’s now define and initialize our network parameters:
We have defined our model parameters using tf.Variable. When you create a tf.Variable you pass a Tensor as its initial value to the Variable() constructor. A Tensor is a term for any N-dimensional matrix. There are a ton of different initial Tensor value functions you can use (full list). All these functions take a list argument that determines their shape. Here we use
tf.truncated_normal for our weights, and
tf.zeros for our biases. Its important that the shape of these parameters are compatible. We’ll be matrix-multiplying the weights, so the last dimension of the previous weight must equal the first dimension of the next weight. Notice this pattern in the Tensor initialization code. Lastly, notice the size of the Tensor for our last weights. We are predicting a vector of size
n_classes so our network needs to end with
Now let’s define our computation graph.
Lets break this down. So our first operation in our graph is a
tf.matmul of our data input and our first set of weights. tf.matmul performs a matrix multiplication of two tensors. This is why it is so important that the dimensions of
h0_weights align (dimension 1 of
data_placeholder must equal dimension 0 of
h0_weights). We then just add the
h0_bias variable and perform a nonlinearity transformation (we use
tf.relu for a rectified linear unit (relu). We do this again for the next hidden layer, and the final output logits.
We have defined our network, but we need a way to train it. Training a network comes down to optimizing our network to minimize a loss function. Here we define our loss function and optimization strategy.
This is a classification problem, so our loss for each example is just the cross entropy of our prediction and the label. We train this one minibatch at a time, so we average the loss of all the examples in our minibatch to get the minibatch loss which we use to update the network parameters. We use Gradient Descent via
tf.train.GradientDescentOptimizer and minimize this loss. We also define a way to compute accuracy for later logging of results.
Quick Conceptual Note:
Nearly everything we do in TensorFlow is an operation with inputs and outputs. Our
loss variable is an operation, that takes the output of the last layer of the net, which takes the output of the 2nd-to-last layer of the net, and so on. Our loss can be traced back to the input as a single function. This is our full computation graph. We pass this to our optimizer which is able to compute the gradient for this computation graph and adjust all the weights in it to minimize the loss.
Training our Net
We have our network, our loss function, and our optimizer ready, now we just need to pass in the data to train it. We pass in the data in chunks called
Running this code, you’ll see the network train and output its performance as it learns. I was able to get it to 65.5% accuracy. This is not great considering random guessing gets you 33.3% accuracy. We’ll cover numerous ways to improve upon this in later tutorials.
This was a brief introduction into TensorFlow. There is so, so much more to learn and explore, but hopefully this has given you some base knowledge to expand upon. As an exercise: see what you can do with this code to improve the performance. Ideas include: randomizing mini-batches, making the network deeper, using RNN’s on word tokens instead of fully connected layers on bag-of-words, trying different optimizers (like Adam), different weight initializations. We’ll explore some of these in future tutorials.
HackerNews discussion: here.