Beginning ML: What next?

Machine learning is a branch of artificial intelligence and deals with lots and lots of data. In our neural network models we have used MNIST dataset which was pre processed and can be directly used by tensorflow. And we also have used a raw dataset that we had to pre process to make use for our model. Clearly one need to understand and analyse data ( In fact large amounts of data ) .

Data Analysis is the branch that deals with data and helps us to understand data and use to draw conclusions. Therefore in machine learning data is critical part  and we have to learn to analyse data.

Next posts we will look how we can use python to perform data analysis. I’m looking up few courses online to learn data analysis .

Let’s catch up in next!

Advertisements

Beginning ML: Movie Review Sentiment Analyser cont. :KnowDev

In last post we have used pandas to extract raw data from .csv files and used bag of words model to pre process our data into feature sets.

In this post we will train the model. It’s most simplest thing. We will use RandomForests to predict. Random forest is a collection of decision trees.

First we initialize forest with 100 decision trees.

forest = RandomForestClassifier(n_estimators=100)

We will use fit function in forest variable to build a forest of trees from training set.

forest = forest.fit(train_cleaned_data, train[‘sentiment’])

trained_cleaned_data is the pre processed data from our last post. train[‘sentiment’] is the labels for all the data corresponding to X. And we are done with training our model.

Now, we can test and predict using our model.

To test we have first transform the test raw data into required format. We will use transform while testing because to avoid over-fitting.

    test_data_features = vectorizer.transform(clean_test_reviews)

Then we will simply predict using predict function of forest variable.

    result = forest.predict(test_data_features)

We will finish off our testing by simply loading all the predictions to a file for permanent storage. And that’s it we have used a new model and a new technique to build a sentiment analyzer. This model is not a perfect one for commercial use because one, we did not use a large dataset and also we did not use a more sophisticated model. In up coming posts we  will see what are those “sophisticated” techniques or models. I’m sure those concepts will be much more interesting, with that I’ll see you soon!

Complete source code here

Beginning ML – Movie Review Analysis: KnowDev

Till the last post we have seen methods of building a sentiment analyzer using multi-layer feedforward neural network. In fact in this post also we will build sentiment analyzer which can predict positiveness or negativeness of a movie review, We can consider this as one of the user case of what we learned so far.

This particular concept is divided into 2 parts. One, Pre processing our data. Two, Using random forest technique to predict.

Pre-Processing :

We will use pandas module to extract data from a csv file. As we did before we will use bag of words model to create feature sets. But before we have to clear little dirt like html tags (using beautifulsoup module), removing punctuations, and removing stopwords . StopWords  are the words like the, and, an, is etc which do not add any specific emotion to the sentence. We are removing punctuations as well to just remove the complexity, once we get quite familiar with what we are doing we add more complexities to our model. We will implement all this functionality in function clean_text.

Now we have to apply these modifications to all the reviews in our file. We call that function as create_clean_train. This function might take couple of minutes because there are almost 25000 reviews all together.

We will create feature sets using CountVectorizer from scikit learn.

In next, we will complete building our movie review sentiment analyser. See you next!

Complete source code: here

Beginning ML: Sentiment Analysis Using Textblob : KnowDev

In the last post we have build a neural network for sentiment analysis. We have used our own dataset which was not pretty big enough. Indeed we were able to achieve accuracy of 54%. Today we shall be using a module of python for sentiment analysis. We shall be building twitter sentiment analyzer ! believe me you’ll be amazed by how easily we can achieve it !

First we need to install 2 modules, tweepy, which allows us to make API calls to twitter. We have to create a app in twitter developer to actually authenticate ourselves. Next,  we need textblob which can perform sentiment analysis. Textblob can actually perform many more operations apart from sentiment analysis. If you are you can check out here.

Let’s import our dependencies

import tweepy
from textblob import TextBlob

We have to declare 4 variables, consumer_key, consumer_secret, access_token, access_token_secret all these can be found after we create app in twitter developer site.

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

We can authenticate ourselves by above 2 lines. We are almost done with the authentication.

api = tweepy.API(auth)

Through api variable we can use search operation to find public tweets.

public_tweet = api.search('search')

search is the key word we will finding for. Now we can iterate through public_tweets and use textblob to perform sentiment analysis on the tweet.

for tweet in public_tweet:
    T = tweet.text
    analysis = TextBlob(tweet.text)
    sentiment = analysis.sentiment.polarity
    print T, sentiment

And that’s it ! We have successfully using tweepy and textblob modules to build a twitter sentiment analyzer in less than 25 lines. In fact there are many more sources from which we can use API.

This is a relatively small post and you know why ! Now you can use sentiment analyzer for wide range of use cases and I’ll see you in next !

Complete source code

Beginning ML – Sentiment Analysis Using Neural Network cont. : KnowDev

This post is a continuation from this .

I hope you have got a good understanding why we have to pre-process. In this post we shall train our model and also input our own sentences.

First of all we shall get our feature sets that we have created either from pickle or call the function to store into a variable.

from create_sentiment_featuresets import create_feature_sets_and_labels
train_x, train_y, test_x, test_y = create_feature_sets_and_labels('pos.txt', 'neg.txt')

We will be using the same neural network model that we used here. First we have define our placeholder for features.

x = tf.placeholder('float', [None, len(train_x[0])])
y = tf.placeholder('float')

len( train_x[0] ) returns the length the features.

The neural network model is define using neural_network_model function. After the neural network is defined it’s time to train our model.

First we’ll capture the prediction / output of neural network using

prediction = neural_network_model(x)

Then, we have to find the cross entropy of the prediction made by our model. We are using softmax regresstion.

#1
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
                                                      prediction, y))

After finding the cross entropy is time to back propagate and try reduce the difference.

#2
optimizer = tf.train.AdadeltaOptimizer(0.5).minimize(cross_entropy)

both #1 and #2 makes the training step. We’ll start session and using number of epochs as 10.

The accuracy we could achieve was 55.44

sentiment-54-accuracy

The trained model is saved into ‘sentiment_model.ckpt’, later we can use that to restore our variables ( i.e weights and biases ) to use.

Making predictions :

To make predictions using our model that we have just trained we have to preprocess our input sentence so that can be passed as features to our model. After we prepocess our input sentence we predict.

result = (sess.run(tf.argmax(prediction.eval(
feed_dict={x: [features[:423]]}), 1)))

we print out whether the output is positive or negative using

if result[0] == 0:
    print('Positive:', input_data)
elif result[0] == 1:
    print('Negative:', input_data

sentiment_output

As you can see our model makes pretty good prediction even though the accuracy is 54% .

In this post we have seen how we can train our own data as well as use it. In less than a week time we are able to make a machine which can predict the sentiment of any sentence pretty interesting right ? In next post I will introduce you to more sophisticated version of sentiment analysis. See you in next !

link to complete source code :  here

Beginning ML – Sentiment Analysis Using Deep Neural Network: KnowDev

In last post we have implemented our first neural network which can classify a set a images. In fact, That experiment can be considered as HELLO WORLD program. There is lot more we have consider while implementing our model. Mainly data ! lot of times data is very raw and we are required to perform some kind of preprocessing so that the data is in format that tensorflow objects can accept. Sentiment Analyser is a program which can tell whether the given sentence is positive or negative. We will be using the same neural network model that we have build in last post. For better understanding purpose the whole process of building of sentiment analyzer is divided into parts.

In this post, we shall be looking on how to get raw data and convert into required format. Both positive dataset and negative dataset is available in GIT link. First we’ll download the datasets into our directory. Both datasets contains 5000 sentences each. Yes! the data we got is not really enough for practical purposes.

Once we have got our datasets ready in our directory. Import tensorflow(duh!) we shall be creating feature sets from the data.

First of all we have create our vocabulary of words.The model we will is bag of words. We will call this collection words as lexicon. We will be using nltk library to extract words which are most relevant. The technique we are using is stemming and lemmatizing.

lexicon = [lemmatizer.lemmatize(i) for i in lexicon]

lexicon in the LHS just contains all the words from pos.txt and neg.txt (our datasets). In fact we can  employ other techniques such as removing stop words (like the, an, a..) which have no particular effect on the sentiment of the sentence. We are kind of removing  those words by considering only words of frequency more than 1000.

for w in w_counts:
    if 1000 > w_counts[w] > 50:
        l2.append(w) # l2- final lexicon list

Now as if have created our vocabulary we can create out features. Here our lexicon size is 423. A tensor accepts a object of floats but the sentences we have in string. Hence we have to use our lexicon that we have created earlier to make a vector which contains the frequency of words in the sentence.

for example, lexicon = [‘dog’, ‘cat’, ‘eat’, fight’, ‘food’] and the given sentence is ” dog fights with cat for food “. Therefore the feature set is [1, 1, 0, 1, 1].

We create a list of list of features and classification. Positive is denoted as [1, 0] and negative as [0, 1]. 

features = list(features)
featureset.append([features, classification])

Finally we’ll create our collection of featureset of both positive and negative. The list shuffled so that the neural network can converge.

features += sample_handling('pos.txt', lexicon, [1, 0])
features += sample_handling('neg.txt', lexicon, [0, 1])
random.shuffle(features)

Now the whole set is divided training data and testing data.

train_x = list(features[:, 0][:-testing_size])
train_y = list(features[:, 1][:-testing_size])

test_x = list(features[:, 0][-testing_size:])
test_y = list(features[:, 1][-testing_size:])

train_x and test_x are the features and train_y and test_y are the labels. We will be using pickle module for permanent storage of these values so that they can be used later for training our neural network.

In this post we have downloaded our own data and cleaned to our requirements as well as dividing our cleaned data into training data and testing data.

In next post we will be using this data to train our model and test to find accuracy and also run the model against our own inputs ! Awesome right ? I’m excited too…

See you in next !

link to complete source code :  https://github.com/makaravind/SentimentAnalyzer-54

next post : next

 

Beginning ML – First Neural Net : KnowDev

We have gone through some of the important topics in tensorflow and believe me there are ton of others ! but no worries.. We’ll catch up ! I always believed in project based learning. Therefore we’ll do the same this time as well. We shall be building a feed forward deep neural network which can classify handwritten digits. Sounds interesting right ? Let’s get into it. Open up any text editor or IDE. I personally prefer coding in an pycharm IDE.  It’s a wonderful piece of software to write your python scripts.So, What is the most critical part of any neural network ? Data ! right ! Neural networks shine when there is lots and lots of data to train it. We will be using mnist dataset provided in tensorflow.org tutorials.We can get the data by simply importing and loading into python variable.

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

In MNIST, Every data point contains 2 points 1) image 2) label. Every image is of size 28 x 28.

Let’s start building our graph by creating a placeholder variable.

x = tf.placeholder(tf.float32, [None, 784])
y = tf.placeholder(tf.float32) # for labels

xisn’t a specific value, but we’ll input when we ask tensorflow to run computations. Here x represents a MNIST image, flattened into 784-dimentional vector. We represent this as 2-D tensor of floating point.

Now we need variables for weights and biases. These we represent with tensorflow variable as these variables can be modified by operational computations.

W = tf.Variable(tf.random_normal([784, 10]))
b = tf.Variable(tf.random_normal([10]))

Notice that W, b are initialized with some random values but it doesn’t actually matter and w is tensor of shape [784, 10] because we want to generate classification for 10 classes i.e 0,1,2..9.  As we are building a deep net we need to have one or more hidden layers.

hidden_1_layer = { ‘weights’: tf.Variable(tf.random_normal([784, n_nodes_hl1])),
‘biases’: tf.Variable(tf.random_normal([n_nodes_hl1])) }

n_nodeshl1 is declared as 500. It is number of nodes in single hidden layer. We can tweak these numbers to check the change in accuracy. Moving on..

Now we can completed the neural network model by completing the implementation of layers.

l1 = tf.add(tf.matmul(data, hidden_1_layer[‘weights’]), hidden_1_layer[‘biases’])
l1 = tf.nn.relu(l1)  #activaion func

There few more steps before we start training our model i.e Actually defining classification algorithm and employ a optimizer for back propagation. We are using softmax_cross_entropy_with_logits  function.

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(prediction, y))

#comparing the diff with predicted vs orginal 

optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cost)

There are many kinds of optimizers available in tensorflow, each one is optimal for some particular use-case. 0.5 mentioned is the learning rate of the neural network.

Let’s train our neural network. Each cycle of feed forward and back propagation is called an epoch. I have set number of epoch as 5 and also 10. we have to start a session and initialize all variables. We can run our both optimizer and cost epoch number of times here it is called as training step.

_, c = sess.run( [optimizer, cost], feed_dict = { x:epoch_x, y:epoch_y } )

We are placing the values of both x and y in batches. c is the variable which holds the value of epoch loss in each epoch.

Its time to test the neural network and check the accuracy of our model.

correct = tf.equal(tf.argmax(prediction,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, ‘float’))

print(‘Accuracy: ‘, accuracy.eval( { x:mnist.test.images, y:mnist.test.labels }) )

The accuracy I could achieve was 97.999.

This implementation of simple deep neural network. This code is inspired from pythonprogramming.net and also tensorflow.org .

complete source code : https://github.com/makaravind/ImageClassifier

We shall be using this model for more couple of use-cases before we move on to another model. I’ll catch you up in next post.