In last post we have implemented our first neural network which can classify a set a images. In fact, That experiment can be considered as HELLO WORLD program. There is lot more we have consider while implementing our model. Mainly data ! lot of times data is very raw and we are required to perform some kind of preprocessing so that the data is in format that tensorflow objects can accept. Sentiment Analyser is a program which can tell whether the given sentence is positive or negative. We will be using the same neural network model that we have build in last post. For better understanding purpose the whole process of building of sentiment analyzer is divided into parts.
In this post, we shall be looking on how to get raw data and convert into required format. Both positive dataset and negative dataset is available in GIT link. First we’ll download the datasets into our directory. Both datasets contains 5000 sentences each. Yes! the data we got is not really enough for practical purposes.
Once we have got our datasets ready in our directory. Import tensorflow(duh!) we shall be creating feature sets from the data.
First of all we have create our vocabulary of words.The model we will is bag of words. We will call this collection words as lexicon. We will be using nltk library to extract words which are most relevant. The technique we are using is stemming and lemmatizing.
lexicon = [lemmatizer.lemmatize(i) for i in lexicon]
lexicon in the LHS just contains all the words from pos.txt and neg.txt (our datasets). In fact we can employ other techniques such as removing stop words (like the, an, a..) which have no particular effect on the sentiment of the sentence. We are kind of removing those words by considering only words of frequency more than 1000.
for w in w_counts: if 1000 > w_counts[w] > 50: l2.append(w) # l2- final lexicon list
Now as if have created our vocabulary we can create out features. Here our lexicon size is 423. A tensor accepts a object of floats but the sentences we have in string. Hence we have to use our lexicon that we have created earlier to make a vector which contains the frequency of words in the sentence.
for example, lexicon = [‘dog’, ‘cat’, ‘eat’, fight’, ‘food’] and the given sentence is ” dog fights with cat for food “. Therefore the feature set is [1, 1, 0, 1, 1].
We create a list of list of features and classification. Positive is denoted as [1, 0] and negative as [0, 1].
features = list(features) featureset.append([features, classification])
Finally we’ll create our collection of featureset of both positive and negative. The list shuffled so that the neural network can converge.
features += sample_handling('pos.txt', lexicon, [1, 0]) features += sample_handling('neg.txt', lexicon, [0, 1]) random.shuffle(features)
Now the whole set is divided training data and testing data.
train_x = list(features[:, 0][:-testing_size]) train_y = list(features[:, 1][:-testing_size]) test_x = list(features[:, 0][-testing_size:]) test_y = list(features[:, 1][-testing_size:])
train_x and test_x are the features and train_y and test_y are the labels. We will be using pickle module for permanent storage of these values so that they can be used later for training our neural network.
In this post we have downloaded our own data and cleaned to our requirements as well as dividing our cleaned data into training data and testing data.
In next post we will be using this data to train our model and test to find accuracy and also run the model against our own inputs ! Awesome right ? I’m excited too…
See you in next !
link to complete source code : https://github.com/makaravind/SentimentAnalyzer-54
next post : next