Till the last post we have seen methods of building a sentiment analyzer using multi-layer feedforward neural network. In fact in this post also we will build sentiment analyzer which can predict positiveness or negativeness of a movie review, We can consider this as one of the user case of what we learned so far.
This particular concept is divided into 2 parts. One, Pre processing our data. Two, Using random forest technique to predict.
We will use pandas module to extract data from a csv file. As we did before we will use bag of words model to create feature sets. But before we have to clear little dirt like html tags (using beautifulsoup module), removing punctuations, and removing stopwords . StopWords are the words like the, and, an, is etc which do not add any specific emotion to the sentence. We are removing punctuations as well to just remove the complexity, once we get quite familiar with what we are doing we add more complexities to our model. We will implement all this functionality in function clean_text.
Now we have to apply these modifications to all the reviews in our file. We call that function as create_clean_train. This function might take couple of minutes because there are almost 25000 reviews all together.
We will create feature sets using CountVectorizer from scikit learn.
In next, we will complete building our movie review sentiment analyser. See you next!
Complete source code: here