In last post we have used pandas to extract raw data from .csv files and used bag of words model to pre process our data into feature sets.
In this post we will train the model. It’s most simplest thing. We will use RandomForests to predict. Random forest is a collection of decision trees.
First we initialize forest with 100 decision trees.
forest = RandomForestClassifier(n_estimators=100)
We will use fit function in forest variable to build a forest of trees from training set.
forest = forest.fit(train_cleaned_data, train[‘sentiment’])
trained_cleaned_data is the pre processed data from our last post. train[‘sentiment’] is the labels for all the data corresponding to X. And we are done with training our model.
Now, we can test and predict using our model.
To test we have first transform the test raw data into required format. We will use transform while testing because to avoid over-fitting.
test_data_features = vectorizer.transform(clean_test_reviews)
Then we will simply predict using predict function of forest variable.
result = forest.predict(test_data_features)
We will finish off our testing by simply loading all the predictions to a file for permanent storage. And that’s it we have used a new model and a new technique to build a sentiment analyzer. This model is not a perfect one for commercial use because one, we did not use a large dataset and also we did not use a more sophisticated model. In up coming posts we will see what are those “sophisticated” techniques or models. I’m sure those concepts will be much more interesting, with that I’ll see you soon!
Complete source code here