Sentiment analysis example using Spark, Scala and Stanford NLP library
Introduction
Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to determine whether data (user opinion as written text) is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs or trends.
Stanford CoreNLP is a Java library capable of text processing and annotation. CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations.
Adding CoreNLP is easy, just add this line to build.sbt file (under libraryDependencies):
"edu.stanford.nlp" % "stanford-corenlp" % "4.5.1"
Stanford CoreNLP library needs a trained language model in order to work. You can download that model (as a jar file) here:
You can put language model jar in lib folder of your project (e.g. lib/stanford-english-corenlp-2018-02-27-models.jar). That file is kind of big (1.04 GB).
We need some kind of text sample to analyze. In this example I'm using Stanford sentiment test data. It can be downloaded here:
Download that file, unzip it and rename CSV file to stanfordSentimentData.csv
. I put my copy in /home/user/Downloads
folder.
Apache Spark & Scala code
We're going to need a couple of imports for this job to run:
import java.util.Properties
import scala.collection.JavaConverters._
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.storage.StorageLevel
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import edu.stanford.nlp.pipeline.{ Annotation, StanfordCoreNLP }
I'm going to assume you know how to create Spark session, there are many examples available online. Use this code to set up Spark session and context:
val session: SparkSession = createSparkSession("sentimentAnalysis")
val context = session.sparkContext
context.setLogLevel("ERROR") // mute Spark logging
val sqlContext = session.sqlContext
import sqlContext.implicits._
Now we're going to read input (text) data. It's a good idea to print a couple of metrics when working with big data, just to make sure things are kinda sane.
val input = session.read.option("header", false)
.csv("file:///home/user/Downloads/stanfordSentimentData.csv")
.persist(StorageLevel.MEMORY_AND_DISK)
input.printSchema()
input.show(10, false)
println("row count: " + input.count())
Stanford test dataset has 6 columns:
- polarity (sentiment score, from 0 to 4, five possible values); index = 0
- record id; index = 1
- timestamp; index = 2
- query topic; index = 3
- user id; index = 4
- text; index = 5
If you're curious about this line:
.persist(StorageLevel.MEMORY_AND_DISK)
We're caching data to avoid repeating Spark transformations. We don't really have any transformations here, we're just loading data, but we do have two consecutive actions called on the same dataset:
input.show(10, false)
...
input.count() // inside println
It's a good habit to cache your data if you're going to reuse it.
Now we can create constants that CoreNLP needs:
val props = new Properties()
props.put("annotators", "tokenize, ssplit, parse, sentiment")
val sentimentScore = Array("Very Negative", "Negative", "Neutral", "Positive", "Very Positive")
val bSentimentScore = context.broadcast(sentimentScore)
Variable props contains types of annotations we want to run on each text block (tokenize, split into sentences, parse grammar and compute sentiment score). See this page for a full list of available annotations:
Variable sentimentScore contains textual representation of sentiment score (same as polarity field mentioned above; Very Negative = 0, Negative = 1, etc.) We're broadcasting that variable to all executors in our Spark cluster.
We're now ready to compute sentiment score for our input data:
val scoredRecords: Dataset[Tuple2[String, String]] = input.mapPartitions { rows =>
val pipeline = new StanfordCoreNLP(props)
rows.flatMap { row =>
if (row.isNullAt(5)) Nil else {
val text = row.getString(5)
val annotation = new Annotation(text)
pipeline.annotate(annotation)
val sentenceList = annotation.get(classOf[CoreAnnotations.SentencesAnnotation]).asScala.toList
sentenceList.map { sentence =>
val tree = sentence.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])
val score = RNNCoreAnnotations.getPredictedClass(tree)
bSentimentScore.value(score) -> text
}
}
}
}
scoredRecords.show(10, false)
We're using a couple of interesting functions:
- mapPartitions - a good function to know when you need to initialize an expensive context of some kind (a database connection, a service call or a text processing pipeline using a huge language model); Spark is going to create that expensive resource only once for each partition and not for each row in a dataset, which is what would happen if we used map function here
- pipeline (new StanfordCoreNLP(props)) - this is the workhorse, it's a NLP pipeline that performs all the annotations we requested on a block of text; note that pipeline is using side effects, we need to create an annotation object (new Annotation(text)) and pipeline is going to set all computed annotations on that object (mutable state)
- getPredictedClass - after running all annotations we get a list of sentences included in each block of text; for each sentence we get a tree of sentiment annotated elements (SentimentAnnotatedTree); for each tree we call RNNCoreAnnotations.getPredictedClass which is a type of a classifier that returns the category index (from 0 to 4)
Final output from this computation is a dataset containing a pair of Strings, where the first string is a textual representation of the sentiment score and the second string is the input text (included for comparison).
Job output
Let's have a look at what is being printed to stdout:
Negative | Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
Positive | Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
Neutral | Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
Negative | is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
Neutral | is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
Negative | I dived many times for the ball. Managed to save 50% The rest go out of bounds
Neutral | I dived many times for the ball. Managed to save 50% The rest go out of bounds
Neutral | my whole body feels itchy and like its on fire
Negative | no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.
Neutral | no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.
We can see that single text block can receive multiple sentiment scores. That happens because one text block can be split into multiple sentences and each sentence becomes a sentiment annotation input.
We can also see that sentence discovery doesn't always work right, and that sentiment assessment doesn't always match the content.
See this article on Wikipedia for more information on sentiment analysis and some of the challenges it presents:
Conclusion
This is just an example but it highlights real issues with sentiment analysis. Like in other areas of machine learning having good training data and pre-processing it right is half the success. NLP is a tough problem in general as it deals with abstract thought. We're entering the realm of artificial intelligence, and it is not an easy problem to solve. Still I hope this example is useful to you and that you can learn from it. Happy ML coding!