Absa: basic tutorial

In the following, we show some common aggregation operations on the output of aspect-based sentiment analysis. Specifically, we will do in-depth sentiment analysis on product reviews, finding out what product features users care about and what they think about them.

1. Data structure and preliminaries

Our dataset contains product reviews for BMW’s series X3 collected from auto.qq.com, the automotive section of qq.com. You can download the data from this link.

>>> import json
>>> from pprint import pprint
>>> from collections import Counter
>>> import operator

>>> import seaborn as sns
>>> import matplotlib.pyplot as plt
>>> import requests

We now use the built-in json module to load the dataset and inspect the structure of the datapoints:

>>> with open('bmw_x3_reviews_autoqq.json') as fp:
>>>     raw_data = json.load(fp)

>>> len(raw_data)
1045

>>> pprint(raw_data[0])
{'author': '三三的爹',
 'date': '2012-10-23',
 'text': '优点:2.0T,10月16日提车,比预计提前2个多月,杭州骏宝行守信,赞一个! '
     '虽中档车,却有驾豪车感觉,操控性强,转弯不减油无失向、被甩感;看上去较省油,就冲低油耗才选这款的;棕色很大气,稀饭。 '
     '缺点:隧道不开灯看不清仪表读数;雨天倒车镜无风吹功能,安全有忧;导航无手写;CD/DVD只有单碟,且行进时屏蔽图像;后厢空间比想象的小,放不了多少东西 '
     '综述:瑕不掩瑜,这个价位值得!原驾一辆05年的本田CRV。 ',
 'title': ' 宝马 宝马X3'}

The body of the review is contained in the ‘text’ field. We concatenate all texts into one list that we can pass for analysis to the API:

>>> text_data = [item['text'] for item in raw_data]

2. Linguistic analysis

To linguistically analyze the data, we define the API URL, provide our user credentials, call the analysis and save the data to a JSON file. Please not that this step can be skipped since the analysis output is already included in your downloaded data package.

>>> response = requests.post('https://api.anacode.de/analyze/',
>>>                          json={'texts': text_data, 'analyses': ['absa']},
>>>                          headers={'Authorization': 'Token <your token>'})

>>> analyzed_data = response.json()['absa']

>>> print('Analyzed {} data items.'.format(len(analyzed_data)))

>>> OUTPUT_FILE = "bmw_x3_reviews_autoqq_analysis.json"
>>> with open(OUTPUT_FILE, "w") as out_stream:
>>>     json.dump(analyzed_data, out_stream)

>>> print('Saved data to {}.'.format(OUTPUT_FILE))

3. Aggregation and plotting

3.1 Frequent features

This tutorial demonstrates how to find the most frequent product features mentioned in a dataset.

In the absa output, features are contained under the “semantics” attribute of “entities”. All feature types start with the prefix “feature_”. We use this as our condition to filter the relevant feature values as we loop through the analysis result and increment their frequency. Once the frequency counter is built, we print out the 20 most frequent items.

>>> feature_freqs = Counter()

>>> for item in analyzed_data:
>>>     for entity in item['entities']:
>>>         semantics = [s['value'] for s in entity['semantics']
>>>         if s['type'].startswith('feature_')]
>>>         feature_freqs.update(semantics)

>>> frequent_features = feature_freqs.most_common(20)

>>> for label, count in frequent_features:
>>>     print('{:<25} {}'.format(label, count))
Power                     86
Control                   81
FuelConsumption           78
VisualAppearance          64
Interior                  61
SteeringWheel             56
Space                     56
Engine                    53
Price                     42
Seats                     39
Configuration             39
AcceleratorPedal          38
Body                      27
OperationQuality          27
Speed                     25
Design                    24
AverageFuelConsumption    23
BackSeat                  23
Chassis                   21
HeadLight                 19

We now plot the result:

>>> labels, counts = zip(*reversed(frequent_features))
>>> fig = plt.figure()
>>> plt.barh(list(range(len(labels))), counts)
>>> plt.title('Frequently Mentioned Features', fontsize=14)
>>> plt.xlabel('Count')
>>> plt.ylabel('Feature')
>>> axes = plt.gca()
>>> axes.set_yticks(list(range(len(labels))))
>>> text_labels = axes.set_yticklabels(labels)
>>> for label in text_labels:
>>>     label.set_verticalalignment('bottom')
_images/bmw3_frequent_features.png

If we are interested in specific feature types, we only need to further restrict the filtering condition. For instance, the following code will only pick up components, whose type is “component_feature”:

>>> component_freqs = Counter()

>>> for item in analyzed_data:
>>>     for entity in item['entities']:
>>>         components = [s['value'] for s in entity['semantics']
>>>                       if s['type'] == 'feature_component']
>>>         component_freqs.update(components)

>>> frequent_components = component_freqs.most_common(20)

>>> labels, counts = zip(*reversed(frequent_components))

>>> fig = plt.figure()
>>> plt.barh(list(range(len(labels))), counts)
>>> plt.title('Frequently Mentioned Components', fontsize=16)
>>> plt.xlabel('Count')
>>> plt.ylabel('Component Name')
>>> axes = plt.gca()
>>> axes.set_yticks(list(range(len(labels))))
>>> text_labels = axes.set_yticklabels(labels)
>>> for label in text_labels:
>>>     label.set_verticalalignment('bottom')
_images/bmw3_frequent_components.png

3.2 Feature sentiments and polarities

In this tutorial, we show how to extract and aggregate sentiment values for specific entities. We retrieve all sentiment values per feature and store results in a map where keys are feature labels and values are lists of particular feature ratings.

>>> feature_evals = defaultdict(list)

>>> for item in analyzed_data:
>>>     relations = item["relations"]
>>>     for relation in relations:
>>>         relation_semantics = relation["semantics"]
>>>         for entity in relation_semantics["entity"]:
>>>             if entity["type"].startswith("feature_"):
>>>                 feature_evals[entity["value"]].append(relation_semantics["sentiment_value"])

Let’s check the evaluations for the X3 engine:

>>> print(feature_evals["Engine"])
[1.0, 0.5, -0.5, 0.875, -0.5, 0.875, 0.875, 0.625, -0.6875, 0.875,
0.5, 0.6875, 0.875, 0.875, 0.875, 0.875, 0.5, 0.5, 0.5, 0.5]

We now calculate the average sentiment values for each feature. To control the reliability of the result, we can filter out features that have ‘too few’ evaluations. In the following, we filter out those features for which we found only one evaluation:

>>> avg_feature_evals = {feature: sum(evals) / len(evals)
>>>                      for feature, evals in feature_evals.items()
>>>                      if len(evals) > 1}

Now, we can inspect the average sentiment values and print out the features with the most positive or negative evaluations. We also print out the number of evaluations we found for each feature.

>>> positive_features = sorted(avg_feature_evals.items(),
>>>                            key=operator.itemgetter(1),
>>>                            reverse=True)[:20]
>>> template = '{:<25} {:5.2f}    {}'
>>> for label, sentiment in positive_features:
>>>     eval_count = len(feature_evals[label])
>>>     print(template.format(label, sentiment, eval_count))
LightEmittingDiode         0.88    2
Sunroof                    0.88    3
Lighting                   0.88    2
Transmission               0.88    4
RunningInPeriod            0.88    2
DaytimeRunningLamp         0.88    2
Workmanship                0.88    2
BeigeColor                 0.84    2
HeadLight                  0.83    8
Sensitivity                0.82    5
Tires                      0.81    2
HandleFeeling              0.78    8
RateOfEyeCatching          0.78    2
PanoramicSunroof           0.76    5
RearLight                  0.75    2
Grip                       0.75    2
SportMode                  0.75    2
Suspension                 0.71    8
CruiseControl              0.71    3
Acceleration               0.70    7

>>> negative_features = sorted(avg_feature_evals.items(),
>>>                            key=operator.itemgetter(1))[:20]
>>> for label, sentiment in negative_features:
>>>     eval_count = len(feature_evals[label])
>>>     print(template.format(label, sentiment, eval_count))
TrunkSpace                -0.53    4
Weight                    -0.32    25
PriceQualityRatio         -0.27    14
ShockAbsorbers            -0.27    3
Firmness                  -0.19    2
Keys                      -0.17    3
Concordance               -0.17    3
FrontSeat                 -0.06    3
AcceleratorPedal          -0.06    9
MaterialProcessing        -0.05    7
Frontage                  -0.03    4
Screen                     0.00    4
Bumpers                    0.00    2
Differential               0.00    2
FrontEnd                   0.00    2
Flavor                     0.03    4
Exquisiteness              0.04    12
Design                     0.04    3
Luxuriousness              0.05    11
ShockAbsorption            0.07    5

Finally, we plot the result:

>>> flat_sentiment_good = [(n, s, len(feature_evals[n])) for n, s in positive_features]
>>> flat_sentiment_bad = [(n, s, len(feature_evals[n])) for n, s in negative_features]

>>> labels_good, sentiments_good, counts_good = zip(*reversed(flat_sentiment_good))
>>> labels_bad, sentiments_bad, counts_bad = zip(*flat_sentiment_bad)
>>> positions = list(range(20))

>>> fig = plt.figure()
>>> plt.barh(positions[-10:], sentiments_good[:10])
>>> plt.barh(positions[:10], sentiments_bad[:10], color='indianred')
>>> plt.title('Average Feature Sentiment', fontsize=16)
>>> plt.xlabel('Average Sentiment')
>>> plt.ylabel('Feature Name')
>>> axes = plt.gca()
>>> axes.set_xlim([-1.0, 1.0])
>>> axes.set_yticks(positions)
>>> text_labels = axes.set_yticklabels(labels_good[:10] + labels_bad[-10:])
>>> for label in text_labels:
>>>     label.set_verticalalignment('bottom')
_images/output_14_0.png

3.3 Feature associations

In this section, we show how to find which features frequently cooccur in the data. This information gives us a fine-grained picture of the relevant features for a given product or user - for example, talking of the car seats, are users more focussed on front or back seats? Which are the relevant aspects, such as look, comfort, material of the seats etc.?

To find major feature associations, we first count the co-occurences of features over the whole data collection. We do this by making all possible pairs of features from each document review:

>>> feature_asso_freqdict = Counter()

>>> for item in analyzed_data:
>>>     for entity in item['entities']:
>>>         features = [se['value'] for se in entity['semantics'] if se['type'].startswith('feature_')]
>>>         for f1, f2 in itertools.combinations(features, 2):
>>>             f1, f2 = sorted([f1, f2])
>>>             key_str = '{} / {}'.format(f1, f2)
>>>             feature_asso_freqdict[key_str] += 1

>>> for item in feature_asso_freqdict.most_common(20):
>>>     print("%s: %s" % (item[0], item[1]))
Interior / VisualAppearance: 4
BeigeColor / Interior: 3
Temperament / VisualAppearance: 3
Control / OperationQuality: 3
Control / Power: 3
CarLight / Temperament: 2
CarLight / LightEmittingDiode: 2
Control / Steadiness: 2
HandleFeeling / SteeringWheel: 2
VisualAppearance / WheelHub: 2
Body / WhiteColor: 2
Height / Seats: 2
Size / Trunk: 2
CarLight / VisualAppearance: 2
AirOuttake / BackSeat: 2
Frontage / VisualAppearance: 2
Design / GearLevers: 2
DrivingMirrors / Size: 2
BodyShell / Lateral: 2
Comfort / Seats: 2

Now, let’s assume we want to “zoom in” on a specific feature, for example the seats, and see which aspects are associated with it. To include related features such as FrontSeat, BackSeat etc., we include all component features that contain the string “Seat”. Similarly to the example above, we collect all associated features into a frequency map and print the top 10 items with their frequencies.

>>> seats_asso_freqdict = Counter()

>>> for item in analyzed_data:
>>>     for entity in item['entities']:
>>>         seat_related = any('Seat' in s['value'] for s in entity['semantics'])
>>>         if not seat_related:
>>>             continue
>>>         semantics = [s['value'] for s in entity['semantics'] if s['type'].startswith('feature_')]
>>>         for f in semantics:
>>>             seats_asso_freqdict[f] += 1
>>> seats_asso_freqdict.pop('Seats')

>>> seats_most_asso = seats_asso_freqdict.most_common(10)
>>> for item in seats_most_asso:
>>>     print("%s: %s" % (item[0], item[1]))
BackSeat: 26
BackSeatSpace: 17
FrontSeat: 14
DrivingSeat: 5
CoDriversSeat: 3
FrontSeatSpace: 3
Comfort: 3
Height: 2
AirOuttake: 2
Visibility: 1

Finally, we plot the result:

>>> labels, counts = zip(*reversed(seats_most_asso))

>>> fig = plt.figure()
>>> plt.barh(list(range(len(labels))), counts)
>>> plt.title('Features associated with seats', fontsize=16)
>>> plt.xlabel('Cooccurrence Count')
>>> plt.ylabel('Feature')
>>> axes = plt.gca()
>>> axes.set_yticks(list(range(len(labels))))
>>> text_labels = axes.set_yticklabels(labels)
>>> for label in text_labels:
>>>     label.set_verticalalignment('bottom')
_images/output_9_0.png