Absa: comparative usecase

In this tutorial, we will analyze product reviews for two comparable car series - BMW X3 and X5 - and compare them in terms of the features being discussed, as well as the sentiments for these features.

1. Data structure and preliminaries

Our dataset contains product reviews for BMW’s series X3 and X5 collected from auto.qq.com, the automotive section of qq.com. You can download the data here:

We make the necessary imports:

>>> import json
>>> import operator
>>> from pprint import pprint
>>> from operator import itemgetter
>>> from collections import defaultdict, Counter

>>> import numpy as np
>>> import seaborn as sns
>>> import matplotlib.pyplot as plt

We also define a helper function features to keep the rest of the code concise:

>>> def features(entity):
>>>     """
>>>     Returns list of features for one entity from
>>>     'entities' part of absa call
>>>     """
>>>     result = []
>>>     for se in entity['semantics']:
>>>         if not se['type'].startswith('feature_'):
>>>             continue
>>>         result.append(se['value'])
>>>     return result

We now use the built-in json module to load the datasets and inspect the structure of the datapoints:

>>> with open('bmw_x3_reviews_autoqq.json') as x3_fp, open('bmw_x5_reviews_autoqq.json') as x5_fp:
>>>     x3_raw_data = json.load(x3_fp)
>>>     x5_raw_data = json.load(x5_fp)

>>> len(x3_raw_data)
1045

>>> pprint(x3_raw_data[0])
{'author': '三三的爹',
 'date': '2012-10-23',
 'text': '优点:2.0T,10月16日提车,比预计提前2个多月,杭州骏宝行守信,赞一个! '
     '虽中档车,却有驾豪车感觉,操控性强,转弯不减油无失向、被甩感;看上去较省油,就冲低油耗才选这款的;棕色很大气,稀饭。 '
     '缺点:隧道不开灯看不清仪表读数;雨天倒车镜无风吹功能,安全有忧;导航无手写;CD/DVD只有单碟,且行进时屏蔽图像;后厢空间比想象的小,放不了多少东西 '
     '综述:瑕不掩瑜,这个价位值得!原驾一辆05年的本田CRV。 ',
 'title': ' 宝马 宝马X3'}

The body of the review is contained in the ‘text’ field. For each series, we concatenate all texts into one list that we can pass for analysis to the API:

>>> text_data = {
>>> "x3": [item['text'] for item in x3_raw_data],
>>> "x5": [item['text'] for item in x5_raw_data]
>>>          }

2. Linguistic analysis

To linguistically analyze the data, we define the API URL, provide our user credentials, call the absa analysis for both series and save the analyzed data to JSON files.

You can skip this step since analyzed data is already provided to you in data packages you downloaded.

>>> for series in text_data:
>>>     response = requests.post('https://api.anacode.de/analyze/',_data, 'analyses': ['absa']},
>>>                              json={'texts': text_data[series],
>>>                                    'analyses': ['absa']},
>>>                              headers={'Authorization': 'Token 5e53579467c3cddb288608b4d1f9944669f0ae9a'})
>>>
>>>     analyzed_data = response.json()['absa']yzed_data)))
>>>
>>>     print('Analyzed {} data items.'.format(len(analyzed_data)))
>>>
>>>     output_file = "bmw_{}_reviews_autoqq_analysis.json".format(series)
>>>     with open(output_file, "w") as out_stream:
>>>         json.dump(analyzed_data, out_stream)

3. Aggregation and plotting

3.1 Frequent features

Let’s find out the most frequent features for X3 and see whether they are comparably popular for the X5 series.

>>> with open('bmw_x3_reviews_autoqq_analysis.json') as x3_fp, open('bmw_x5_reviews_autoqq_analysis.json') as x5_fp:
>>>     x3_analysis = json.load(x3_fp)
>>>     x5_analysis = json.load(x5_fp)


>>> def count_feature_frequencies(absa_data):
>>>     feature_freqs = Counter()
>>>     for item in absa_data:
>>>         for entity in item['entities']:
>>>             sub_entities = features(entity)
>>>             feature_freqs.update(sub_entities)
>>>     return feature_freqs


>>> x3_feature_freqs = count_feature_frequencies(x3_analysis)
>>> x5_feature_freqs = count_feature_frequencies(x5_analysis)


>>> for feature, count in x3_feature_freqs.most_common(10):
>>>     print('{:<20} {} {}'.format(feature, count, x5_feature_freqs[feature]))
Power                91 117
Control              88 94
VisualAppearance     81 70
FuelConsumption      79 80
Interior             64 62
Space                60 37
SteeringWheel        57 95
Engine               54 72
Price                51 25
Configuration        43 34

We plot the result to get a clearer view of the data:

>>> N = 10
>>> ind = np.arange(N)
>>> width = 0.35

>>> data = [(f, c, x5_feature_freqs[f]) for f, c in x3_feature_freqs.most_common(10)]
>>> x3_labels, x3_feature_frequencies, x5_feature_frequencies = zip(*reversed(data))
>>>
>>> fig = plt.figure(figsize=(10,7))
>>> plt.barh(ind + width, x3_feature_frequencies, width, label='BMW X3')
>>> plt.barh(ind, x5_feature_frequencies, width, color='indianred', label='BMW X5')

>>> # set title, labels and ticks
>>> plt.title('Feature frequencies for BMW X3 and X5', fontsize=16)
>>> plt.ylabel('Features')
>>> plt.yticks(ind + width)
>>> ax = plt.gca()
>>> ax.set_yticklabels(x3_labels)

>>> plt.legend(loc='lower right')

>>> plt.show()
_images/output_10_0.png

We see that feature frequencies are very similar for some features, for instance the control, the fuel consumption and the interior. By contrast, there are also features where the frequencies show strong divergences, for example the steering wheel and the space. Let’s say we want to dig deeper into the steering wheel and see what associated concepts justify its higher frequency for BMW X5. To get a better feeling for the data, we also collect and print the associated surface strings:

>>> def steering_wheel_concepts(absa_data):
>>>     concepts = defaultdict(list)
>>>     for item in absa_data:
>>>         for entity in item['entities']:
>>>             all_features = features(entity)
>>>             no_wheel_features = list(filter(lambda f: 'SteeringWheel' not in f, all_features))

>>>             # if we did not filter out any mention of steering wheel it means there is none
>>>             if len(all_features) == len(no_wheel_features):
>>>                 continue

>>>             for feature in no_wheel_features:
>>>                 concepts[feature].append(entity['surface']['surface_string'])
>>>     return concepts


>>> x5_steering_wheel = steering_wheel_concepts(x5_analysis)
>>> for feature, strings in sorted(x5_steering_wheel.items(), key=lambda i: (len(i[1]), i[0]), reverse=True):
>>>     print('{:<17}\t{}\t{}'.format(feature, len(strings), ' '.join(strings)))
Boosting            2       X5的助力方向盘 X5的助力方向盘
Angle               2       方向角度 方向角度
YellowColor         1       黄色的方向盘符号
Steadiness          1       沉稳的方向盘
Size                1       方向盘大小
Sensitivity         1       灵敏的方向盘
RoofPanel           1       方向盘、车顶
Range               1       感觉X5的方向幅度
HandleFeeling       1       方向盘手感
Design              1       方向盘设计
Control             1       方向盘路感
CarLight            1       方向灯

For comparison purpose, let’s do the same for X3:

>>> x3_steering_wheel = steering_wheel_concepts(x3_analysis)
>>> for feature, strings in sorted(x3_steering_wheel.items(), key=lambda i: (len(i[1]), i[0]), reverse=True):
>>>     print('{:<17}\t{}\t{}'.format(feature, len(strings), ' '.join(strings)))
HandleFeeling       2       方向盘手感 方向盘手感
Size                1       方向盘大小
Power               1       方向盘转向力度
ElectronicBoosting  1       电子助力方向
Control             1       方向指向
Angle               1       方向盘角度

Looking at the results, we learn that the heating of the steering wheel is often commented on for X5. By contrast, for X3 we get only one mention, which actually tells us that X3 doesn’t have the heating function (“没有方向盘加热”).

3.2 Feature sentiments

We now want to find features with significant divergences in related sentiment values. First, we build the sentiment dictionaries for both car series by accumulating sentiment values for each feature and then averaging them.

>>> def feature_evals(absa_data):
>>>     evals = defaultdict(list)
>>>     for item in absa_data:
>>>         for relation in item['relations']:
>>>             for entity_part in relation['semantics']['entity']:
>>>                 if entity_part['type'].startswith('feature_'):
>>>                     evals[entity_part['value']].append(relation['semantics']['value'])
>>>     return evals


>>> def average_evals(evals):
>>>     return {feature: (sum(evals) / len(evals))
>>>             for feature, evals in evals.items()
>>>             if len(evals) > 1}


>>> x3_feature_evals = feature_evals(x3_analysis)
>>> x3_avg_feature_evals = average_evals(x3_feature_evals)
>>> x5_feature_evals = feature_evals(x5_analysis)
>>> x5_avg_feature_evals = average_evals(x5_feature_evals)

Since we are interested in those features that have a “significant” divergence in sentiment, let’s fix a threshold of 1.5 as minimum divergence. We print all features for which sentiment values diverge by more than 1.5 sorted by evaluation difference.

>>> x3_feature_set, x5_feature_set = set(x3_avg_feature_evals.keys()), set(x5_avg_feature_evals.keys())
>>> common_evals = [
>>>     (f, x3_avg_feature_evals[f], x5_avg_feature_evals[f])
>>>     for f in x3_feature_set.intersection(x5_feature_set)
>>> ]
>>> common_evals = [(f, x3, x5, x3 - x5) for f, x3, x5 in common_evals]

>>> print('X3 is strongly favored for following features:')
>>> for feature, x3_eval, x5_eval, diff in sorted(common_evals, key=itemgetter(3), reverse=True):
>>>     if diff > 0.3:
>>>         print('{:<17}\t{:.2f}\t{:.2f}'.format(feature, x3_eval, x5_eval))

>>> print()
>>> print('X5 is strongly favored for following features:')
>>> for feature, x3_eval, x5_eval, diff in sorted(common_evals, key=itemgetter(3)):
>>>     if diff < -0.3:
>>>         print('{:<17}\t{:.2f}\t{:.2f}'.format(feature, x3_eval, x5_eval))
X3 is strongly favored for following features:
Transmission        0.88    0.50
Sensitivity         0.82    0.52

X5 is strongly favored for following features:
TrunkSpace          -0.53   0.53
ShockAbsorbers      -0.27   0.61
Convenience         0.10    0.74
ShockAbsorption     0.07    0.63
Frontage            -0.03   0.50
Screen              0.00    0.50
Color               0.17    0.62
RearEnd             0.19    0.61
InsideOfCar         0.45    0.84
Fashionability      0.33    0.69
Luxuriousness       0.05    0.36

We now want to plot the sentiment of the features for which X5 is favored. We sort the features by decreasing sentiment divergence and build the chart:

>>> x5_favored_features = [(f, x5, x3) for f, x3, x5, d in common_evals if d < -0.3]
>>> sorted_fe_tuples = sorted(x5_favored_features, key=lambda v: abs(v[1] - v[2]))
>>> N = len(sorted_fe_tuples)
>>> ind = np.arange(N)
>>> width = 0.35

>>> features, x5_evaluations, x3_evaluations = zip(*sorted_fe_tuples)

>>> fig = plt.figure(figsize=(10, 7))
>>> plt.barh(ind + width, x5_evaluations, width, label='BMW X5')
>>> plt.barh(ind, x3_evaluations, width, color='indianred', label='BMW X3')
>>> plt.xlim([-1.0, 1.0])
>>> plt.ylim([-0.1, 10.8])

>>> # set title, labels and ticks
>>> plt.title('Diverging polarities for BMW X3 and X5')
>>> plt.ylabel('Features')
>>> plt.yticks(ind + width)
>>> ax = plt.gca()
>>> ax.set_yticklabels(features)

>>> ax.legend(loc="lower left")

>>> plt.show()
_images/output_22_0.png

3.3 Inspecting original texts

If we are curious about what users say exactly about some of these features, we can print out the associated texts. For instance, let’s print the evaluation texts associated with the trunk space:

>>> print("Trunk space for X3: ")
>>> for item in x3_analysis:
>>>     relations = item["relations"]
>>>     for relation in relations:
>>>         entity = relation["semantics"]["entity"]
>>>         for sub_entity in entity:
>>>             if sub_entity["value"] == "TrunkSpace":
>>>                 print(relation["surface"]["surface_string"])

>>> print("\n")
>>> print("Trunk space for X5: ")
>>> for item in x5_analysis:
>>>     relations = item["relations"]
>>>     for relation in relations:
>>>         entity = relation["semantics"]["entity"]
>>>         for sub_entity in entity:
>>>             if sub_entity["value"] == "TrunkSpace":
>>>                 print(relation["surface"]["surface_string"])
Trunk space for X3:
储物空间不够
储物空间少
储物空间少
车内储物空间偏少


Trunk space for X5:
储物空间丰富
后备箱空间不小
后备箱空间不小
后备箱空间不小
后备箱空间也还不错
Trunk space for X3:
储物空间不够
储物空间少
储物空间少
储物空间太小太少
储物空间太小太少
储物空间太小太少
车内储物空间偏少


Trunk space for X5:
储物空间丰富
后备箱空间都很大
后备箱空间不小
后备箱空间不小
还可以增加储物空间
后备箱空间不小
还可以增加储物空间
后备箱空间也还不错

For your convenience here are above statements translated to english:

Trunk space for X3:

储物空间不够 - The trunk space is not sufficient.

储物空间少 - The trunk space is small.

储物空间少 - The trunk space is small.

储物空间太小太少 - The trunk space is too small.

储物空间太小太少 - The trunk space is too small.

储物空间太小太少 - The trunk space is too small.

车内储物空间偏少 - The trunk space is relatively small.

Trunk space for X5:

储物空间丰富 - The trunk space is abundant.

后备箱空间都很大 - The trunk space is very large.

后备箱空间不小 - The trunk space is not small.

后备箱空间不小 - The trunk space is not small.

还可以增加储物空间 - You can even enlarge the trunk space.

后备箱空间不小 - The trunk space is not small.

后备箱空间也还不错 - The trunk space is not bad.

Thus, we can see that the evaluations for X3 refer to an unsufficient size of the trunk space; by contrast, the size of the X5 trunk space seems to satisfy the users.