Anacode Toolkit: Getting a semantic overview over your dataset

Using the API to get a linguistic analysis of your data is nice, but what you want to achieve in the end - especially if analyzing larger text collections - is an aggregated and visualized result. In this tutorial, we show some basic usage scenarios covering the whole pipeline of analyzing, saving, aggregating and plotting your data which will allow you to gain a broad overview over the topics and concepts in a text collection.

The tutorial makes use of the Anacode Toolkit, a python library that you can use for writing your data to various formats, aggregating and plotting them. You can download the toolkit from pypi.

1. Data structure and preliminaries

Our dataset contains 100 articles which were collected from auto.sohu.com, the automotive section of sohu.com. You can download the data from this link.

First, we make the necessary imports, also importing anacode, the Anacode Toolkit:

>>> import os
>>> import json
>>> import tarfile
>>> from pprint import pprint
>>> from itertools import chain
>>> from collections import Counter
>>> from multiprocessing.dummy import Pool as ThreadPool

>>> import pandas as pd
>>> import seaborn as sns
>>> import matplotlib.pyplot as plt
>>> import requests

>>> import anacode

We load the json data and inspect its size and structure:

>>> with open('sohu_auto_100/sohu_auto_100.json') as fp:
>>>     original_data = json.load(fp)

>>> fp.close()

>>> print(len(original_data))
100
>>> pprint(original_data[0])
{'author': '熊飞',
 'content': ['[搜狐汽车新车]又到了春暖花开的季节,动物们迎来了......不对应该是车企们又到了发布新车的季节。今天小编和大家一起看看即将在3月18号上市的雷诺紧凑级SUV“新”作——东风雷诺科雷嘉。虽然这样说,但是大家都知道科雷嘉与逍客都是基于CMF平台打造的紧凑级SUV。',
             '同平台节约成本外,还有它意?',
             '雷诺和日产在2001年10月成立雷诺-日产联盟,卡洛斯·戈恩任联席CEO。雷诺作为法国第二大汽车制造商,拥有强大的造车实力,但是在中国,雷诺的车确实卖不了几台。既然你们对雷诺不感冒,我就给你们熟悉的日产车做个拉皮升级版,这回你们买吗?科雷嘉能否成功将逍客的潜在用户拉拢过来,我想这是东风雷诺首先面对的问题,毕竟一个造车大厂是不会满足靠打游击得来的微薄销量。',
             '逍客去年卖6万台,我来个升级版。',
             '科雷嘉相比逍客提升了不少实用配置,车身长度也稍占优势。相同的平台通过不同的工程师调教,会赋予车子不同的性格特点,并非换个壳这么简单而已。',
             '倘若科雷嘉以之前16万至23万的预售价格上市销售,将会与大部分合资紧凑级SUV形成价格上的重叠。除了要和同门兄弟逍客争饭碗,还要面对奇骏和CR-V这些销量轻松达到十几万的车型。在这些大哥们的打压下,科雷嘉想要开疆扩土就不单单是加个配置改个外观就能做到的,所以千万要找准自己的定位。',
             '面对韩国朋友怎么办,科雷嘉与KX5你选谁?',
             '科雷嘉连乞丐版都能卖16万,韩国朋友表示不服。我堂堂起亚KX5,论各方面都不比你差,你凭什么比我贵?',
             '两车都定位较为年轻的消费群体,要说科雷嘉的优势在哪里,我只能说我喜欢科雷嘉中控台上的碳纤维面板。',
             '科雷嘉虽然算的上是升级版的逍客,但是其产品定位更加小众。个性的外观加上偏向赛车设计的内饰风格好像吸引的都是一帮爱玩的小伙子,还真不容易像逍客一样被大众人群接受。而且同样定位年轻市场的KX5在各方面至少不输给科雷嘉,但是起亚品牌在中国影响力却强于雷诺。科雷嘉想要打开中国的市场,会面临严峻的挑战。小编我只是希望科雷嘉能尽可能的降低产品定位,给我们这些爱车却又没钱的小伙伴一些希望。越早建立品牌拥护者,才能越快立足市场。至于今后科雷嘉将以何种身份立足中国车市,敬请期待3月18号科雷嘉上市直播。'],
 'date': '2016-03-17',
 'domain': 'auto.sohu.com',
 'source': '搜狐汽车',
 'title': '每日一车:东风雷诺科雷嘉等于换壳逍客?',
 'url': 'http://auto.sohu.com/20160317/n440640970.shtml'}

The auto_sohu_100 dataset was scraped using the Web scraping functionality of the Web&Text API and thus has the same output structure as described for the scrape call . The linguistic analysis mostly works with the ‘title’ string and the list of paragraphs under ‘content’. We concatenate these data for each document to get one string per document, since document-level granularity will be sufficient to get an overview over the data:

>>> texts = [" ".join([doc["title"]]+doc["content"]) for doc in original_data]

2. Linguistic analysis

Two analyses are used in this tutorial - Text Categorization (categories) and Concept Extraction (concepts). We use the analyze function of anacode.api.client.AnacodeClient to analyze our text data. If you run this code, please don’t forget to specify your token:

>>> api = anacode.api.client.AnacodeClient('<your token>')
>>> json_analysis = api.analyze(texts, ['concepts', 'categories'])

Optionally, we save the analyzed data to the ling/ directory using the anacode.api.writers.CSVWriter class in the toolkit:

>>> writer = anacode.api.writers.CSVWriter("ling")
>>> writer.init()
>>> writer.write_analysis(json_analysis)
>>> writer.close()

Please note that you do not have to run this step yourself since the analyzed data is already provided inside the data package you downloaded.

4. Aggregating and plotting

We now use the anacode.agg.DatasetLoader class to load the data into a Dataset object. If data was stored to disk using the CSVWriter, they can be loaded using the from_path method of DatasetLoader:

>>> dataset = anacode.agg.DatasetLoader.from_path("ling")

To get a broad overview over the data, we find the 5 main categories that were identified by the categories analysis:

>>> categories = dataset.categories
>>> anacode.agg.plotting.piechart(categories.categories(),
>>>                               category_count=5)
_images/output_23_1.png

Now, we go into more detail about concepts. We plot a concept cloud, which puts all concept types into one bucket and plots them with sizes reflecting their relative frequencies:

>>> concepts = dataset.concepts
>>> anacode.agg.plotting.concept_cloud(concepts.concept_frequencies())
>>> plt.show()
_images/output_25_1.png

Finally, we want to get information specific to the brands and products discussed in the data. We use the most_common_concepts aggregation of ConceptsDataset and apply it to the two different concept types:

>>> frequent_brands = concepts.most_common_concepts(n=10, concept_type="brand")
>>> anacode.agg.plotting.barhchart(frequent_brands)
_images/output_27_1.png
>>> frequent_products = concepts.most_common_concepts(n=10, concept_type="product")
>>> anacode.agg.plotting.barhchart(frequent_products)
_images/output_28_1.png