Anacode Toolkit: Getting a semantic overview over your dataset

Using the API to get a linguistic analysis of your data is nice, but what you want to achieve in the end - especially if analyzing larger text collections - is an aggregated and visualized result. In this tutorial, we show some basic usage scenarios covering the whole pipeline of analyzing, saving, aggregating and plotting your data which will allow you to gain a broad overview over the topics and concepts in a text collection.

The tutorial makes use of the Anacode Toolkit, a python library that you can use for writing your data to various formats, aggregating and plotting them. You can download the toolkit from pypi.

1. Data structure and preliminaries

Our dataset contains 100 articles which were collected from, the automotive section of You can download the data from this link.

First, we make the necessary imports, also importing anacode, the Anacode Toolkit:

>>> import os
>>> import json
>>> import tarfile
>>> from pprint import pprint
>>> from itertools import chain
>>> from collections import Counter
>>> from multiprocessing.dummy import Pool as ThreadPool

>>> import pandas as pd
>>> import seaborn as sns
>>> import matplotlib.pyplot as plt
>>> import requests

>>> import anacode

We load the json data and inspect its size and structure:

>>> with open('sohu_auto_100/sohu_auto_100.jsonl') as fp:
>>>     original_data = [json.loads(line.strip()) for line in fp]

>>> fp.close()

>>> print(len(original_data))
>>> pprint(original_data[0])
{'author': '熊飞',
 'content': ['[搜狐汽车新车]又到了春暖花开的季节,动物们迎来了......不对应该是车企们又到了发布新车的季节。今天小编和大家一起看看即将在3月18号上市的雷诺紧凑级SUV“新”作——东风雷诺科雷嘉。虽然这样说,但是大家都知道科雷嘉与逍客都是基于CMF平台打造的紧凑级SUV。',
 'date': '2016-03-17',
 'domain': '',
 'source': '搜狐汽车',
 'title': '每日一车:东风雷诺科雷嘉等于换壳逍客?',
 'url': ''}

The auto_sohu_100 dataset was scraped using the Web scraping functionality of the Web&Text API and thus has the same output structure as described for the scrape call . The linguistic analysis mostly works with the ‘title’ string and the list of paragraphs under ‘content’. We concatenate these data for each document to get one string per document, since document-level granularity will be sufficient to get an overview over the data:

>>> texts = [" ".join([doc["title"]]+doc["content"]) for doc in original_data]

2. Linguistic analysis

Two analyses are used in this tutorial - Text Categorization (categories) and Concept Extraction (concepts). We use the analyze function of anacode.api.client.AnacodeClient to analyze our text data. If you run this code, please don’t forget to specify your token:

>>> api = anacode.api.client.AnacodeClient('<your token>')
>>> json_analysis = api.analyze(texts, ['concepts', 'categories'])

Optionally, we save the analyzed data to the ling/ directory using the anacode.api.writers.CSVWriter class in the toolkit:

>>> writer = anacode.api.writers.CSVWriter("ling")
>>> writer.init()
>>> writer.write_analysis(json_analysis)
>>> writer.close()

Please note that you do not have to run this step yourself since the analyzed data is already provided inside the data package you downloaded.

4. Aggregating and plotting

We now use the anacode.agg.DatasetLoader class to load the data into a Dataset object. If data was stored to disk using the CSVWriter, they can be loaded using the from_path method of DatasetLoader:

>>> dataset = anacode.agg.DatasetLoader.from_path("ling")

To get a broad overview over the data, we find the 5 main categories that were identified by the categories analysis:

>>> categories = dataset.categories
>>> anacode.agg.plotting.piechart(categories.categories(),
>>>                               category_count=5)

Now, we go into more detail about concepts. We plot a concept cloud, which puts all concept types into one bucket and plots them with sizes reflecting their relative frequencies:

>>> concepts = dataset.concepts
>>> anacode.agg.plotting.concept_cloud(concepts.concept_frequencies())

Finally, we want to get information specific to the brands and products discussed in the data. We use the most_common_concepts aggregation of ConceptsDataset and apply it to the two different concept types:

>>> frequent_brands = concepts.most_common_concepts(n=10, concept_type="brand")
>>> anacode.agg.plotting.barhchart(frequent_brands)
>>> frequent_products = concepts.most_common_concepts(n=10, concept_type="product")
>>> anacode.agg.plotting.barhchart(frequent_products)