Concepts: cooking recipe analysis

This tutorial demonstrates the usage of the concepts analysis. We analyze recipes with rice and find ingredients that, according to the numeric user ratings on this website, make your rice dish more tasty.

1. Data structure and preliminaries

Our dataset contains around 17K cooking recipes which were collected from xiachufang, a major Chinese food and cooking website. You can download the data from this link.

First, we make the necessary imports:

>>> import json
>>> import tarfile
>>> from pprint import pprint
>>> from itertools import chain
>>> from collections import Counter
>>> from multiprocessing.dummy import Pool as ThreadPool

>>> import pandas as pd
>>> import seaborn as sns
>>> import matplotlib.pyplot as plt
>>> import requests

We now use the built-in json module to load the dataset and inspect the structure of the datapoints:

>>> with open('xiachufang/xiachufang.json') as fp:
>>>     original_data = json.load(fp)

>>> len(original_data)
17940

>>> pprint(original_data[0])
{'author': '栗子家的喵',
 'content': {'info': '夏日清...',
             'ingredients': [{'name': '西柚', 'quantity': '2个'},
                             {'name': '蜂蜜', 'quantity': '2Tbsp'},
                             {'name': '水', 'quantity': '1/2 cup'}],
             'texts': ['西柚去...',
                       '果肉+...',
                       '倒入冰...',
                       '脱模时...',
                       '小贴士',
                       '*尝试...',
                       '*如果...']},
 'date': '2015-08-03',
 'scores': {'cooked': 4},
 'title': '西柚蜜冰棍的做法',
 'url': 'http://www.xiachufang.com/recipe/100537177/'}

We see that actual text strings are stored in the fields ‘ingredients’ and ‘texts’, whereas the other fields contain metadata. For the purpose of this tutorial, we connect the ‘texts’ list under ‘content’ into one single string since we do not need paragraph granularity. We also concatenate the ingredient names for easier manipulation:

>>> for recipe in original_data:
>>>     recipe['content']['texts'] = ''.join(recipe['content']['texts'])
>>>     easy_ings = [i['name'] for i in recipe['content']['ingredients']]
>>>     recipe['content']['ingredients-easy'] = easy_ings
>>> pprint(original_data[0])
{'author': '栗子家的喵',
 'content': {'info': '夏日清凉冰棍...',
             'ingredients': [{'name': '西柚', 'quantity': '2个'},
                             {'name': '蜂蜜', 'quantity': '2Tbsp'},
                             {'name': '水', 'quantity': '1/2 cup'}],
             'ingredients-easy': ['西柚', '蜂蜜', '水'],
             'texts': '西柚去皮,去白筋...'},
 'date': '2015-08-03',
 'scores': {'cooked': 4},
 'title': '西柚蜜冰棍的做法',
 'url': 'http://www.xiachufang.com/recipe/100537177/'}

2. Linguistic analysis

To linguistically analyze the data, we define the API URL we will be using to extract concepts from Chinese text, provide our user credentials and create a dataset of valid recipes by filtering out those documents that don’t contain any text:

>>> url = 'https://api.anacode.de/analyze/'
>>> auth = {'Authorization': 'Token <token>'}
>>> recipes = [recipe for recipe in original_data
>>>            if recipe['content']['texts'] or
>>>                recipe['content']['ingredients-easy']]
>>> len(recipes)
14629

Since the downloaded data contain analyzed data next step can be skipped if you wish to spare your API calls.

We now perform the analysis using multithreading for increased speed:

>>> prepared_data = []
>>> for recipe in recipes:
>>>     info = recipe['content']['info']
>>>     texts = recipe['content']['texts']
>>>     ings = ''.join(recipe['content']['ingredients-easy'])
>>>     prepared_data.append(info + texts + ings)

>>> def concepts(text):
>>>     res = requests.post(url, headers=auth,
>>>                         json={'texts': [text], 'analysis': ['concepts']})
>>>     return res.json()

>>> pool = ThreadPool(3)
>>> concepts = pool.map(concepts, prepared_data)
>>> pool.close()
>>> pool.join()
>>> del pool
>>> concepts = [c[0] for c in concepts]

>>> len(concepts)
14629
>>> concepts[0]
[{'entity': 'ColdTexture',
  'surface': [{'surface_string': '冷'}],
  'freq': 1,
  'relevance_score': 0.06627054773953846,
  'type': 'food_features'},
 {'entity': 'SourFlavorTaste',
  'surface': [{'surface_string': '酸'}],
  'freq': 1,
  'relevance_score': None,
  'type': 'food_features'},
  ...]

3. Aggregation and plotting

Let’s find all the rice dishes in the dataset...

>>> rice_dishes = []
>>> for index, concept in enumerate(concepts):
>>>     cstr = ''.join([c['concept'] for c in concept]).lower()
>>>     if 'rice' in cstr:
>>>         rice_dishes.append(index)

... and find ingredients that are used in each dish:

>>> ingredients = []
>>> for recipe in concepts:
>>>     ings = list(
>>>         filter(lambda c: c['type'] in ['food_ingredient'],
>>>                recipe)
>>>     )
>>>     ingredients.append(ings)

We will also find the “popularity” of ingredients used in rice dishes - i.e., how often every ingredient was used in dishes that contain rice.

>>> ing_counter = Counter()
>>> for dish in rice_dishes:
>>>     for ing in concepts[dish]:
>>>         if ing['type'] == 'food_ingredient':
>>>             ing_counter[ing['concept']] += 1

And finally we will find popularity for all rice recipes. Herefore, we use the two scores from the recipe metadata: the ‘rating’ score and the ‘cooked’ score, which indicates how many users actually cooked the dish.

>>> rice_scores = [original_data[dish]['scores']
>>>                for dish in rice_dishes]

>>> from collections import defaultdict
>>> rice_ing_scores = defaultdict(list)
>>> for index, dish in enumerate(rice_dishes):
>>>     if 'rating' not in rice_scores[index]:
>>>         continue
>>>     score = rice_scores[index]['rating']
>>>     dish = concepts[dish]
>>>     ings = set([ing['concept'] for ing in dish
>>>                 if ing['type'] in ['food_ingredient']])
>>>     for ing in ings:
>>>         rice_ing_scores[ing].append(score)

We will now create a pandas DataFrame to hold our recipe data for easier manipulation and plotting.

>>> result_mean = [(ing, (sum(scores) / len(scores)))
>>>                for ing, scores in rice_ing_scores.items()]
>>> scores_len = [(ing, len(scores))
>>>               for ing, scores in rice_ing_scores.items()]
>>> labels, scores = zip(*result_mean)
>>> _, counts = zip(*scores_len)

>>> r = pd.Series(scores, index=labels)
>>> s = pd.Series(counts, index=labels)

>>> data = pd.concat([r, s], axis=1)
>>> data.columns = ['score', 'count']

To improve the statistical relevance of the result, we filter out ingredients that are used less than 5 times in the data. We also calculate the overall recipe rating mean to see how biased our ratings are.

>>> relevant = data[data['count'] >= 5]
>>> relevant = relevant.sort_values(by='score').reset_index()
>>> relevant.columns = ['Ingredient', 'Score', 'Count']

>>> all_ratings = [d['scores']['rating'] for d in original_data
>>>                if 'scores' in d and 'rating' in d['scores']]
>>> total_mean = sum(all_ratings) / len(all_ratings)

Finally let’s plot the best and the worst rice “co-ingredients”:

>>> g = sns.barplot(x='Ingredient', y='Score',
>>>                 data=relevant[-10:],
>>>                 color=sns.xkcd_rgb["denim blue"])
>>> ylim = g.set(ylim=(7.5, 9.0))
>>> labels = g.set_xticklabels(g.get_xticklabels(), rotation=20)
>>> plt.axhline(total_mean, color='black', linestyle='dashed',
>>>             label='Overall mean recipe score')
>>> plt.tight_layout()
>>> plt.legend()
_images/ingredients-good.png
>>> g = sns.barplot(x='Ingredient', y='Score',
>>>                 data=relevant[:10],
>>>                 color=sns.xkcd_rgb["denim blue"])
>>> ylim = g.set(ylim=(7.5, 9.0))
>>> labels = g.set_xticklabels(g.get_xticklabels(), rotation=20)
>>> plt.axhline(total_mean, color='black', linestyle='dashed',
>>>             label='Overall mean recipe score')
>>> plt.legend()
>>> plt.tight_layout()
_images/ingredients-bad.png