Scrape: continuous data collection

In this tutorial, we demonstrate a simple way to collect Web data using the scrape call. The tutorial follows the usual logic of crawling algorithms: we save all URLs identified on previously visited pages to a URL frontier. At each new iteration, we fetch a new URL link from this frontier. The content scraped from the new URL is added to the collected data, whereas the links scraped from the URL are again added to the frontier.

We will be keeping track of our scraping progress in two sets of URLs - one set of already visited URLs, and another set of URLs that are yet to be visited. At the beginning, we initialize urls_to_visit with one single seed URL. The set of visited URLs (visited_urls) is empty:

>>> SEED_URL = "http://sports.sohu.com/20170305/n482403989.shtml"

>>> urls_to_visit = {SEED_URL}
>>> visited_urls = set()

We now start the crawling. To control the size of the collected data, a while-limit is set to the number of pages we want to scrape (10 in our example). In each iteration, we pop a new link to be visited and try to scrape it using the scrape call. All new URLs are added to the set of links to visit, whereas the currently scraped URL is added to the set of already visited links. If the scrape of the page was successful (which, in our example, means that the page has some content), the scrape result is also appended to the scraped data.

>>> import requests

>>> scraped_data = []
>>> SCRAPED_DATA_MAX_LENGTH = 10

>>> while len(urls_to_visit) > 0 and len(scraped_data) < SCRAPED_DATA_MAX_LENGTH:
>>>     url = urls_to_visit.pop()
>>>     res = requests.post('https://api.anacode.de/scrape',
>>>                         json={"url": url},
>>>                         headers={"Authorization": "Token 5e53579467c3cddb288608b4d1f9944669f0ae9a"})
>>>     if res.status_code == 200:
>>>         scraped_item = res.json()
>>>         urls_to_visit.update(url for url in scraped_item.get("hrefs", [])
>>>                              if url not in visited_urls)
>>>         if scraped_item["content"]:
>>>             scraped_data.append(scraped_item)
>>>     visited_urls.add(url)

>>> print("Successfully scraped data.")
Successfully scraped data.

Let’s check that we have collected the right amount of data and that the content is well-formed:

>>> from pprint import pprint
>>> print(len(scraped_data))
>>> print(scraped_data[1]["content"])
10
['1、本网所有内容,凡注明"来源:搜狐××(频道)"的所有文字、图片和音视频资料,版权均属搜狐公司所有,任何媒体、网站或个人未经本网协议授权不得转载、链接、转贴或以其他方式复制发布/发表。已经本网协议授权的媒体、网站,在下载使用时必须注明"稿件来源:搜狐网",违者本网将依法追究责任。凡本网注明"来源:XXX "的文/图等稿件,本网转载出于传递更多信息之目的,并不意味着赞同其观点或证实其内容的真实性。', '2、除注明"来源:搜狐××(频道)"的内容外,本网以下内容亦不可任意转载:a.本网所指向的非本网内容的相关链接内容;b.已作出不得转载或未经许可不得转载声明的内容;c.未由本网署名或本网引用、转载的他人作品等非本网版权内容;d.本网中特有的图形、标志、页面风格、编排方式、程序等;e.本网中必须具有特别授权或具有注册用户资格方可知晓的内容;f.其他法律不允许或本网认为不适合转载的内容。', '3、转载或引用本网内容必须是以新闻性或资料性公共免费信息为使用目的的合理、善意引用,不得对本网内容原意进行曲解、修改,同时必须保留本网注明的"稿件来源",并自负版权等法律责任。', '4、转载或引用本网内容不得进行如下活动:a. 损害本网或他人利益;b. 任何违法行为;c. 任何可能破坏公秩良俗的行为;d. 擅自同意他人继续转载、引用本网内容;', '5、转载或引用本网版权所有之内容须注明“转自(或引自)搜狐网”字样,并标明本网网址www.sohu.com。', '6、转载或引用本网中的署名文章,请按规定向作者支付稿酬。', '7、对于不当转载或引用本网内容而引起的民事纷争、行政处理或其他损失,本网不承担责任。', '8、本网以“法定许可”方式使用作品,已与知识产权所有者签署合作协议并支付报酬。如有未尽事宜请相关权利人直接与本网媒体合作部联系,联系电话为:010-56603441。', '9、对不遵守本声明或其他违法、恶意使用本网内容者,本网保留追究其法律责任的权利。']

Optionally, during scraping, it is possible to further constrain the URLs that are scraped. For example, you might want to only scrape URLs from the same base domain (http://sports.sohu.com). To do this, you can add a new condition on the update of the set of URLs to visit:

>>> urls_to_visit.update(url for url in scraped_item.get("hrefs", [])
>>>     if url not in visited_urls and url.startswith('http://sports.sohu.com'))

The method illustrated in this tutorial is recommended for collecting smaller data samples. If you need larger datasets, please consider downloading our ready-to-use web data or submitting a crawl request which we will carry out for you.