For each data science project, you somehow have to retrieve your data. In some cases you get preprocessed data (consider Kaggle competitions), in other cases you will search for existing data sources in formats like CSV, JSON or Excel (consider data.gov and similar ones) and sometimes you have to collect the data yourself.
From my point-of-view the last situation appears to be the most interesting one, because you can work on a topic, which nobody else has analysed before. In many of those situations you do not want to scrape all content from one data source all at once, but instead you have to monitor a system for a long time. Examples for such a situation might be systems where you want to display the latest data up to the current point in time like price or market trends.
To setup such systems reliably, you need to consider several aspects:
- You should not crawl the website to fast, otherwise you might be blocked
- You should always include as much information as possible, in case you later want to analyse more fields than you thought about beforehand
- You need to constantly monitor if your system still works
Crawling at an appropriate speed for not getting blocked is not a very difficult problem for a long-term data collection project. This point is much more challenging if you want to retrieve historical data from several months or years at once.
If you only crawl the data that gets added each day, your access patterns might not be too different from a standard visitor. However, you still should pay close attention not to crawl the website too fast. When I do not have too many requests in total, I like to have a delay of one or more seconds between requests.
Save Source Code
You should always store the whole raw data on your hard disk, because later you will have more ideas which information you could gather from the raw data. Imagine you are crawling news paper articles and you parsed the publish date, the author and the category of an article. Now you realize that it would be an interesting idea to check the word occurences in titles, but you did not store the titles on your disk. This would mean you have to re-crawl the whole website - unless you stored the source code for each crawled article.
You should only refrain from storing the whole raw data, if you are not allowed to store it (e.g. because it would mean that you are illegaly copying copyright protected data).
Monitor the Availability of your Process
You always want to know if your scraping process still works. It will fail at some point in time and you want to know it immediately. Thus, you have to plan for this situation beforehand. You need some system to monitor if your process works or if it has failed.
One way to achieve this would be to check the number of scraped items per time interval (e.g. per day, if you are OK with getting to know an issue only after a day) and send an alert if the number of scraped items falls short of a certain limit.
Another possibility - however not a full replacement for method one - is monitoring the runtimes of your scraper. If you expect your scraper to run once per hour, you should monitor if it does run once per hour. When there is a situation where it did not run for two hours, you should get an alert immediately. This method, however, is not a full replacement for checking the number of scraped items, because your scraper might run without being able to gather any data (e.g. because it was blocked from the website).
A System for Collecting News Paper Information
Inspired by SpiegelMining I wanted to do something similar for other news papers and other countries. So, I started a scraping process for Austrian and Croatian news papers. This project only gets interesting once you have enough data, because only then you can start to see trends. Thus, it is a perfect projects for the points mentioned above.
I implemented my scraping process with scrapy. A few times per hour I visit the front pages of several news papers and check for new articles. New articles will get crawled and I will store the website content in an S3 bucket. Articles that have been crawled already are not getting crawled again, because I also keep a list of already crawled articles in a DynamoDB database.
Each month I will zip the list of all monthly articles from all news papers into one gzip archive. This reduces storage costs a bit (which are not high for my volumes, but I like to think in terms of scalability) and it is vital for the performance of my following analysis step.
Only then I will start to really scrape the data and extract features I am interested in. For example, news papers maintain a list of categories and we can extract the category for each article. Each article also has a publish date and most articles have an author (either a person or an agency). I will extract this information and re-combine it into one file per newspaper with a map-reduce job.
All the steps up to zipping the original files happen automatically, only the data extraction is triggered by me on-demand. I will then use the resulting files for further analysis in Jupyter Notebooks.
For keeping track of crawler healthiness I currently use the service healthchecks.io, which works quite well. I sometimes hit the limits of my server which means that my docker containers will get into trouble. In these cases I always received mails from healthchecks.io.
I currently also plan to extend this system to other situations where I’d like to collect some data over a longer period of time.