-
Analyzing the Common Crawl using Map-Reduce
Let’s analyze some real data using Map-Reduce. Common Crawl is a web crawl of the entire web by a non profit organization (but they seem to have some sponsors to pay for resources and they’re even hiring employees). Their datasets are provided in a public S3 bucket for free to the downloader. We will analyze the data using Hadoop (in my case on Amazon’s EMR). At first I tried to use Disco, but it caused a lot of effort and some day I got stuck with a problem to hard to invest more time.
-
Training your custom classifier in Tensorflow Inception image recognition
Just some months ago, Google released code for classifying images using neural networks. Some time later, they also released code to train your custom models, either from scratch or improving a baseline model. The baseline model in that case usually is a model trained on the ImageNet dataset.
-
Using Map-Reduce on Graphs
Map-Reduce seems to be the standard technology for working with large amounts of data these days. It is most well-known in combination with simple flat, table-like structures, maybe because most beginner tutorials focus on these.
-
Using CodeCommit with the Ubuntu AMI
Sometimes, you might have to fetch your own git repository from an AMI. In order to achieve this, you need a role which allows your EC2 instance to access the git repository. So, in your IAM create a new role with the attached policy AWSCodeCommitReadOnly and a trust relationship for EC2.
-
Disco: Eine Alternative zu Hadoop
Die Administration von Hadoop ist für einen Hobby-Nutzer relativ umständlich. Nach dem, was ich im Berufsleben bisher von Hadoop gesehen habe, hatte ich zumindest keine Lust, mir zu Hause ein Testsystem einzurichten, das über eine fertige Distribution hinausgeht.