A look at AirBnB demographics

Once in a while I use AirBnB. There are a couple of features that I (intuitively) use to judge if an apartment is save to book; ratings, images of the flat and the user avatar. Apparently, these avatars play an important part in the overall service and usage of AirBnB. A recent study finds that “Attractive Airbnb hosts are more likely to get bookings, even with bad reviews”.

Read More

Is Online AD Space a Commodity?

I just started reading Alvin Roth’s book “Who gets What - And why?” and it already got me thinking. The book discusses the principles of markets and market design using various examples. One starting point is the transition of markets into being commodity markets. Simplified; in a commodity market all products sold are equal (think: stock markets), hence only the price is a relevant criteria for the buyer/seller. Alvin exemplifies the transition of products into being a commodity with the (Ethiopian) coffee beans market.

Read More

Arbitrage in Euro'16 soccer odds?

With the Euro’16 coming up in 2 weeks, I thought it would be great to look up the odds for each team. Using a small R-Script, I got the data from this site. As previously discussed, I cleaned the data (calculate the probability from the odds and then normalize the probability to account for the bookmaker’s revenue).

Read More

RoogleVision released - a Package for Image Recognition

First to the naming; it basically is an arbitrary condensation of “R + Google Cloud Vision API”. I wonder why google chooses to mix google with vision. In my opinion it sounds pretty much like “to goggle with vision”, which makes limited sense. For the functionality; the package enables convenient Image Recognition, Object Detection, and OCR using the Google’s Cloud Vision API. More precisely the user can pick between the following image recognition modes: FACE_DETECTION, LANDMARK_DETECTION, LOGO_DETECTION, LABEL_DETECTION, TEXT_DETECTION. Without further undo, here is how you get started:

Read More

Who is going down? Bundesliga Betting Odds - updated

A while ago, I wrote about soccer odds in Germany. Specifically I wrote about the odds of relegation for two local teams; SV Darmstadt and Eintracht Frankfurt.

As the season progressed in quite a negative sense, the question is still relevant. Let’s have a quick look at the current tableau.

Read More

Image Recognition and Object Detection with R/Shiny and Google Vision

Image recognition and object detection has been around for some years. However, usage and adoption was limited due to quality and ease of development. With the release of Microsoft’s Project Oxford, and Google’s Vision API, the accessibility and applicability has massively improved. Both APIs use REST API access and provide an excellent opportunity for the average developer to augment their apps with fancy -state of the art- machine learning features. In a previous post I discussed Microsoft’s offering. In this post I give the Google’s Vision API a shot, especially the object detection functionality.

Read More

From Image Recognition to Brand Logo Detection

I previously did a short review on Microsoft’s image recognition and face detection API. A couple of weeks ago Google announced their vision API providing some similar features. Even though there is no R package or code to dive into this API and their API documentation is rather sparse, I thought it could be fun and inspiring to give it a try.

Read More

Web-Scraping JavaScript rendered Sites

Gathering data from the web is one of the key tasks in order to generate easy data-driven insights into various topics. Thanks to the fantastic Rvest R package web scraping is pretty straight forward. It basically works like this; go to a website, find the right items using the selector gadget and plug the element path into your R-code. There are various, great tutorials on how to do that (e.g. 1, 2 ).

Read More

Revisiting Data-driven Marketing, part III

In the last two posts 1, 2, I tried to discuss how measurement and false metrics drive optimization towards low hanging fruits and in the end degrade ad effectiveness. I would like to follow up with a short example of how the issue extends into the paid search (e.g. Google Adwords) channel.

Read More

Revisiting Data-driven Marketing, part II

In the last post, I discussed how the current digital measurement approach is biased towards targeted ad buying. The key reason is that ad effectiveness is calculated on a cost per order/conversion basis. As particular user segments -which are addressed with digital targeting- have a high base purchase probability, the segment looks more responsive. Due to insufficient experimentation and proper measurement, the budget optimization tries to reach users who are likely to order/”convert” anyway. Hence, it ignores incremental effects and instead focuses on overall purchase probabilities, mixing base purchase probability and ad-driven incremental effects.

Read More

Revisiting Data-driven Marketing

One of the key trends in the advertising industry is (digital) data-driven marketing. The whole thing starts with massive, passive data collection. No matter which website we visit or which app we use: We leave a digital footprint. These footprints are compiled for individual users and form so-called user profiles. A set of similar profiles is then aggregated to user segments, which aim at describing a homogeneous set of users with similar preferences and interests. In order to improve marketing activities, companies use these user segments to show them advertisements fitting the user’s interest. To be more concrete; users visiting car sites are more likely to see BWM or Daimler-Benz ads.

Read More

On Panel Sizes

In the face of upcoming elections in the US and in Germany, polling is big news. One thing that strikes me as enormously missing in the debate is how inaccurate a single poll is. Moreover, one never reads about the uncertainty around a single poll. What is the range of expected outcomes or a least a confidence interval? I am by no means an expert on polling, but I thought some simulations would help me interpreting the headline figures more accurately. In Germany most of the time, polling institutes ask around 1000 individuals to get a representative view. Using exemplary vote shares of (40, 30, 15, 10 , 5) %, I simulated 100 polls and potential outcomes.

Read More

Artificial Intelligence in the News?

In a previous post, I expressed the feeling that artificial intelligence, data science and big data are currently “hot” in the news. One of my favorite news outlet (“Die Zeit”) has an open data policy in a sense that they have a public API. I thought it would be worth to check my feeling. I should mention that “Die Zeit” is not very technical or economic/finance oriented. Without further ado, here it is:

Read More

Who is going down? Bundesliga Betting Odds

An essential part of the typical office talk in Germany is about soccer and the Bundesliga. One of the current key questions is; which team will be relegated. The two local teams (SV Darmstadt and Eintracht Frankfurt) are (hot) candidates.

While I love the banter, let’s be data-driven and have a close look at the current odds. I wrote a small R-script to get the data from bookies on this topic from this site.

Read More

Image Recognition and Face Detection

Image recognition and face detection has been around for some years. However, usage and adoption was limited due to quality and ease of development. With the release of Microsoft’s Project Oxford, the accessibility to such tools has massively improved. Their simple to use REST API provides an excellent opportunity for the average developer to augment their apps with fancy -state of the art- machine learning features.

Read More

Analyzing Job Postings; A Cross Country Comparison

Writing my first post about the German data scientist job market, I was surprised to find such a low number of open positions. The scripts are easy to extend to a cross-country comparison. To simplify the analysis, this time I focused on the Indeed website only.

As a baseline, I scanned the websites for the number of open positions which contain the term “Excel”.

Read More

Data related Job Requirements

Previously, I wrote about the German data science job market. The scripts are easy to extend to a data-related job requirement analysis. As before I scraped the websites Monster, Indeed and Stepstone for certain keywords. Compared to the previous analysis, this time I just look at the number of listed positions. To give an example the keyword search for “SAS” returned

Read More

The German Data Scientist Job Market

There are plenty of news stories about the increased importance of (big) data, data analysis, and machine learning for business success. Is this all cheap talk or how big is the job market for data scientist in Germany?

To get some facts, I scraped all available job openings from the sites: Monster, Stepstone and Indeed searching for the term “Data Scientist”.

Altogether I found 137 open positions from 58 companies.

Read More

Analyzing "Brand Ticker" data.

The Brand Ticker provides brand specific marketing data. On a day to day basis they list the brand value and top 3 associations gathered from news mentions and social data. They provide no documentation on how they process the data or which sources they use.

Here is short look at their service. I pulled half a year’s data in irregular intervals from their API using a small r-script. See this GIST

Due to some recent attention we start the analysis by looking at the Volkswagen (VW) data.

Read More

A Note on Data Leakage


We live in a connected world. No matter which website we visit, which app we use and which people we interact with: We leave a digital footprint.
Day by day, there is more behavioral data created and it often makes using the internet more comfortable.

Read More

Hello World

Well, starting with good intentions, this place hopefully helps me to document some minor projects and thoughts on prediction markets, data analysis and R programming.

Read More