Predominately data science projects deal with descriptive statistics. The common theme (especially on this blog) is to gather a dataset, visualize and describe it. The toolset consists of a combination of machine learning, descriptive statistics and (gg-)plots. This time I want to go a step further; from descriptive to prescriptive analytics. The goal is to optimize a fantasy football team. To be more precise the task at hand is to select a set of players while keeping within the budget (e.g. a typical knapsack problem). For that I first gathered some fantasy football data from comunio.de
Currently the news are filled with articles about the rise of machine intelligence, artificial intelligence and deep learning. For the average reader it seems that there was this single technical breakthrough that made AI possible. While I strongly believe in the fascinating opportunities around deep learning for image recognition, natural language processing and even end-to-end “intelligent” systems (e.g. chat bots), I wanted to get a better feeling of the recent technological progress.
A while back, I created the small package Roxford to access Microsoft’s Cognitive Services API in order to easily recognize objects in images. Back then Microsoft called the service “Project Oxford”, hence the name “Roxford”. Since then Microsoft extended their API to include image tagging, description and celebrity detection. In the following post, I will try to illustrate the functionality and how it is called through the package. To install the package, just follow the guide.
Themes are an convenient way to give ggplot charts an individualized, sometimes stylish look. Most of the time, I rely on the ggthemes package and the Economist style. Last week colleagues asked me to change the look of my charts. We joked around and I agreed to create a unicorn ggplot theme. I want to use the challenge to detail a) how to create custom ggplot themes and b) look at unicorn startup data.
In the last post, I mapped gas stations and gas prices in Germany. After posting it, I started to look at the dataset from a different angle. The starting question was; “How can I model gas prices? What are the influencing factors?” One well known fact is that certain gas station brands demand higher prices.
One of the most appealing data visualisation charts are maps. I love maps as they combine an incredible information density with intuitive readability. Also I feel that most people prefer maps over other visualisations. (Is there research on this?) So it is time to get R-map-ready.
As a play example, I downloaded all German gas stations which are next to the “Autobahn”. Along with the names, I got the exact locations (in form of latitude/longitude) and the price of gasoline at each station. (Prices are in Euro and taken on a Friday night in a time span of roughly 30 minutes.) For starters, we just plot all gas stations on a map and color them depending on their price for (super e5) gasoline.
Digital transformation or digital business transformation is apparently currently one of the hot topics in the German business world. What puzzles me slightly, is why just now? The digitization trend has been around for ~20 years. Established buisness models have been destroyed or massively changed due to the trend in the last 15 years. E.g. the music industry around 2000 with the start of Napster, the camera sector twice since the introduction of digital cameras and now with mobile phones taking their place. Other often named categories are; retail (Amazon), taxi (uber), movie (netflix) businesses.
Youtube is one of the channels the candidates for the US election use extensively to promote themself. Using the public Youtube API and the R package tuber it is pretty straightforward to create a snapshop of the online discussion and sentiment.
Once in a while I use AirBnB. There are a couple of features that I (intuitively) use to judge if an apartment is save to book; ratings, images of the flat and the user avatar. Apparently, these avatars play an important part in the overall service and usage of AirBnB. A recent study finds that “Attractive Airbnb hosts are more likely to get bookings, even with bad reviews”.
I just started reading Alvin Roth’s book “Who gets What - And why?” and it already got me thinking. The book discusses the principles of markets and market design using various examples. One starting point is the transition of markets into being commodity markets. Simplified; in a commodity market all products sold are equal (think: stock markets), hence only the price is a relevant criteria for the buyer/seller. Alvin exemplifies the transition of products into being a commodity with the (Ethiopian) coffee beans market.
With the Euro’16 coming up in 2 weeks, I thought it would be great to look up the odds for each team. Using a small R-Script, I got the data from this site. As previously discussed, I cleaned the data (calculate the probability from the odds and then normalize the probability to account for the bookmaker’s revenue).
First to the naming; it basically is an arbitrary condensation of “R + Google Cloud Vision API”. I wonder why google chooses to mix google with vision. In my opinion it sounds pretty much like “to goggle with vision”, which makes limited sense. For the functionality; the package enables convenient Image Recognition, Object Detection, and OCR using the Google’s Cloud Vision API. More precisely the user can pick between the following image recognition modes: FACE_DETECTION, LANDMARK_DETECTION, LOGO_DETECTION, LABEL_DETECTION, TEXT_DETECTION. Without further undo, here is how you get started:
A while ago, I wrote about soccer odds in Germany. Specifically I wrote about the odds of relegation for two local teams; SV Darmstadt and Eintracht Frankfurt.
As the season progressed in quite a negative sense, the question is still relevant. Let’s have a quick look at the current tableau.
Image recognition and object detection has been around for some years. However, usage and adoption was limited due to quality and ease of development. With the release of Microsoft’s Project Oxford, and Google’s Vision API, the accessibility and applicability has massively improved. Both APIs use REST API access and provide an excellent opportunity for the average developer to augment their apps with fancy -state of the art- machine learning features. In a previous post I discussed Microsoft’s offering. In this post I give the Google’s Vision API a shot, especially the object detection functionality.
I previously did a short review on Microsoft’s image recognition and face detection API. A couple of weeks ago Google announced their vision API providing some similar features. Even though there is no R package or code to dive into this API and their API documentation is rather sparse, I thought it could be fun and inspiring to give it a try.
Gathering data from the web is one of the key tasks in order to generate easy data-driven insights into various topics. Thanks to the fantastic Rvest R package web scraping is pretty straight forward. It basically works like this; go to a website, find the right items using the selector gadget and plug the element path into your R-code. There are various, great tutorials on how to do that (e.g. 1, 2 ).
In the last two posts 1, 2, I tried to discuss how measurement and false metrics drive optimization towards low hanging fruits and in the end degrade ad effectiveness. I would like to follow up with a short example of how the issue extends into the paid search (e.g. Google Adwords) channel.
In the last post, I discussed how the current digital measurement approach is biased towards targeted ad buying. The key reason is that ad effectiveness is calculated on a cost per order/conversion basis. As particular user segments -which are addressed with digital targeting- have a high base purchase probability, the segment looks more responsive. Due to insufficient experimentation and proper measurement, the budget optimization tries to reach users who are likely to order/”convert” anyway. Hence, it ignores incremental effects and instead focuses on overall purchase probabilities, mixing base purchase probability and ad-driven incremental effects.
One of the key trends in the advertising industry is (digital) data-driven marketing. The whole thing starts with massive, passive data collection. No matter which website we visit or which app we use: We leave a digital footprint. These footprints are compiled for individual users and form so-called user profiles. A set of similar profiles is then aggregated to user segments, which aim at describing a homogeneous set of users with similar preferences and interests. In order to improve marketing activities, companies use these user segments to show them advertisements fitting the user’s interest. To be more concrete; users visiting car sites are more likely to see BWM or Daimler-Benz ads.
In the face of upcoming elections in the US and in Germany, polling is big news. One thing that strikes me as enormously missing in the debate is how inaccurate a single poll is. Moreover, one never reads about the uncertainty around a single poll. What is the range of expected outcomes or a least a confidence interval? I am by no means an expert on polling, but I thought some simulations would help me interpreting the headline figures more accurately. In Germany most of the time, polling institutes ask around 1000 individuals to get a representative view. Using exemplary vote shares of (40, 30, 15, 10 , 5) %, I simulated 100 polls and potential outcomes.
In a previous post, I expressed the feeling that artificial intelligence, data science and big data are currently “hot” in the news. One of my favorite news outlet (“Die Zeit”) has an open data policy in a sense that they have a public API. I thought it would be worth to check my feeling. I should mention that “Die Zeit” is not very technical or economic/finance oriented. Without further ado, here it is:
An essential part of the typical office talk in Germany is about soccer and the Bundesliga. One of the current key questions is; which team will be relegated. The two local teams (SV Darmstadt and Eintracht Frankfurt) are (hot) candidates.
Image recognition and face detection has been around for some years. However, usage and adoption was limited due to quality and ease of development. With the release of Microsoft’s Project Oxford, the accessibility to such tools has massively improved. Their simple to use REST API provides an excellent opportunity for the average developer to augment their apps with fancy -state of the art- machine learning features.
Writing my first post about the German data scientist job market, I was surprised to find such a low number of open positions. The scripts are easy to extend to a cross-country comparison. To simplify the analysis, this time I focused on the Indeed website only.
As a baseline, I scanned the websites for the number of open positions which contain the term “Excel”.
Previously, I wrote about the German data science job market. The scripts are easy to extend to a data-related job requirement analysis. As before I scraped the websites Monster, Indeed and Stepstone for certain keywords. Compared to the previous analysis, this time I just look at the number of listed positions. To give an example the keyword search for “SAS” returned
There are plenty of news stories about the increased importance of (big) data, data analysis, and machine learning for business success. Is this all cheap talk or how big is the job market for data scientist in Germany?
To get some facts, I scraped all available job openings from the sites: Monster, Stepstone and Indeed searching for the term “Data Scientist”.
Altogether I found 137 open positions from
The Brand Ticker provides brand specific marketing data. On a day to day basis they list the brand value and top 3 associations gathered from news mentions and social data. They provide no documentation on how they process the data or which sources they use.
Here is short look at their service. I pulled half a year’s data in irregular intervals from their API using a small r-script. See this GIST
Due to some recent attention we start the analysis by looking at the Volkswagen (VW) data.
We live in a connected world. No matter which website we visit, which app we use and which people we interact with: We leave a digital footprint.
Day by day, there is more behavioral data created and it often makes using the internet more comfortable.
Call a bike is a service of the “Deutsche Bahn”, providing a rental bikes for short trips similar to citibikeNYC. I used it extensively for some time. Recently I found out that they provide individual trip data trough their API. I pulled last year’s data from the “CallaBike”-SOAP API.
So it looks like I did 403 trips using a Call a Bike in 2014.
Well, starting with good intentions, this place hopefully helps me to document some minor projects and thoughts on prediction markets, data analysis and R programming.