What is probabilistic truth?


I am currently working on a validation metric for binary prediction models. That is, models which make predictions about outcomes that can take on either of two possible states (eg Dead/not dead, heads/tails, cat in picture/no cat in picture, etc.) The most commonly used metric for this class of models is AUC, which assesses the relative error rates (false positive, false negative) across the whole range of possible decision thresholds. The result is a curve that looks something like this:


Where the area under the curve (the curve itself is the Receiver Operator Curve (ROC)) is some value between 0 and 1. The higher this value, the better your model is said to perform. The problem with this metric, as many authors have pointed out, is that a model can perform very well in terms of AUC, but be completely miscalibrated in terms of the actual probabilities placed on…

View original post 273 more words


Sharing my R notes

TRinker's R Blog

I started working with R 2 1/2 years ago. I remember opening R closing it and thinking it was the dumbest thing ever (command line to a non programmer is not inviting). Now it’s my constant friend. From the beginning I took notes to remind myself all of the things I learned and relearned. They’ve been invaluable to me in learning. They are not particularly well arranged nor do they credit sources properly. There are likely bad or outdated practices in there but I figured they may be helpful to others learning the language and so I’m sharing.

Note that :

1) they are poorly arranged
2) they may have mistakes
3) they don’t credit others work properly or at all

They were for me but now I think maybe others will find them useful so here they are:

*Note that the file is larger ~7000KB and…

View original post 3 more words

How analzying Wikipedia page views could help you make money


Plenty of companies have been looking at software for analyzing private large data sets and combining it with external streams such as tweets to make predictions that could boost revenue or cut expenses. Walmart, (s wmt) for instance, has come up with a way for company buyers to cross sales data with tweets on products and categories on Twitter and thereby determine which products to stock. Here’s another possible data source to consider checking: Wikipedia.

No, this doesn’t mean a company that wants to predict the future should take a guess based on what a person or company’s Wikipedia page says. However, researchers have found value in page views on certain English-language Wikipedia pages. The results were published Wednesday in the online journal Scientific Reports.

The researchers looked at page views and edits for Wikipedia entries on public companies that are part of the Dow Jones Industrial Average, such…

View original post 510 more words

Predicting Twitter popularity is all about probability


Tweets have the power to decimate markets, but they also have users and companies seeing dollar signs. With huge marketing, political, and social mobilization potential, how can you predict which tweets will get more views, and which retweets will go viral? A new study developed a statistical model that attempts to estimate the popularity of tweets, and thus how memes spread.

Starting with 52 “root” tweets from users both famous and obscure, the researchers first analyzed the dynamics of retweeting, like the speed and spread of a tweet from a user to followers and then their followers. The researchers, from the University of Washington, MIT, and Penn, used the Twitter API to collect all the retweet information and found that most retweets occurred within one hour of the original tweet. Not surprisingly, they also found that root tweets are retweeted more than the retweets themselves.

They then plugged the important…

View original post 275 more words

Visualization startup Datahero opens its doors and delivers data analysis for the masses


When I first met Datahero Co-founder Chris Neumann a year ago, I was pretty excited about what he claimed his new company was going to do. Essentially, he told me, it was going to offer a simple, cloud-based data analysis and visualization service that anyone could use. About a month later, in late May, I got a demo of a very-early-stage Datahero and was impressed with the vision. On Tuesday, the company is officially opening its service to a public beta, and the more-finished product still strikes the right chord.

Before evaluating Datahero, though, it’s important to know what it’s not. Namely: it’s not enterprise software, it’s not even business intelligence software and it’s not designed for people who hope to run complex analyses. Neumann nicely summed up what Datahero is during a recent call: “We’re gonna make it usable by the masses,” he said, which means there are going…

View original post 723 more words

Data Science for Social Good and Humanitarian Action


My (new) colleagues at the University of Chicago recently launched a new and exciting program called “Data Science for Social Good”. The program, which launches this summer, will bring together dozens top-notch data scientists, computer scientists an social scientists to address major social challenges. Advisors for this initiative include Eric Schmidt (Google), Raed Ghani (Obama Administration) and my very likable colleague Jake Porway (DataKind). Think of “Data Science for Social Good” as a “Code for America” but broader in scope and application. I’m excited to announce that QCRI is looking to collaborate with this important new program given the strong overlap with our Social Innovation Vision, Strategy and Projects.

My team and I at QCRI are hoping to mentor and engage fellows throughout the summer on key humanitarian & development projects we are working on in partnership with the United Nations, Red Cross, World Bank and…

View original post 166 more words

How energy harvesting tech could power wearables and the internet of things


It’s all very well talking about the evolution of wearable computing and the internet of things, but something has to power these thin and/or tiny devices. For that reason, it’s a good thing that so many ideas are popping up in the field of energy harvesting and storage.

Some of these ideas were on display this week at the Printed Electronics Europe 2013 event in Berlin, which took in a variety of sub-events including the Energy Harvesting & Storage Europe show. The concepts ranged from the practical to the experimental, so let’s start with the practical.

Here’s Perpetuum‘s Vibration Energy Harvester (VEH), being carried around (appropriately) on a model train.

Perpetuum train sensor

The VEH is a wireless sensor that gets attached to rotating components, such as wheel bearings, on trains. Cleverly, the device both measures and is powered by mechanical vibration. It also measures temperature, and it wirelessly transmits the results…

View original post 634 more words

On big data, the Boston Marathon and civil liberties


For all the concerns over mobile phone logs, video footage and other data collection that could potentially be used to survail American citizens, it’s times like this that I think we see their real value.

According to a Los Angeles Times article about Monday’s bomb attack at the Boston Marathon, the FBI has collected 10 terabytes that it’s sifting through in order to seek out clues about what exactly happened and who did it. Maybe I’m just a techno-optimist, but I find this very reassuring.

According the Times, “The data include call logs collected by cellphone towers along the marathon route and surveillance footage collected by city cameras, local businesses, gas stations, media outlets and spectators who volunteered to provide their videos and snap shots.”

Lots of data means lots of potential value

It’s reassuring because I’ve spoken with so many smart people over the years who can do amazing…

View original post 631 more words

Blab predicts what people will tweet, blog and report on


It’s one thing to monitor social statements on Twitter and other social networks as they happen. It’s another thing to predict what will happen over the next three days.

Blab, a Seattle-based company, has emerged with a tool that lets companies do just that, with visualizations of where conversations will pop up from more than 50,000 sources, including Facebook, (s fb) Tumblr, Twitter, YouTube (s goog), blogs and news outlets. It does this by paying close attention to where a conversation is now and then predicting based on what other conversations it could look like. For example, if people started talking about a previous Amazon (s amzn) Web Services outage on Twitter and then the conversation moved to blogs and then to mainstream media outlets, that same pattern could happen in the case of another AWS outage. That’s why measuring the trajectory of each conversation and storing it for…

View original post 188 more words