Archive | May 2013

Big Data Industry Atlas

What's The Big Data?



Source:; Forbes

Note: Actian has recently acquired ParAccel and Pervasive Software

See also Top Ten Big Data Pure-Plays and Top 10 Most Funded Big Data Startups


View original post


Test Drive of Parallel Computing with R

Yet Another Blog in Statistical Computing

Today, I did a test run of parallel computing with snow and multicore packages in R and compared the parallelism with the single-thread lapply() function.

In the test code below, a data.frame with 20M rows is simulated in a Ubuntu VM with 8-core CPU and 10-G memory. As the baseline, lapply() function is employed to calculate the aggregation by groups. For the comparison purpose, parLapply() function in snow package and mclapply() in multicore package are also used to generate the identical aggregated data.

Below is the benchmark output. As shown, the parallel solution, e.g. SNOW or MULTICORE, is 3 times more efficient than the baseline solution, e.g. LAPPLY, in terms of user time.

In order to illustrate the CPU usage, multiple screenshots have also been taken to show the difference between parallelism and single-thread.

In the first screenshot, it is shown that only 1 out of 8 CPUs is used…

View original post 56 more words

What is probabilistic truth?


I am currently working on a validation metric for binary prediction models. That is, models which make predictions about outcomes that can take on either of two possible states (eg Dead/not dead, heads/tails, cat in picture/no cat in picture, etc.) The most commonly used metric for this class of models is AUC, which assesses the relative error rates (false positive, false negative) across the whole range of possible decision thresholds. The result is a curve that looks something like this:


Where the area under the curve (the curve itself is the Receiver Operator Curve (ROC)) is some value between 0 and 1. The higher this value, the better your model is said to perform. The problem with this metric, as many authors have pointed out, is that a model can perform very well in terms of AUC, but be completely miscalibrated in terms of the actual probabilities placed on…

View original post 273 more words

Sharing my R notes

TRinker's R Blog

I started working with R 2 1/2 years ago. I remember opening R closing it and thinking it was the dumbest thing ever (command line to a non programmer is not inviting). Now it’s my constant friend. From the beginning I took notes to remind myself all of the things I learned and relearned. They’ve been invaluable to me in learning. They are not particularly well arranged nor do they credit sources properly. There are likely bad or outdated practices in there but I figured they may be helpful to others learning the language and so I’m sharing.

Note that :

1) they are poorly arranged
2) they may have mistakes
3) they don’t credit others work properly or at all

They were for me but now I think maybe others will find them useful so here they are:

*Note that the file is larger ~7000KB and…

View original post 3 more words

How analzying Wikipedia page views could help you make money


Plenty of companies have been looking at software for analyzing private large data sets and combining it with external streams such as tweets to make predictions that could boost revenue or cut expenses. Walmart, (s wmt) for instance, has come up with a way for company buyers to cross sales data with tweets on products and categories on Twitter and thereby determine which products to stock. Here’s another possible data source to consider checking: Wikipedia.

No, this doesn’t mean a company that wants to predict the future should take a guess based on what a person or company’s Wikipedia page says. However, researchers have found value in page views on certain English-language Wikipedia pages. The results were published Wednesday in the online journal Scientific Reports.

The researchers looked at page views and edits for Wikipedia entries on public companies that are part of the Dow Jones Industrial Average, such…

View original post 510 more words