Posts Tagged ‘data’

BBC Business Daily – interview with Arthur Laffer

July 12, 2012

In other news, Art Laffer has become a parody of himself.

  • Deliberately misrepresenting the flat tax. Making it sound like 12% is a tax cut for most Americans, when in fact the effective tax rate for everyone making under $100k is already under 12%.
    The average effective tax rate on almost all Americans is already under 12%. 
  • If what you really want to do is raise taxes on the poor and cut taxes on the rich, why don’t you just come out and say it, Art?
  • Conflating simplicity of tax returns with elimination of tax brackets.
  • Poverty is caused by high tax rates and welfare. Quick, tell Somalia!
    Somalian GDP is $600 per person. Better tell them to stop paying out so much welfare because that is what's making them the #222 richest country. 
  • Brings up sin taxes as a distraction, like a magician’s trick.
  • No estimates of how much income is being “accelerated” from 2013 into 2012, only statements that the number is huge.
  • Of course when the tax rates rise on the rich, they’re all going to flee the US. Because ceteris paribus it’s in their interest to do so. All other things considered, the first thing I do every morning is ask what tax rates are on various activities I could engage in and various countries I could move to.
  • The sign is negative, therefore the magnitude is large.
  • Also austerity is equivalent to growth, although later on he contradicts that and says a lot of government spending is necessary. (Wonder whether he wants to cut health benefits, elderly benefits, military protection, or government jobs?)
  •  Some government spending is wasteful (again no discussion of magnitude—.00001% or 10%?) therefore austerity.
  • Ricardian Equivalence, because I say so.
  • Twisted reasoning like “Poor people are poor because of disincentives to work, but rich people would pay all of their tax if only you didn’t ask so much of them.” However not going to prove any of that, it’s just “common sense”.
  • This passes for argument: “Come on, you know that.” (I count four.)
  • Because, incentives. Because, markets.

If you laugh and squeal while you say it, you’re right.


I would love to have a less cynical view of the world than that any yahoo who claims taxes can be vastly reduced on the rich with no negative consequences to anyone gets banquets in his honour, funding from “think tanks”, and handed the reins of policy. But this sh*t tests me.

Searching for something positive to say…at least he said the goal of government is to get poor people to be prosperous.

June 24, 2012


  • It’s easier for me to grok statistical significance (p’s and t’s) from a scatterplot than magnitude (β’s).
  • Even though magnitude can be the most important thing, it’s “hidden” off to the left. Note to self: look off to the left more, and for longer.
  • But I’m set up to understand the correlativeness in a sub_i, sub_j sense — which particular countries fit the pattern as well as how closely.


  • Minute __:__ Do each of the dimensions of social problems correlate individually, or is this only a mass effect of the combination?

If it’s true that raising marginal tax rates on the rich lowers crime rates without paying for any anti-crime programmes, that’s almost a free lunch.

UPDATE: Oh, hey, six months after I watch this and 3 days after I put up the story, I see Harvard Business Review has a story corroborating the same effect, instead pointing out how economists don’t look at the p’s and t’s on a regression table. I feel like I “mentally cross out” any lines with a low t value and then wonder about the F value on a regression with the “worthless” line removed.

June 19, 2012

Opportunity is fleeting. Experience is fallacious. Judgment is difficult.


May 23, 2012

It takes ~20 observations to verify your first significant digit of the mean with confidence.

Do you know how many observations it takes to verify your first sig-fig of the variance? More like 1000. And that’s just to get one digit of accuracy! Higher moments (skew, kurtosis) are even worse.

That’s why I often laugh out loud when I read in the newspaper claims that rely on a certain value of the variance. Even in serious, published papers!—I often see tables with estimates of standard deviation that go out to three decimal places, just because the software spat the numbers out that way. It gives a false sense of accuracy. It’s ridiculous.

Karen Kafadar

Infinite Data

May 17, 2012

Since people liked my last opinion piece on #big data, here’s another one.

Imagine there was a technology that allowed me to record the position of every atom in a small room, thereby generating some ridiculous amount of data (Avogadro’s number is 𝒪(10²³)ŽŽ so some prefix around that order of magnitude — eg yoctobytes). And also imagine that there was a way for other scientists to decode and view all of that. (Maybe the latency and bandwidth can still be restricted even though neither capacity nor resolution nor fidelity nor coverage of the measurement are restricted — although that won’t be relevant to my thought experiment, it would seem “like today” where MapReduce is required.)

Let’s say I am running some behavioural economics experiment, because I like those. What fraction of the data am I going to make use of in building my model? I submit that the psychometric model might be exactly the same size as it is today. If I’m interested in decision theory then I’m going to be looking to verify/falsify some high-level hypothesis like “Expected utility” or “Hebbian learning”. The evidence for/against that idea is going to be so far above the atomic level, so far above the neuron level, I will basically still be looking at what I look at now:

  • Did the decisions they ended up making (measured by maybe 𝒪(100), maybe even 𝒪(1) numbers in a table) correspond to the theory?
  • For example if I draw out their assessment of the probability and some utility ranking then did I get them to violate that?

If I’ve recorded every atom in the room, then with some work I can get up to a coarser resolution and make myself an MRI. (Imagine working with tick-level stock data when you really are only interested in monthly price movements—but in 3-D.) (I guess I wrote myself into even more of a corner here, if we have atomic level data then it’s quantum, meaning you really have to do some work to get it to the fMRI scale!) But say I’ve gotten to fMRI level data, then what am I going to do with them? I don’t know how brains work. I could look up some theories of what lighting-up in different areas of the brain means (and what about 16-way dynamical correlations of messages passing between brain areas? I don’t think anatomy books have gotten there yet). So I would have all this fMRI data and basically not know what to do with it. I could start my next research project to look at numerically / mathematically obvious properties of this dataset, but that doesn’t seem like it would yield up a Master Answer of the Experiment because there’s no interplay beween theories of the brain and trying different experiments to test it out — I’m just looking at “one single cross section” which is my one behavioural econ experiment. Might squeeze some juice but who knows.

Then let’s talk about people critiquing my research paper. I would post all the atomic-level data online of course, because that’s what Jesus would do. But would the people arguing against my paper be able to use that granular data effectively?

I don’t really think so. I think they would look at the very high level of 𝒪(100) or 𝒪(1) data that I mentioned before, where I would be looking.

  • They might argue about my interpretation of the numbers or statistical methods.
  • They might say that what I count as evidence doesn’t really count as evidence because my reasoning was bad.
  • They couldn’t argue that the experiment isn’t replicable because I imagined a perfect-fidelity machine here.
  • They could go one or two levels deeper and find that my experimental setup was imperfect—the administrator of the questions didn’t speak the questions in exactly the same tone of voice each time; her face was at a slightly different angle; she wore a different coloured shirt on the other day. But in my imaginary world with perfect instruments, those kinds of errors would be so easy to see everywhere that nobody would take such a criticism seriously. (And of course because I am the author of this fantasy, there actually aren’t significant implementation errors in the experiment.)

Now think about either the scientists 100 years after that or if we had such perfect-fidelity recordings of some famous historical experiment. Let’s say it’s Michelson & Morley. Then it would be interesting to just watch the video from all angles (full resolution still not necessary) and learn a bit about the characters we’ve talked so much about.

But even here I don’t think what you would do is run an exploratory algorithm on the atomic level and see what it finds — even if you had a bajillion processing power so it didn’t take so long. There’s just way too much to throw away. If you had a perfect-fidelity-10²⁵-zoom-full-capacity replica of something worth observing, that resolution and fidelity would be useful to make sure you have the one key thing worth observing, not because you want to look at everything and “do an algo” to find what’s going on. Imagine you have a videotape of a murder scene, the benefit is that you’ve recorded every angle and every second, and then you zoom in on the murder weapon or the grisly act being committed or the face of the person or the tiny piece of hair they left and that one little sliver of the data space is what counts.

What would you do with infinite data? I submit that, for analysis, you’d throw most of the 10²⁵ bytes away.

May 14, 2012

Gauging the frothiness of the webby/techy/san-fran VC market.

Source: Mark Suster. Propagated via one of tumblr’s owners, who added:

Based on the NVCA statistics on the venture capital industry, there are [approximately] 1,000 early stage financings every year….

And somewhere around 50 – 100 of them exit for more than $100mm every year. So 5-10% of the companies financed by VCs end up exiting for more than $100mm.

Mathematical PS: These are value-at-risk numbers, just upside-down.

May 12, 2012

Lawrence Krauss, author of A Universe from Nothing lecturing on cosmology.

  • Don’t really agree with or like his monolithic straw-man representation of “religion” versus “science” at minute 6. “Religion pretends to know all the answers” .

    Sub-i, sub-j, larry. There are many religions and many sciences.

  • Minute 14. Edwin Hubble’s original data! straight-line plot through a bunch of dispersed points. “That’s why we know he was a great scientist” — nobody laughed in the tape, but I did — “he knew that he should draw a straight line through a cloud of points”. I also love it when people take the time to go through an old paper, pull things out, and present them anew.
  • I have never understood the business of standard candles. To me it seems like you have two degrees of freedom (distance and brightness), only one of which can be knocked out by the measurement of apparent brightness.

    So say we figure out a “standard candle” — a star with a particular colour signature that tells us “The star is at X phase of its life, is made up of Z, and such stars always shine at a constant brightness of 1 for Q million years.”

    But still — how do we know that our theory is right? How do we know, know, know that  it’s really brightness of 1? It’s not like we can triangulate. And it’s certainly not like we’ve been there and seen it first-hand.

  • I had the same problem in a discussion with a geologist a few months ago. I sometimes get the sense that working scientists are so immersed in the practical fact that, yes, for all intents and purposes we know X to be true, that they’re not willing to step back to an abstract, philosophical level and say: “Well, if you really keep pulling on the threads, there are assumptions at the bottom of everything, so yes, we really don’t absolutely know X to be the case. However, Philosophical Prig, we don’t really know we’re not living in The Matrix either! So hush up and get back to doing something relevant.” But that’s the kind of answer I really want to hear: no, we don’t know know know, but for all practical purposes, yes we know.
  • Minute 15. How old is the universe? So Hubble got the answer wrong in 1929, and it was obviously wrong. “Scientists don’t know what they’re doing”

    But I had the same reaction to people talking about dark matter in the 90’s. “What is this stuff we call dark matter? Or dark energy?” As I understood it at the time, “dark matter” just represented a 90% fudge factor in astronomical measurements. It could be that gravity or quarks or anything else about the laws of physics is simply different in other parts of the universe. And how would we rule out that hypothesis? We just rule it out by assuming that the laws of Nature are the same everywhere, because that’s what we’ve assumed for the last few hundred years and it’s always worked out. Straight-line extrapolation to “That assumption must be true now and everywhere” despite that we’re now talking about multiple galaxies so unimaginably far away.

  • Minute 18:30 “This is a Hubble plot, much better than Hubble’s plot. It was made after the discovery that on a log-log plot, everything is a straight line.” Again, no laughs, but I thought that was hilarious.
  • Calculations that estimate the total energy in all vacuums add up to 10^28 times the observed mass of the universe. Whoops.
  • Dark matter here on Earth? Let’s go down into the mines and measure it. (By the way, where would the physicists be if those evil resource-extraction companies in Lead, South Dakota hadn’t negotiated with the legal entities that be and drilled into the Earth’s crust? Way to play it as it lies, Sandia Labs. #scruples)
  • Flat, closed, or open universe? (also why are these the only three options?) Well, we only observe 30% of the mass thta would be required to make the universe flat.
  • A gigantic, gigantic, um, really gigantic triangle — to measure the curvature of the universe.
  • That’s what those microwave-background radiation detecting balloons in Antarctica have been doing.
  • There’s always something there, even when there’s nothing. (see this video of the quantum fields flickering about in empty space)
  • 90% of the mass of a proton is due to the vacuum. (not delta spikes, more like 1/x or exp(−x) integrals.) Therefore your mass is 90% due to quantum fluctuations around the zero point energy.
  • The universe also has a net total energy of 0. Hence the possibility of “a universe from nothing” (our universe needn’t have a Creator since there is enough mass/energy in the physical vacuum that those virtual fluctuations could have acted as a Prime Mover).
  • 70% + 30% = 100%
  • Making our place in the Universe even less special. “Regular” matter—the stuff we observe—is only a 1% pollution in the uniform dark-energy / dark-matter background of the universe.
  • Deep-future scientists (like in a few billion years) won’t be able to observe other galaxies. Measuring the universe, they will observe (correctly) that their galaxy is the only one around, and that there is nothing but empty, eternal space around them.
  • So they will be “Lonely and ignorant, but dominant. Of course those of us who live in the United States are already used to that.”

Big Data vs Quality Data

May 3, 2012

theLoneFuturist: I’m not certain why learning Hadoop isn’t more attractive to you. If you are fine with R, doesn’t having lots of data interest you?
theLoneFuturist: Don’t get me wrong, there are probably unexciting tasks associated with big data, but you’d then get to run your algorithms over big data. And lack of data is an often cited problem for learning/adaptive algorithms. But of no interest to you?
isomorphisms: The BIG DATA fad seems to be based on “let’s turn a generic algorithm loose on exabytes!”
isomorphisms: No matter how the data was gathered, what its underlying shape/logic is, what’s left out.

isomorphisms: For example twitter text analysis. At a high level I might ask “How are attitudes changing?” “How do people talk about women differently than men?” “Do attitudes toward Barack Obama depend on the state of the US economy?” Questions whose answers aren’t easy to turn into just a few numbers.

isomorphisms: My parody of a big-data faddist’s response would be all the sophistication of: listen twitter | Hadoop_grep Obama | uniq -c | well_known_sentiment_analysis_algo. Hooray! Now I know how people feel about Obama! /sarcasm

isomorphisms: In the ‘modelling vs scavenging’ war (cf Leo Breiman) I’m more on the modelling side. So I find some aspects of the ML / bigdata craze unsavoury.
isomorphisms: But the emergence of petareams is certainly a paradigm shift. I don’t think the Big Data faddists are wrong in that. That environmental difference will change things as surely as cheap computing power changed statistics. (Why learn statistical theory when you can bootstrap?) As far as the art of the possible — more clickstreams being recorded makes more analysis doable.

isomorphisms: Anyway, to answer your question, no, having a lot of data doesn’t interest me.
isomorphisms: I’d rather have interesting data than lots of it.
theLoneFuturist: Thing is, interesting data is probably a subset of big data. Mechanically define/separate interesting and you can get it.
isomorphisms: Definitely not, think about historical data.
isomorphisms: For example Angus Maddison’s estimates of ancient incomes; the archaeological or geological record; unscanned text (like the Book of Kells, are you going to OCR an illuminated manuscript? You would miss the Celtic knots)
isomorphisms: Even if stuff were OCR scanned properly and no problems with tables, the interpretive work that historians do would be hard to code up in an algorithm. To me they dig up much more interesting information than the petabytes of clickstream logs.
isomorphisms: Or these internal documents they just found from Al-Qaeda? Which would you rather have, 100 GB of server logs or 10 kB worth of text from Osama bin Laden at a crucial moment?
isomorphisms: Also, we talk about text being “unstructured data”, how about “I smell sulphur coming from over there” (during an archaeological dig) or “This kind of quartz shouldn’t be at this depth in this part of the world” or, you know, “Hey look are those dinosaur footprints?”
isomorphisms: The kind of stuff a fisherman might notice. THAT’S unstructured data.

theLoneFuturist: Sure, though if enough historical records get scanned, they too become the dread big data. I do catch your point, though.

April 2, 2012

An unabashedly narcissistic data analysis of my own tweets.

The unequivocally lovely Jeff Gentry (@geoffjentry) has contributed an R package with easy-to-read documentation that works, which I’ll walk through here so that you, too, can gaze at your own face mirrored in the beauty of a woodland pond—er, sea of electrons.

Here’s the basic flow for grabbing stuff. You can do more with ROAuth but that’s a bit of a pain.

require(twitteR) <- searchTwitter("RT @isomorphisms", n=100)
news <- getTrends(n=50)
firehose <- publicTimeline(n=999)

my.tweets <- userTimeline('isomorphisms', n=3500)
head(my.tweets$text) Consider: Donkey Kong is neither a donkey, nor a kong. William Thurston, geometrizer of manifolds When I invent a single-letter language, it's going to be called Ж. @theLoneFuturist True. If $GOOG were only an ad network, with no search facility, how much would it be worth? Do your arms hang down by your side in zero gravity? Because then I bet astronauts have less smelly armpits. Can't log into Hacker News with #w3m! Unexpected. Salt and sugar are opposites. Therefore if i eat too much salty food I must balance it with candy. #logic @leighblue Do you know any behavioural econ studies on utility vs bite-size / package-size?

Those are some of the ways you can grab data — twitteR hooks into RCurl and then, like, the info is just there. Run twListToDF( tweets ) to split the raw info into 10 subfields—text-of-tweet, to-whom-was-the-reply-threaded, timestamp, and more.

To pull out just one of those fields—like “source of tweet”, for example, use sapply:

my.tweets <- userTimeline('isomorphisms', n=3000)
whence.i.tweet <- sapply( my.tweets, function(x) x$statusSource

You can see from plots 1, 2, 3, and 4 that I use @floodgap’s TTYtter client (tweeting from the command line; no installation). In fact this is why I’ve started tweeting so much the last few months: I run TTYtter in a virtual terminal, mutt (command-line gmail) in another virtual terminal, and therefore it becomes quite easy to flick my virtual newsfeed/conversation stream on for a minute or two here and there whenever I’m at the computer. It feels like The Matrix or Neuromancer or something.

Here’s how I created the ggplot radial chart #4 — this was the longest command I had to use to generate any of them. For some reason qplot didn’t like scale_y_log10() so I did:

ggplot( data = data.frame(whence.i.tweet),  aes( x=factor(whence.i.tweet),  fill=factor(whence.i.tweet) )   )
 + scale_y_log10()
 + geom_bar()
 + coord_polar()
 + opts(  title="whence @isomorphisms tweets",   axis.title.x=theme_blank(),   legend.title=theme_blank()   )

In the words of @jeffreybreen, twitteR almost makes this too easy. A few months ago — before I knew about this package — I was analysing tweets for a client who wanted to gauge the effectiveness of “customer service tweets”. I wrote an ugly, hacky perl script that told me whether the tweet had an @ in it, whether the @ was a RT @an wem the tweet was @, and so on. Dealing with people using @ in another sense besides “Hey @cmastication, what’s up?” or different numbers of spaces between RT/MT; multiple RT’s in the same message; and so on — was an icky mess. I probably spent half a week changing my regexes around to deal with more cases I hadn’t thought of. Like most statisticians, I hate data munging—swimming around in the data is the fun part, not patching up the kiddie pool. Besides that, my client wanted the results in an Excel file — and Excel can’t handle multidimensional arrays (whereas a tweet mentioning @a @b @c should have just one “mentions” slot with three things in it).

That twitteR package is so hot right now.


But as much fun as it was to display my love of TTYtter in four different plots, that’s not the only R-based egotainment you can compute on a Friday night.

How wordy am I?

I know I am wordy. I often adopt a telegraphic SMS-like typing style (“Sntrm wd b gr8 prez, like Ahmedinejad”) rather than hold back my trenchant remarks about astronauts’ armpits. Tumblr’s auto-tweets don’t help my average, either—the default is long, and I’m usually too lazy to change it.

With the magic of kernel density estimates—which are definitely not overkill for the analysis of my appropriately-florid and highly-important charstreams—and my usual base::plot params, the length of my tweets is made art in the form of chart #5.

I got a vector of tweet-lengths using @hadleywickham’s stringr package:

my.tweets <- userTimeline('isomorphisms', n=3500)
my.tweets <- twListToDF( my.tweets ) iso <- my.tweets$text require(stringr) iso.len <- str_length(iso) #vectorised! No for loops necessary hist( iso.len, fill="cyan" )

Proving once again that all real-world distributions fit a bell curv—…um.

You can of course use subset( my.tweets ) to plot tweets that were made under certain conditions—I might look only at my tumblr auto-posts using subset( my.tweets, statusSource=="tumblr"). Or only at short tweets using subset( my.tweets, str_length(my.tweets$text)<100 ). And so on.


Lastly, I wanted to plot my tweeple—the people I talk to on twitter (most of whom I don’t actually know in real life … I like to keep friends and mathematical geekery separate). As you can see from the final chart, it was largely a sh_tshow. Or so I thought, until I considered attacking the problem with ggplot.

One of ggplot’s strengths—in my opinion its greatest strength—is the facet_grid( atttribute.1 ~ attribute.2) function. In combination with base::cut — which assigns discrete “levels” to the data — facetting is especially powerful. I cut my data into four subsets, based on how many times I’ve tweeted @ someone:

my.tweets <- userTimeline( 'isomorphisms', n=3000 )

# only tweets that are @ someone talkback <- subset( my.tweets, == FALSE )
#the value would be NA iff I tweeted into the vast nothingness, apropos of no-one
# just the names, not the rest of the tweet's text or meta-information tweeps <- talkback$replyToSN
#make a new data frame for ggplot to facet_wrap. tweep.count <- table(tweeps) tweep.levels <- cbind( tweep.count,
cut( tweep.count, c(0,1,2,5,100) ),
) tweeps <- data.frame(tweep.levels) names(tweeps) <- c("number", "category", "name") class(tweeps$number) <- "numeric"
#all the above stuff only came clear after a few attempts
#and likewise the plot didn't work out perfect at first, either!
#but here's a decent plot that works: ggplot( data = tweeps, aes(x=number) ) + facet_wrap(~ category, scale="free_x") + geom_text( aes(label=name, y=30-order(name), size=sqrt(log(number)),    col=number+(as.numeric(category))^2 ), position="jitter" ) + opts( legend.title = theme_blank(), legend.text = theme_blank() )

This made for a much more readable image. Not perfect, but definitely displaying info now.


OK, I do love talking about my twistory a little too much — but I’d like to see your histograms as well! If you run some stats on your own account, please post some pics below. I believe images can be directly embedded in the Disqus comments with <img src="">.

(To save your R plots to a file rather than to the screen, do png("a plot named Sue.png"); plot( laa dee daa ); where ; could be replaced by a newline.)

March 25, 2012

Cost of university attendance in California as a fraction of the median family’s yearly income, 1975-2010.

by Sean Mulcahy