Categories
读书有感

连续>离散

刚刚跑code的间隙去扫了一眼这篇Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable,蛮有意思的。基调就是,如果可以选择连续变量,就不要用那些拆分出来的离散变量了。举了一些例子,baseball的那些我不熟,最后econ的那个自然是吸引眼球的——

我只是在试图恢复,所以顺便看点死物。

--------------------废话结束---------------------

我很佩服Andrew Gelman这样一写博客写了那么多年的,还什么都涉及到一些的,无论什么时候读起来都觉得很有收获(希望我是在进步....)。经常能在他那里看到一些“不是很大”却很基本的问题。

刚刚跑code的间隙去扫了一眼这篇Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable,蛮有意思的。基调就是,如果可以选择连续变量,就不要用那些拆分出来的离散变量了。举了一些例子,baseball的那些我不熟,最后econ的那个自然是吸引眼球的——

Even in recent years, with all the sophistication in economic statistics, you’ll still see people fitting logistic models for binary outcomes even when the continuous variable is readily available. (See, for example, the second-to-last paragraph here, which is actually an economist doing political science, but I’m pretty sure there are lots of examples of this sort of thing in econ too.)

然后又翻回到那篇Estimating the incumbent-party advantage and the incumbency advantage in House elections,略读了一下明白原来Andrew是建议直接预测numbers of votes而不是预测win or lose。否则中间丢失的信息蛮可惜的——

The key is that vote differential is available, and a simply performing a logit model for wins alone is implicitly taking this differential as latent or missing data, thus throwing away information.

此外,有人觉得用binary会变得更加稳健,因为不需要对分布进一步做假设。对此,Andrew的回应和以前看到过的他的另外一篇post相同—— Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses,当你把那么多时间地点的分散信息汇总在一起做回归的时候,就已经在挑战估计量的稳健性了。所以用连续变量,反而允许你在一定程度上更少的混合这些数据就可以得出比较好的估计量。

----------------检讨开始--------------

1. R里面的cut()函数需要慎用。

2. 刚刚还在试图把一个连续变量分成几段呢...默默的把写好的SQL的一堆case when删掉了,sigh。白白的码了那么半天。

2 replies on “连续>离散”

Comments are closed.