Category: 经济、IT观察与思考

最新的一些经济学研究趋势...

今天闲着无聊抓了一下NBER最近一年的working paper数据看看。众所周知，econ现在发表周期越来越长，一两年都算少的，三五年也挺常见的。虽然跟跟AER什么的也是个比较好的指向，但多少还是“旧”了一点。

NBER覆盖的研究范围还是蛮广的，大部分发表的paper都能在这里找到working paper版本，所以一时没想到更好的抓数据的来源：

Aging(AG)
Asset Pricing(AP)
Children(CH)
Corporate Finance(CF)
Development Economics(DEV)
Development of the American Economy(DAE)
Economics of Education(ED)
Economic Fluctuations and Growth(EFG)
Environmental and Energy Economics (EEE)
Health Care(HC)
Health Economics(HE)
Industrial Organization(IO)
International Finance and Macroeconomics(IFM)
International Trade and Investment(ITI)
Labor Studies(LS)
Law and Economics(LE)
Monetary Economics(ME)
Political Economy(POL)
Productivity, Innovation, and Entrepreneurship Program(PR)
Public Economics(PE)

抓了一番之后，基本关键词热度如下。一些太没有意义的我就调透明了。(个人很讨厌word cloud这种东西，所以还是选择了bar chart)

[3/19更新] 和Bing里面的key words match了一下。貌似信息多了一些。

虽然数目不代表质量，但至少能看出来有多少人在某个领域耕耘。最突出的就是health这里了，很高（钱很多）。然后还有很多研究trade和growth的。然后risk和finance好像也蛮多的，crisis好像也挺多。Labor和IO一直也是热热的。研究方法上，随机试验还是最亮的。

没有进一步分析那些作者在高产，下次搞个“抱大腿”趋势好了。

代码在这里：

grab_url <- c("http://www.nber.org/new_archive/mar14.html",
              "http://www.nber.org/new_archive/dec13.html",
              "http://www.nber.org/new_archive/sep13.html",
              "http://www.nber.org/new_archive/jun13.html",
              "http://www.nber.org/new_archive/mar13.html")

library(RCurl)
require(XML)

grab_paper <- function (grab) {
  webpage <- getURLContent(grab)
  web_content <- htmlParse(webpage,asText = TRUE)
  paper_title <- sapply(getNodeSet(web_content, path="//li/a[1]"),xmlValue)
  author <- sapply(getNodeSet(web_content, path="//li/text()[1]") ,xmlValue)
  paper_author <- data.frame(paper_title = paper_title, author = author)
  return(paper_author)
}

library(plyr)
paper_all <- ldply(grab_url,grab_paper)

titles <- strsplit(as.character(paper_all$paper_title),split="[[:space:]|[:punct:]]")
titles <- unlist(titles)

library(tm)
library(SnowballC)
titles_short <- wordStem(titles)
Freq2 <- data.frame(table(titles_short))
Freq2 <- arrange(Freq2, desc(Freq))
Freq2 <- Freq2[nchar(as.character(Freq2$titles_short))>3,]
Freq2 <- subset(Freq2, !titles_short %in% stopwords("SMART"))
Freq2$word <- reorder(Freq2$titles_short,X = nrow(Freq2) - 1:nrow(Freq2))
Freq2$common <- Freq2$word %in% c("Evidenc","Effect","Econom","Impact","Experiment","Model","Measur","Rate","Economi",
                                  "High","Data","Long","Chang","Great","Estimat","Outcom","Program","Analysi","Busi"
                                  ,"Learn","More","What")
library(ggplot2)
ggplot(Freq2[1:100,])+geom_bar(aes(x=word,y=Freq,fill = common,alpha=!common))+coord_flip()

### get some keywords from Bing academic
start_id_Set = (0:5)*100+1
require(RCurl)
require(XML)
# start_id =1
# 

get_keywords_table <- function (start_id) {
  end_id = start_id+100-1
  keyword_url <- paste0("http://academic.research.microsoft.com/RankList?entitytype=8&topDomainID=7&subDomainID=0&last=0&start=",start_id,"&end=",end_id)
  keyword_page <- getURLContent(keyword_url)
  keyword_page <- htmlParse(keyword_page,asText = TRUE)
  keyword_table <- getNodeSet(keyword_page, path="id('ctl00_MainContent_divRankList')//table")
  table_df <- readHTMLTable(keyword_table[[1]])
  names(table_df) <- c("rowid","Keywords"   ,  "Publications" ,"Citations")
  return (table_df)
}

require(plyr)
keywords_set <- ldply(start_id_Set,get_keywords_table)

save(keywords_set, file="keywords_set.rdata")

最后更新的部分代码。效率偏低，见谅。

### map keywords
load("keywords_set.rdata")
load("NBER.rdata")
keys <- strsplit(as.character(keywords_set$Keywords), split=" ")
require(SnowballC)
keys_Stem <- lapply(keys,wordStem)

#get edges 
edge_Set <- data.frame()
for (word in Freq2$word){
#   print(word)
  for (key_id in 1:length(keys_Stem)){
#     print(keys_Stem[[key_id]])
    if (word %in% keys_Stem[[key_id]]) {
      edge <- data.frame(keywords = keywords_set[key_id,]$Keywords, kid = word)
      edge_Set <- rbind(edge_Set,edge)}
  }
}

#edge_Set
require(ggplot2)
kid_freq <- as.data.frame(table(edge_Set$kid))
require(plyr)
kid_freq <- arrange(kid_freq, desc(Freq))

edge_Set_sub <- subset(edge_Set, kid %in% Freq2[1:100,]$word)
edge_Set_sub$keywords <- as.character(edge_Set_sub$keywords)
# edge_Set_sub$kid <- as.character(edge_Set_sub$kid)

link_keys <- function(x) {paste(x$keywords,collapse = ", ")}

linked <- ddply(edge_Set_sub, .(kid), link_keys)

show_keys <- merge(Freq2[1:100,],linked, by.x="word",by.y="kid", all.x=T)
names(show_keys)[5] <- "linked"

ggplot(show_keys[!is.na(show_keys$linked),],aes(x=word,y=Freq))+
  geom_bar(aes(fill = common,alpha=!common),stat="identity",ymin=10)+coord_flip()+
  geom_text(aes(label=substr(linked,1,200),y = Freq, size = 1),hjust=0)

Tags NBER, working paper, 指数, 研究, 经济学, 趋势

读书有感

读大学读什么？

最近一直在想这个问题：花费了那么多时间读书，究竟读了一些什么？

知识这东西，但凡肯花时间，大部分都是能学会的。应付考试什么的就更不是特别难的事情了。

可是成绩单上满满的，都是知识、知识。让人看起来都觉得疲倦。

除了知识，上学的时候还学会了什么？更多是培养性情？养成一颗好奇心，养成探索事物的兴趣，广泛的接纳各个领域的思维冲击。说起来工作了之后，太多东西都是可以现用现学的，没有什么那么困难的。

前段时间在看美国LAC(Liberal Arts College)的教育模式，培养精英的气质。因为有幸接触过一些top LAC出来的精英，确实气质上稍胜一筹。

A "liberal arts" institution can be defined as a "college or university curriculum aimed at imparting broad general knowledge and developing general intellectual capacities, in contrast to a professional, vocational, or technical curriculum."

越往后走，这种积淀的力量越能超越知识课程什么的，支撑着前行。而我的大学，确实缺少这样的时间。被无辜的填了太多鸭，被GPA逼得去竞争分数，缺少了太多太多思考的广度和深度。而那些知识，考过了试，又有多少受用至今？了了。

说回语言。学西班牙语的时候，很多人说，拉丁语系学两门以上，其他的就都很容易了。现在深以为然——计算机语言也是如此。R和Matlab用的熟了，加上C和PHP的一些基础，现在去看Python真的没什么难度。估计去学Java也不会花太多功夫。

我曾经试图说服无数周围的人，数学也是一门语言(统计学不是，它是一种思维方式，可以用多种语言表述)，学了那么多公式什么的表达的其实是人们对于逻辑推理的极致追求。看似复杂高深的课程，其实大都还是可以，读书百变、其意自现的。

想到这里就说到这里。是的，我是在有些可惜那些匆匆错过的时光。

Tags GPA, LAC, Python, 填鸭, 大学, 好奇心, 时光, 气质, 精英教育, 语言, 读书

读书有感

python小试

今天非常无聊的决定去试一下python。找了一个题，大意如下：

给定一个输入字符串，找出最漂亮的无重复子字符串。
子字符串：从原字符串中减掉某些字符可得到的。
无重复字符串：没有重复的字符
甲比乙漂亮：甲的长度>乙，或者甲的字典排序在乙之后。

因为都是无重复的，所以肯定不需要甲的长度大于乙，故而是所有长度一样的无重复子字符串中，找出字典排序最大的。

这个先用R写的，为的是写出一个有效的算法来。基本的思路就是强行的逐层递归。

x = 'nlhthgrfdnnlprjtecpdrthigjoqdejsfkasoctjijaoebqlrgaiakfsbljmpibkidjsrtkgrdnqsknbarpabgokbsrfhmeklrle'

x_split = strsplit(x,split="")[[1]]
unique_x = unique(x_split) 
unique_x_order = sort(unique_x,decreasing=T) 
x_remain = character() 

# find the largest character than can be remained

#initialize
current_string = x_split
current_unique = unique_x
current_order = unique_x_order
while ( length(x_remain) < 20) 
{ 
  for(i in 1:length(current_order))
  { character = current_order[i]
    index = which(current_string == character)
    sub_string = current_string[min(index):length(current_string)]  
    if (length(setdiff(unique(current_string),unique(sub_string)))==0) #no lose of characters
    {x_remain = c(x_remain,character);
     current_string = current_string[-c(1:min(index),index)];
     current_unique = unique(current_string);
     current_order = sort(current_unique,decreasing=T);
     break;
    }
  }
}

#answer is 'tsocrpkijgdqnbafhmle'

后面用python重写了一遍。基本就是等价函数的替换...我是不是在暴殄天物的利用python？完全不理解program on the fly的感觉...

x = 'nlhthgrfdnnlprjtecpdrthigjoqdejsfkasoctjijaoebqlrgaiakfsbljmpibkidjsrtkgrdnqsknbarpabgokbsrfhmeklrle';
x_split = list(x);
unique_x = list(set(x_split));
unique_x.sort(reverse=True)
x_remain = list();
###initialize
current_string = x_split;current_unique = unique_x;current_order = unique_x;
while len(x_remain) < len(unique_x):
	for character in current_order:
		index = current_string.index(character);
		sub_string = current_string[index:len(current_string)];
		#print(character);
		if (len(set(current_string)-set(sub_string))==0): #no lose of characters
			x_remain.append(character);
			for i in range(sub_string.count(character)):
				sub_string.remove(character);
			current_string= sub_string;
			current_unique = list(set(current_string));
			current_unique.sort(reverse=True);
			current_order = current_unique;
			break;
print(x_remain);

最后好不容易写完python之后，发现网断了...没法在线提交了。等重新连上，时间已经过了，sigh。就当周末无聊历练一下了。

Tags Python, R, 字符串操作, 算法, 递归

读书有感

连续>离散

我只是在试图恢复，所以顺便看点死物。

--------------------废话结束---------------------

我很佩服Andrew Gelman这样一写博客写了那么多年的，还什么都涉及到一些的，无论什么时候读起来都觉得很有收获（希望我是在进步....）。经常能在他那里看到一些“不是很大”却很基本的问题。

刚刚跑code的间隙去扫了一眼这篇Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable，蛮有意思的。基调就是，如果可以选择连续变量，就不要用那些拆分出来的离散变量了。举了一些例子，baseball的那些我不熟，最后econ的那个自然是吸引眼球的——

Even in recent years, with all the sophistication in economic statistics, you’ll still see people fitting logistic models for binary outcomes even when the continuous variable is readily available. (See, for example, the second-to-last paragraph here, which is actually an economist doing political science, but I’m pretty sure there are lots of examples of this sort of thing in econ too.)

然后又翻回到那篇Estimating the incumbent-party advantage and the incumbency advantage in House elections，略读了一下明白原来Andrew是建议直接预测numbers of votes而不是预测win or lose。否则中间丢失的信息蛮可惜的——

The key is that vote differential is available, and a simply performing a logit model for wins alone is implicitly taking this differential as latent or missing data, thus throwing away information.

此外，有人觉得用binary会变得更加稳健，因为不需要对分布进一步做假设。对此，Andrew的回应和以前看到过的他的另外一篇post相同—— Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses，当你把那么多时间地点的分散信息汇总在一起做回归的时候，就已经在挑战估计量的稳健性了。所以用连续变量，反而允许你在一定程度上更少的混合这些数据就可以得出比较好的估计量。

----------------检讨开始--------------

1. R里面的cut()函数需要慎用。

2. 刚刚还在试图把一个连续变量分成几段呢...默默的把写好的SQL的一堆case when删掉了，sigh。白白的码了那么半天。

Tags 信息丢失, 投票, 政治学, 有效性, 离散变量, 稳健性, 连续变量

读书有感

Constitutional Law by Yale 听课笔记（二）

Post author By Liyun
Post date February 24, 2014

随便整理一点东西。

Anti-Federalists and the Federalists

基本上这两派就是对联邦政府和州政府权力应该多大的争议。抄一段总结：

The Anti-Federalists opposed the new U.S. Constitution for numerous reasons.

They distrusted large, powerful national governments and believed liberty could only be protected in small republics in which the rulers were closely checked by the public.
They believed a large nation could best be governed by a confederation, with local governments having the most control. A strong national government would be distant from the people and not capable of protecting the rights of the citizens. Congress would tax too heavily and the Supreme Court would overrule state courts.
They distrusted the president having too much power, including a standing army under his control.
They also favored the addition of a Bill of Rights to protect the citizens from the national government. They wanted the House of Representatives increased in size so it would reflect a greater variety of popular interests.
The wanted a council created to check the actions of the president.
They also favored leaving military affairs in the hands of the state militias.

Federalists favored a strong national government with supreme power over state governments.

The rights of citizens would be protected from the government via legislation, the courts, and the Bill of Rights.
Federalists distrusted the masses to select the best candidates so they made only the House of Representatives directly elected by the people. Checks and Balances within the Constitution would make sure no one branch became too powerful.
The President would have control over the military, necessary for national defense, but could not violate the laws.The Secretary of War would advise the President.
The national government needed the power to tax and enforce the laws, or the ills of the Articles would hamper the development, agriculture and industry, of the new nation.

说白了，Anti-Federalists就是希望州政府更加独立，而联邦政府减少对各州的干涉。

Tags Anti-Federalists, 宪法, 州政府, 联邦政府