实现一个荐股评价系统(中)——抓取

Published: 2015-11-10

Tags: Node.js 爬虫

本文总阅读量

之前用Python抓取数据时面对js动态加载的数据使用Phantomjs解析开心的不行,于是本打算抓取数据部分就用python好了,但这么个小东西还得用Node.js+python搞定,确实太无聊,很久之前逛论坛有人讨论抓取动态数据的方法时有人说直接找到接口调用

因为不久之前才开始学习前端,AJAX看过一遍基础后,觉得可以开始找接口了...如果没有做验证,那么接口就可以快乐的使用了,做了基本的验证,查看js文件的验证过程也可拼接完成,还好,我要抓取的网站不需要登陆,也没有做任何限制...


东方财富网的股票“视觉中国”界面,有各种数据,我关心的是昨收实时价格

我用的是Chrome浏览器,右键,审查元素

昨收的id是"gt8",数据既然是js动态加载,那么就需要找到处理gt8的js文件,如果没加密就最好了,右键,查看网页的源代码,我是用的笨方法,总共也就十来个js文件,逐个打开Ctrl+F查找"gt8"

最终找到了这个js负责填充数据

http://hqres.eastmoney.com/emag14/js/quote-min.js

代码没加密,但是都堆在了一起,找个"Javascript格式化工具",先恢复为可读体再说

复制到文本编辑器方便查看,找到gt8的位置

往上找找看,可以看到请求的接口

就是这个http://nuff.eastmoney.com/EM_Finance2015TradeInterface/JS.ashx?id=

现在选择Chrome的Console选项,会看到很多输出

在左上角输入刚才我们获取的网址,点击清除信息,再点击过滤,稍等片刻即可出现数据

但是网址太长被隐藏了,如

http://nuff.eastmoney.com/EM_Finance2015TradeInterface/JS.ashx?id=0006812&t…callback0701420895056799&callback=callback0701420895056799&_=1447126053134

关于查看完整的网址,你可以使用Chrome的开发者版本,有导出日志的功能,或者使用Firefox也可看到完整的网址,之后沪深300的接口就是通过火狐查看的完整地址

在本例中,就不需要了,因为取得

http://nuff.eastmoney.com/EM_Finance2015TradeInterface/JS.ashx?id=0006812

用浏览器打开就可以获得数据,感谢东财的后端程序员,对缺省做了很好的处理,复制这个接口到浏览器即可

可获得如下数据

callback({"Comment":[],"Value":["2","000681","视觉中国","31.86","31.85","31.83","31.82","31.80","31.88","31.92","31.93","31.99","32.00","67","90","25","38","110","12","2","7","105","84","33.24","27.20","31.86","31.67","1.64","30.19","5.43","32.78","86277","29.78","2","30.22","2.73亿","2.12","4.39","179.39","52987","33290","22.27","120","11.09","2","6266080658","22320397539","0|0|0|0|0","0|0|0|0|0","2015-11-10 11:34:51"]})

昨收30.22就在其中

同理,我们可以获取我们所需的股票股票昨日收盘价股票今日开盘价沪深300昨日收盘价沪深300今日开盘价实时的股票价格实时的沪深300价格

经过精简后的沪深300的接口为

http://nufm2.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=0003001&&sty=AMIC&st=z&sr=1&p=1&ps=1000&cb=&js=var%20jszs={list:[%28x%29]};&token=beb0a0047196124721f56b0f0ff5a27c

返回的数据为

var jszs={list:["沪深300,000300,3868.87,28.51,0.74%,3876.49,3798.82,3806.67,3840.35,2.02%,53.92,209018753024,15250815744,2015-11-10 12:05:40,1"]};

Tips:

很多时候,js代码是这样的

eval(function(p,a,c,k,e,d){e=function(c){return(c<a?"":e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode...

复制到http://tool.lu/js/解密即可获得原始内容

有时,获得的地址有很多参数,比如获取沪深300的接口完整如下

http://nufm2.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=0000011,0000161,0000021,0000031,3990012,3991062,3990022,3990032,0003001,3990052,0000091,0000121,0000111,3990062,3991022,0003001,0000011&&sty=AMIC&st=z&sr=1&p=1&ps=1000&cb=&js=var%20jszs={list:[(x)]};&token=beb0a0047196124721f56b0f0ff5a27c&v=0.3181131611638608

但是其返回数据很多是冗余的

var jszs={list:["上证指数,000001,3662.36,15.48,0.42%,3669.53,3607.89,3617.40,3646.88,1.69%,0.00,305327190016,23000007424,2015-11-10 11:55:57,1","上证50,000016,2554.04,19.96,0.79%,2559.83,2498.46,2504.25,2534.08,2.42%,12.15,65187216128,5341526272,2015-11-10 11:55:24,1","A股指数,000002,3835.67,16.22,0.42%,3843.21,3778.52,3788.46,3819.44,1.69%,0.00,304917356544,22944505344,2015-11-10 11:55:24,1","B股指数,000003,375.55,1.36,0.36%,376.15,372.19,373.70,374.19,1.06%,1.00,409832640,55502100,2015-11-10 11:55:57,1","深证成指,399001,12542.93,89.69,0.72%,12597.16,12356.41,12390.53,12453.24,1.93%,0.00,391959244800,21762556928,2015-11-10 11:55:57,2","深证综指,399106,2211.41,19.80,0.90%,2222.39,2175.07,2181.28,2191.60,2.16%,0.00,391959244800,21762556928,2015-11-10 11:55:24,2","深成指R,399002,14661.18,104.84,0.72%,14724.57,14443.16,14483.04,14556.34,1.93%,27.21,117359394816,9144839424,2015-11-10 11:55:24,2","成份B指,399003,6952.38,39.93,0.58%,6976.71,6886.78,6900.09,6912.45,1.30%,0.87,248977141,26268882,2015-11-10 11:55:57,2","沪深300,000300,3868.87,28.51,0.74%,3876.49,3798.82,3806.67,3840.35,2.02%,52.61,209018753024,15250815744,2015-11-10 11:55:57,1","中小板指,399005,8459.94,53.28,0.63%,8508.31,8345.93,8364.10,8406.66,1.93%,627.19,165624295424,8692043264,2015-11-10 11:55:24,2","上证380,000009,6714.58,27.07,0.40%,6744.51,6632.00,6650.85,6687.51,1.68%,0.00,91367333888,6811353600,2015-11-10 11:55:24,1","国债指数,000012,152.72,0.00,0.00%,152.74,152.71,152.73,152.72,0.02%,0.84,156293357,1490040,2015-11-10 11:55:24,1","基金指数,000011,5951.91,6.93,0.12%,5956.11,5923.57,5933.35,5944.98,0.55%,1.69,24044699904,904206704,2015-11-10 11:55:24,1","创业板指,399006,2748.59,23.97,0.88%,2783.12,2700.29,2710.90,2724.62,3.04%,0.00,108726575104,3899405456,2015-11-10 11:55:57,2","创业板综,399102,3102.25,27.08,0.88%,3136.05,3052.52,3062.59,3075.16,2.72%,0.00,108726575104,3899405456,2015-11-10 11:55:24,2","沪深300,000300,3868.87,28.51,0.74%,3876.49,3798.82,3806.67,3840.35,2.02%,52.61,209018753024,15250815744,2015-11-10 11:55:57,1","上证指数,000001,3662.36,15.48,0.42%,3669.53,3607.89,3617.40,3646.88,1.69%,0.00,305327190016,23000007424,2015-11-10 11:55:57,1"]};

观察URL,可得知0003001才是沪深300,所以删除其它的才得出之前的接口,也许试验后还可精简,但是意义不大,数据已经足够简洁


从接口获取数据才是很不错的,在这之前,无论用python还是Node.js写爬虫,也都是解析网页,获取数据,遇到动态加载的网页使用Phantomjs渲染需要点时间,对于网络的要求也很高,具体可以看之前整理的《使用Python+selenium+Phantom.js爬取js加载数据的网页 》,经过这次的尝试,发现还是接口大法好

总结:抓取动态网页当先看接口,实在无法模拟接口访问后再尝试暴力的模拟抓取...