すずけんメモ

技術メモです

s3から直接fetchしたときの方が演算が速い?

Run Spark and Shark on Amazon Elastic MapReduce : Articles & Tutorials : Amazon Web Services http://aws.amazon.com/articles/4926593393724923

これを試していた。

hadoop@ip-10-123-50-206:~$ SPARK_MEM="2g" /home/hadoop/shark/bin/shark

Starting the Shark Command Line Client
Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0-bin/lib/hive-common-0.9.0-shark-0.8.1.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201403250828_452147288.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/spark-0.8.1-emr/jars/spark-assembly-0.8.1-incubating-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/shark-0.8.1-bin-hadoop1/lib_managed/jars/org.slf4j/slf4j-log4j12/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/Static
LoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
shark> set mpred.reduce.tasks=10;
shark> set mapred.reduce.tasks=10;
shark> create table wikistat (projectcode string, pagename string, pageviews int, pagesize int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' location 's3://b
igdatademo/sample/wiki/';
OK
Time taken: 3.046 seconds
shark> create table wikistats_cached as select * from wikistat;
Moving data to: hdfs://10.123.50.206:9000/user/hive/warehouse/wikistats_cached
Failed with exception Unable to rename: hdfs://10.123.50.206:9000/tmp/hive-hadoop/hive_2014-03-25_08-30-04_457_6892188994042884524/-ext-10004 to: hdfs://10.12
3.50.206:9000/user/hive/warehouse/wikistats_cached
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

wikistats_cachedで一旦HDFSに展開するようにしているのだが、一回renameでエラーが出た模様。とりあえずs3から直接fetchして演算させる。

shark> select pagename, sum(pageviews) c from wikistat group by pagename order by c desc limit 10;
OK
Special:Search  328476
Main_Page       217924
Special:Random  73900
404_error/      65047
%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8  55814
Special:Export/Where_Is_My_Mind 21521
Wikipedia:Portada       19722
%E7%89%B9%E5%88%A5:%E6%A4%9C%E7%B4%A2   18312
Pagina_principale       17080
Alexander_McQueen       17067
Time taken: 37.087 seconds

次にcacheすべく、table: wikistats_cachedを一旦dropしてもう一度試した。

shark> drop table wikistats_cached;
OK
Time taken: 4.939 seconds
shark> create table wikistats_cached as select * from wikistat;
Moving data to: hdfs://10.123.50.206:9000/user/hive/warehouse/wikistats_cached
OK
Time taken: 37.926 seconds

今度はうまく行ったようだ。cacheしたテーブルを利用して計算を回す。

shark> select pagename, sum(pageviews) c from wikistats_cached group by pagename order by c desc limit 10;
OK
Special:Search  328476
Main_Page       217924
Special:Random  73900
404_error/      65047
%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8  55814
Special:Export/Where_Is_My_Mind 21521
Wikipedia:Portada       19722
%E7%89%B9%E5%88%A5:%E6%A4%9C%E7%B4%A2   18312
Pagina_principale       17080
Alexander_McQueen       17067
Time taken: 86.962 seconds

なぜかcacheしたほうが遅いという結果になった。