s3から直接fetchしたときの方が演算が速い?
Run Spark and Shark on Amazon Elastic MapReduce : Articles & Tutorials : Amazon Web Services http://aws.amazon.com/articles/4926593393724923
これを試していた。
hadoop@ip-10-123-50-206:~$ SPARK_MEM="2g" /home/hadoop/shark/bin/shark Starting the Shark Command Line Client Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0-bin/lib/hive-common-0.9.0-shark-0.8.1.jar!/hive-log4j.properties Hive history file=/tmp/hadoop/hive_job_log_hadoop_201403250828_452147288.txt SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/spark-0.8.1-emr/jars/spark-assembly-0.8.1-incubating-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/shark-0.8.1-bin-hadoop1/lib_managed/jars/org.slf4j/slf4j-log4j12/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/Static LoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] shark> set mpred.reduce.tasks=10; shark> set mapred.reduce.tasks=10; shark> create table wikistat (projectcode string, pagename string, pageviews int, pagesize int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' location 's3://b igdatademo/sample/wiki/'; OK Time taken: 3.046 seconds shark> create table wikistats_cached as select * from wikistat; Moving data to: hdfs://10.123.50.206:9000/user/hive/warehouse/wikistats_cached Failed with exception Unable to rename: hdfs://10.123.50.206:9000/tmp/hive-hadoop/hive_2014-03-25_08-30-04_457_6892188994042884524/-ext-10004 to: hdfs://10.12 3.50.206:9000/user/hive/warehouse/wikistats_cached FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
wikistats_cached
で一旦HDFSに展開するようにしているのだが、一回renameでエラーが出た模様。とりあえずs3から直接fetchして演算させる。
shark> select pagename, sum(pageviews) c from wikistat group by pagename order by c desc limit 10; OK Special:Search 328476 Main_Page 217924 Special:Random 73900 404_error/ 65047 %E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 55814 Special:Export/Where_Is_My_Mind 21521 Wikipedia:Portada 19722 %E7%89%B9%E5%88%A5:%E6%A4%9C%E7%B4%A2 18312 Pagina_principale 17080 Alexander_McQueen 17067 Time taken: 37.087 seconds
次にcacheすべく、table: wikistats_cached
を一旦dropしてもう一度試した。
shark> drop table wikistats_cached; OK Time taken: 4.939 seconds shark> create table wikistats_cached as select * from wikistat; Moving data to: hdfs://10.123.50.206:9000/user/hive/warehouse/wikistats_cached OK Time taken: 37.926 seconds
今度はうまく行ったようだ。cacheしたテーブルを利用して計算を回す。
shark> select pagename, sum(pageviews) c from wikistats_cached group by pagename order by c desc limit 10; OK Special:Search 328476 Main_Page 217924 Special:Random 73900 404_error/ 65047 %E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 55814 Special:Export/Where_Is_My_Mind 21521 Wikipedia:Portada 19722 %E7%89%B9%E5%88%A5:%E6%A4%9C%E7%B4%A2 18312 Pagina_principale 17080 Alexander_McQueen 17067 Time taken: 86.962 seconds
なぜかcacheしたほうが遅いという結果になった。