本文共 2987 字,大约阅读时间需要 9 分钟。
前几篇介绍了MapReduce环境的搭建,我们来做些更有实际意义的事情吧,用Python来写分布式的程序。这样速度快。便于调试,更有实际意义。
个人感觉MapReduce适合于对文本文件的处理及数据挖掘用:在每台机器上:
su - hadoopwget http://www.python.org/ftp/python/3.0.1/Python-3.0.1.tar.bz2tar jxvf Python-3.0.1.tar.bz2cd Python-3.0.1./configure --prefix=/home/hadoop/python;make;make install
vi /home/hadoop/mapper.py
#!/home/hadoop/python/bin/python3.0 import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print ("%st%s" % (word, 1))
vi /home/hadoop/reduce.py
#!/home/hadoop/python/bin/python3.0 from operator import itemgetter import sys word2count = {} for line in sys.stdin: line = line.strip() word, count = line.split('t', 1) try: count = int(count) word2count[word] = word2count.get(word, 0) + count except ValueError: pass sorted_word2count = sorted(word2count.items(), key=itemgetter(0)) for word, count in sorted_word2count: print ("%st%s" % (word, count))
测测好不好用:
echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.pyfoo 1foo 1quux 1labs 1foo 1bar 1quux 1
echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reduce.py
bar 1foo 3labs 1quux 2在各个节点上都要准备好这两个文件啊!!!
在master主节点上执行:
# 拷贝conf目录到hdfs文件系统中$ cd /home/hadoop/hadoop-0.19.1$ bin/hadoop dfs -copyFromLocal conf 111
# 查看一下是否已经拷过去了
$ bin/hadoop dfs -lsFound 1 itemsdrwxr-xr-x - hadoop supergroup 0 2009-05-18 15:27 /user/hadoop/111
# 分布计算
$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar -mapper /home/hadoop/mapper.py -reducer /home/hadoop/reduce.py -input 111/* -output 111-outputadditionalConfSpec_:nullnull=@@@userJobConfProps_.get(stream.shipped.hadoopstreamingpackageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar29198/] [] /tmp/streamjob29199.jar tmpDir=null[...] INFO mapred.FileInputFormat: Total input paths to process : 12[...] INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hadoop/mapred/local][...] INFO streaming.StreamJob: Running job: job_200905191453_0001[...] INFO streaming.StreamJob: To kill this job, run:...[...][...] INFO streaming.StreamJob: map 0% reduce 0%[...] INFO streaming.StreamJob: map 43% reduce 0%[...] INFO streaming.StreamJob: map 86% reduce 0%[...] INFO streaming.StreamJob: map 100% reduce 0%[...] INFO streaming.StreamJob: map 100% reduce 33%[...] INFO streaming.StreamJob: map 100% reduce 70%[...] INFO streaming.StreamJob: map 100% reduce 77%[...] INFO streaming.StreamJob: map 100% reduce 100%[...] INFO streaming.StreamJob: Job complete: job_200905191453_0001[...] INFO streaming.StreamJob: Output: 111-output [hadoop@wangyin4 hadoop-0.19.1]$$ bin/hadoop dfs -ls 111-outputFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2009-05-19 14:54 /user/hadoop/111-output/_logs-rw-r--r-- 2 hadoop supergroup 30504 2009-05-19 16:26 /user/hadoop/111-output/part-00000$ bin/hadoop dfs -cat 111-output/part-00000you 3you've 1your 1zero 3zero, 1
Over,搞定。大家可以拓展这个例子,写出自己的应用来。
转自:
转载地址:http://uztxi.baihongyu.com/