准备工作 系统 Hadoop官方文档对于操作系统要求原文
Operating Systems: The community SHOULD maintain the same minimum OS requirements (OS kernel versions) within a minor release. Currently GNU/Linux and Microsoft Windows are the OSes officially supported by the community, while Apache Hadoop is known to work reasonably well on other OSes such as Apple MacOSX, Solaris, etc. Support for any OS SHOULD NOT be dropped without first being documented as deprecated for a full major release and MUST NOT be dropped without first being deprecated for at least a full minor release.
此文使用系统:ubuntukylin-14.04.5
Hadoop安装包 此文使用Hadoop版本:Hadoop-2.6.0
Hadoop环境变量 安装java 1 hadoop@ubuntu:~$ sudo apt-get install default-jre default-jdk
修改环境变量 1 hadoop@ubuntu:~$ vim ~/.bashrc
增加: export JAVA_HOME=/usr/lib/jvm/default-java export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=.:${JAVA_HOME}/bin:$PATH
1 hadoop@ubuntu:~$ source ~/.bashrc #使环境变量生效
查看环境变量是否生效
1 hadoop@ubuntu:~$ echo $JAVA_HOME
hadoop安装 1 2 3 4 hadoop@ubuntu:~$ sudo tar -zxf hadoop.2.6.0.jar.gz -C /usr/local #将hadoop压缩包解压至/usr/local hadoop@ubuntu:~$ cd /usr/local hadoop@ubuntu:/usr/local$ mv hadoop.2.6.0 hadoop #改名 hadoop@ubuntu:/usr/local$ chown -R hadoop:hadoop hadoop 更改文件夹属性
字符串检索 1 2 3 4 hadoop@ubuntu:/usr/local/hadoop$ mkdir ./input #创建输入目录 hadoop@ubuntu:/usr/local/hadoop$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+' #搜索input中字符串dfs的出现次数,[a-z]+为可出现拥有dfs的单词 hadoop@ubuntu:/usr/local/hadoop$ cat output/* #查看搜索结果 每次输出对象不可重复
词频统计 java词频统计 此功能需要使用java词频统计脚本,以下为WordCount.java脚本内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount { public static class TokenizerMapper extends Mapper <Object , Text , Text , IntWritable > { private final static IntWritable one = new IntWritable(1 ); private Text word = new Text(); public void map (Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()){ word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer <Text ,IntWritable ,Text ,IntWritable > { private IntWritable result = new IntWritable(); public void reduce (Text key, Iterable<IntWritable> values, Context context ) throws IOException,InterruptedException { int sum = 0 ; for (IntWritable val : values){ sum +=val.get(); } result.set(sum); context.write(key,result); } } public static void main (String[] args) throws Exception { Configuration conf =new Configuration(); Job job =Job.getInstance(conf, "word count" ); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0 ])); FileOutputFormat.setOutputPath(job, new Path(args[1 ])); System.exit(job.waitForCompletion(true )? 0 : 1 ); } }
词频统计 1 2 3 4 5 6 7 8 9 10 hadoop@ubuntu:/usr/local/hadoop$ mkdir wordspace/ hadoop@ubuntu:/usr/local/hadoop$ mv WordCount.java wordspace/ #将Java文件放入hadoop下的wordspace中 hadoop@ubuntu:/usr/local/hadoop$ export PATH=$JAVA_HOME/bin:$PATH #添加环境变量 hadoop@ubuntu:/usr/local/hadoop$ export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar #添加环境变量 hadoop@ubuntu:/usr/local/hadoop$ cd wordspace hadoop@ubuntu:/usr/local/hadoop/wordspace$ ../bin/hadoop com.sun.tools.javac.Main WordCount.java #编译java,将会产生3个class文件 hadoop@ubuntu:/usr/local/hadoop/wordspace$ jar cf WordCount.jar WordCount*.class #将class文件打包为jar文件 hadoop@ubuntu:/usr/local/hadoop/wordspace$ cd .. hadoop@ubuntu:/usr/local/hadoop $ bin/hadoop jar workspace/WordCount.jar WordCount input output #input,output与字符串搜素时作用一样 hadoop@ubuntu:/usr/local/hadoop$ cat output/*