hadoop部署与使用

  • 本文为个人学习笔记,仅供参考

准备工作

系统

Hadoop官方文档对于操作系统要求原文

Operating Systems: The community SHOULD maintain the same minimum OS requirements (OS kernel versions) within a minor release. Currently GNU/Linux and Microsoft Windows are the OSes officially supported by the community, while Apache Hadoop is known to work reasonably well on other OSes such as Apple MacOSX, Solaris, etc. Support for any OS SHOULD NOT be dropped without first being documented as deprecated for a full major release and MUST NOT be dropped without first being deprecated for at least a full minor release.

此文使用系统:ubuntukylin-14.04.5

Hadoop安装包

此文使用Hadoop版本:Hadoop-2.6.0

Hadoop环境变量

安装java

1
hadoop@ubuntu:~$ sudo apt-get install default-jre default-jdk

修改环境变量

1
hadoop@ubuntu:~$ vim ~/.bashrc

增加:
export JAVA_HOME=/usr/lib/jvm/default-java
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=.:${JAVA_HOME}/bin:$PATH

1
hadoop@ubuntu:~$ source ~/.bashrc      #使环境变量生效

查看环境变量是否生效

1
hadoop@ubuntu:~$ echo $JAVA_HOME

hadoop安装

1
2
3
4
hadoop@ubuntu:~$ sudo tar -zxf hadoop.2.6.0.jar.gz -C /usr/local		#将hadoop压缩包解压至/usr/local
hadoop@ubuntu:~$ cd /usr/local
hadoop@ubuntu:/usr/local$ mv hadoop.2.6.0 hadoop #改名
hadoop@ubuntu:/usr/local$ chown -R hadoop:hadoop hadoop 更改文件夹属性

字符串检索

1
2
3
4
hadoop@ubuntu:/usr/local/hadoop$ mkdir ./input		#创建输入目录
hadoop@ubuntu:/usr/local/hadoop$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+' #搜索input中字符串dfs的出现次数,[a-z]+为可出现拥有dfs的单词
hadoop@ubuntu:/usr/local/hadoop$ cat output/* #查看搜索结果
每次输出对象不可重复

词频统计

java词频统计

此功能需要使用java词频统计脚本,以下为WordCount.java脚本内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()){
word.set(itr.nextToken());
context.write(word, one);
}
}
}


public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable>{
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException,InterruptedException{
int sum = 0;
for (IntWritable val : values){
sum +=val.get();
}
result.set(sum);
context.write(key,result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf =new Configuration();
Job job =Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)? 0 : 1);
}
}

词频统计

1
2
3
4
5
6
7
8
9
10
hadoop@ubuntu:/usr/local/hadoop$ mkdir wordspace/
hadoop@ubuntu:/usr/local/hadoop$ mv WordCount.java wordspace/ #将Java文件放入hadoop下的wordspace中
hadoop@ubuntu:/usr/local/hadoop$ export PATH=$JAVA_HOME/bin:$PATH #添加环境变量
hadoop@ubuntu:/usr/local/hadoop$ export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar #添加环境变量
hadoop@ubuntu:/usr/local/hadoop$ cd wordspace
hadoop@ubuntu:/usr/local/hadoop/wordspace$ ../bin/hadoop com.sun.tools.javac.Main WordCount.java #编译java,将会产生3个class文件
hadoop@ubuntu:/usr/local/hadoop/wordspace$ jar cf WordCount.jar WordCount*.class #将class文件打包为jar文件
hadoop@ubuntu:/usr/local/hadoop/wordspace$ cd ..
hadoop@ubuntu:/usr/local/hadoop $ bin/hadoop jar workspace/WordCount.jar WordCount input output #input,output与字符串搜素时作用一样
hadoop@ubuntu:/usr/local/hadoop$ cat output/*

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!