摘要
本文详细介绍Hadoop生态系统核心组件(Hadoop、Hive、Zookeeper、YARN、DataX)的安装与配置,涵盖单机测试环境和生产级集群环境。通过本指南,您可以快速搭建大数据处理平台,为数据仓库、数据同步和分布式计算提供完整解决方案。
一、概述
1.1 组件介绍
- Hadoop: 分布式存储(HDFS)和计算(MapReduce)框架
- Hive: 基于Hadoop的数据仓库工具,提供类SQL查询
- Zookeeper: 分布式协调服务,用于集群管理和配置同步
- YARN: Hadoop资源管理器,负责集群资源分配
- DataX: 阿里巴巴开源的数据同步工具,支持多种数据源
1.2 部署模式对比
| 模式 | 适用场景 | 节点要求 | 优缺点 |
|---|---|---|---|
| 单机模式 | 学习、测试、开发 | 1个节点 | 部署简单,资源要求低,但无高可用 |
| 集群模式 | 生产环境、压力测试 | 3+个节点 | 高可用、高性能,但部署复杂 |
二、单机安装部署
2.1 环境准备
2.1.1 系统要求
# 操作系统:Ubuntu 20.04/CentOS 8
# 内存:≥8GB
# 磁盘:≥50GB
# Java:JDK 8或11
# 创建专用用户
sudo useradd hadoop
sudo passwd hadoop
sudo usermod -aG wheel hadoop # CentOS
sudo usermod -aG sudo hadoop # Ubuntu
# 切换到hadoop用户
su - hadoop
2.1.2 目录结构规划
mkdir -p ~/bigdata/{software,data,logs,conf}
2.2 Hadoop单机安装
2.2.1 下载与解压
cd ~/bigdata/software
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xzf hadoop-3.3.6.tar.gz -C ~/
mv ~/hadoop-3.3.6 ~/hadoop
2.2.2 环境变量配置
# 编辑~/.bashrc或~/.bash_profile
cat >> ~/.bashrc << 'EOF'
export HADOOP_HOME=$HOME/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
EOF
source ~/.bashrc
2.2.3 配置文件修改
1. core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/bigdata/data/tmp</value>
</property>
</configuration>
2. hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/bigdata/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/bigdata/data/datanode</value>
</property>
</configuration>
3. mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
4. yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
2.2.4 初始化与启动
# 格式化HDFS
hdfs namenode -format
# 启动服务
start-dfs.sh
start-yarn.sh
# 验证服务
jps
# 应看到:NameNode, DataNode, ResourceManager, NodeManager
2.3 Zookeeper单机安装
2.3.1 下载与安装
cd ~/bigdata/software
wget https://downloads.apache.org/zookeeper/zookeeper-3.8.3/apache-zookeeper-3.8.3-bin.tar.gz
tar -xzf apache-zookeeper-3.8.3-bin.tar.gz -C ~/
mv ~/apache-zookeeper-3.8.3-bin ~/zookeeper
2.3.2 配置与启动
cd ~/zookeeper/conf
cp zoo_sample.cfg zoo.cfg
# 修改配置
cat > zoo.cfg << EOF
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/hadoop/bigdata/data/zookeeper
clientPort=2181
EOF
# 启动Zookeeper
~/zookeeper/bin/zkServer.sh start
2.4 Hive单机安装
2.4.1 下载与安装
cd ~/bigdata/software
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar -xzf apache-hive-3.1.3-bin.tar.gz -C ~/
mv ~/apache-hive-3.1.3-bin ~/hive
2.4.2 配置元数据存储(Derby)
cd ~/hive/conf
cat > hive-site.xml << EOF
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/hadoop/bigdata/data/metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
</configuration>
EOF
2.4.3 初始化与启动
# 创建HDFS目录
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse
# 初始化元数据库
schematool -dbType derby -initSchema
# 启动Hive CLI
hive
2.5 DataX单机安装
2.5.1 下载与安装
cd ~/bigdata/software
wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
tar -xzf datax.tar.gz -C ~/
mv ~/datax ~/datax
2.5.2 验证安装
cd ~/datax
python bin/datax.py job/job.json
2.6 单机环境验证
#!/bin/bash
echo "=== 单机环境验证 ==="
echo "1. Hadoop验证:"
hdfs dfs -mkdir -p /test
echo "Hello Hadoop" > test.txt
hdfs dfs -put test.txt /test/
hdfs dfs -cat /test/test.txt
echo -e "\n2. Hive验证:"
hive -e "CREATE DATABASE IF NOT EXISTS testdb;"
hive -e "SHOW DATABASES;"
echo -e "\n3. Zookeeper验证:"
echo stat | nc localhost 2181 | head -10
echo -e "\n4. DataX验证:"
python ~/datax/bin/datax.py --version
echo -e "\n=== 验证完成 ==="
三、集群安装部署
3.1 集群规划
3.1.1 节点规划(3节点示例)
| 节点 | 主机名 | IP地址 | 部署服务 |
|---|---|---|---|
| 主节点 | master | 192.168.1.100 | NameNode, ResourceManager, Zookeeper, Hive Metastore |
| 从节点1 | slave1 | 192.168.1.101 | DataNode, NodeManager, Zookeeper |
| 从节点2 | slave2 | 192.168.1.102 | DataNode, NodeManager, Zookeeper |
3.1.2 前置条件
# 所有节点执行
# 1. 配置主机名
sudo hostnamectl set-hostname master # 在master节点
sudo hostnamectl set-hostname slave1 # 在slave1节点
sudo hostnamectl set-hostname slave2 # 在slave2节点
# 2. 配置hosts文件
sudo tee -a /etc/hosts << EOF
192.168.1.100 master
192.168.1.101 slave1
192.168.1.102 slave2
EOF
# 3. 配置SSH免密登录(在master节点执行)
ssh-keygen -t rsa
ssh-copy-id hadoop@master
ssh-copy-id hadoop@slave1
ssh-copy-id hadoop@slave2
3.2 Hadoop集群配置
3.2.1 核心配置文件
1. core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/bigdata/data/tmp</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>master:2181,slave1:2181,slave2:2181</value>
</property>
</configuration>
2. hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/bigdata/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/bigdata/data/datanode</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>master:9870</value>
</property>
</configuration>
3. 配置workers文件
cat > $HADOOP_HOME/etc/hadoop/workers << EOF
master
slave1
slave2
EOF
3.2.2 分发配置到所有节点
# 在master节点执行
for node in slave1 slave2; do
rsync -av $HADOOP_HOME/ hadoop@$node:$HADOOP_HOME/
rsync -av ~/.bashrc hadoop@$node:~/
done
3.2.3 启动Hadoop集群
# 1. 格式化NameNode(仅在首次启动时)
hdfs namenode -format
# 2. 启动HDFS
start-dfs.sh
# 3. 启动YARN
start-yarn.sh
# 4. 验证集群
hdfs dfsadmin -report
yarn node -list
3.3 Zookeeper集群配置
3.3.1 配置zoo.cfg
# 在所有节点执行
cd ~/zookeeper/conf
cat > zoo.cfg << EOF
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/hadoop/bigdata/data/zookeeper
clientPort=2181
maxClientCnxns=60
# 集群配置
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
EOF
# 创建myid文件(每个节点不同)
# master节点:echo 1 > ~/bigdata/data/zookeeper/myid
# slave1节点:echo 2 > ~/bigdata/data/zookeeper/myid
# slave2节点:echo 3 > ~/bigdata/data/zookeeper/myid
3.3.2 启动Zookeeper集群
# 在每个节点启动
~/zookeeper/bin/zkServer.sh start
# 检查状态
~/zookeeper/bin/zkServer.sh status
3.4 Hive集群配置(使用MySQL元数据库)
3.4.1 安装配置MySQL
# 在master节点安装MySQL
sudo yum install mysql-server -y # CentOS
sudo apt install mysql-server -y # Ubuntu
# 启动并配置MySQL
sudo systemctl start mysqld
sudo mysql_secure_installation
# 创建Hive数据库
mysql -u root -p
CREATE DATABASE hive_metastore;
CREATE USER 'hive'@'%' IDENTIFIED BY 'hivepassword';
GRANT ALL ON hive_metastore.* TO 'hive'@'%';
FLUSH PRIVILEGES;
EXIT;
3.4.2 配置Hive使用MySQL
# 1. 下载MySQL驱动
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.33.tar.gz
tar -xzf mysql-connector-java-8.0.33.tar.gz
cp mysql-connector-java-8.0.33/mysql-connector-java-8.0.33.jar ~/hive/lib/
# 2. 配置hive-site.xml
cat > ~/hive/conf/hive-site.xml << EOF
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master:3306/hive_metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hivepassword</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://master:9083</value>
</property>
</configuration>
EOF
3.4.3 初始化并启动Hive服务
# 初始化元数据库
schematool -dbType mysql -initSchema
# 启动Metastore服务
nohup hive --service metastore > ~/bigdata/logs/metastore.log 2>&1 &
# 启动HiveServer2
nohup hive --service hiveserver2 > ~/bigdata/logs/hiveserver2.log 2>&1 &
3.5 DataX集群配置
3.5.1 分发DataX到所有节点
# 在master节点打包并分发
cd ~
tar -czf datax.tar.gz datax
for node in slave1 slave2; do
scp datax.tar.gz hadoop@$node:~/
ssh hadoop@$node "tar -xzf datax.tar.gz"
done
3.5.2 创建集群任务分发脚本
cat > ~/bigdata/scripts/distribute_datax.sh << 'EOF'
#!/bin/bash
# DataX集群任务分发脚本
JOB_FILE=$1
NODES="master slave1 slave2"
if [ -z "$JOB_FILE" ]; then
echo "Usage: $0 <datax_job_json>"
exit 1
fi
for node in $NODES; do
echo "Copying $JOB_FILE to $node..."
scp $JOB_FILE hadoop@$node:~/datax/job/
done
echo "任务分发完成"
EOF
chmod +x ~/bigdata/scripts/distribute_datax.sh
3.6 高可用配置(HDFS HA)
3.6.1 配置JournalNodes
# 在hdfs-site.xml中添加
cat >> $HADOOP_HOME/etc/hadoop/hdfs-site.xml << EOF
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>master:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>slave1:9000</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://master:8485;slave1:8485;slave2:8485/mycluster</value>
</property>
EOF
3.6.2 启动HA集群
# 在各个节点启动JournalNode
hdfs --daemon start journalnode
# 格式化并启动HA
hdfs namenode -format
hdfs zkfc -formatZK
start-dfs.sh
3.7 集群管理脚本
3.7.1 集群启动脚本
cat > ~/bigdata/scripts/start_cluster.sh << 'EOF'
#!/bin/bash
echo "========== 启动Hadoop集群 =========="
start-dfs.sh
start-yarn.sh
echo "========== 启动Zookeeper集群 =========="
for node in master slave1 slave2; do
ssh hadoop@$node "~/zookeeper/bin/zkServer.sh start"
done
echo "========== 启动Hive服务 =========="
nohup hive --service metastore > /dev/null 2>&1 &
nohup hive --service hiveserver2 > /dev/null 2>&1 &
echo "========== 集群状态检查 =========="
hdfs dfsadmin -report
echo ""
yarn node -list
EOF
chmod +x ~/bigdata/scripts/start_cluster.sh
3.7.2 集群监控脚本
cat > ~/bigdata/scripts/cluster_monitor.sh << 'EOF'
#!/bin/bash
echo "===== 集群监控报告 ====="
date
echo -e "\n1. 节点状态:"
for node in master slave1 slave2; do
if ping -c 1 $node > /dev/null 2>&1; then
status="✓"
else
status="✗"
fi
echo " $status $node"
done
echo -e "\n2. 服务状态:"
for node in master slave1 slave2; do
echo " $node:"
ssh hadoop@$node "jps | grep -v Jps"
done
echo -e "\n3. HDFS状态:"
hdfs dfsadmin -report | grep -E "Live|Configured"
echo -e "\n4. 磁盘使用:"
for node in master slave1 slave2; do
echo " $node: $(ssh hadoop@$node "df -h /home | tail -1")"
done
EOF
chmod +x ~/bigdata/scripts/cluster_monitor.sh
3.8 性能优化配置
3.8.1 Hadoop性能调优
# 在yarn-site.xml中添加
cat >> $HADOOP_HOME/etc/hadoop/yarn-site.xml << EOF
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
EOF
# 在mapred-site.xml中添加
cat >> $HADOOP_HOME/etc/hadoop/mapred-site.xml << EOF
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1638m</value>
</property>
EOF
3.8.2 Hive性能调优
cat >> ~/hive/conf/hive-site.xml << EOF
<property>
<name>hive.exec.parallel</name>
<value>true</value>
</property>
<property>
<name>hive.exec.parallel.thread.number</name>
<value>8</value>
</property>
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
</property>
<property>
<name>hive.vectorized.execution.enabled</name>
<value>true</value>
</property>
EOF
3.9 集群验证测试
#!/bin/bash
echo "========== 集群验证测试 =========="
echo "1. HDFS分布式测试:"
hdfs dfs -mkdir -p /cluster_test
echo "Cluster Test Data" > cluster_test.txt
hdfs dfs -put cluster_test.txt /cluster_test/
hdfs dfs -setrep 3 /cluster_test/cluster_test.txt
hdfs dfs -ls /cluster_test
echo "副本数: $(hdfs fsck /cluster_test/cluster_test.txt -files -blocks -locations | grep 'replica')"
echo -e "\n2. MapReduce集群测试:"
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
wordcount /cluster_test /cluster_test_output
hdfs dfs -cat /cluster_test_output/part-r-00000 | head -5
echo -e "\n3. Hive集群测试:"
beeline -u jdbc:hive2://master:10000 -n hadoop -e "
CREATE DATABASE IF NOT EXISTS cluster_db;
USE cluster_db;
CREATE TABLE cluster_test (id INT, name STRING);
INSERT INTO cluster_test VALUES (1, 'Cluster Test');
SELECT * FROM cluster_test;
"
echo -e "\n4. DataX集群测试:"
~/bigdata/scripts/distribute_datax.sh ~/datax/job/stream2stream.json
ssh hadoop@slave1 "cd ~/datax && python bin/datax.py job/stream2stream.json"
echo -e "\n========== 验证完成 =========="
四、故障排除与维护
4.1 常见问题解决
4.1.1 服务启动失败
# 查看日志定位问题
tail -f ~/hadoop/logs/hadoop-*-namenode-*.log
tail -f ~/hive/logs/hive-*.log
# 检查端口占用
netstat -tlnp | grep -E '9000|9870|8088|2181'
# 检查防火墙
sudo firewall-cmd --list-all # CentOS
sudo ufw status # Ubuntu
4.1.2 节点无法通信
# 检查网络连接
ping slave1
telnet slave1 22
# 检查SSH配置
ssh hadoop@slave1 "hostname"
# 检查/etc/hosts配置
cat /etc/hosts
4.2 数据备份与恢复
4.2.1 HDFS快照管理
# 启用快照
hdfs dfsadmin -allowSnapshot /user
# 创建快照
hdfs dfs -createSnapshot /user my_snapshot
# 恢复快照
hdfs dfs -restoreSnapshot /user my_snapshot
4.2.2 Hive元数据备份
# 备份MySQL中的Hive元数据
mysqldump -u hive -p hive_metastore > hive_metastore_backup.sql
# 恢复元数据
mysql -u hive -p hive_metastore < hive_metastore_backup.sql
五、总结
5.1 部署要点回顾
- 环境准备: 确保所有节点时间同步、SSH免密、主机名配置正确
- 配置一致性: 配置文件需在所有节点保持一致
- 启动顺序: 先启动Zookeeper,再启动Hadoop,最后启动Hive
- 资源分配: 根据实际硬件配置调整内存、CPU等参数
5.2 生产环境建议
- 硬件建议:
- NameNode: 32GB+ RAM, SSD磁盘
- DataNode: 16GB+ RAM, 多块HDD磁盘
- 网络: 千兆以太网或InfiniBand
- 安全建议:
- 启用Kerberos认证
- 配置SSL加密
- 设置严格的防火墙规则
- 监控建议:
- 部署Prometheus + Grafana监控
- 配置日志聚合(ELK Stack)
- 设置告警机制
5.3 扩展方向
- 添加新组件:
- Spark: 内存计算框架
- Flink: 流式计算引擎
- HBase: NoSQL数据库
- Presto/Trino: 分布式SQL查询引擎
- 容器化部署:
- 使用Docker容器化各组件
- 使用Kubernetes进行编排管理
- 云原生部署:
- AWS EMR
- Azure HDInsight
- Google Cloud Dataproc
本文提供了从单机到集群的完整部署指南,涵盖了安装、配置、优化和运维的全过程。实际部署时,请根据具体业务需求调整配置参数,并进行充分的测试验证。
附录
A. 常用命令速查
# Hadoop管理
hdfs dfsadmin -report # HDFS状态报告
hdfs dfs -ls / # 查看HDFS目录
hdfs dfs -du -h / # 查看HDFS磁盘使用
# YARN管理
yarn application -list # 查看YARN应用
yarn logs -applicationId <appId> # 查看应用日志
# Hive管理
hive --service metastore # 启动Metastore
beeline -u jdbc:hive2://... # 连接HiveServer2
# Zookeeper管理
zkCli.sh -server localhost:2181 # 连接Zookeeper
zkServer.sh status # 查看Zookeeper状态