Hadoop生态系统部署指南

摘要

本文详细介绍Hadoop生态系统核心组件(Hadoop、Hive、Zookeeper、YARN、DataX)的安装与配置,涵盖单机测试环境和生产级集群环境。通过本指南,您可以快速搭建大数据处理平台,为数据仓库、数据同步和分布式计算提供完整解决方案。

一、概述

1.1 组件介绍

  • Hadoop: 分布式存储(HDFS)和计算(MapReduce)框架
  • Hive: 基于Hadoop的数据仓库工具,提供类SQL查询
  • Zookeeper: 分布式协调服务,用于集群管理和配置同步
  • YARN: Hadoop资源管理器,负责集群资源分配
  • DataX: 阿里巴巴开源的数据同步工具,支持多种数据源

1.2 部署模式对比

模式适用场景节点要求优缺点
单机模式学习、测试、开发1个节点部署简单,资源要求低,但无高可用
集群模式生产环境、压力测试3+个节点高可用、高性能,但部署复杂

二、单机安装部署

2.1 环境准备

2.1.1 系统要求

# 操作系统:Ubuntu 20.04/CentOS 8
# 内存:≥8GB
# 磁盘:≥50GB
# Java:JDK 8或11

# 创建专用用户
sudo useradd hadoop
sudo passwd hadoop
sudo usermod -aG wheel hadoop  # CentOS
sudo usermod -aG sudo hadoop   # Ubuntu

# 切换到hadoop用户
su - hadoop

2.1.2 目录结构规划

mkdir -p ~/bigdata/{software,data,logs,conf}

2.2 Hadoop单机安装

2.2.1 下载与解压

cd ~/bigdata/software
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xzf hadoop-3.3.6.tar.gz -C ~/
mv ~/hadoop-3.3.6 ~/hadoop

2.2.2 环境变量配置

# 编辑~/.bashrc或~/.bash_profile
cat >> ~/.bashrc << 'EOF'
export HADOOP_HOME=$HOME/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
EOF

source ~/.bashrc

2.2.3 配置文件修改

1. core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/bigdata/data/tmp</value>
    </property>
</configuration>

2. hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///home/hadoop/bigdata/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///home/hadoop/bigdata/data/datanode</value>
    </property>
</configuration>

3. mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

4. yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

2.2.4 初始化与启动

# 格式化HDFS
hdfs namenode -format

# 启动服务
start-dfs.sh
start-yarn.sh

# 验证服务
jps
# 应看到:NameNode, DataNode, ResourceManager, NodeManager

2.3 Zookeeper单机安装

2.3.1 下载与安装

cd ~/bigdata/software
wget https://downloads.apache.org/zookeeper/zookeeper-3.8.3/apache-zookeeper-3.8.3-bin.tar.gz
tar -xzf apache-zookeeper-3.8.3-bin.tar.gz -C ~/
mv ~/apache-zookeeper-3.8.3-bin ~/zookeeper

2.3.2 配置与启动

cd ~/zookeeper/conf
cp zoo_sample.cfg zoo.cfg

# 修改配置
cat > zoo.cfg << EOF
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/hadoop/bigdata/data/zookeeper
clientPort=2181
EOF

# 启动Zookeeper
~/zookeeper/bin/zkServer.sh start

2.4 Hive单机安装

2.4.1 下载与安装

cd ~/bigdata/software
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar -xzf apache-hive-3.1.3-bin.tar.gz -C ~/
mv ~/apache-hive-3.1.3-bin ~/hive

2.4.2 配置元数据存储(Derby)

cd ~/hive/conf
cat > hive-site.xml << EOF
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:derby:;databaseName=/home/hadoop/bigdata/data/metastore_db;create=true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.apache.derby.jdbc.EmbeddedDriver</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>
</configuration>
EOF

2.4.3 初始化与启动

# 创建HDFS目录
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse

# 初始化元数据库
schematool -dbType derby -initSchema

# 启动Hive CLI
hive

2.5 DataX单机安装

2.5.1 下载与安装

cd ~/bigdata/software
wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
tar -xzf datax.tar.gz -C ~/
mv ~/datax ~/datax

2.5.2 验证安装

cd ~/datax
python bin/datax.py job/job.json

2.6 单机环境验证

#!/bin/bash
echo "=== 单机环境验证 ==="

echo "1. Hadoop验证:"
hdfs dfs -mkdir -p /test
echo "Hello Hadoop" > test.txt
hdfs dfs -put test.txt /test/
hdfs dfs -cat /test/test.txt

echo -e "\n2. Hive验证:"
hive -e "CREATE DATABASE IF NOT EXISTS testdb;"
hive -e "SHOW DATABASES;"

echo -e "\n3. Zookeeper验证:"
echo stat | nc localhost 2181 | head -10

echo -e "\n4. DataX验证:"
python ~/datax/bin/datax.py --version

echo -e "\n=== 验证完成 ==="

三、集群安装部署

3.1 集群规划

3.1.1 节点规划(3节点示例)

节点主机名IP地址部署服务
主节点master192.168.1.100NameNode, ResourceManager, Zookeeper, Hive Metastore
从节点1slave1192.168.1.101DataNode, NodeManager, Zookeeper
从节点2slave2192.168.1.102DataNode, NodeManager, Zookeeper

3.1.2 前置条件

# 所有节点执行
# 1. 配置主机名
sudo hostnamectl set-hostname master  # 在master节点
sudo hostnamectl set-hostname slave1  # 在slave1节点
sudo hostnamectl set-hostname slave2  # 在slave2节点

# 2. 配置hosts文件
sudo tee -a /etc/hosts << EOF
192.168.1.100 master
192.168.1.101 slave1
192.168.1.102 slave2
EOF

# 3. 配置SSH免密登录(在master节点执行)
ssh-keygen -t rsa
ssh-copy-id hadoop@master
ssh-copy-id hadoop@slave1
ssh-copy-id hadoop@slave2

3.2 Hadoop集群配置

3.2.1 核心配置文件

1. core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/bigdata/data/tmp</value>
    </property>
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>master:2181,slave1:2181,slave2:2181</value>
    </property>
</configuration>

2. hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///home/hadoop/bigdata/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///home/hadoop/bigdata/data/datanode</value>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>master:9870</value>
    </property>
</configuration>

3. 配置workers文件

cat > $HADOOP_HOME/etc/hadoop/workers << EOF
master
slave1
slave2
EOF

3.2.2 分发配置到所有节点

# 在master节点执行
for node in slave1 slave2; do
    rsync -av $HADOOP_HOME/ hadoop@$node:$HADOOP_HOME/
    rsync -av ~/.bashrc hadoop@$node:~/
done

3.2.3 启动Hadoop集群

# 1. 格式化NameNode(仅在首次启动时)
hdfs namenode -format

# 2. 启动HDFS
start-dfs.sh

# 3. 启动YARN
start-yarn.sh

# 4. 验证集群
hdfs dfsadmin -report
yarn node -list

3.3 Zookeeper集群配置

3.3.1 配置zoo.cfg

# 在所有节点执行
cd ~/zookeeper/conf

cat > zoo.cfg << EOF
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/hadoop/bigdata/data/zookeeper
clientPort=2181
maxClientCnxns=60

# 集群配置
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
EOF

# 创建myid文件(每个节点不同)
# master节点:echo 1 > ~/bigdata/data/zookeeper/myid
# slave1节点:echo 2 > ~/bigdata/data/zookeeper/myid
# slave2节点:echo 3 > ~/bigdata/data/zookeeper/myid

3.3.2 启动Zookeeper集群

# 在每个节点启动
~/zookeeper/bin/zkServer.sh start

# 检查状态
~/zookeeper/bin/zkServer.sh status

3.4 Hive集群配置(使用MySQL元数据库)

3.4.1 安装配置MySQL

# 在master节点安装MySQL
sudo yum install mysql-server -y  # CentOS
sudo apt install mysql-server -y  # Ubuntu

# 启动并配置MySQL
sudo systemctl start mysqld
sudo mysql_secure_installation

# 创建Hive数据库
mysql -u root -p
CREATE DATABASE hive_metastore;
CREATE USER 'hive'@'%' IDENTIFIED BY 'hivepassword';
GRANT ALL ON hive_metastore.* TO 'hive'@'%';
FLUSH PRIVILEGES;
EXIT;

3.4.2 配置Hive使用MySQL

# 1. 下载MySQL驱动
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.33.tar.gz
tar -xzf mysql-connector-java-8.0.33.tar.gz
cp mysql-connector-java-8.0.33/mysql-connector-java-8.0.33.jar ~/hive/lib/

# 2. 配置hive-site.xml
cat > ~/hive/conf/hive-site.xml << EOF
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://master:3306/hive_metastore?createDatabaseIfNotExist=true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.cj.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hivepassword</value>
    </property>
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://master:9083</value>
    </property>
</configuration>
EOF

3.4.3 初始化并启动Hive服务

# 初始化元数据库
schematool -dbType mysql -initSchema

# 启动Metastore服务
nohup hive --service metastore > ~/bigdata/logs/metastore.log 2>&1 &

# 启动HiveServer2
nohup hive --service hiveserver2 > ~/bigdata/logs/hiveserver2.log 2>&1 &

3.5 DataX集群配置

3.5.1 分发DataX到所有节点

# 在master节点打包并分发
cd ~
tar -czf datax.tar.gz datax

for node in slave1 slave2; do
    scp datax.tar.gz hadoop@$node:~/
    ssh hadoop@$node "tar -xzf datax.tar.gz"
done

3.5.2 创建集群任务分发脚本

cat > ~/bigdata/scripts/distribute_datax.sh << 'EOF'
#!/bin/bash
# DataX集群任务分发脚本

JOB_FILE=$1
NODES="master slave1 slave2"

if [ -z "$JOB_FILE" ]; then
    echo "Usage: $0 <datax_job_json>"
    exit 1
fi

for node in $NODES; do
    echo "Copying $JOB_FILE to $node..."
    scp $JOB_FILE hadoop@$node:~/datax/job/
done

echo "任务分发完成"
EOF

chmod +x ~/bigdata/scripts/distribute_datax.sh

3.6 高可用配置(HDFS HA)

3.6.1 配置JournalNodes

# 在hdfs-site.xml中添加
cat >> $HADOOP_HOME/etc/hadoop/hdfs-site.xml << EOF
<property>
    <name>dfs.nameservices</name>
    <value>mycluster</value>
</property>
<property>
    <name>dfs.ha.namenodes.mycluster</name>
    <value>nn1,nn2</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.mycluster.nn1</name>
    <value>master:9000</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.mycluster.nn2</name>
    <value>slave1:9000</value>
</property>
<property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://master:8485;slave1:8485;slave2:8485/mycluster</value>
</property>
EOF

3.6.2 启动HA集群

# 在各个节点启动JournalNode
hdfs --daemon start journalnode

# 格式化并启动HA
hdfs namenode -format
hdfs zkfc -formatZK
start-dfs.sh

3.7 集群管理脚本

3.7.1 集群启动脚本

cat > ~/bigdata/scripts/start_cluster.sh << 'EOF'
#!/bin/bash
echo "========== 启动Hadoop集群 =========="
start-dfs.sh
start-yarn.sh

echo "========== 启动Zookeeper集群 =========="
for node in master slave1 slave2; do
    ssh hadoop@$node "~/zookeeper/bin/zkServer.sh start"
done

echo "========== 启动Hive服务 =========="
nohup hive --service metastore > /dev/null 2>&1 &
nohup hive --service hiveserver2 > /dev/null 2>&1 &

echo "========== 集群状态检查 =========="
hdfs dfsadmin -report
echo ""
yarn node -list
EOF

chmod +x ~/bigdata/scripts/start_cluster.sh

3.7.2 集群监控脚本

cat > ~/bigdata/scripts/cluster_monitor.sh << 'EOF'
#!/bin/bash
echo "===== 集群监控报告 ====="
date

echo -e "\n1. 节点状态:"
for node in master slave1 slave2; do
    if ping -c 1 $node > /dev/null 2>&1; then
        status="✓"
    else
        status="✗"
    fi
    echo "  $status $node"
done

echo -e "\n2. 服务状态:"
for node in master slave1 slave2; do
    echo "  $node:"
    ssh hadoop@$node "jps | grep -v Jps"
done

echo -e "\n3. HDFS状态:"
hdfs dfsadmin -report | grep -E "Live|Configured"

echo -e "\n4. 磁盘使用:"
for node in master slave1 slave2; do
    echo "  $node: $(ssh hadoop@$node "df -h /home | tail -1")"
done
EOF

chmod +x ~/bigdata/scripts/cluster_monitor.sh

3.8 性能优化配置

3.8.1 Hadoop性能调优

# 在yarn-site.xml中添加
cat >> $HADOOP_HOME/etc/hadoop/yarn-site.xml << EOF
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>8192</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>8192</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>2.1</value>
</property>
EOF

# 在mapred-site.xml中添加
cat >> $HADOOP_HOME/etc/hadoop/mapred-site.xml << EOF
<property>
    <name>mapreduce.map.memory.mb</name>
    <value>2048</value>
</property>
<property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>4096</value>
</property>
<property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx1638m</value>
</property>
EOF

3.8.2 Hive性能调优

cat >> ~/hive/conf/hive-site.xml << EOF
<property>
    <name>hive.exec.parallel</name>
    <value>true</value>
</property>
<property>
    <name>hive.exec.parallel.thread.number</name>
    <value>8</value>
</property>
<property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
</property>
<property>
    <name>hive.vectorized.execution.enabled</name>
    <value>true</value>
</property>
EOF

3.9 集群验证测试

#!/bin/bash
echo "========== 集群验证测试 =========="

echo "1. HDFS分布式测试:"
hdfs dfs -mkdir -p /cluster_test
echo "Cluster Test Data" > cluster_test.txt
hdfs dfs -put cluster_test.txt /cluster_test/
hdfs dfs -setrep 3 /cluster_test/cluster_test.txt
hdfs dfs -ls /cluster_test
echo "副本数: $(hdfs fsck /cluster_test/cluster_test.txt -files -blocks -locations | grep 'replica')"

echo -e "\n2. MapReduce集群测试:"
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
  wordcount /cluster_test /cluster_test_output
hdfs dfs -cat /cluster_test_output/part-r-00000 | head -5

echo -e "\n3. Hive集群测试:"
beeline -u jdbc:hive2://master:10000 -n hadoop -e "
  CREATE DATABASE IF NOT EXISTS cluster_db;
  USE cluster_db;
  CREATE TABLE cluster_test (id INT, name STRING);
  INSERT INTO cluster_test VALUES (1, 'Cluster Test');
  SELECT * FROM cluster_test;
"

echo -e "\n4. DataX集群测试:"
~/bigdata/scripts/distribute_datax.sh ~/datax/job/stream2stream.json
ssh hadoop@slave1 "cd ~/datax && python bin/datax.py job/stream2stream.json"

echo -e "\n========== 验证完成 =========="

四、故障排除与维护

4.1 常见问题解决

4.1.1 服务启动失败

# 查看日志定位问题
tail -f ~/hadoop/logs/hadoop-*-namenode-*.log
tail -f ~/hive/logs/hive-*.log

# 检查端口占用
netstat -tlnp | grep -E '9000|9870|8088|2181'

# 检查防火墙
sudo firewall-cmd --list-all  # CentOS
sudo ufw status              # Ubuntu

4.1.2 节点无法通信

# 检查网络连接
ping slave1
telnet slave1 22

# 检查SSH配置
ssh hadoop@slave1 "hostname"

# 检查/etc/hosts配置
cat /etc/hosts

4.2 数据备份与恢复

4.2.1 HDFS快照管理

# 启用快照
hdfs dfsadmin -allowSnapshot /user

# 创建快照
hdfs dfs -createSnapshot /user my_snapshot

# 恢复快照
hdfs dfs -restoreSnapshot /user my_snapshot

4.2.2 Hive元数据备份

# 备份MySQL中的Hive元数据
mysqldump -u hive -p hive_metastore > hive_metastore_backup.sql

# 恢复元数据
mysql -u hive -p hive_metastore < hive_metastore_backup.sql

五、总结

5.1 部署要点回顾

  1. 环境准备: 确保所有节点时间同步、SSH免密、主机名配置正确
  2. 配置一致性: 配置文件需在所有节点保持一致
  3. 启动顺序: 先启动Zookeeper,再启动Hadoop,最后启动Hive
  4. 资源分配: 根据实际硬件配置调整内存、CPU等参数

5.2 生产环境建议

  1. 硬件建议:
    • NameNode: 32GB+ RAM, SSD磁盘
    • DataNode: 16GB+ RAM, 多块HDD磁盘
    • 网络: 千兆以太网或InfiniBand
  2. 安全建议:
    • 启用Kerberos认证
    • 配置SSL加密
    • 设置严格的防火墙规则
  3. 监控建议:
    • 部署Prometheus + Grafana监控
    • 配置日志聚合(ELK Stack)
    • 设置告警机制

5.3 扩展方向

  1. 添加新组件:
    • Spark: 内存计算框架
    • Flink: 流式计算引擎
    • HBase: NoSQL数据库
    • Presto/Trino: 分布式SQL查询引擎
  2. 容器化部署:
    • 使用Docker容器化各组件
    • 使用Kubernetes进行编排管理
  3. 云原生部署:
    • AWS EMR
    • Azure HDInsight
    • Google Cloud Dataproc

本文提供了从单机到集群的完整部署指南,涵盖了安装、配置、优化和运维的全过程。实际部署时,请根据具体业务需求调整配置参数,并进行充分的测试验证。

附录

A. 常用命令速查

# Hadoop管理
hdfs dfsadmin -report              # HDFS状态报告
hdfs dfs -ls /                     # 查看HDFS目录
hdfs dfs -du -h /                  # 查看HDFS磁盘使用

# YARN管理
yarn application -list             # 查看YARN应用
yarn logs -applicationId <appId>   # 查看应用日志

# Hive管理
hive --service metastore           # 启动Metastore
beeline -u jdbc:hive2://...       # 连接HiveServer2

# Zookeeper管理
zkCli.sh -server localhost:2181    # 连接Zookeeper
zkServer.sh status                 # 查看Zookeeper状态

B. 参考文档

  1. Apache Hadoop官方文档
  2. Apache Hive官方文档
  3. Apache Zookeeper官方文档
  4. DataX GitHub仓库
This entry was posted in 应用. Bookmark the permalink.