Spark集群部署
Xiao Feng Lv2

SSH & JDK部署

互联互通

在哥们(hhh)的帮助下,搞到四台阿里云服务器,不过挺寒碜的配置:

编号 IP 用户名 密码 系统配置 备注
1 8.219.xx0.46 root —- Ubuntu 22.04 | 2vCPU/4GiB 主节点
2 4x.236.2x.161 root —- Ubuntu 22.04 | 2vCPU/1GiB 从节点
3 4x.236.15x.1x2 root —- Ubuntu 22.04 | 2vCPU/2GiB 从节点
4 47.2x6.x15.x57 root —- Ubuntu 22.04 | 2vCPU/1GiB 客户端

写了一个shell脚本,在四台服务器上创建用户dase-dis(注意确保四台服务器的用户名和密码一致才可以使用):

sudo apt install sshpass,在Linux下执行脚本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash

# 服务器IP地址列表
ip_list=("xxxx" "xxxx" "xxxx" "xxxx")

# 设置统一的密码
password="admin"

# 循环遍历IP地址列表
for ip in "${ip_list[@]}"
do
echo "Connecting to $ip..."

# 连接服务器并创建用户
sshpass -p "$ssh_password" ssh -o StrictHostKeyChecking=no root@$ip << EOF
# 创建用户并设置密码
useradd -m -s /bin/bash dase-dis
echo "dase-dis:$password" | chpasswd
# 添加用户到sudo组
usermod -aG sudo dase-dis
EOF

echo "User dase-dis created on $ip with password $password"
done

执行结果:

image


实现四台服务器之间ssh免密登录

安装openssh

在四台服务器上执行

  • sudo apt-get install openssh-server 安装openssh

更改主机名

在1号机(主节点,在文章开头编号1)执行:

  • sudo hostnamectl set-hostname ecnu01 更改主机名

在2号机(从节点,在文章开头编号2)执行:

  • sudo hostnamectl set-hostname ecnu02

…以此类推

  • sudo hostnamectl set-hostname ecnu03

  • sudo hostnamectl set-hostname ecnu04

四台服务器都执行完毕后,断开ssh重新连接,观察到主机名字已经成功更改

image

更改hosts

原理:

Hosts 文件是本地计算机上的文本文件,用于将主机名与 IP 地址关联起来,绕过 DNS 解析。Linux hosts 文件的格式通常是:

IP地址 主机名 [别名...]

/etc/hosts 路径下,每行代表一个主机名到 IP 地址的映射。例如:

1
2
3
127.0.0.1   localhost
::1 localhost
192.168.1.2 example.com

其中,127.0.0.1 和 ::1 映射到 localhost,192.168.1.2 映射到 example.com。hosts 文件允许手动指定主机名与 IP 地址的对应关系,用于特定网络配置和测试。

开始修改:

在四台机上执行以下操作:

  • sudo vim /etc/hosts

hosts文件后追加(ip需要改成自己的哇):

1
2
3
4
5
# IP地址 主机名
8.219.108.46 ecnu01
47.236.20.161 ecnu02
47.236.157.142 ecnu03
47.236.115.157 ecnu04

!!!!!!!注意!!!!!!!!!
!!!!!!!注意!!!!!!!!!
!!!!!!!注意!!!!!!!!!

在云服务器配置时, 本机使用内网IP, 其余为公网IP

查看内网IP:

image

hosts数值示例:

image

拷贝ssh公钥

在所有机器依次执行下面命令:

作用是将除主机外的三台机的ssh公钥拷贝到主中,实现其余三台机器到主机的ssh免密登录

  • ssh-keygen -t rsa 生成ssh密钥

  • ssh dase-dis@ecnu01 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' < ~/.ssh/id_rsa.pub 发送公钥到主机

  • sudo service ssh restart && chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys 重启本机ssh服务+解决ssh文件夹的权限问题

image


主机执行:

作用是将主机的ssh认证拷贝到其余三台机中,实现其余三台机器之间的ssh免密登录

  • scp ~/.ssh/authorized_keys dase-dis@ecnu02:/home/dase-dis/.ssh/authorized_keys

  • scp ~/.ssh/authorized_keys dase-dis@ecnu03:/home/dase-dis/.ssh/authorized_keys

  • scp ~/.ssh/authorized_keys dase-dis@ecnu04:/home/dase-dis/.ssh/authorized_keys

上面的三条命令等价于命令:for host in ecnu02 ecnu03 ecnu04; do scp ~/.ssh/authorized_keys dase-dis@$host:/home/dase-dis/.ssh/; done

然后主机执行:

sudo service ssh restart && chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys

运行结果:

image


验证:

互相ssh过去看看要不要输入密码

1
2
3
4
5
6
7
8
ssh dase-dis@ecnu01
exit
ssh dase-dis@ecnu02
exit
ssh dase-dis@ecnu03
exit
ssh dase-dis@ecnu04
exit

关闭防火墙

如果你是本地虚拟机:

  • systemctl stop firewalld.service

  • systemctl disable firewalld.service

如果你是云服务器:

请确保你知道自己在干什么, 关闭防火墙(开放所有端口)可能导致服务器被入侵

image

配置Java环境

在四台机器上配置:

可能你需要在上面oracle网站登陆后上手动找到下载地址,然后使用wget下载

  • 下载:wget https://download.oracle.com/otn/java/jdk/8u202-b08/1961070e4c9b4e26a04e7f5a083f551e/jdk-8u202-linux-x64.tar.gz

  • 解压:tar -zxvf jdk-8u202-linux-x64.tar.gz

  • 环境变量配置:sudo vi /etc/profile

  • 添加以下内容:

1
2
3
4
# 路径自己配自己的
export JAVA_HOME=/home/dase-dis/jdk1.8.0_202
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.ja
export PATH=$PATH:$JAVA_HOME/bin
  • 刷新:source /etc/profile

  • 验证:java -version

image


Hadoop 2.x部署

下载

地址:https://archive.apache.org/dist/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz

在主节点上执行:

  • 下载:wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz

  • 解压:tar -zxvf hadoop-2.10.1.tar.gz

image

  • 进入文件夹:cd ~/hadoop-2.10.1/

  • 查看下载软件的版本:./bin/hadoop version

image

修改配置

修改slaves

在主节点上执行:

  • 修改 slaves 文件: vim ~/hadoop-2.10.1/etc/hadoop/slaves

修改为:

1
2
ecnu02
ecnu03

修改core-site

  • 修改 core-site.xml: vim ~/hadoop-2.10.1/etc/hadoop/core-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/dase-dis/hadoop-2.10.1/tmp</value>
</property>
<!--以下填写主节点主机名-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://ecnu01:9999</value>
</property>
</configuration>

image

修改hdfs-site

  • 修改 hdfs-site.xml: vim ~/hadoop-2.10.1/etc/hadoop/hdfs-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/dase-dis/hadoop-2.10.1/tmp/dfs/name</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/dase-dis/hadoop-2.10.1/tmp/dfs/name</value>
</property>
</configuration>

修改hadoop-env

  • 修改 hadoop-env.sh: vim ~/hadoop-2.10.1/etc/hadoop/hadoop-env.sh

  • JAVA_HOME改为:

1
export JAVA_HOME=/home/dase-dis/jdk1.8.0_202

image

拷贝安装包

好了好了,终于改完了,接下来将改好的这份hadoop拷贝到其余三台机:

  • 拷贝到从节点1:scp -r /home/dase-dis/hadoop-2.10.1 dase-dis@ecnu02:/home/dase-dis/

  • 拷贝到从节点2:scp -r /home/dase-dis/hadoop-2.10.1 dase-dis@ecnu03:/home/dase-dis/

  • 拷贝到客户端:scp -r /home/dase-dis/hadoop-2.10.1 dase-dis@ecnu04:/home/dase-dis/

其实打包一下拷贝会更加好的,这里偷懒了

image

启动HDFS服务

格式化

注意: 仅在第一次启动 HDFS 时才需要格式化 NameNode,如果是重启HDFS那么跳过这步,直接执行下一步即可。
此外,在进行 NameNode 格式化之前,如果~/hadoop-2.10.1/tmp/文件夹已存在,那么需要删除该文件夹后再执行以下格式化命令。

如果启动时炸了,CTRL+C了,断电了,请参考后文解决办法,可能仍然需要格式化

  • 格式化命令: ~/hadoop-2.10.1/bin/hdfs namenode -format

image


启动

  • 启动:~/hadoop-2.10.1/sbin/start-dfs.sh

image


验证

  • 验证:jps

  • 主节点

image

  • 从节点

image


浏览器访问http://主节点IP:50070/,(如果主节点是云服务器记得把防火墙打开

开防火墙:

image

集群工作正常:

image

查看节点信息:

image

集群异常解决

  1. 如果因为一些情况导致集群第一次没有启动成功,请在主、从节点:
  • 主节点, 停止集群:~/hadoop-2.10.1/sbin/stop-dfs.sh

  • 删除运行生成文件:cd ~/hadoop-2.10.1/tmp/dfs && rm -rf *

  • 删除日志:cd ~/hadoop-2.10.1/logs && rm -rf *

  • 解决端口占用:sudo reboot

  • 主节点, 重新执行格式化命令:~/hadoop-2.10.1/bin/hdfs namenode -format


  1. 云服务器可能会出现的错误
  • 错误日志:

image

  • 提示绑定错误或2024-04-30 16:06:52,547 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.net.BindException: Problem binding to [ecnu01:9000] java.net.BindException: Cannot assign requested address; For more details see: http://wiki.apache.org/hadoop/BindException

  • 检查文章spark-1 中提到的hosts设置是否正确, 设置好了不会出现这种情况

  • 参考:

  1. https://blog.csdn.net/xiaosa5211234554321/article/details/119627974

  2. https://cwiki.apache.org/confluence/display/HADOOP2/BindException

HDFS Shell

注意:第一次使用 HDFS 时,需要首先在 HDFS 中创建用户目录

  • 打开工作目录: cd ~/hadoop-2.10.1

  • 为当前 dase-dis 用户创建一个用户根目录: ./bin/hdfs dfs -mkdir -p /user/dase-dis

HDFS Shell目录操作示例:

  • 显示 hdfs:///user/dase-dis 下的文件: ./bin/hdfs dfs -ls /user/dase-dis

  • 新建 hdfs:///user/dase-dis/input 目录: ./bin/hdfs dfs -mkdir /user/dase-dis/input

  • 删除 hdfs:///user/dase-dis/input 目录: ./bin/hdfs dfs -rm -r /user/dase-dis/input


Spark部署

修改配置文件

修改.bashrc文件

客户端执行:

  • vi ~/.bashrc

  • i进入编辑模式,按方向键到文件最后一行,输入export TERM=xterm-color

image

  • Esc键退出编辑模式,输入:wq保存退出

  • 使.bashrc配置生效:source ~/.bashrc

下载 spark

主节点执行:

  • 启动HDFS服务(已经启动直接下一步):~/hadoop-2.10.1/sbin/start-dfs.sh

  • 下载Spark安装包:wget http://archive.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-without-hadoop.tgz

  • 解压安装包:tar -zxvf spark-2.4.7-bin-without-hadoop.tgz

  • 改名:mv ~/spark-2.4.7-bin-without-hadoop ~/spark-2.4.7

上述步骤完成后:
image

修改配置

主节点执行以下修改:

spark-env

  • cp /home/dase-dis/spark-2.4.7/conf/spark-env.sh.template /home/dase-dis/spark-2.4.7/conf/spark-env.sh

  • vi /home/dase-dis/spark-2.4.7/conf/spark-env.sh

修改为:

1
2
3
4
5
6
7
# 因为下载的是Hadoop Free版本的Spark, 所以需要配置Hadoop的路径
export HADOOP_HOME=/home/dase-dis/hadoop-2.10.1
export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

export SPARK_MASTER_HOST=ecnu01 #主节点主机名
export SPARK_MASTERPORT=7077 #端口号

image

slaves

  • cp spark-2.4.7/conf/slaves.template spark-2.4.7/conf/slaves

  • vi spark-2.4.7/conf/slaves

修改为:

1
2
3
# localhost
ecnu02
ecnu03

image

spark-defaults

  • cp spark-2.4.7/conf/spark-defaults.conf.template spark-2.4.7/conf/spark-defaults.conf

  • vi spark-2.4.7/conf/spark-defaults.conf

修改为:

1
2
3
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://ecnu01:9000/tmp/spark_history
spark.history.fs.logDirectory=hdfs://ecnu01:9000/tmp/spark_history

image

spark-config

  • vi spark-2.4.7/sbin/spark-config.sh

追加:

1
export JAVA_HOME=/home/dase-dis/jdk1.8.0_202

image

安装spark

拷贝

本步骤将spark修改好的安装包拷贝到其他三台机:

  • scp -r spark-2.4.7 dase-dis@ecnu02:~/

  • scp -r spark-2.4.7 dase-dis@ecnu03:~/

  • scp -r spark-2.4.7 dase-dis@ecnu04:~/

HDFS中建立日志目录

  • ~/hadoop-2.10.1/bin/hdfs dfs -mkdir -p /tmp/spark_history

启动 spark

千辛万苦, 终于到启动了
image

主节点执行:

  • 启动spark: ~/spark-2.4.7/sbin/start-all.sh

  • 启动日志服务器: ~/spark-2.4.7/sbin/start-history-server.sh

  • 主节点:

image

  • 从节点:

image

错误处理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dase-dis@ecnu01:~$ ~/spark-2.4.7/sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /home/dase-dis/spark-2.4.7/logs/spark-dase-dis-org.apache.spark.deploy.master.Master-1-ecnu01.out
ecnu02: starting org.apache.spark.deploy.worker.Worker, logging to /home/dase-dis/spark-2.4.7/logs/spark-dase-dis-org.apache.spark.deploy.worker.Worker-1-ecnu02.out
ecnu03: starting org.apache.spark.deploy.worker.Worker, logging to /home/dase-dis/spark-2.4.7/logs/spark-dase-dis-org.apache.spark.deploy.worker.Worker-1-ecnu03.out
ecnu02: failed to launch: nice -n 0 /home/dase-dis/spark-2.4.7/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://172.19.39.254:7077
ecnu02: at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
ecnu02: at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
ecnu02: at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
ecnu02: at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
ecnu02: at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
ecnu02: at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
ecnu02: at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
ecnu02: at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
ecnu02: at java.lang.Thread.run(Thread.java:748)
ecnu02: 24/05/02 10:23:10 INFO util.ShutdownHookManager: Shutdown hook called
ecnu02: full log in /home/dase-dis/spark-2.4.7/logs/spark-dase-dis-org.apache.spark.deploy.worker.Worker-1-ecnu02.out

请检查hosts文件设置, 文章[大数据]Spark-1 SSH & JDK部署

验证

浏览器访问: http://主节点IP:8080/,(如果主节点是云服务器记得把防火墙打开

image

可以看到有两个worker在线, 大功告成

运行spark应用程序

好不容易搞好了, 来玩一下:

创建文件夹&上传文件

  • 创建spark_input文件夹: ~/hadoop-2.10.1/bin/hdfs dfs -mkdir -p spark_input

  • 上传文件RELEASEspark_input: ~/hadoop-2.10.1/bin/hdfs dfs -put ~/spark-2.4.7/RELEASE spark_input/

hadood的页面可以看到, 文件RELEASE存储在两个节点中:

image

启动 Spark Shell

  • 启动spark-shell: ~/spark-2.4.7/bin/spark-shell --master spark://ecnu01:7077

image

  • 键入以下Scala代码:
1
2
3
4
val sc = spark.sparkContext
val textFile = sc.textFile("hdfs://ecnu01:9000/user/dase-dis/spark_input/RELEASE")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.collect().foreach(println)

shell输出:

image

可以在网页查看到正在运行的任务信息:

image

到此Spark集群就已经顺利搭建完毕了

停止集群

如果你希望停止集群:

  • 停止Spark: /spark-2.4.7/sbin/stop-all.sh

  • 停止Spark日志服务: /spark-2.4.7/sbin/stop-history-server.sh

  • 停止HDFS服务: /hadoop-2.10.1/sbin/stop-dfs.sh


测试运行

经典的WordCount程序源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package cn.edu.ecnu.spark.example.java.wordcount;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.*;
import scala.Tuple2;

import java.util.Arrays;
import java.util.Iterator;

public class WordCount {
public static void run(String[] args) {
/* 步骤1:通过SparkConf设置配置信息,并创建SparkContext */
SparkConf conf = new SparkConf();
conf.setAppName("WordCount");
JavaSparkContext sc = new JavaSparkContext(conf);

/* 步骤2:按应用逻辑使用操作算子编写DAG,其中包括RDD的创建、转换和行动等 */
// 读入文本数据,创建名为lines的RDD
JavaRDD<String> lines = sc.textFile(args[0]);

// 将lines中的每一个文本行按空格分割成单个单词
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String line) throws Exception {
return Arrays.asList(line.split(" ")).iterator();
}
});
// 将每个单词的频数设置为1,即将每个单词映射为[单词, 1]
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word, 1);
}
});
// 按单词聚合,并对相同单词的频数使用sum进行累计
JavaPairRDD<String, Integer> wordCounts = pairs.groupByKey().mapToPair(new PairFunction<Tuple2<String, Iterable<Integer>>, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Tuple2<String, Iterable<Integer>> t) throws Exception {
Integer sum = Integer.valueOf(0);
for (Integer i : t._2) {
sum += i;
}
return new Tuple2<String, Integer>(t._1, sum);
}
});
// 合并机制
/*
JavaPairRDD<String, Integer> wordCounts =
pairs.reduceByKey(
new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer t1, Integer t2) throws Exception {
return t1 + t2;
}
});
*/

// 输出词频统计结果
wordCounts.saveAsTextFile(args[1]);

/* 步骤3:关闭SparkContext */
sc.stop();
}

public static void main(String[] args) {
run(args);
}
}

新建maven项目

  • idea新建项目:

image

  • pom.xml内容如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>org.example</groupId>
<artifactId>spark-wordcount</artifactId>
<version>1.0-SNAPSHOT</version>

<properties>
<spark.version>1.2.0</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.7</version>
</dependency>
</dependencies>

</project>
  • 更新依赖

新建 Java 代码

  • 新建包cn.edu.ecnu.spark.example.java.wordcount,类WordCount:

image

打包

  • 打包为.jar:

image

image

传送到客户端

  • 将打包好的.jar(位置项目路径\out\artifacts\spark_wordcount_jar)传到客户端/home/dase-dis/spark-2.4.7/myapp

image

下载测试数据

  • 下载:wget https://github.com/ymcui/Chinese-Cloze-RC/archive/master.zip

  • 解压:unzip master.zip

  • 解压:unzip ~/Chinese-Cloze-RC-master/people_daily/pd.zip

  • 拷贝到集群:~/hadoop-2.10.1/bin/hdfs dfs -put ~/Chinese-Cloze-RC-master/people_daily/pd/pd.test spark_input/pd.test

提交jar任务

  • 删除输出文件夹:~/hadoop-2.10.1/bin/hdfs dfs -rm -r spark_output

  • 提交任务:~/spark-2.4.7/bin/spark-submit \ --master spark://ecnu01:7077 \ --class cn.edu.ecnu.spark.example.java.wordcount.WordCount \ /home/dase-dis/spark-2.4.7/myapp/spark-wordcount.jar hdfs://ecnu01:9000/user/dase-dis/spark_input hdfs://ecnu01:9000/user/dase-dis/spark_output

正在运行

image

顺利跑完

ssh:

image

webui:

image

查看output文件夹:

image

查看part01运行结果:

  • ./hadoop-2.10.1/bin/hdfs dfs -cat /user/dase-dis/spark_output/part-00001

image

 评论
评论插件加载失败
正在加载评论插件