Spark构建Cube踩坑日记

写在前面

这几天因为在Kylin中使用Spark构建Cube遇到了不少坑。其中主要因为Spark和YARN不太熟悉导致的。所以在本地沙箱中很快就跑通了,但是在测试环境中(没有spark on yarn)频繁翻车。

环境介绍

  • Hadoop 2.6.0-cdh5.4.4 (jdk1.7)
  • Spark2.3.2(Kylin自带的,jdk1.8)
  • Kylin2.6.0

Spark构建配置步骤

配置Spark

kylin.properties中配置spark:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
kylin.env.hadoop-conf-dir=/home/kylin/software/hadoop/etc/hadoop
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.driver.memory=2G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=1
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
#kylin.engine.spark-conf.spark.executor.instances=1
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
kylin.engine.spark-conf.spark.eventLog.dir=hdfs:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs:///kylin/spark-history
kylin.engine.spark-conf.spark.yarn.archive=hdfs:///kylin/spark/spark-libs.jar
kylin.engine.spark.additional-jars=/home/kylin/software/kylin/lib/kylin-job-2.6.0.jar
kylin.engine.spark-conf.spark.executorEnv.JAVA_HOME=/opt/soft/jdk/jdk1.8.0_66
kylin.engine.spark-conf.spark.yarn.appMasterEnv.JAVA_HOME=/opt/soft/jdk/jdk1.8.0_66
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-XX:MaxTenuringThreshold=15
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-XX:MaxTenuringThreshold=15

Hadopp与Spark jdk版本不一致问题解决

注意spark配置的最后四行是为了解决Hadoop集群的jdk1.7和spark的jdk1.8版本不一致问题。后面使用到的spark-2.3.2-yarn-shuffle.jar也需要找一个jdk1.7的,spark目录下的是1.8版本的。

上传Spark jar包到HDFS

上传spark jars包到hdfs中:

1
2
3
$ jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ .
$ hadoop fs -mkdir -p /kylin/spark/
$ hadoop fs -put spark-libs.jar /kylin/spark/

(官网的教程就到这里,接下来都是踩坑后总结的。)

拷贝依赖jar包

kylin源码的engine-spark模块,pom.xml中有一些provided的依赖,但是在kylin server启动后并没有在CLASSPATH中找到,所以,简单的方法是把找不到的依赖jar包直接拷贝到$KYLIN_HOME/tomcat/lib下面:

1
2
$ cp $KYLIN_HOME/spark/jars/spark-core_2.11-2.1.2.jar $KYLIN_HOME/tomcat/lib 
$ cp $KYLIN_HOME/spark/jars/scala-library-2.11.8.jar $KYLIN_HOME/tomcat/lib

修改HDFS目录权限

修改hdfs /kylin/spark-history目录权限:

1
$ hadoop fs -chmod 777 /kylin/spark-history

Spark on YARN配置

修改yarn-site.xml

1
2
3
4
5
6
7
8
9
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

将准备好的spark-2.3.2-yarn-shuffle.jar拷贝到$HADOOP_HOME/share/hadoop/common目录下,重启yarn。

启动Kylin

启动Kylin,就可以用Spark构建Cube了。

参考:


遇到的问题与解决方法

问题一:spark构建cube点击build后报错 (测试环境)

1
2
3
4
5
6
7
8
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/api/java/function/Function
at org.apache.kylin.engine.spark.SparkBatchCubingJobBuilder2.<init>(SparkBatchCubingJobBuilder2.java:53) ~[kylin-engine-spark-2.6.3.jar:2.6.3]
at org.apache.kylin.engine.spark.SparkBatchCubingEngine2.createBatchCubingJob(SparkBatchCubingEngine2.java:44) ~[kylin-engine-spark-2.6.3.jar:2.6.3]
at org.apache.kylin.engine.EngineFactory.createBatchCubingJob(EngineFactory.java:60) ~[kylin-core-job-2.6.3.jar:2.6.3]
at org.apache.kylin.rest.service.JobService.submitJobInternal(JobService.java:234) ~[kylin-server-base-2.6.3.jar:2.6.3]
at org.apache.kylin.rest.service.JobService.submitJob(JobService.java:202) ~[kylin-server-base-2.6.3.jar:2.6.3]
at org.apache.kylin.rest.controller.CubeController.buildInternal(CubeController.java:395) ~[kylin-server-base-2.6.3.jar:2.6.3]
... 77 more

kylin源码的engine-spark模块,pom.xml中有一些provided的依赖,但是在kylin server启动后并没有在CLASSPATH中找到,所以,简单的方法是把找不到的依赖jar包直接拷贝到$KYLIN_HOME/tomcat/lib下面:

1
2
$ cp $KYLIN_HOME/spark/jars/spark-core_2.11-2.1.2.jar $KYLIN_HOME/tomcat/lib 
$ cp $KYLIN_HOME/spark/jars/scala-library-2.11.8.jar $KYLIN_HOME/tomcat/lib

重启Kylin生效。

问题二:构建Cube报错(本地沙箱环境)

1
2
19/10/14 06:00:47 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2250 MB per container)
Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (4096+1024 MB) is above the max threshold (2250 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.

spark.executor.memory调到合适的大小即可。

问题三:Spark任务一直处于Running(本地沙箱环境)

可以正常提交Spark任务,但是提交后一直处于Running状态。在Web UI上查看该任务,发现没有启动executor,查看container日志发现如下:

1
WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

提示资源不够,一开始查看队列资源,vCores一直为0,以为是core配置问题,所以浪费了不上时间。 最后将yarn.nodemanager.resource.memory-mb 调大最大后,Spark可以正常跑了。

问题四:提交Spark任务后报错(测试环境)

kylin提交spark任务,或者手动提交后报错,spark日志没有明显问题。

使用yarn命令查看日志:

1
$ yarn logs -applicationId APP_ID

日志如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
$ yarn logs -applicationId application_1570798822519_0069 


Container: container_1570798822519_0069_02_000001 on bjm6-14-67.58os.org_40428
================================================================================
LogType:stderr
Log Upload Time:星期一 十月 21 14:56:23 +0800 2019
LogLength:1998
Log Contents:
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/spark/network/sasl/SecretKeyHolder : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)
at java.lang.Class.getMethod0(Class.java:2764)
at java.lang.Class.getMethod(Class.java:1653)
at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)

LogType:stdout
Log Upload Time:星期一 十月 21 14:56:23 +0800 2019
LogLength:0

查看hadoop-env.sh 发现是JAVA_HOME是jdk1.7,而spark2.3使用的是jdk1.8.

可以通过指定参数来解决这个问题。

1
2
3
4
kylin.engine.spark-conf.spark.executorEnv.JAVA_HOME=/opt/soft/jdk/jdk1.8.0_66
kylin.engine.spark-conf.spark.yarn.appMasterEnv.JAVA_HOME=/opt/soft/jdk/jdk1.8.0_66
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-XX:MaxTenuringThreshold=15
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-XX:MaxTenuringThreshold=15

问题五:Spark运行时报错(测试环境)

Spark运行时报错(一开始以为是快跑完写文件时报的错,其实是报错导致任务停止的):

1
User class threw exception: java.lang.RuntimeException: error execute org.apache.kylin.engine.spark.SparkFactDistinct. Root cause: Permission denied: user=kylin, access=WRITE, inode="/home/kylin/kylin_2.6.0_metadata/kylin-153e81be-1c4e-0566-e2ba-3c67a2d95fb7/***":***:kylin:drwxr-xr-x

发现因为提交Spark任务的user是kylin,而kylin用户没有cube 元数据hdfs路径的权限。因此需要更改提交用户。

添加参数 --conf spark.proxy-user=*** 发现不行,用户还是kylin。最后改用export HADOOP_PROXY_USER=***即可。(后续需要改源码,根据登陆用户来指定)

指定后来又报错:

1
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=***, access=WRITE, inode="/kylin/spark-history":kylin:hadoop:drwxr-xr-x

这是因为提交用户改为了hdp_fin_ba,但是/kylin/spark-history这个目录其他用户没有写权限。

修改hdfs目前权限即可:

1
$ hadoop fs -chmod 777 /kylin/spark-history

问题六:Spark任务一直处于Running状态(测试环境)

继问题五后,终于不报权限的错误了。但是Spark任务一直处于Running状态。发现没有启动Executor。最后只好Spark UI上强制kill掉任务。

然后使用 yarn logs -applicationId APP_ID命令查看该app的日志发现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2019-10-22 15:21:37 ERROR YarnAllocator:91 - Failed to launch executor 1 on container container_1571725794420_0001_03_000002
org.apache.spark.SparkException: Exception while starting container container_1571725794420_0001_03_000002 on host bjm6-14-67.58os.org
at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:125)
at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:534)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:205)
at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:122)
... 5 more

是因为没有配置spark on yarn。

修改yarn-site.xml:

1
2
3
4
5
6
7
8
9
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

启动nodemanager报错:

1
2
3
4
5
6
7
8
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.spark.network.yarn.YarnShuffleService not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2112)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2136)
... 10 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.spark.network.yarn.YarnShuffleService not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2110)
... 11 more

将spark/yarn/下的spark-2.3.2-yarn-shuffle.jar包拷贝到$HADOOP_HOME/share/hadoop/common目录下,再重启。

重启nodemanager报错:

1
2
2019-10-22 16:57:01,751 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager main: Error starting NodeManager
java.lang.UnsupportedClassVersionError: org/apache/spark/network/yarn/YarnShuffleService : Unsupported major.minor version 52.0

spark目录下的yarn-shuffle jar包与hadoop集群的jdk版本不一致。找spark组的同学要了一个jdk1.7版本的,解决了问题。

参考:

  1. kylin-2.6.3使用Spark构建Cube

  2. WARN YarnClusterScheduler: not accepted any resources

  3. 2018年第41周-sparkSql搭建及配置