Spark2.1 cluster installation (standalone mode)

Spark2.1 cluster installation (standalone mode)

Machine deployment

  Prepare three Linux servers and install JDK1.7

Download the Spark installation package

  Upload the spark-2.1.0-bin-hadoop2.6.tgz installation package to Linux (intsmaze-131)

  Unzip the installation package to the specified location tar -zxvf spark-2.1.0-bin-hadoop2.6.tgz -C/home/hadoop/app/spark2.0/

Configure Spark

  Go to the Spark installation directory

  cd/spark-2.1.0-bin-hadoop2.6/conf

  mv spark-env.sh.template spark-env.sh

  vi spark-env.sh

  Add the following configuration to the configuration file

export JAVA_HOME=/home/hadoop/app/jdk1.7.0_65
export SPARK_MASTER_IP=intsmaze-131 (specify which spark node is the master node master in standalone mode)
export SPARK_MASTER_PORT=7077 

  mv slaves.template slaves

  vi slaves

  Add the location of the child node in the file (Worker node)

intsmaze-131
intsmaze-132
intsmaze-134

  Copy the configured Spark to other nodes (note that the path of the node must be the same as that of the master, otherwise the master will start the cluster and start the work from the corresponding directory in the node, and the inconsistency will report No such file or directory)

scp -r spark-2.1.0-bin-hadoop2.6/intsmaze-132:/home/hadoop/app/spark2.0/
scp -r spark-2.1.0-bin-hadoop2.6/intsmaze-134:/home/hadoop/app/spark2.0/

  After the Spark cluster is configured, it is currently 1 Master and 3 Work. Start the Spark cluster on intsmaze-131 (master node)

/home/hadoop/app/spark2.0/spark-2.1.0-bin-hadoop2.6/sbin/start-all.sh (The spark cluster does not need to start the hadoop cluster, etc., has nothing to do with the hadoop cluster. If the configuration is spark on On yarn, then spark and yarn clusters must be started without starting hadoop)

  Execute the jps command after startup. There are Master and Work processes on the master node, and Work processes on other child nodes. Log in to the Spark management interface to view the cluster status (master node): http://intsmaze-131:8080/

Execute the first spark program

/home/hadoop/app/spark2.0/spark-2.1.0-bin-hadoop2.6/bin/spark-submit/
 --class org.apache.spark.examples.SparkPi/
 --master spark://intsmaze-131:7077/
 --executor-memory 1G/
 --total-executor-cores 2/home/hadoop/app/spark2.0/spark-2.1.0-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.1.0.jar/
 100

This algorithm uses Monte Carlo algorithm to find PI.

Start Spark Shell

  spark-shell is an interactive Shell program that comes with Spark, which is convenient for users to perform interactive programming. Users can write spark programs in scala under the command line.

/home/hadoop/app/spark2.0/spark-2.1.0-bin-hadoop2.6/bin/spark-shell/
 --master spark://intsmaze-131:7077/
--executor-memory 2g/
--total-executor-cores 2

Parameter Description:

--master spark://intsmaze-131:7077 Specify the address of the Master

--executor-memory 2g specifies that the available memory of each worker is 2G, the task of the existing cluster will not be able to start, and should be modified to 512m.

--total-executor-cores 2 Specifies the number of cup cores used by the entire task as 2.

  Note: If the resources allocated to the task cannot reach the specified value, the job will not be successfully started. For example, the server node can have 1G memory. If you set each worker 2G, the task will not be able to start. askSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

  If you do not specify the master address when starting the spark shell, but you can also start the spark shell and execute the programs in the spark shell normally, in fact, the local mode of spark is started. This mode only starts a process on the machine without establishing contact with the cluster. In Spark Shell, the SparkContext class has been initialized to the object sc by default. If the user code needs to be used, just use sc directly.

Write WordCount program in spark shell

1. First start hdfs

2. Upload a file to hdfs to intsmaze-131:9000/words.txt

3. Write spark program in scala language in spark shell

sc.textFile("hdfs://192.168.19.131:9000/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile( "hdfs://192.168.19.131:9000/out")

4. Use hdfs command to view the results

hdfs dfs -ls hdfs://intsmaze-131:9000/out/p*

Description:

sc is the SparkContext object, which is the entry point for submitting the spark program
textFile(hdfs://intsmaze-131:9000/words.txt) is the data read in hdfs
flatMap(_.split(" ")) first map is flattened
map((_,1)) forms a tuple of words and 1
reduceByKey(_+_) reduces by key and accumulates value
saveAsTextFile("hdfs://intsmaze-131:9000/out") write the result to hdfs

Configure Spark's high availability

So far, the Spark cluster is installed, but there is a big problem, that is, the Master node has a single point of failure. To solve this problem, you must use zookeeper and start at least two Master nodes to achieve high reliability. Configuration method easier:

  Spark cluster planning: intsmaze-131, intsmaze-132 is Master; intsmaze-131, intsmaze-132, intsmaze-134 is Worker

  Install and configure zk cluster, and start zk cluster

  Stop all spark services, modify the configuration file spark-env.sh, delete SPARK_MASTER_IP in the configuration file and add the following configuration

  export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1,zk2,zk3 -Dspark.deploy.zookeeper.dir=/spark"

  Execute the sbin/start-all.sh script on node4, and then execute sbin/start-master.sh on node5 to start the second Master