Prepare three Linux servers and install JDK1.7
Upload the spark-2.1.0-bin-hadoop2.6.tgz installation package to Linux (intsmaze-131)
Unzip the installation package to the specified location tar -zxvf spark-2.1.0-bin-hadoop2.6.tgz -C/home/hadoop/app/spark2.0/
Go to the Spark installation directory
mv spark-env.sh.template spark-env.sh
Add the following configuration to the configuration file
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_65 export SPARK_MASTER_IP=intsmaze-131 (specify which spark node is the master node master in standalone mode) export SPARK_MASTER_PORT=7077
mv slaves.template slaves
Add the location of the child node in the file (Worker node)
intsmaze-131 intsmaze-132 intsmaze-134
Copy the configured Spark to other nodes (note that the path of the node must be the same as that of the master, otherwise the master will start the cluster and start the work from the corresponding directory in the node, and the inconsistency will report No such file or directory)
scp -r spark-2.1.0-bin-hadoop2.6/intsmaze-132:/home/hadoop/app/spark2.0/ scp -r spark-2.1.0-bin-hadoop2.6/intsmaze-134:/home/hadoop/app/spark2.0/
After the Spark cluster is configured, it is currently 1 Master and 3 Work. Start the Spark cluster on intsmaze-131 (master node)
/home/hadoop/app/spark2.0/spark-2.1.0-bin-hadoop2.6/sbin/start-all.sh (The spark cluster does not need to start the hadoop cluster, etc., has nothing to do with the hadoop cluster. If the configuration is spark on On yarn, then spark and yarn clusters must be started without starting hadoop)
Execute the jps command after startup. There are Master and Work processes on the master node, and Work processes on other child nodes. Log in to the Spark management interface to view the cluster status (master node): http://intsmaze-131:8080/
/home/hadoop/app/spark2.0/spark-2.1.0-bin-hadoop2.6/bin/spark-submit/ --class org.apache.spark.examples.SparkPi/ --master spark://intsmaze-131:7077/ --executor-memory 1G/ --total-executor-cores 2/home/hadoop/app/spark2.0/spark-2.1.0-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.1.0.jar/ 100
This algorithm uses Monte Carlo algorithm to find PI.
spark-shell is an interactive Shell program that comes with Spark, which is convenient for users to perform interactive programming. Users can write spark programs in scala under the command line.
/home/hadoop/app/spark2.0/spark-2.1.0-bin-hadoop2.6/bin/spark-shell/ --master spark://intsmaze-131:7077/ --executor-memory 2g/ --total-executor-cores 2
--master spark://intsmaze-131:7077 Specify the address of the Master
--executor-memory 2g specifies that the available memory of each worker is 2G, the task of the existing cluster will not be able to start, and should be modified to 512m.
--total-executor-cores 2 Specifies the number of cup cores used by the entire task as 2.
Note: If the resources allocated to the task cannot reach the specified value, the job will not be successfully started. For example, the server node can have 1G memory. If you set each worker 2G, the task will not be able to start. askSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
If you do not specify the master address when starting the spark shell, but you can also start the spark shell and execute the programs in the spark shell normally, in fact, the local mode of spark is started. This mode only starts a process on the machine without establishing contact with the cluster. In Spark Shell, the SparkContext class has been initialized to the object sc by default. If the user code needs to be used, just use sc directly.
1. First start hdfs
2. Upload a file to hdfs to intsmaze-131:9000/words.txt
3. Write spark program in scala language in spark shell
sc.textFile("hdfs://192.168.19.131:9000/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile( "hdfs://192.168.19.131:9000/out")
4. Use hdfs command to view the results
hdfs dfs -ls hdfs://intsmaze-131:9000/out/p*
sc is the SparkContext object, which is the entry point for submitting the spark program textFile(hdfs://intsmaze-131:9000/words.txt) is the data read in hdfs flatMap(_.split(" ")) first map is flattened map((_,1)) forms a tuple of words and 1 reduceByKey(_+_) reduces by key and accumulates value saveAsTextFile("hdfs://intsmaze-131:9000/out") write the result to hdfs
So far, the Spark cluster is installed, but there is a big problem, that is, the Master node has a single point of failure. To solve this problem, you must use zookeeper and start at least two Master nodes to achieve high reliability. Configuration method easier:
Spark cluster planning: intsmaze-131, intsmaze-132 is Master; intsmaze-131, intsmaze-132, intsmaze-134 is Worker
Install and configure zk cluster, and start zk cluster
Stop all spark services, modify the configuration file spark-env.sh, delete SPARK_MASTER_IP in the configuration file and add the following configuration
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1,zk2,zk3 -Dspark.deploy.zookeeper.dir=/spark"
Execute the sbin/start-all.sh script on node4, and then execute sbin/start-master.sh on node5 to start the second Master