Misunderstanding of Spark-not only spark is in-memory computing, hadoop is also in-memory computing

Misunderstanding of Spark-not only spark is in-memory computing, hadoop is also in-memory computing

       There are some misunderstandings by beginners in the market. When they compare spark and hadoop, they will say that Spark is in-memory computing, and in-memory computing is a characteristic of spark. May I ask in the computer field, mysql, redis, ssh framework, etc., are they not in-memory computing? According to the Von Neumann architecture, is there any technical program that does not run in memory and requires data to be pulled from the hard disk and then used by the CPU carried out? All said that the feature of sprk is that memory computing is equivalent to nothing. So what are the real characteristics of spark? Aside from the way of spark execution model, its characteristic is nothing more than the data communication between multiple tasks does not need to rely on the hard disk but through the memory, which greatly improves the execution efficiency of the program. And because of its own model characteristics of Hadoop, data communication between multiple tasks must be implemented with the help of hard disks. Then the characteristic of spark is that data interaction does not go to the hard disk. It can only be said that the data interaction of multiple tasks does not go to the hard disk, but the shuffle process of sprk still has to go to the hard disk like hadoop.

Misunderstanding 1: Spark is a memory technology

  The biggest misunderstanding of Spark is that Spark is a memory technology. In fact, no Spark developer officially stated this. This is a misunderstanding of the Spark calculation process. Spark is in-memory computing without errors, but this is not its feature. It is just that when many experts introduce the features of spark, it is simplified that spark is in-memory computing.

  What is memory technology? It is a technology that allows you to persist data in RAM and process it effectively. However, Spark does not have the option of storing data in RAM. Although we all know that data can be stored in HDFS, HBase and other systems, there is no built-in persistence code whether it is stored on disk or in memory. All it can do is cache data, and this is not data persistence. Data that has been cached can be easily deleted and recalculated later when needed.

  But some people still think that Spark is a memory-based technology, because Spark processes data in memory. This is of course correct, because we cannot use other methods to process data. The APIs in the operating system only allow you to load data from the block device to the memory, and then store the calculated result in the block device. We cannot calculate directly on HDD devices; therefore, all processing in modern systems is basically done in memory.

  Although Spark allows us to use memory caching and LRU replacement rules, think about current RDBMS systems, such as Oracle, how do you think they handle data? They use shared memory segments as the storage pool for table pages. All data reads and writes go through this pool. This storage pool also supports LRU replacement rules; all modern databases can also meet most needs through LRU strategies. . But why don't we call Oracle a memory-based solution? Think about the operating system IO again, do you know? All IO operations will also use LRU cache technology.

  Does Spark handle all operations in memory? The core of Spark: shuffle, which writes data to disk. The shuffle processing includes two stages: map and reduce. The Map operation only calculates its hash value based on the key and stores the data in different files in the local file system. The number of files is usually the number of partitions on the reduce side; the reduce side will pull data from the map side and combine these The data is merged into the new partition. So if your RDD has M partitions, and you convert it into a PairRDD with N partitions, then M*N files will be created in the shuffle phase! Although there are currently some optimization strategies that can reduce the number of files created, they still cannot change the fact that you need to write data to disk every time you perform a shuffle operation!

So the conclusion is: Spark is not a memory-based technology! It is actually a technology that can effectively use the memory LRU strategy.

Misunderstanding 2: Spark is 10x-100x faster than Hadoop

  Everyone must have seen the picture shown below on Spark's official website

  This picture is a comparison of the running time of the Logistic Regression (Logistic Regression) machine learning algorithm using Spark and Hadoop respectively. From the above figure, it can be seen that the running speed of Spark is obviously hundreds of times faster than that of Hadoop! But is this actually the case? What is the core part of most machine learning algorithms? In fact, the same iterative calculation is performed on the same data set, and this place is where Spark's LRU algorithm is proud. When you scan the same data set multiple times, you only need to load it into the memory at the first access, and you can get it directly from the memory for subsequent accesses. This feature is great! But unfortunately, when the official uses Hadoop to run logistic regression, it is very likely that the cache function of HDFS is not used, but extreme situations are used. If the HDFS cache function is used when running logistic regression in Hadoop, its performance is likely to be only 3x-4x worse than Spark, instead of the same as shown in the above figure.

Based on experience, benchmark test reports made by companies are generally untrustworthy! Generally, independent third-party benchmark test reports are more credible, such as TPC-H. Their benchmark reports generally cover most of the scenarios in order to show the results realistically.

Generally speaking, Spark runs faster than MapReduce for the following reasons:

  • Task startup time is relatively fast, Spark forks out threads; MR starts a new process;
  • For faster shuffles, Spark only puts data on disk during shuffle, but MR does not.
  • Faster workflow: A typical MR workflow is composed of many MR jobs. The data interaction between them requires data to be persisted to disk; Spark supports DAG and pipelining, so it’s not necessary to do shuffle if you don’t encounter it. Cache data to disk.
  • Caching: Although currently HDFS also supports caching, in general, Spark's caching function is more efficient. Especially in SparkSQL, we can store data in memory in a columnar form.

  All these reasons make Spark have better performance than Hadoop; it can be 100 times faster in a relatively short job, but in a real production environment, it is generally only 2.5x ~ 3x faster!

Blowing a wave of water, I have been too busy recently, not only to overcome the thunder buried in the data given by the competitors, but also to constantly email the Belgian partners to check the progress and solutions of the project. But it does not prevent me from updating an article every month, -_-.