In-depth analysis: Are Redis distributed locks safe? After reading this article, I thoroughly understand it!

In-depth analysis: Are Redis distributed locks safe? After reading this article, I thoroughly understand it!

Wechat search followed the "Water Drop and Silver Bullet" public account to get high-quality technical dry goods in the first place. 7 years of senior back-end research and development, showing you a different technical perspective.

Hi everyone, my name is Kaito.

In this article, I want to talk to you about the "security" of Redis distributed locks.

Many articles on the topic of Redis distributed locks have been written badly. Why should I write this article?

Because I found that 99% of the articles on the Internet did not really clarify this issue. As a result, many readers have read many articles, and they are still in the cloud. For example, can you answer these questions clearly?

  • How to implement a distributed lock based on Redis?
  • Is Redis distributed lock really safe?
  • What's wrong with Redlock in Redis? Is it safe?
  • The industry is arguing about Redlock, what are they arguing about? Which view is correct?
  • Should Redis or Zookeeper be used for distributed locks?
  • What are the issues that need to be considered when implementing a "fault-tolerant" distributed lock?

In this article, I will clarify these issues thoroughly.

After reading this article, you will not only have a thorough understanding of distributed locks, but also have a deeper understanding of "distributed systems".

The article is a bit long, but there is a lot of dry stuff, I hope you can read it patiently.

Why do we need distributed locks?

Before starting to talk about distributed locks, it is necessary to briefly introduce why distributed locks are needed?

Corresponding to the distributed lock is the "single machine lock". When we write multi-threaded programs, we avoid data problems caused by operating a shared variable at the same time. We usually use a lock to "mutual exclusion" to ensure the correctness of the shared variable. , The scope of its use is in the "same process".

If multiple processes need to operate on a shared resource at the same time, how can they be mutually exclusive?

For example, current business applications are usually microservice architecture, which also means that an application will deploy multiple processes. If these multiple processes need to modify the same row of records in MySQL, in order to avoid data errors caused by operation disorder At this point, we need to introduce "distributed locks" to solve this problem.

To implement distributed locks, an external system must be used, and all processes go to this system to apply for "locking".

And this external system must realize the ability of "mutual exclusion", that is, two requests coming in at the same time will only return success to one process, and return failure (or wait) to the other process.

This external system can be MySQL, Redis or Zookeeper. But in pursuit of better performance, we usually choose to use Redis or Zookeeper.

Next, I will take Redis as the main line, from the shallower to the deeper, and take you an in-depth analysis of the various "security" issues of distributed locks to help you thoroughly understand distributed locks.

How to implement distributed locks?

Let s start with the simplest.

Want to achieve distributed lock, Redis must require the ability to "mutually exclusive", we can use SETNX command, which represents the SET IF N OT E the X- ISTs, that is, if the key does not exist, will set its value, or what Don't do it either.

Two client processes can execute this command to achieve mutual exclusion, and a distributed lock can be realized.

Client 1 applies for locking, and the locking is successful:

127.0.0.1:6379> SETNX lock 1 (integer) 1//Client 1, locked successfully Copy code

Client 2 applied for lock, because it arrived later, the lock failed:

127.0.0.1:6379> SETNX lock 1 (integer) 0//Client 2, lock failed Copy code

At this point, the client that has successfully locked can operate the "shared resource", for example, modify a row of MySQL data, or call an API request.

After the operation is completed, the lock must be released in time to give latecomers the opportunity to operate shared resources. How to release the lock?

It is also very simple, just use the DEL command to delete this key:

127.0.0.1:6379> DEL lock//release the lock (integer) 1 Copy code

The logic is very simple, and the overall journey is like this:

However, it has a big problem. After client 1 gets the lock, if the following scenario occurs, it will cause a "deadlock":

  1. The program handles business logic exceptions, and the lock is not released in time
  2. The process hangs and there is no chance to release the lock

At this time, this client will keep occupying this lock, and other clients will "never" get this lock.

How to solve this problem?

How to avoid deadlock?

The solution we can easily think of is to set a "lease period" for the lock when applying for the lock.

When implemented in Redis, it is to set an "expiration time" for this key. Here we assume that the time to operate the shared resource will not exceed 10s, then when locking, set the 10s expiration for this key:

127.0.0.1:6379> SETNX lock 1//lock (integer) 1 127.0.0.1:6379> EXPIRE lock 10//automatically expires after 10s (integer) 1 Copy code

In this way, regardless of whether the client is abnormal or not, the lock can be "automatically released" after 10s, and other clients can still get the lock.

But is this really okay?

There are still problems.

In the current operation, there are two commands for locking and setting expiration. Is it possible that only the first one is executed, but the second one is "too late" to be executed? E.g:

  1. SETNX was executed successfully, but the execution failed due to network problems when EXPIRE was executed
  2. SETNX is executed successfully, Redis crashes abnormally, EXPIRE has no chance to execute
  3. SETNX is executed successfully, the client crashes abnormally, and EXPIRE has no chance to execute

In short, these two commands cannot be guaranteed to be atomic operations (successfully together), and there is a potential risk that the expiration time setting will fail, and the "deadlock" problem will still occur.

How to do?

Before Redis 2.6.12, we need to think of ways to ensure the atomic execution of SETNX and EXPIRE, and consider how to deal with various abnormal situations.

But after Redis 2.6.12, Redis expanded the parameters of the SET command, just use this command:

//One command guarantees atomic execution 127.0.0.1:6379> SET lock 1 EX 10 NX OK Copy code

This solves the deadlock problem and is relatively simple.

Let's look at the analysis again, what are the problems with it?

Imagine such a scenario:

  1. Client 1 successfully locked and started operating shared resources
  2. The time for client 1 to operate the shared resource "exceeds" the lock expiration time, and the lock is "automatically released"
  3. Client 2 successfully locked and started operating shared resources
  4. Client 1 completes the operation of shared resources and releases the lock (but the lock of client 2 is released)

Have you seen it, there are two serious problems here:

  1. Lock expiration : Client 1 takes too long to operate the shared resource, causing the lock to be automatically released and then held by Client 2
  2. Release someone else s lock : Client 1 releases the lock of client 2 after completing the operation of the shared resource

What is the cause of these two problems? Let's look at them one by one.

The first problem may be caused by the inaccuracy of the time we evaluate and operate shared resources.

For example, the "slowest" time to operate shared resources may take 15s, but we only set an expiration of 10s, so there is a risk of premature lock expiration.

If the expiration time is too short, increase the redundancy time. For example, set the expiration time to 20s. Is this always okay?

This can indeed "alleviate" the problem and reduce the probability of a problem, but it still cannot "completely solve" the problem.

why?

The reason is that after the client obtains the lock, when operating the shared resource, the scenario encountered may be very complicated, for example, an exception occurs in the program, a network request timeout, and so on.

Since it is an "estimated" time, it can only be roughly calculated, unless you can predict and cover all the scenes that lead to a longer time, but this is actually difficult.

Is there any better solution?

Don't worry, about this problem, I will talk about the corresponding solution in detail later.

We continue to look at the second question.

The second problem is that one client releases locks held by other clients.

Think about it, what is the key point that caused this problem?

The point is that when each client releases a lock, it is a "no-brainer" operation, and it does not check whether the lock is still "owned by itself", so there is a risk of releasing someone else's lock. This unlocking process, Very not "rigorous"!

How to solve this problem?

What should I do if the lock is released by someone else?

The solution is: when the client locks, set a "unique identifier" that only it knows.

For example, it can be your own thread ID or a UUID (random and unique). Here we take UUID as an example:

//The value of the lock is set to UUID 127.0.0.1:6379> SET lock $uuid EX 20 NX OK Copy code

It is assumed here that the 20s operation sharing time is completely sufficient, and the problem of automatic lock expiration is not considered.

After that, when releasing the lock, you must first determine whether the lock is still held by yourself. The pseudo code can be written as follows:

//The lock is its own, then release if redis.get( "lock" ) == $uuid: Redis. del ( "Lock" ) Copy the code

The two commands used to release the lock here are GET + DEL. At this time, we will encounter the atomicity problem we mentioned earlier.

  1. Client 1 executes GET and judges that the lock is its own
  2. Client 2 executes the SET command and forcibly acquires the lock (although the probability of occurrence is relatively low, we need to carefully consider the security model of the lock)
  3. Client 1 executes DEL, but the lock of client 2 is released

It can be seen that these two commands still have to be executed atomically.

How to execute it atomically? Lua script.

We can write this logic as a Lua script and let Redis execute it.

Because Redis processes each request in a "single thread" execution, when executing a Lua script, other requests must wait until the Lua script is processed, so that no other commands will be inserted between GET + DEL.

The Lua script to safely release the lock is as follows:

//Determine that the lock is one's own, then release it if redis.call("GET",KEYS[1]) == ARGV[1] then return redis.call("DEL",KEYS[1]) else return 0 end Copy code

Well, with optimization along the way, the entire process of locking and unlocking becomes more "rigorous".

Here we first summarize, a rigorous process based on the distributed lock implemented by Redis is as follows:

  1. Locking: SET $lock_key $unique_id EX $expire_time NX
  2. Operating shared resources
  3. Release the lock: Lua script, first GET to determine whether the lock belongs to itself, and then DEL to release the lock

Well, with this complete lock model, let us return to the first problem mentioned earlier.

What should I do if the lock expiration time is not easy to evaluate?

What should I do if the lock expiration time is not easy to evaluate?

As we mentioned earlier, if the expiration time of the lock is not evaluated well, the lock will have the risk of expiring "early".

The compromise solution given at that time was to try to "redundant" the expiration time to reduce the probability of the lock expiring early.

In fact, this solution can't solve the problem perfectly, so what should I do?

Is it possible to design such a scheme: when locking, first set an expiration time, and then we start a "daemon thread" to regularly detect the expiration time of this lock, if the lock is about to expire and the operation of the shared resource has not been completed, then Automatically "renew" the lock and reset the expiration time.

This is indeed a better solution.

If you are a Java technology stack, fortunately, there is already a library that encapsulates these tasks: Redisson .

Redisson is a Redis SDK client implemented in Java language. When using distributed locks, it uses an "automatic renewal" scheme to avoid lock expiration. This daemon thread is generally called a "watchdog" thread. .

In addition, this SDK also encapsulates many easy-to-use functions:

  • Reentrant lock
  • Optimistic lock
  • Fair lock
  • Read-write lock
  • Redlock (red lock, will be described in detail below)

The API provided by this SDK is very friendly. It can operate distributed locks like local locks. If you are a Java technology stack, you can use it directly.

I will not focus on the use of Redisson here. You can see the official Github to learn how to use it, which is relatively simple.

Let's summarize here again, the implementation of distributed locks based on Redis, the problems encountered before, and the corresponding solutions:

  • Deadlock : Set expiration time
  • The expiration time is not well evaluated, and the lock expires early : daemon thread, automatic renewal
  • The lock is released by someone else : the lock is written with a unique ID, and the ID is checked first when the lock is released, and then released

What other problem scenarios will endanger the security of Redis locks?

The previously analyzed scenarios are all problems that may occur when locked in a "single" Redis instance, and do not involve the details of the Redis deployment architecture.

When we use Redis, we usually deploy the master-slave cluster + sentinel mode. The advantage of this is that when the main library goes down abnormally, the sentinel can achieve "failure automatic switchover" and upgrade the slave library to the main library. Continue Provide services to ensure availability.

Then when the "master-slave switch occurs", will this distributed lock still be safe?

Imagine this scenario:

  1. Client 1 executes the SET command on the main library, and the lock is successful
  2. At this time, the main library is down abnormally, and the SET command has not been synchronized to the slave library (master-slave replication is asynchronous)
  3. The slave library was promoted by the sentinel to the new main library. This lock was locked on the new main library and was lost!

It can be seen that when the Redis copy is introduced, the distributed lock may still be affected.

how to solve this problem?

To this end, the author of Redis proposed a solution, which is the Redlock that we often hear .

Can it really solve the above problem?

Is Redlock really safe?

Okay, finally the highlight of this article. what? So many questions mentioned above, are they just the basis?

Yes, those are just appetizers, real hard dishes, which have just started from here.

If you haven't understood the content mentioned above, I suggest you read it again and first clarify the basic process of locking and unlocking.

If you already know Redlock, you can follow me to review it again. If you don t know Redlock, it s okay, I will take you to know it again.

It is worth reminding you that I will not only talk about the principle of Redlock, but also bring up many questions about "distributed systems". You'd better follow my thoughts and analyze the answers to the questions together in your mind.

Now let's look at how the Redlock solution proposed by the Redis author solves the problem of lock failure after the master-slave switch.

Redlock's solution is based on 2 premises:

  1. No longer need to deploy slave library and sentinel instances, only deploy the main library
  2. However, multiple main libraries need to be deployed, and at least 5 instances are officially recommended

In other words, if you want to use Redlock, you need to deploy at least 5 Redis instances, and they are all main libraries. There is no relationship between them, they are all isolated instances.

Note: Either deploy Redis Cluster or deploy 5 simple Redis instances.

How to use Redlock specifically?

The overall process is like this, which is divided into 5 steps:

  1. The client first obtains the "current timestamp T1"
  2. The client initiates a lock request to these 5 Redis instances in turn (using the SET command mentioned above), and each request will set a timeout period (milliseconds, much less than the effective time of the lock). Failure (including various abnormal situations such as network timeout, lock being held by other people, etc.), immediately apply for the lock to the next Redis instance
  3. If the client successfully locks from >=3 (most) Redis instances, it will obtain the "current timestamp T2" again. If T2-T1 <the lock expiration time, the client is considered to be locked successfully, otherwise Think lock failed
  4. The lock is successful, to operate the shared resource (for example, modify a row of MySQL, or initiate an API request)
  5. Failed to lock, initiate a lock release request to "all nodes" (the Lua script mentioned earlier releases the lock)

Let me briefly summarize for you, there are 4 key points:

  1. The client applies for locks on multiple Redis instances
  2. Must ensure that most nodes are successfully locked
  3. The total time it takes to lock most nodes is less than the expiration time of the lock setting
  4. To release the lock, it is necessary to initiate a lock release request to all nodes

It may not be easy to understand the first time you see it. It is recommended that you read the above text several times to deepen your memory.

Then, it is very important to remember these 5 steps. Based on this process, we will analyze various assumptions that may cause the lock to fail.

Well, after understanding Redlock's process, let's see why Redlock does this.

1) Why do we need to add locks on multiple instances?

Essentially for "fault tolerance", some instances are abnormally down, the remaining instances are successfully locked, and the entire lock service is still available.

2) Why do most of the locks succeed when they are successful?

Multiple Redis instances are used together to form a "distributed system".

In a distributed system, there will always be "abnormal nodes". Therefore, when discussing distributed system problems, you need to consider how many abnormal nodes there are, and it will still not affect the "correctness" of the entire system.

This is a "fault tolerance" problem of a distributed system. The conclusion of this problem is that if there are only "faulty" nodes, as long as most nodes are normal, the entire system can still provide correct services.

The model of this problem is the "Byzantine Generals" problem we often hear. If you are interested, you can see the deduction process of the algorithm.

3) Why should the accumulated time of locking be calculated after the locking in step 3 is successful?

Because you are operating multiple nodes, it will definitely take longer than operating a single instance. Moreover, because it is a network request, the network situation is complicated, and there may be delays, packet loss, timeouts, etc., network requests The more, the greater the probability of abnormal occurrence.

Therefore, even if most nodes are successfully locked, if the accumulated time of locking has "exceeded" the expiration time of the lock, then the lock on some instances may have expired at this time, and the lock is meaningless.

4) Why release the lock and operate all nodes?

When a certain Redis node is locked, the locking may fail due to "network reasons".

For example, if the client successfully locks a Redis instance, but when reading the response result, network problems cause the reading to fail , then the lock has actually been successfully locked on Redis.

Therefore, when releasing the lock, regardless of whether the lock was successfully locked before, it is necessary to release the locks of "all nodes" to ensure that the "residual" locks on the nodes are cleared.

Okay, I understand Redlock's process and related issues. It seems that Redlock has indeed solved the problem of lock failure in abnormal downtime of Redis nodes, ensuring the "safety" of the lock.

But is this really the case?

Who is right in the Redlock argument?

Redis authors of this plan was put forth, immediately by the industry's leading distributed systems experts questioned !

This expert is called Martin , a distributed system researcher at the University of Cambridge in the United Kingdom. Before that, he was a software engineer and entrepreneur, working on large-scale data infrastructure related work. He also frequently speaks at conferences, writes blogs, writes books, and is also an open source contributor.

He immediately wrote an article, questioning the Redlock algorithm model is problematic, and put forward his own views on the design of distributed locks.

Afterwards, Redis author Antirez faced doubts and was not to be outdone. He also wrote an article refuting the other party's point of view and analyzed more design details of the Redlock algorithm model in detail.

Moreover, the debate on this issue also caused very intense discussions on the Internet at that time.

The two have clear ideas and sufficient arguments. This is a master trick and a very good collision of ideas in the field of distributed systems! Both parties are experts in the field of distributed systems, but they put forward many opposite conclusions on the same issue. What is going on?

Below I will extract important points from their controversial articles, organize them and present them to you.

Reminder: The amount of information that follows is so great that it may not be suitable for understanding. It is best to read at a slower speed.

Distributed expert Martin questioned Relock

In his article, four arguments are mainly elaborated:

1) What is the purpose of distributed locks?

Martin said that you must first understand what is the purpose of using distributed locks?

He thinks there are two purposes.

1. efficiency.

Using the mutual exclusion capability of distributed locks is to avoid doing the same two jobs unnecessarily (for example, some expensive computing tasks). If the lock fails, it will not bring about "malicious" consequences, such as sending emails twice, etc., which is harmless.

2. correctness.

Use locks to prevent concurrent processes from interfering with each other. If the lock fails, it will cause multiple processes to operate the same piece of data at the same time. The consequences are serious data errors, permanent inconsistencies, data loss and other malignant problems. Just like giving patients repeated doses of drugs, the consequences are serious.

He believes that if you are for the former efficiency, then you can use the stand-alone version of Redis. Even if lock failures (downtime, master-slave switching) occur occasionally, there will be no serious consequences. And using Redlock is too heavy and unnecessary.

And if it is for correctness, Martin believes that Redlock does not meet the security requirements at all, and there is still the problem of lock failure!

2) Problems encountered when locking in a distributed system

Martin said that a distributed system is more like a complex "beast", with all kinds of abnormal situations you can't think of.

These abnormal scenes mainly include three major areas, which are also the three mountains that distributed systems will encounter: NPC .

  • N: Network Delay, network delay
  • P: Process Pause, process pause (GC)
  • C: Clock Drift, clock drift

Martin used a process pause (GC) example to point out Redlock security issues:

  1. Client 1 requests to lock nodes A, B, C, D, E
  2. After client 1 gets the lock, it enters the GC (longer time)
  3. The locks on all Redis nodes have expired
  4. Client 2 gets the locks on A, B, C, D, E
  5. Client 1 GC ends and thinks that the lock has been successfully acquired
  6. Client 2 also thinks that the lock has been acquired and a "conflict" has occurred

Martin believes that GC may occur at any time in the program, and the execution time is uncontrollable.

Note: Of course, even if a programming language without GC is used, network delays and clock drifts may cause Redlock problems. Here Martin is just taking GC as an example.

3) It is unreasonable to assume that the clock is correct

Or, when there is a problem with the "clock" of multiple Redis nodes, it will also cause the Redlock lock to become invalid .

  1. Client 1 obtains the locks on nodes A, B, C, but cannot access D and E due to network problems
  2. The clock on node C "jumps forward", causing the lock to expire
  3. Client 2 acquires the locks on nodes C, D, E, and cannot access A and B due to network problems
  4. Clients 1 and 2 now believe that they hold the lock (conflict)

Martin felt that Redlock must "strongly rely" on the synchronization of the clocks of multiple nodes. Once a node clock error occurs, the algorithm model becomes invalid.

Even if C is not a clock jump, but "restart immediately after crash", a similar problem will occur.

Martin went on to explain that an error in the machine's clock is very likely to happen:

  • The system administrator "manually modified" the machine clock
  • The machine clock made a big "jump" when synchronizing the NTP time

In short, Martin believes that Redlock's algorithm is based on the "synchronization model". A large amount of data research shows that the assumption of the synchronization model is problematic in distributed systems.

In a chaotic distributed system, you cannot assume that the system clock is correct, so you must be very careful about your assumptions.

4) Propose a scheme of fecing token to ensure correctness

Correspondingly, Martin proposed a scheme called fecing token to ensure the correctness of distributed locks.

The model process is as follows:

  1. When the client acquires the lock, the lock service can provide an "incremental" token
  2. The client holds this token to operate shared resources
  3. Shared resources can reject ``latecomer'' requests based on tokens

In this way, no matter what kind of abnormal situation of the NPC occurs, the security of the distributed lock can be guaranteed, because it is built on the "asynchronous model".

And Redlock cannot provide a scheme similar to fecing token, so it cannot guarantee security.

He also said that a good distributed lock, no matter how the NPC happens, can not give a result within the specified time, but it will not give a wrong result. That is, it will only affect the "performance" (or called liveness) of the lock, but not its "correctness".

Martin's conclusion:

1. Redlock is nondescript : In terms of efficiency, Redlock is heavier, there is no need to do this, and Redlock is not safe enough for correctness.

2. Unreasonable clock assumptions : The algorithm makes dangerous assumptions about the system clock (assuming that the clocks of multiple nodes are the same). If these assumptions are not met, the lock will fail.

3. The correctness cannot be guaranteed : Redlock cannot provide a scheme similar to fencing token, so the correctness problem cannot be solved. For correctness, please use software with a "consensus system", such as Zookeeper.

Well, the above is Martin's objection to the use of Redlock, which seems to be justified.

Let's take a look at how Redis author Antirez refuted.

Redis author Antirez's rebuttal

In the Redis author's article, there are three key points:

1) Explain the clock problem

First of all, the Redis author can see through the core issue raised by the other party at a glance: the clock problem .

The author of Redis stated that Redlock does not need a completely consistent clock, but only needs to be roughly consistent, allowing "errors".

For example, if you want to time 5s, it may actually record 4.5s and then 5.5s. There is a certain error, but as long as it does not exceed the "error range" lock failure time, this kind of clock accuracy is not very demanding , And this is also in line with the real environment.

Regarding the issue of "clock modification" mentioned by the other party, the Redis author retorted:

  1. Manually modify the clock : Just don't do this, otherwise if you modify the Raft log directly, then Raft will not work...
  2. Clock jump : Through "appropriate operation and maintenance", it is ensured that the machine clock will not jump significantly (completed by small adjustments each time), which is actually possible

Why do Redis authors give priority to explaining clock issues? Because in the subsequent rebuttal process, you need to rely on this basis for further explanation.

2) Explain network latency and GC issues

Afterwards, the Redis author also refuted the other party's problem that network delay wan and process GC may cause Redlock to fail:

Let s revisit the hypothesis of Martin s question:

  1. Client 1 requests to lock nodes A, B, C, D, E
  2. After client 1 gets the lock, it enters the GC
  3. The locks on all Redis nodes have expired
  4. Client 2 obtains the locks on nodes A, B, C, D, and E
  5. Client 1 GC ends and thinks that the lock has been successfully acquired
  6. Client 2 also thinks that the lock has been acquired and a "conflict" has occurred

The Redis author refuted that this assumption is actually problematic, and Redlock can guarantee lock security.

What is going on here?

Remember those 5 steps that introduced the Redlock process earlier? I'll bring it here again for you to review.

  1. The client first obtains the "current timestamp T1"
  2. The client initiates a lock request to these 5 Redis instances in turn (using the SET command mentioned above), and each request will set a timeout period (milliseconds, much less than the effective time of the lock). Failure (including various abnormal situations such as network timeout, lock being held by other people, etc.), immediately apply for the lock to the next Redis instance
  3. If the client successfully locks from more than 3 (most) Redis instances, it will get the "current timestamp T2" again. If T2-T1 <the lock expiration time, then the client is considered to be locked successfully, otherwise it is considered to be added Lock failed
  4. The lock is successful, to operate the shared resource (for example, modify a row of MySQL, or initiate an API request)
  5. Failed to lock, initiate a lock release request to "all nodes" (the Lua script mentioned earlier releases the lock)

Note that the focus is on 1-3. In step 3, why do I need to re-acquire the "current timestamp T2" after the lock is successful? Also use the time T2-T1 to compare with the lock expiration time?

The Redis author emphasizes: if there is a time-consuming abnormal situation such as network delay and process GC in 1-3, it can be detected in step 3 T2-T1. If the expiration time of the lock setting is exceeded, then at this time Just think that the lock will fail, and then release the locks of all nodes!

The Redis author continues to discuss that if the other party believes that the network delay occurred and the process GC was after step 3, that is, the client confirmed that the lock was obtained, and there was a problem on the way to operate the shared resource, which caused the lock to fail, then this is not just Redlock's The problem, any other lock service such as Zookeeper, has a similar problem, which is beyond the scope of discussion .

Here I give an example to explain this problem:

  1. The client successfully obtained the lock through Redlock (passed the check logic for successful lock lock and lock lock time-consuming for most nodes)
  2. The client starts to operate the shared resource, and a long time-consuming situation such as network delay and process GC occurs at this time
  3. At this time, the lock is automatically released after expiration
  4. The client starts to operate MySQL (the lock at this time may be obtained by others and the lock becomes invalid)

The conclusion of the Redis author here is:

  • Before the client gets the lock, no matter what time-consuming problem it experiences, Redlock can detect it in step 3.
  • After the client gets the lock, an NPC occurs, and Redlock and Zookeeper are powerless

Therefore, the author of Redis believes that Redlock can guarantee the correctness of the clock on the basis of ensuring the correctness of the clock.

3) Question the fencing token mechanism

The Redis author also raised questions about the fecing token mechanism proposed by the other party, which is mainly divided into two questions. It is the most incomprehensible here. Please follow my thoughts.

First , this solution must require the "shared resource server" to be operated to have the ability to reject the "old token".

For example, to operate MySQL, get a token with an increasing number from the lock service, and then the client needs to take this token to change a certain row of MySQL, which requires the use of MySQL's "things isolation".

// Two clients must take advantage of things and achieve the purpose of isolation // attention determination condition token The UPDATE Table T the SET Val = $ new_val, current_token = $ token the WHERE ID = $ ID the AND current_token < $ token copy the code

But what if it is not MySQL? For example, if you write a file to the disk or initiate an HTTP request, then this solution is powerless. This places higher requirements on the resource server to be operated.

In other words, most of the resource servers to be operated do not have this mutual exclusion capability.

Furthermore, since resource servers all have the "mutual exclusion" capability, what are the distributed locks for?

Therefore, the Redis author believes that this scheme is untenable.

Second , take a step back. Even if Redlock does not provide the ability of fecing token, Redlock already provides a random value (that is, the UUID mentioned above). Using this random value, you can achieve the same effect as fecing token.

How to do it?

The author of Redis only mentioned that it can complete the similar function of fecing token, but did not expand the relevant details. According to the information I consulted, the approximate process should be as follows, if there is an error, welcome to communicate~

  1. The client uses Redlock to get the lock
  2. Before operating the shared resource, the client first sets the VALUE of this lock to mark the shared resource to be operated
  3. The client processes the business logic, and finally, when modifying the shared resource, it is judged whether the mark is the same as before, and only modified (similar to the CAS idea)

Let's take MySQL as an example. An example is this:

  1. The client uses Redlock to get the lock
  2. Before the client wants to modify a row of data in the MySQL table, first update the VALUE of the lock to a field in this row (here assumed to be the current_token field)
  3. Client processing business logic
  4. The client modifies this row of MySQL data, regards VALUE as the WHERE condition, and then modifies it
The UPDATE Table T the SET Val = $ new_val the WHERE ID = $ ID the AND current_token = $ redlock_value duplicated code

It can be seen that this solution relies on MySQL's transaction mechanism and achieves the same effect as the fecing token mentioned by the other party.

But there is still a small problem here, which was raised by netizens when they participated in the discussion: two clients through this scheme, first "mark" and then "check + modify" to share resources, then the order of operations of these two clients cannot be guaranteed. ?

With the fecing token mentioned by Martin, because this token is a monotonically increasing number, the resource server can reject small token requests, ensuring the "sequence" of operations!

The author of Redis has given different explanations for this problem. I think it makes sense. He explained: The essence of distributed locks is for "mutual exclusion". As long as it can ensure that two clients are concurrent, one succeeds and the other fails. That's fine, there is no need to care about "sequence".

In the previous question of Martin, he has always been concerned about this order issue, but the author of Redis has a different view.

In summary, the Redis author's conclusion:

1. The author agrees with the other party about the impact of "clock jumping" on Redlock, but believes that clock jumping can be avoided, depending on the infrastructure and operation and maintenance.

2. Redlock was designed with full consideration of the NPC problem. NPC appeared before Redlock step 3 to ensure the correctness of the lock, but after step 3, NPC occurred, not only Redlock has problems, but other distributed lock services also have problems. , So it is not in the scope of discussion.

Is it interesting?

In a distributed system, a small lock may encounter so many problem scenarios, which affect its security!

I don t know which side you agree with after reading the views of both sides?

Don't worry, I will summarize the above arguments later and talk about my own understanding.

Well, after talking about the dispute between the two parties on Redis distributed locks, you may have also noticed that in his article, Martin recommends using Zookeeper to implement distributed locks. He thinks it is more secure. Is it true?

Are Zookeeper-based locks safe?

If you have known Zookeeper, the distributed lock based on it is like this:

  1. Clients 1 and 2 both try to create "temporary nodes", such as/lock
  2. Assuming that client 1 arrives first, the lock is successful, and client 2 fails to lock
  3. Client 1 operating shared resources
  4. Client 1 deletes the/lock node and releases the lock

You should have also seen that Zookeeper does not need to consider the expiration time of the lock like Redis. It uses a "temporary node" to ensure that after client 1 gets the lock, it can keep holding the lock as long as the connection continues.

Moreover, if client 1 crashes abnormally, this temporary node will be automatically deleted, ensuring that the lock will be released.

Yes, there is no worry about lock expiration, and the lock can be automatically released when abnormal. Does it feel perfect?

actually not.

Think about it. After client 1 creates a temporary node, how does Zookeeper ensure that the client keeps holding the lock?

The reason is that client 1 will maintain a session with the Zookeeper server at this time, and this session will rely on the client's "timed heartbeat" to maintain the connection.

If Zookeeper does not receive the client's heartbeat for a long time, it considers that the Session has expired, and deletes the temporary node.

Similarly, based on this issue, we also discuss the impact of GC issues on Zookeeper locks:

  1. Client 1 successfully created a temporary node/lock and got the lock
  2. Client 1 has a long GC
  3. Client 1 cannot send a heartbeat to Zookeeper, Zookeeper "delete" the temporary node
  4. Client 2 successfully created a temporary node/lock and got the lock
  5. Client 1 GC is over, it still thinks it holds the lock (conflict)

It can be seen that even if Zookeeper is used, the safety of the process GC and network delay abnormal scenarios cannot be guaranteed.

This is what the author of Redis mentioned in the rebuttal article: If the client has already obtained the lock, but the client and the lock server are "lost" (for example, GC), it is not only the Redlock problem, but other lock services are similar. The same is true for Zookeeper!

So, here we can come to the conclusion: a distributed lock, in extreme cases, is not necessarily safe.

If your business data is very sensitive, you must pay attention to this issue when using distributed locks. You cannot assume that distributed locks are 100% safe.

Okay, now let's summarize the pros and cons of Zookeeper when using distributed locks:

Advantages of Zookeeper:

  1. No need to consider the expiration time of the lock
  2. Watch mechanism, if lock fails, you can watch to wait for the lock to be released to achieve optimistic locking

But its disadvantages are:

  1. Performance is not as good as Redis
  2. High deployment and operation and maintenance costs
  3. The client has lost connection with Zookeeper for a long time, and the lock is released.

My understanding of distributed locks

Well, the previous detailed introduction is based on Redis's Redlock and Zookeeper implementation of distributed locks, security issues in various abnormal situations, I want to talk to you about my views, for reference only, do not like do not spray.

1) Should I use Redlock at all?

As analyzed earlier, Redlock can only work normally if it is built on the premise of "correct clock". If you can guarantee this premise, you can use it.

But to ensure that the clock is correct, I don't think it can be done as simple as you think.

1. from a hardware perspective , clock drift occurs from time to time and cannot be avoided.

For example, CPU temperature, machine load, and chip material may cause clock drift.

2. from my work experience , I have encountered clock errors and violent operation and maintenance to modify the clock, which affects the correctness of the system. Therefore, it is difficult to completely avoid human errors.

Therefore, my personal opinion on Redlock is to try not to use it as much as possible, and its performance is not as good as the stand-alone version of Redis, and the deployment cost is also high. I will still give priority to using Redis's "master-slave + sentinel" model to implement distributed locks.

How to ensure correctness? The second point gives you the answer.

2) How to use distributed locks correctly?

When analyzing Martin's point of view, it mentioned the scheme of fecing token, which gave me a lot of inspiration. Although this scheme has great limitations, it is a very good idea for ensuring the "correctness" of the scene.

So, we can combine the two to use:

1. Use distributed locks to accomplish the purpose of "mutual exclusion" at the upper level. Although the lock will fail in extreme cases, it can block concurrent requests to the highest level to the greatest extent and reduce the pressure on the operating resource layer.

2. But for businesses that require absolutely correct data, you must do a good job at the resource level. The design ideas can be done with the scheme of fecing token.

Combining the two ideas, I think that for most business scenarios, it can already meet the requirements.

summary

Well, to sum it up.

In this article, we mainly discuss the issue of whether distributed locks based on Redis are safe.

From the implementation of the simplest distributed lock, to the handling of various abnormal scenarios, to the introduction of Redlock, and the debate between two distributed experts, the applicable scenarios of Redlock were derived.

Finally, we also compared the problems that Zookeeper may encounter when doing distributed locks, and the differences with Redis.

Here I have summarized these contents into a mind map for your convenience.

postscript

The amount of information in this article is actually very large. I think the issue of distributed locks should be thoroughly clarified.

If you don t understand, I suggest you read it a few more times, construct various hypothetical scenarios in your mind, and think twice.

While writing this article, I have re-studied these two articles about Redlock's dispute by two great gods. It can be said to be very rewarding, and I will share some tips for you.

1. In a distributed system environment, the seemingly perfect design scheme may not be so "tightly fit". If you carefully examine it, you will find various problems. Therefore, when thinking about distributed system issues, we must be cautious and then cautious .

2. From Redlock's argument, we should not pay too much attention to right and wrong, but to learn more about the way of thinking of the great gods and the rigorous spirit of rigorous review of a problem.

Finally, end with Martin's thoughts written after arguing about Redlock:

" Predecessors have created many great results for us: standing on the shoulders of giants, we can build better software. In any case, by arguing and checking whether they can withstand scrutiny by others, this is Part of the learning process. But the goal should be to acquire knowledge, not to persuade others to believe that you are right. Sometimes it just means stopping and thinking about it. "

mutual encouragement.


Want to see more hardcore technical articles? Welcome to follow my official account " Water Drops and Silver Bullets ".

I m Kaito, a senior back-end programmer who thinks about technology. In my article, I will not only tell you what a technical point is, but also why you do it? I will also try to distill these thinking processes into a general methodology, so that you can apply it in other fields, by analogy.


references: