Data races are a common problem in multithreaded programming. Data races occur when multiple tasks or threads access a shared resource without sufficient protections, leading to undefined or unpredictable behavior.
When you author software to simultaneously handle multiple tasks, you may use multithreaded programming, that is, programs with constructs such as multiple entry points, interleaving of threads, and asynchronous interrupts. However, multithreaded programming can be highly complex and introduce subtle defects such as data races and deadlocks. When such a defect occurs, it can take a long time to reproduce the issue and even longer to identify the root cause and fix the defect.
Example of a Data Race
Let us start with the simplest example of a data race. In the following diagram,
Task2 write values to the shared resources,
sharedVar2. The tasks later read the values of the shared resources through the functions,
do_sth_with_shared_resources2(). Let us begin with a simple situation that has no protection mechanisms established in the operations.
You may ask: what value of
sharedVar1 does the function
do_sth_with_shared_resources1()read? You may expect the value to be 11 since this value was written in
Task1 immediately before the function call. However, without any protection mechanisms, the value read may be 21, or in some situations, even a corrupt random value. Because of the concurrent execution of
Task2, the shared resource
sharedVar1 may be rewritten in
Task2 before being read again in
In other words, both sequences can happen:
- Sequence 1:
Task1: sharedVar1 = 11;
- Sequence 2:
Task1: sharedVar1 = 11;
Task2: sharedVar1 = 21;
Without imposing protection mechanisms, any code you write in do_sth_wth_shared_resources1() cannot rely on a particular sequence occurring and, therefore, a particular value of sharedVar1. If your code relies on a particular value of sharedVar1, then the data race becomes a bug.
Data races occur when a shared resource is unpredictably accessed by multiple tasks. Data races may not be easy to understand because the execution of instructions does not follow the sequence in which the instructions are written. Also, the result can change in each test run, making a data race difficult to reproduce and fix.
How to Prevent Data Races with Mutual Exclusion Locks (Mutexes)
A common mechanism to avoid data races is to force a mutual exclusion. In the previous example, you can enforce sequence 1 by:
- Locking a mutex before
Task1: sharedVar1 = 11;
- Unlocking the mutex after
Other tasks, such as
Task2, have to wait for the mutex to be unlocked before accessing
sharedVar1; however, the placement of mutex locks and unlocks is not as simple as it sounds. Here is a C code example that implements the tasks shown in Figure 1 with the POSIX-based
pthread_ family of functions. The example attempts to protect against data races by using functions such as
pthread_mutex_unlock to lock and unlock a mutex.
You can see this full code example to review the details.
The code starts two threads, each with its own temporary variable
tmp. The temporary variable reads the value of a shared resource (
sharedVar2) immediately after the resource is written. The write and subsequent read operations are protected using mutexes. As a result, the values of the temporary variable and the shared resource are expected to be the same. If the values do not agree, the threads print a message such as
thread:1, sharedVar2 = 22 and tmp = 12 differ.
You can look at the code to review the details, or run the above code in a real environment for the following results.
You can see that the message for unintended values,
thread:1, sharedVar2 = 22 and tmp = 12 differ, appears several times. Despite the placement of mutexes, the data race continues to occur.
Debugging such data race in a real application can take several hours because of the non-deterministic nature of the issue. As you can see in Figure 2, the message for unintended values appears only sporadically. Also, once reproduced, the issue can be difficult to fix. It is not sufficient to simply use mutexes: their placement in the code is also critical.
How to Detect and Fix Data Races
A static analysis tool that automatically detects data races and suggests possible fixes can save a lot of debugging effort.
To understand why the data race continues to occur in the above example despite the use of mutexes, we used the data race checkers of a static analysis tool, Polyspace Bug Finder™. This tool can detect the data race that we saw earlier through the program output.
In Figure 4, you can see the program control flow that leads to each operation. The circles marked with ‘t’ show the beginning of two different tasks,
task_main::thread2(). The subsequent circles show how the control flow goes through functions,
thread2_main, and eventually to the write operations. A shield icon on the write operation in the second task indicates that some protection mechanisms are used on this operation. The absence of a similar icon in the first task confirms the earlier suggestion that write operations on
sharedVar2 are not protected in this task.
From this suggestion, you can check the function
thread1_main and see that the mutex in this function is prematurely unlocked before all shared resources are accessed. You can change the placement of the mutex so that it occurs after
sharedVar2 is accessed, and fix the data race.
In the example from the section above, you can spot the data race during a visual inspection, but in real applications of hundreds of files and thousands of lines of code, data races can be difficult to detect because:
- Problems occur sporadically and can be hard to reproduce
- Results can differ for each run. Even setting a breakpoint with a debugger can influence the result.
- Incorrect placement of mutexes may not fix the root cause or may introduce other problems such as deadlocks or double locks
It is important to run a static analysis tool at a regular cadence to identify data races as soon as possible. A static analysis tool creates an abstraction of the concurrency model used in your program, and it can easily detect whether the established protections are sufficient to prevent data races.
Polyspace Bug Finder offers several features to identify concurrency issues such as data races and deadlocks, along with features that ease their review, such as the above textual and graphical representation of conflicting operations. These features help you identify the root cause of a data race more easily.
In the next post, we’ll look at another common concurrency issue known as deadlock.
Written by Yoo Yong-chul and Anirban Gangopadhyay.
Yoo Yong-chul works as an application engineer at MathWorks Korea and is responsible for code verification products.
Anirban Gangopadhyay works as a documentation writer at MathWorks US. He oversees technical documentation of Polyspace® products.
Original post: Naver blog post
MathWorks Korea 2021.4.17