Chrome C++ Lock and ConditionVariable
First a warning: do you really need Locking or CondVars?
Are you sure you need to use explicit locking and condition variables? In Chrome
code, message passing is far more common (via TaskRunner
and PostTask
) and
low-level primitives like locks and condition variables should be used only when
necessary.
Some additional motivation:
- Condition variables are nearly impossible to implement correctly on Windows XP or earlier. Our implementation is correct, but very slow. Whenever you use a CV you are disproportionately harming our performance on Windows.
- A lot of times people just want to wait on a boolean. In such cases, if message passing cannot work, please use WaitableEvent instead.
But for the times when you do need to use locks and condition variables, this document will explain the best practices and pitfalls in using them.
(Much of the below was originally written by Mike Burrows.) (TODO: Figure out how to get anchor links to work)
Using Lock
and ConditionVariable
Terminology and basics
The Lock
class implements a mutual exclusion lock, or mutex for short. A
mutex is used to permit only one thread at a time to have exclusive access to
some resource, which is typically some variable or data structure. Mutexes are
so common that many words have been coined to describe their operation.
Each Lock
mu
has two basic operations: mu.Acquire()
and mu.Release()
.
Conceptually, it has just a single bit of abstract state: the boolean mu.held
.
When mu
is created, mu.held
is false
and mu
is said to be free or
unlocked. mu.Acquire()
blocks the caller until some moment when mu
is
free, and then atomically changes mu.held
from false
to true
; mu
is then
said to be held or locked by the caller. mu.Release()
sets mu.held
to
false
once more.
Calling mu.Acquire()
is often referred to as locking or acquiring mu
,
while calling mu.Release()
is referred to as unlocking or releasing mu
.
An action performed by a thread while holding mu
is said to be performed
under mu
. Data structures manipulated under mu
, and their invariants, are
said to be protected by mu
.
Clients of Lock
must obey these rules:
- Each time a thread acquires
mu
it must later releasemu
. - A thread may not attempt to release
mu
unless it holdsmu
.
Because mu.Acquire()
acts atomically to change the state of mu.held
we are
guaranteed that, if these rules are followed, only one thread may hold mu
at
any given time.
These rules are best followed by bracketing regions of code with matching calls
to mu.Acquire()
and mu.Release()
in the same procedure. Such sections of
code are called critical regions or critical sections, or occasionally
monitors after Hoare monitors, from which mutexes were derived.
(A Hoare monitor is an abstraction that automatically acquires a lock on entry
and releases it on exit.) In Chrome C++ code, many use the idiom AutoLockl(mu)
,
which acquires mu
and releases it when l
goes out of scope. (Less
commonly, AutoUnlock l(mu)
can also be used.)
Mutexes perform two tasks in concurrent programming. Their primary role is to
protect the invariants of variables and data structures manipulated by multiple
threads (these invariants are often called monitor invariants, again recalling
the work of Hoare). The programmer is required to establish the invariant before
releasing the mutex; he can then assume the invariant holds whenever he acquires
the mutex, even if his updates temporarily invalidate the invariant during the
critical section. One cannot guarantee the invariant is true in a thread that
does not hold the mutex, because the mutex holder may be changing the monitored
state at that moment. For example, suppose Lock mu
protects the invariant that
a + b == 0
. This code is then legal:
mu.Acquire();
assert(a + b == 0);
// invariant assumed to hold
a++;
// invariant temporarily invalidated
b--;
// invariant restored before mu is released
mu.Release();
while this code is erroneous:
mu.Acquire();
assert(a + b == 0);
// invariant assumed to hold
a++;
// invariant invalidated
mu.Release();
// BUG: mu released while invariant invalid
mu.Acquire();
b--;
// attempt to restore the invariant, but the damage is already done
mu.Release();
The following does not invalidate the invariant, but incorrectly assumes it is true when the lock is not held:
mu.Acquire();
assert(a + b == 0);
// correct: invariant assumed to hold
mu.Release();
assert(a + b == 0);
// BUG: can't assume invariant without lock
The invariant holds only when evaluated on state observed in a single critical section:
mu.Acquire();
assert(a + b == 0);
// correct: invariant assumed to hold
temp = a;
// takes a temporary copy of "a"
mu.Release();
mu.Acquire();
assert(temp + b == 0);
// BUG: can't assume invariant on state
// from two distinct critical sections
mu.Release();
A less obvious role of mutexes is to ensure a sequentially-consistent view of such data structures on machines with memory systems that are not sequentially consistent. Mutexes also prevent compiler reordering which could otherwise cause race conditions.
Other properties of Lock
The call mu.Try()
either returns false
or acquires mu
and returns
true
. It does not block. If mu
is free, it is unlikely to return
false
.
The call mu.AssertAcquired()
aborts the process in debug mode if mu
is
not held by the calling thread..
Lock
is able to synchronize its own deletion. For example, if an object
*o
contains a Lock
mu
and a correctly-maintained reference count c
,
this code is safe:
o->mu.Acquire();
bool del = (--(o->c) == 0);
o->mu.Release();
if (del) { delete o; }
Not all lock implementations have this property, so it
should not be taken for granted when porting code. (To provide this
property, we guarantee that mu.Release()
does not touch mu
after the
atomic moment at which mu
becomes free.)
Lock
is not re-entrant (also known as not recursive). See
below.
Condition variables
Condition variables are a means for blocking a thread until some condition has
been satisfied. Viewed in isolation, a condition variable allows threads to
block and to be woken by other threads. However, condition variables are
designed to be used in a specific way; a condition variable interacts with a
mutex to make it easy to wait for an arbitrary condition on state protected by
the mutex. Chrome's C++ condition variables have type
ConditionVariable
.
Suppose that a thread is to wait for some boolean expression cond_expr to
become true, where the state associated with cond_expr is protected by Lock mu
. The programmer would write:
// Waiter mu.Acquire(); while (!cond_expr) { cv.Wait(); // mu was passed to cv's
constructor } // cond_expr now holds ... mu.Release(); The Wait()
call
atomically unlocks mu
(which the thread must hold), and blocks on the
condition variable cv
. When another thread signals the condition variable, the
thread will reacquire the mu
, and go around the mandatory
while-loop to recheck cond_expr.
Another thread that makes cond_expr true might execute:
// Waker mu.Acquire();
Make_cond_expr_True();
// cond_expr now holds
cv.Signal();
mu.Release();
The call to Signal()
wakes at least one of the
threads waiting on cv
. Many threads may be blocked on a condition variable at
any given time; if it makes sense to wake more than one such thread
Broadcast()
can be used. (However, this may lead to contention and poor
performance if all waiting threads use the same lock; a possibly better approach
to getting a lot of threads out of Wait()
is to have each thread (upon exiting
Wait()
) call Signal()
to free up another Wait()
ing thread.)
A single condition variable can be used by threads waiting for different
conditions. However, in this case, Broadcast()
must be used when any of the
conditions becomes true, because the ConditionVariable
implementation cannot
otherwise guarantee to wake the correct thread. It can be more efficient to use
one condition variable for each different condition; any number of condition
variables can be used with a single mutex.
Both Signal()
and Broadcast()
are efficient if there are no threads to wake.
(TODO: verify this) Clients should call Signal()
or Broadcast()
inside the
critical
section
that makes the condition true.
The call TimedWait()
allows a thread to wait until either a condition is true,
or until some time has elapsed. Like Wait()
, TimedWait()
always reacquires
the mutex before returning. Example use:
static const int64 kMSToWait = 1000;
// we'll wait at most 1000ms TimeDelta
left_to_wait = base::Milliseconds(kMSToWait);
// ms to wait at any given time
Time deadline = Time::Now() + left_to_wait;
mu.Acquire();
while(!cond_expr && left_to_wait.InMilliseconds() > 0) {
cv.TimedWait(left_to_wait);
left_to_wait = deadline - Time::Now();
}
if (cond_expr) {
// cond_expr true
} else {
// cond_expr false; 1000ms timeout expired
}
mu.Release();
Recommended usage
Variables accessed by more than one thread and written by at least one thread
should be protected consistently by a Lock
.
An exception to this rule is that fields may be initialized in constructors without a mutex, since no other thread should be able to reference the memory at that time.
Recommended commenting convention
There are two main dangers when using threads and mutexes: deadlocks and data races. These can be avoided using some simple rules and adding small comments to variable and procedure declarations. You are strongly advised to use such comments; they may seem tedious to write, but they help tremendously in avoiding errors. The particular commenting conventions shown below are derived from the work of Nelson and Manasse.
Critical sections should almost always start and end in the same routine.
That is, if a routine acquires a lock, it should release it before it
returns, and it should release no locks that it does not itself acquire.
This is normally achieved by writing Acquire()
and Release()
calls in
pairs, or by using AutoLock l(mu)
, which automatically releases mu
when
l
goes out of scope.
Every shared variable/field should have a comment indicating which mutex protects it:
int accesses_;
// account of accesses (guarded by mu_)
// or a comment explaining why no mutex is needed:
int table_size_; // no. of elements in table (readonly after init)
Every mutex should have a comment indicating which variables and also any non-obvious invariants it protects:
// protects accesses_, list_, count_
// invariant: count_ == number of elements in linked-list list_
Lock mu_;
Think of the matching comments on variables and mutexes as analogous to matching types on procedure arguments and parameters; the redundancy can be very helpful to later maintainers of the code.
Whenever a thread can hold two mutexes concurrently, either one (or both) of
the mutexes should be commented with acquired before
or acquired after
to indicate which mutex must be acquired first:
Lock mu0_;
// protects foo_ (acquired before mu1_)
If the mutex acquisition order is not consistent, deadlock may result.
Each routine should have a comment indicating which mutexes must and must not be held on entry. This allows implementors to edit routines without examining their call sites, and allows clients to use routines without reading their bodies.
Never hold locks when invoking a callback, as the callback may call into the module once more, leading to deadlock. (Violations of this rule should be extremely rare and conspicuously commented in the module's interface.) Comments should indicate what threads can or cannot be used for callbacks:
// Call cb.Run() in "ms" milliseconds.
// cb.Run() is called in a private thread; it will not be called
// from the thread calling RunAfter(), even if ms==0.
void RunAfter(Closure cb, int ms);
In rare cases, it may be useful for a routine to acquire a lock and return
without releasing it, or to release a lock (perhaps temporarily using
ConditionVariable::Wait
) that it did not acquire. Such routines may
surprise clients, and should be commented clearly in interfaces. Note that a
routine that acquires a lock and returns without releasing it is practically
a locking primitive and should be commented as such.
Every condition variable should have a comment indicating when it is signalled:
ConditionVariable non_empty_; // signalled when the queue becomes non-empty
In some cases, exclusive access to data is ensured by referencing it only from one thread. (See the section on message passing.) The thread can be thought of as playing the part of a mutex; you should name the thread and use the name in comments as if it were a lock.
int queue_length_; // length of receive queue (under net_thread) ...
// Process one packet from the queue.
// L >= net_thread void
ProcessPacket() { ... }
In very rare cases, a variable may be protected by more than one mutex. This means that the variable may be read while holding any of the mutexes, but may be written only when all the mutexes are held. You should document it clearly in the comments.
int bytes_left_; // bytes remaining in queue (under mu_ and net_thread)
If these conventions are followed it is straightforward to tell what locks are held at any point in a routine by reading the routine and its comments. By reading the comments on shared variables and mutexes, it is then possible to tell that all variable accesses are correctly protected by a mutex, and that mutexes are acquired in the correct order.
Without such comments, working with mutexes is significantly harder. We recommend their use.
Performance hints
Critical sections should not be too long
Normally, you should hold mutexes for short periods (nanoseconds to microseconds) at a time, and the mutexes should be free most of the time, often approaching 99%. To achieve this, it's best to avoid doing slow operations such as I/O inside critical sections---assuming it is not the purpose of the mutex to serialize the I/O, of course.
Another, more complex technique is to make the locking more fine-grained by employing more locks, each protecting a subset of the data.
Other transformations may help, such as breaking a critical section in two, or arranging to perform long-running operations on state that is local to a thread.
Lock acquisitions are cheap but not free
A lock acquisition is generally more expensive than a cached variable access, but less expensive than a cache miss. If a mutex is acquired and released too often (say, more than a hundred thousand times per second) the overhead of these operations themselves may start to be significant in CPU profiles.
Frequent acquisition can be avoided by combining critical sections, or by delaying operations on shared state by buffering them in memory local to a single thread. For example, tcmalloc uses a per-thread cache of memory to avoid locking overhead on every allocation.
Pitfalls
Deadlocks
A deadlock (sometimes called a deadly embrace) occurs when an activity attempts to acquire a limited resource that has been exhausted and cannot be replenished unless that activity makes progress.
When considering deadlocks involving only mutexes, each activity is typically represented by a thread, and the resources are mutexes that are exhausted when held, and replenished when released. However, other possibilities exist: In a message passing environment, an activity may be represented by a message, and a resource may be represented by a bounded-length message queue or a bounded-size thread pool. When both mutexes and message passing are used, deadlocks may involve combinations of these activities and resources.
The simplest mutex-related deadlock is the self-deadlock:
mu.Acquire();
mu.Acquire();
// BUG: deadlock: thread already holds mu Deadlocks
involving two resources, such as a mutex and a bounded-sized thread pool, are easily generated too, but deadlocks involving three or more resources are less common. A two-mutex deadlock results when thread T0 attempts to acquire M1 while holding M0 at the same time that thread T1 attempts to acquire M0 while holding M1; each thread will wait indefinitely for the other.
Debugging deadlocks
Fortunately, deadlocks are among the easiest bugs to debug and avoid. Debugging is typically easy because the address space stops exactly where the bug occurs. A stack trace of the threads is usually all that is is required to see what the threads are waiting for and what resources they hold. (Deadlocks involving messages in queues can be harder to spot.)
Avoiding deadlocks
Deadlocks can be avoided by disallowing cycles in the resources' exhaust-before graph; this can be done by imposing a partial order on the graph. If an activity can exhaust resource R0 and then attempt to use a resource R1 that may be exhausted, then we say that R0 precedes R1 (and R1 follows R0) in the exhaust-before graph. To guarantee no deadlocks, it is sufficient to guarantee that if R0 precedes R1, then R1 never precedes R0. That is, for all pairs of resources R0 and R1, either R0 and R1 are unordered (neither precedes the other), or their ordering is unambiguous (one precedes the other, but not vice versa).
Considering only mutexes, we can avoid deadlocks by ensuring that the acquired-before graph is a partial order and is therefore free of cycles. In practice, we simply pick an order for any two mutexes that can be held simultaneously by the same thread, and comment the code with this choice.
As described above, if a thread does mu0.Acquire(); mu1.Acquire();
then we should comment the declarations of mu0
and mu1
with either
acquired before
or acquired after
(or both). Because we wish our code to be
modular, our comments should also indicate what locks a caller must or must not
hold on entry to a routine. Combined, these comments allow the programmer to
know whether he is about to violate the locking order by acquiring a mutex or
calling a routine. Experience shows that if this convention is followed,
deadlocks are usually both rare and easy to correct.
A particularly important rule of thumb for deadlock avoidance is never hold a lock while invoking a callback. More generally, try to avoid holding locks for long periods, and across calls to other levels of abstraction that you do not fully understand. (That is, you might hold a lock access an access to a hash table, but you should not hold a lock across a call to a complex subsystem.)
Races
Races occur in three main ways:
A shared variable is accessed without being protected consistently by a mutex. The reasons for the problems are discussed at length below. This error can be avoided with the conventions already described; simply ensure that each shared variable is accessed only when its protecting mutex is known to be held. Such races can be detected automatically by ThreadSanitizer as described below.
A critical section modifies protected state but does not preserve the monitor invariant. Such bugs are rare if invariants are commented correctly.
A critical section reads protected state, which is then encoded in a temporary variable or the program counter. Then the lock is released, then reacquired and the state from the previous critical section is used as though still valid:
string name_; // guarded by mu_
size_t NameLen() {
AutoLock l(this->mu_);
return this->name_.size();
}
// Requires 0 <= i < this->name_.size()
int CharFromName(size_t i) {
AutoLock l(this->mu_);
return this->name_[i];
}
...
size_t len = this->NameLen();
int x = 0;
if (len > 0) {
// BUG: temporary len encodes protected state from a previous
// critical section that is used inside another.
// The length of name_ may have changed since len was assigned.
x = this->CharFromName(len - 1);
}
...
This is the most insidious form of race, and the best known way to avoid them is vigilance. Fortunately, they are quite rare. There are algorithms that can detect such races using data flow analysis, but as yet none has been applied to C++.
Debugging
Lock
and ConditionVariable
have various features to aid debugging.
Assertions
mu.AssertAcquired()
: abort in debug mode if mu
is not held by the
calling thread.
Race detection
Race detection requires an external tool. One such tool is ThreadSanitizer, which is a dynamic race detector based on the Valgrind binary translation framework. See this page for more details on how to use it with Chrome.
Examples
The following show simple implementations of reader-writer locks and producer-consumer queues using condition variables.
Reader-writer lock
The example below could be improved in various ways at the cost of clarity. In particular, they allow readers to starve writers.
class RWLock {
public:
RWAcquire() : lockers_(0), lock_free_(&mu_) {}
~RWAcquire() {}
void WriteAcquire() {
// acquire a write lock
this->mu_.Acquire();
while (this->lockers_ != 0) {
this->lock_free_.Wait();
}
this->lockers_ = -1;
this->mu_.Release();
}
void ReadAcquire() {
// acquire a read lock
this->mu_.Acquire();
while (this->lockers_ < 0) {
this->lock_free_.Wait();
}
this->lockers_++; this->mu_.Release();
}
void Release() {
// release lock (either mode)
this->mu_.Acquire();
this->lockers_ = (this->lockers_ == -1? 0 : this->lockers_ - 1);
if (this->lockers_ == 0) {
// if lock became free, wake all waiters
this->lock_free_.Broadcast();
}
this->mu_.Release();
}
private: Lock mu_;
// protects lockers_ int lockers_;
// 0 => free, -1 => writer, +ve => reader count
ConditionVariable lock_free_; // signalled when lockers_ becomes 0
};
Producer-consumer queue
class PCQ {
// a bounded, blocking FIFO producer-consumer queue
public:
PCQ(int n) : max_count_(n), non_full_(&mu_), non_empty_(&mu_) {}
~PCQ() {
CHECK_EQ(this->queue_.size(), 0);
}
// error if queue is not empty
// waits until queue is not full, then adds value to its end
void Add(void *value) {
this->mu_.Acquire();
while (this->queue_.size() == this->max_count_) {
this->non_full_.Wait();
}
this->non_empty_.Signal(); // no need to Broadcast.
// (only one remover can consume this item)
// Could use:
// if (this->queue_.size() == 0) { this->non_empty_.Broadcast(); }
this->queue_.push_back(value);
this->mu_.Release();
}
// waits until this queue is non-empty, then removes and returns first value
void *Remove() {
this->mu_.Acquire();
while (this->queue_.size() == 0) {
this->non_empty_.Wait();
}
this->non_full_.Signal();
// no need to Broadcast. (only one adder can consume this gap)
// Could use:
// if (this->queue_.size() == this->max_count_) {
// this->non_full_.Broadcast();
// }
void *value = this->queue_.front();
this->queue_.pop_front();
this->mu_.Release();
return value;
}
private:
Lock mu_; // protects *queue_
// protects invariant 0 <= queue_.size() <= max_count_
deque<void *> queue_; // list of queued items const int
max_count_; // max number of items in *queue_ (readonly after init)
ConditionVariable non_full_; // signalled when queue becomes non-full
ConditionVariable non_empty_; // signalled when queue becomes non-empty
};
Why are mutexes the way they are?
Why use a mutex when accessing shared data?
It is perilous to access data that another thread may be modifying concurrently.
Consider accesses to a C++ string
. A well-formed string
may have invariants,
such as its length field indicating the true length of the string it represents.
Such invariants may be temporarily broken by the string
implementation when an
update occurs. Clearly, one thread may not be allowed to read a string
that is
being written by another, as it may not observe a length consistent with the
stored bytes. If the string is accessed only under a mutex mu
, the string
's
invariants become mu
's monitor invariants, and each thread will see a
well-formed string
.
It is tempting to assume that mutexes are unnecessary when there is no obvious
monitor invariant to protect. Consider a variable or field with type double
.
One might expect to be able to read and write this variable from multiple
threads without the protection of a mutex, but this is not safe:
Many machines, including the x86, do not guarantee to access a double
atomically. (A stack-allocated double need not be naturally-aligned by the
compiler, potentially leading to two memory operations for a single access.)
Thus, there is an invariant we need to protect: that the double
is
well-formed.
On some machines seemingly obvious data-dependency properties do not hold
without the cross-thread synchronization provided by a mutex; a thread might
read a well-formed double
but get a value from an apparently earlier time.
This comment applies to all types, including integers, pointers, and even
"atomic" types provided by the language or runtime.
A variable's concrete type may change as a program is maintained, and this
change may be hidden by a typedef
.
In short, data accessed by multiple threads should be protected with a mutex. To do otherwise is to court disaster.
Despite this advice, some programmers enjoy the intellectual challenge of using lower-level primitives ("atomic" types, compare-and-swap, memory barrier) instead of mutexes. There are many problems with such code:
- it usually does not work.
- it is often no faster overall than using mutexes.
- it may rely on assumptions about the hardware or compiler, or both.
However, the most important reason not to use such code is that it is complicated. Even if the author understands it, the next maintainer of the code may not. Worse, he may think he understands it.
The best way to avoid locking is to avoid shared mutable state. When shared mutable state is needed, use a mutex. If you experience lock contention, consider using more mutexes, each protecting less data (that is, make the locking finer-grained). If you feel you must access a shared mutable variable without a mutex, and have data that shows it is worth the maintenance expense of doing so, ask an expert how to do it.
Why can the holder of a Lock
not reacquire it?
Some mutex implementations, notably that of Java, and the Windows
CRITICAL_SECTION are called reentrant or recursive. They allow the holder of
a mutex to reacquire the mutex without deadlocking by maintaining an internal
acquisition count and holder identity instead of the held
boolean. The
mutex is free when the count is zero. The acquisition count keeps track of the
number of acquisitions performed by the holder; the holder is required to
release the mutex the same number of times it acquired it to make the mutex
free. This bookkeeping allows a method of an object to call other methods of the
same object while holding the lock, even if those other methods acquire the
lock. We do not allow this apparently convenient usage in Lock
not because of
the small additional cost of maintaining the counter, but because of two
problems.
Recall that a mutex's primary purpose is to allow the programmer to maintain
monitor invariants, and that the programmer may assume a monitor invariant just
after acquiring the appropriate mutex. Consider a Java method M0 that acquires a
mutex mu
protecting invariant Inv. The author of M0 is entitled to assume
Inv at the moment he acquires mu
. Now consider another method M1 that
acquires mu
, invalidates Inv during an update, calls M0, restores Inv, and
releases mu
. M0 assumes Inv, but Inv is untrue when M0 is called, so M0
fails. Remember that both M0 and M1 may have multiple implementations written by
multiple authors over years, and perhaps multiple implementations in the same
program, due to inheritance. The source code of M0 may not be available to the
author of M1, or vice versa. Without remarkably good discipline and
documentation standards, the programmers may not understand why M0 is not
functioning correctly.
If a programmer attempted to do the same thing with a non-reentrant mutex such
as Lock
, his code would instantly deadlock and a stack trace would show that a
thread is attempting to reacquire a lock it already holds. Not only is the error
discovered immediately, but the fix is usually trivial: write a small method M0
that acquires mu
and calls a private method M0', which does the real work,
assuming Inv but without acquiring mu
. The specifications of M0 and M0'
differ only in their locking behaviour, so the programmer almost always
documents this difference, often in the names of M0 and M0'. The presence of the
additional method and the corresponding name or comment provides an additional
reminder to the author of M1. He realizes that by calling M0' rather than M0, he
has the burden of establishing Inv before the call---it is not guaranteed
automatically by the monitor invariant. This solution is not a panacea, but
disallowing reentrancy at least makes the error apparent, rather than hiding it.
The second problem with reentrancy is associated with condition variables. In
the example above, imagine that M0 waits on a condition variable and thus
effectively contains more than one critical section. Normally M0 will work, but
if called from M1 with mu
held, it is unclear what happens, and neither
outcome is satisfactory. If the wait() primitive decrements mu
's acquisition
count by one, mu
does not become free, the condition never becomes true, and
the thread deadlocks. If instead the acquisition count is set to zero by
Wait()
, mu
becomes free during a critical section initiated by M1. This is
likely to cause M1 to malfunction silently. In the non-reentrant case, M0 must
be split into M0 and M0' as before. Since M0' waits on a condition variable, it
now has an interesting specification: it temporarily releases a mutex that it
did not acquire. This is unusual, and usually very dangerous, so one might
expect a comment to that effect. This comment then tells the author of M1 that
he must be especially careful if he calls M0'.
Why not use monitored modules? (or automatically locked objects, locking pointers, lock-free data structures, ...)
It seems attractive to automate the acquisition and release of mutexes by declaring somehow that a mutex will be acquired on entry to a module and released on exit, as in a Hoare monitor. This can be used for trivial cases, but even quite common examples require more complex locking.
Consider a table *t
mapping strings to integer counts. The table might have
methods insert()
, lookup()
, remove()
. If the table provides its own
synchronization, perhaps inserted automatically in each method by some
mechanism, we eliminate data races within the table itself, but this does not
help the client. Consider this code, which increments the count for "foo" in the
table *t
:
int *v = t->lookup("foo");
// safe because *t is a monitor
if (v != 0) {
(*v)++;
// BUG: data race: unlocked increment
} else {
t->insert("foo", 1);
// safe because *t is a monitor
}
If the client does not use his own mutex,
counts may be missed. If he does, the synchronization inside *t
is redundant.
Thus, monitored modules are rarely helpful.
The implementors of SGI STL made the same observation.
A further problem is that the designers of automatic locking mechanisms often desire to reduce the amount of typing needed to implement a monitor, rather than to improve the readability and maintainability of the code. All too often, these two desires are in conflict; some code is more readable if one can tell when a lock is released.
Alternatives to mutexes
There are a number of ways to handle concurrency, and ways to avoid it altogether. Of the various possible models, only two permit high-performance implementations that can use multiple CPUs and sharing of resources, and still allow large programs to be built from smaller components while maintaining abstraction boundaries. These models are "threads + mutexes + condition-variables", and "threads + message-passing". These two can be used together, and often are.
Message passing
One can associate data with threads, so that each thread owns some variables
and data structures; a variable is accessed only by its owning thread. Other
threads wishing to access the data then communicate with the owning thread via
message passing, such as Chrome's TaskRunner
.
This style is a dual of the mutex-based style (see Lauer and Needham's oft-cited paper on the subject): A message-send corresponds to acquiring a mutex; running in a critical section corresponds to executing code within the owning thread; and receiving a reply corresponds to releasing the mutex. Thus, the most obvious difference between the approaches is that in message-passing all the code that manipulates a particular data item is brought together into one thread, while with mutexes the data accesses can be interleaved with other code.
Message passing and mutexes can be intermixed; often one is preferred either
because the author is comfortable with the style, or because one leads to a
clearer module than the other. The message-passing model tends to work well when
there is a natural resource that already serializes accesses (such as an I/O
device), a linear state machine best expressed as a single routine, or when
critical sections are long. Mutexes work well when critical sections are short
and may be invoked in many places, or when reader-writer locks can be used
effectively. In Chrome code, message passing is much more common (via
TaskRunner
and PostTask
) and low-level primitives like locks and condition
variables are used only when necessary.
Both models allow high-throughput implementations, and both can suffer from both races and deadlocks. Deadlocks can often be eliminated in the message-passing model by using unbounded queues and/or threadpools.
Atomic types and atomic operations
Many runtimes and languages (including C++11) provide atomic operations, such as compare-and-swap, and "atomic" types that can be read and written atomically. Atomic operations and types are much harder to use than one might first think, and they should not be used in normal code. Unfortunately, programmers are attracted to them for various reasons:
Programmers enjoy the intellectual puzzle of using these operations. This leads to clever, but ill-advised, and often broken code.
Algorithms involving atomic operations are extremely subtle. For example, a general-purpose, efficient, lock-free, singly-linked list algorithm took significant research to discover, and requires care to implement. Almost all programmers make mistakes when they attempt direct use of atomic operations. Even when they don't make mistakes, the resulting code is hard for others to maintain. Both CPUs and compilers can rearrange reads and writes in ways that lead to subtle race conditions. The simple-sounding pattern of double-checked locking is actually extremely subtle and is usually implemented incorrectly.
Programmers assume that locking is expensive, and that using atomic operations will be more efficient. But in reality, acquiring and releasing a lock is cheaper than a cache miss; attention to cache behaviour is usually a more fruitful way to improve performance. Furthermore, lock-free data structures are often more expensive than using locks. This is because a lock allows arbitrary changes to be made to a complex data structure; if the same changes must be made without a lock, the result is likely to take more atomic read-modify-write instructions, not fewer.
People wish to avoid lock contention when concurrency is high. This is best solved by partitioning locked data structures to avoid lock contention. For example, it is easier, more efficient, and more useful to build a high-concurrency hash table from many normal hash tables, each with its own lock, than to build one lock-free hash table using atomic operations.
Atomic operations should be used in only a handful of low-level data structures, written by a few experts, and then reviewed and tested thoroughly. Unfortunately, many attempt to write lock-free code, and almost always this is a mistake. Please do not fall into this trap: do not use atomic operations or types---if you do, you will make mistakes, and you will cost the company time and money.
A single thread
A process that uses only a single thread requires no mutexes, and this is often the best approach for simple programs that do not require high performance or that are inherently sequential. However, it is usually not the best choice for large programs, or when high performance is required.
A single-threaded application can use only one CPU, which typically makes it far slower than other options, even when the overheads of locking are taken into account. If the application is simple enough, one may be able to run multiple copies of the same program on each machine, but this introduces two inefficiencies: cross-address-space context switches are more expensive than thread context switches because threads share TLB entries while address spaces do not; and the address space separation precludes sharing some resources (caches, ports, etc.).
Some programs, such as network servers, exhibit natural concurrency: they must deal with many concurrent client requests, so some mechanism is needed to allow this.
The fallacy of thread context-switch cost
Some try to argue that it is significantly cheaper to multiplex a single thread than to use multiple threads because a single thread requires no thread context switches. Such an argument stems from confusion about what constitutes a "context switch" and what contributes to its cost. A context switch is simply the act of multiplexing the processor between multiple activities; its dominant costs are similar regardless of whether this is done in kernel-mode or in user-mode:
When a program switches to a new activity, it incurs cache and TLB misses by touching the data and instructions associated with a new activity. This cost is the most important, and is essentially the same regardless of whether the new activity takes place in a different thread or in the same thread. The cost occurs not only when a multithreaded program performs a thread context switch, but also when an event driven program processes a new event, or when a co-operative multithreaded program switches context in user-space. Multithreaded programs rarely context switch due to time-slicing because time-slices are large.
The cost of user-kernel mode switches is sometimes counted as part of the context-switch cost between threads. However, multithreaded programs usually context switch when they have already entered the kernel for other reasons---typically, via some system call or to service an interrupt. A single-threaded program incurs these same mode switches, and thus they are common to all models. One might expect mutex and condition variable calls to add to the number of system calls, but this effect is modest because uncontended mutexes induce no system calls, while contended mutexes and condition-variable operations should be relatively rare.
Context switches between address spaces are more expensive because TLB entries must be replaced.
To summarize, if a single address space is used, the costs of switching between activities are nearly independent of the number of threads used within that address space; the technique that leads to slowest execution is to run multiple copies of a single-threaded program.
The event-driven style
To handle concurrent activities in a single thread, some programmers adopt a
style variously known as event-driven, state-machine or
continuation-passing: One can decompose the actions for a given request into a
graph of handler routines that never block for I/O internally, but rather
specify which handlers should be invoked when pending I/O requests complete.
This approach can be made to work and may even be straightforward in simple
programs, but it has bad effects on readability, modularity, and abstraction, as
well as using only one CPU. To see the problems with the event-driven style,
consider the code ... for (int i = 0; i != 10; i++) { foo(i); } ... Now imagine
that the third-party library routine foo()
is modified at some future time to
improve its functionality or average performance. Imagine that a side-effect of
this improvement is that occasionally foo()
must perform some blocking I/O
that should not be performed within a handler. Neither the author of the
for-loop nor the author of the new implementation of foo()
has done anything
unusual, and yet the program may show poor throughput or even deadlock in an
event-driven environment. Even subtler changes can have undesirable effects.
Imagine that foo()
includes a call to fprintf()
; if one day the output
stream is redirected to a device with high-throughput but high-latency, the
program's throughput will drop precipitously because foo()
's latency cannot be
hidden in the event-driven model without rewriting both fprintf()
and foo()
.
We can fix the performance problem if we change foo()
's signature to include a
continuation to be called when foo()
completes. However, this is not a local
change: the loop must be restructured so that the loop variable is encoded in
the continuation state. Worse, every use of foo()
must be similarly modified
not only to handle the new signature, but to break the calling procedure into
two---the prefix before the call to foo()
, and the continuation. Thus, a
local change in foo()
has affected an arbitrary number of modules in a
significant way: the event-driven style does not preserve modularity.
Notice how this differs from the multi-threaded case. Just as the event-driven
style style requires that foo()
be non-blocking, the multithreaded style
requires that foo()
be thread-safe. However, this constraint is easier to live
with. First, most libraries are thread-safe either naturally or by design, while
few are designed for use in event-driven systems. (For example, fprintf()
is
thread safe, but provides no callback mechanism.) Second, if foo()
were not
thread safe, calls to it can be made safe either by a change to foo()
or
by wrapping foo()
with a new routine foo()
that acquires a lock before
calling the old foo()
. Either change is local, and does not affect other
aspects of the interface, so modularity is preserved.
The problem with the event-driven style is worse for routines like printf()
whose signatures cannot be changed lightly. Even more worryingly, some I/O
methods cannot be made efficient in a single-threaded event-driven system even
with arbitrary restructuring of the entire program. For example, while one can
with effort write a continuation-passing equivalent of printf()
, memory-mapped
I/O and programmed I/O have no such equivalent.
A further problem with the event-driven style is that the resulting code becomes quite difficult to understand, maintain, and debug. This is primarily because it is harder to tell from reading the code which routine caused the current one to be called, and which routine will be called next. Standard debugging and performance tools become less effective too, as they rarely have support for tracing through continuations, and continuation mechanisms rarely maintain call history. In contrast, non-event-driven programs maintain much of their state and history conveniently on the thread stacks; debugging tools can reveal a great deal simply by displaying stack traces.
Co-operative multithreading
An alternative style called co-operative multithreading allows the programmer
to use multiple threads on a single CPU. The scheduler guarantees that no two
threads can run concurrently, and guarantees never to pre-empt a thread that has
not blocked. In theory, this allows mutexes to be omitted: the code a++; b--;
will always execute atomically. Unfortunately, reliance on this property makes
the code more fragile. For example, because any I/O may block, a++; printf("bar\n"); b--;
probably does not execute atomically, and a++; foo(); b--;
may or may not execute atomically, depending on the implementation of
foo()
. Thus, co-operative multithreading without explicit synchronization can
lead to code in which a bug may be introduced to one module by adding a debug
printf-statement in another module. If explicit synchronization is used, the
technique becomes equivalent to the straightforward use of threads.
For these reasons, unless a program is quite simple it usually pays in both performance and maintainability to use multiple threads, and to protect shared variables explicitly with mutexes or to communicate between threads with messages.
Why is cv.Wait()
always in a while-loop?
Hoare's original condition variables did not require the while-loop, but modern versions require it for somewhat subtle reasons:
The presence of the while-loop allows one to tell by local inspection that
the condition is true when the loop exits. Hoare's original precise
semantics required inspection of all code that could potentially call
Signal()
, which made some errors rather harder to debug.
The while-loop allows clients to do spurious wakeups, which gives the programmer the opportunity to trade performance for ease of programming. Suppose he arranges for threads always to signal the condition variable when they modify protected state, rather than only when they make a specific condition true. This allows modularity between waiters and wakers: the wakers don't need to know what conditions wakers are waiting for, and each waiter can wait for a different condition without affecting the code of the wakers.
The while-loop allows the condition variable implementation more freedom to schedule woken threads in any order. Consider a thread T0 that wakes thread T1 that was waiting for condition C. If the runtime semantics guarantee that T1 will enter the critical section next, T1 can assume C. But context switches have overhead, so it is usually more efficient merely to add T1 to the run queue while continuing to run T0 and perhaps other threads, which may then enter the critical section before T1. If any of those threads falsifies C, T1 must not assume C on entering the critical section; scheduling has made it appear that it has received a spurious wakeup. The while-loop ensures that T1 tests C, and continues only if C is really true. Thus, the while-loop effectively allows more freedom in choosing an efficient scheduling discipline.
Timed waits become less error-prone. A timed wait may cause a thread to wake before its condition C is true. Suppose the programmer forgets to test for a timeout. If he is forced to use a while-loop, his thread will go to sleep again and his program will probably deadlock, allowing easy detection of the bug. Without the while-loop, the thread would falsely assume C, and cause arbitrarily bad behavior.
Why must the condition used with cv.Wait()
be a function of state protected by the mutex?
Consider a thread W waiting for a condition cond_expr
to become true:
mu.Acquire(); while (!cond_expr) { cv.Wait(); // mu was passed to cv's
constructor } // cond_expr now holds ... mu.Release(); If cond_expr
is not a
function of state protected by mu
, two bad things can happen:
Suppose that thread W finds cond_expr
false, and is about to call
cv.Wait()
. If the state associated with cond_expr
is not protected by
mu
, another thread can make cond_expr
true and call cv.Signal()
before
W calls cv.Wait()
. This means that W may block indefinitely in Wait()
,
even though cond_expr
is true (only a thread currently in Wait()
is
woken by a call to Signal()
).
Suppose that thread W finds cond_expr
true, and is about to execute the
code labelled "cond_expr now holds". If the state associated with
cond_expr
is not protected by mu
, another thread can make cond_expr
false before W runs the rest of the code, so W cannot assume cond_expr
.
This negates the purpose of the condition variable, which was to give W a
guarantee about cond_expr
.
Why put Signal()
inside the critical section?
In most cases, it is correct to put Signal()
after the critical section, but
in Chrome code it is always both safe and efficient to put it inside the
critical section. (TODO: verify this)
Some texts recommend putting Signal()
after the critical section because this
makes it more likely that the mutex is free when a thread attempts to reacquire
it on return from Wait()
. If the Signal()
were inside the critical section,
a naive implementation might wake the thread which could then block once more on
the mutex held by the very thread that woke it.
Chrome's condition variables (and most reasonable implementations) detect this
case, and delay waking the waiting thread until the mutex is free. (TODO: verify
this) Hence, there is no performance penalty for calling Signal()
inside the
critical section.
In rare cases, it is incorrect to call Signal()
after the critical section, so
we recommend always using it inside the critical section. The following code can
attempt to access the condition variable after it has been deleted, but could be
safe if Signal()
were called inside the critical section.
struct Foo {
Lock mu;
ConditionVariable cv;
bool del;
...
};
...
void Thread1(Foo *foo) {
foo->mu.Acquire();
while (!foo->del) {
foo->cv.Wait();
}
foo->mu.Release();
delete foo;
}
...
void Thread2(Foo *foo) {
foo->mu.Acquire();
foo->del = true;
// Signal() should be called here
foo->mu.Release();
foo->cv.Signal(); // BUG: foo may have been deleted
}
Why should implementors of mutexes pay attention to mutex performance under contention?
Clients should avoid lock contention, because contention necessarily implies less parallelism; some threads are blocked while another executes the critical section. Because clients must avoid contention, some implementors of mutexes pay less attention to the performance of mutexes under contention. However, contention is sometimes encountered despite clients' best efforts. For example:
A network server may become overloaded or see a changed pattern of use, causing a mutex to be used more than it normally would.
A program may be run on an upgraded machine with more CPUs, causing contention on a mutex that was previously lightly loaded.
Software developers encourage abstraction between parts of our programs, so the authors and clients of modules may have different expectations of how the module will be used. In particular, a client may cause contention on a mutex that he is unaware of.
While it is important for clients to fix contention to avoid loss of parallelism, that loss of parallelism should be their main consideration. The performance of the mutex itself should not degrade precipitously, even when heavily contended. That is: an overloaded server should recover from overload if the load drops once more; a machine with more CPUs should run no slower than a machine with fewer CPUs; and calling a module more often should not reduce the amount of work that gets done, even if it doesn't increase it.
Ideally, a critical section should provide approximately the same rate of progress to many contending threads as it can to a single thread. Mutex implementations can approximate this ideal by not providing fairness, and by preventing multiple threads that have already blocked from simultaneously competing for the lock.
Does every Lock operation imply a memory barrier?
Programmers should not use Lock
operations as a means for inserting arbitrary
memory barriers into their code. (Or for exerting control over when threads
run.) Lock
operations imply only ordering necessary for the protection of
monitor invariants. In particular, the intent is:
If threads T0 and T1 execute the following code, where some location is
modified by one of T0_Inside()
and T1_Inside()
and read or written by the
other:
// thread T0 // thread T1
T0_Before();
T1_Before();
mu.Acquire();
mu.Acquire();
T0_Inside();
T1_Inside();
mu.Release();
mu.Release();
T0_After();
T1_After();
then the memory operations in Tx_Before()
and Tx_Inside()
all precede the
memory operations in Ty_Inside()
and Ty_After()
either for x=0, y=1
or for
x=1, y=0
.
If the predicate does not hold, no memory ordering should be assumed from the
Lock
operations. This surprises programmers who expect the simplest possible
implementation, with no optimizations. Unfortunately, this expectation is
reinforced by some API standards.
We discourage such assumptions because they make various transformations more difficult. Examples include:
- Removal of critical sections that are redundant with others.
- Removal of locks used by only one thread, or that protect no data.
- Coalescing and splitting of locks and critical sections.
- Conversion of exclusive locks to shared locks.
- Replacing locks with transactional memory.
Some lock implementations already apply some of these transformations, and are
more efficient as a result. Therefore, Lock
reserves the right to use such
transformations when safe, even if that means removing memory barriers.