Software Geek

March 17, 2008

Memory Model

Filed under: Software


Developing Customer Relationship Management Solutions. Web, e-Commerce, Database Design and Software Development.

One of
the suggestions for a blog entry was the managed memory model.  This is timely, because we’ve just been
revising our overall approach to this confusing topic.  For the most part, I write about product
decisions that have already been made and shipped.  In this note, I’m talking about future
directions.  Be
skeptical.

< ?xml:namespace prefix = o ns =
"urn:schemas-microsoft-com:office:office" /> size=2> 

So what
is a memory model?  It’s the
abstraction that makes the reality of today’s exotic hardware comprehensible to
software developers.

size=2> 

The
reality of hardware is that CPUs are renaming registers, performing speculative
and out-of-order execution, and fixing up the world during retirement.  Memory state is cached at various levels
in the system (L0 thru L3 on modern X86 boxes, presumably with more levels on
the way).  Some levels of cache are
shared between particular CPUs but not others.  For example, L0 is typically per-CPU but
a hyper-threaded CPU may share L0
between the logical CPUs of a single physical CPU.  Or an 8-way box may split the system into two
hemispheres with cache controllers performing an elaborate coherency protocol
between these separate hemispheres. 
If you consider caching effects, at some level all MP (multi-processor)
computers are NUMA (non-uniform memory access).  But there’s enough magic going on that
even a Unisys 32-way can generally be considered as UMA by
developers.


size=2> 

Free Live Chat: Next generation of Live Chat. On-Demand. Easy-to-Use.
Customer Help Solution: jbTop is Jabber/XMPP Live Chat Soulution for your website.

It’s
reasonable for the CLR to know as much as possible about the cache architecture
of your hardware so that it can exploit any imbalances.  For example, the developers on our
performance team have experimented with a scalable rendezvous for phases of the
GC.  The idea was that each CPU
establishes a rendezvous with the CPU that is “closest” to it in distance in the
cache hierarchy, and then one of this pair cascades up a tree to its closest
neighbor until we reach a single root CPU. 
At that point, the rendezvous is complete.  I think the jury is still out on this
particular technique, but they have found some other techniques that really pay
off on the larger systems.

size=2> 

Of
course, it’s absolutely unreasonable for any managed developer (or 99.99% of
unmanaged developers) to ever concern themselves with these imbalances.  Instead, software developers want to
treat all computers as equivalent. 
For managed developers, the CLR is the computer and it better work
consistently regardless of the underlying machine.

Also see: C# 3.0 Lambdas and Type Inference

Customer Help Solution: jbTop is Jabber/XMPP Live Chat Soulution for your website.

size=2> 

size=2>Although managed developers shouldn’t know the difference between a 4-way
AMD server and an Intel P4 hyper-threaded dual proc, they still need to face the
realities of today’s hardware. 
Today, I think the penalty of a CPU cache miss that goes all the way to
main memory is about 1/10th the penalty of a memory miss that goes
all the way to disk.  And the trend
is clear.

size=2> 

If
you wanted good performance on a virtual memory system, you’ve always been
responsible for relieving the paging system by getting good page density and
locality in your data structures and access patterns.

size=2> 

In
a similar vein, if you want good performance on today’s hardware, where
accessing main memory is a small disaster, you must pack your data into cache
lines and limit indirections.  If
you are building shared data structures, consider separating any data that’s
subject to false sharing.

Also see: From C# to Java: Part 3

size=2> 

To
some extent, the CLR can help you here. 
On MP machines, we use lock-free allocators which (statistically)
guarantee locality for each thread’s allocations.  Any compaction will (statistically)
preserve that locality.  Looking
into the very far future – perhaps after our sun explodes – you could imagine a
CLR that can reorganize your data structures to achieve even better
performance.

size=2> 

size=2>This means that if you are writing single-threaded managed code to
process a server request, and if you can avoid writing to any shared state, you
are probably going to be pretty scalable without even trying.

size=2> 

Getting
back to memory models, what is the abstraction that will make sense of current
hardware?  It’s a simplifying model
where all the cache levels disappear. 
We pretend that all the CPUs are attached to a single shared memory.  Now we just need to know whether all the
CPUs see the same state in that memory, or if it’s possible for some of them to
see reordering in the loads and stores that occur on other CPUs.

Also see: A quick update on me.

size=2> 

At one
extreme, we have a world where all the CPUs see a single consistent memory.  All the loads and stores expressed in
programs are performed in a serialized manner and nobody perceives a particular
thread’s loads or stores being reordered. 
That’s a wonderfully sane model which is easy for software developers to
comprehend and program to. 
Unfortunately, it is far too slow and non-scalable.  Nobody builds this.

size=2> 

At the
other extreme, we have a world where CPUs operate almost entirely out of private
cache.  If another CPU ever sees
anything my CPU is doing, it’s a total accident of timing.  Because loads and stores can propagate
to other CPUs in any random order, performance and scaling are great.  But it is impossible for humans to
program to this model.

size=2> 

Also see: The 2 Technology Magazines You Should Read

Live Person Software: Just one single click and your website visitors are getting into instant message chatting with you.

In
between those extremes are a lot of different possibilities.  Those possibilities are explained in
terms of acquire and release semantics:

size=2> 

  • style="MARGIN: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list.5in">A normal load or store can be freely reordered with respect
    to other normal load or store operations.
  • style="MARGIN: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list.5in">A load with acquire semantics creates a downwards
    fence.  This means that normal
    loads and stores can be moved down past the load.acquire, but nothing can be
    moved to above the load.acquire.
  • style="MARGIN: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list.5in">A store with release semantics creates an upwards
    fence.  This means that normal
    loads and stores can be moved above the store.release, but nothing can be
    moved to below the store.release.
  • style="MARGIN: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list.5in">A full fence is effectively an upwards and downwards
    fence.  Nothing can move in either
    direction across a full fence.

Also see: From C# to Java: Part 4

Custom Software Development for Real-Estate, Hosting providers, Workflow and Business Management Systems.

size=2> 

A
super-strong extreme model puts a full fence after every load or store.  A super-weak extreme model uses normal
loads and stores everywhere, with no fencing.

size=2> 

The most
familiar model is X86.  It’s a
relatively strong model.  Stores are
never reordered with respect to other stores.  But, in the absence of data dependence,
loads can be reordered with respect to other loads and stores.  Many X86 developers don’t realize that
this reordering is possible, though it can lead to some nasty failures under
stress on big MP machines.

size=2> 

In terms
of the above, the memory model for X86 can be described as:

size=2> 

  1. style="MARGIN: 0in 0in 0pt; mso-list: l3 level1 lfo2; tab-stops: list.5in">All stores are actually store.release.
  2. style="MARGIN: 0in 0in 0pt; mso-list: l3 level1 lfo2; tab-stops: list.5in">All loads are normal loads.
  3. style="MARGIN: 0in 0in 0pt; mso-list: l3 level1 lfo2; tab-stops: list.5in">Any use of the LOCK prefix (e.g. ‘LOCK CMPXCHG’ or ‘LOCK
    INC’) creates a full fence.

Also see: A Quick Fix for the Validator SetFocusOnError Bug

size=2> 

size=2>Historically, Windows NT has run on Alpha and MIPS computers.

size=2> 

Looking
forwards, Microsoft has announced that Windows will support Intel’s IA64 and
AMD’s AMD64 processors.  Eventually,
we need to port the CLR to wherever Windows runs.  You can draw an obvious conclusion from
these facts.

size=2> 

AMD64
has the same memory model as X86.

size=2> 

IA64
specifies a weaker memory model than X86. 
Specifically, all loads and stores are normal loads and stores.  The application must use special ld.acq
and st.rel instructions to achieve acquire and release semantics.  There’s also a full fence instruction,
though I can’t remember the opcode (mf?).

Also see: From C# to Java: Part 4

Help Desk Software: Next generation of Live Chat. Jabber/XMPP Live Chat Service for your website.

size=2> 

Be
especially skeptical when you read the next paragraph:

size=2> 

There’s
some reason to believe that current IA64 hardware actually implements a stronger
model than is specified.  Based on
informed hearsay and lots of experimental evidence, it looks like normal store
instructions on current IA64 hardware are retired in order with release
semantics.

size=2> 

If this
is indeed the case, why would Intel specify something weaker than what they have
built?  Presumably they would do
this to leave the door open for a weaker (i.e. faster and more scalable)
implementation in the future.

size=2> 

In fact,
the CLR has done exactly the same thing. 
Section 12.6 of Partition I of the ECMA CLI specification explains our
memory model.  This explains the
alignment rules, byte ordering, the atomicity of loads and stores, volatile
semantics, locking behavior, etc. 
According to that specification, an application must use volatile loads
and volatile stores to achieve acquire and release semantics.  Normal loads and stores can be freely
reordered, as seen by other CPUs.

Also see: Should “Membership Stores” Be Permitted in Redmond’s Manufacturing Park Zone?

Multisoft Group: Custom software solutions for your business.
Live Help Server: Jerry Messenger Server is Live Chat with Users on your websites.

size=2> 

What is
the practical implication of this? 
Consider the standard double-locking protocol:

size=2> 

if (a == null)

{

  lock(obj)

  {

    if (a == null) a = new
A();

  }

Also see: Silverlight and WPF Control Developer Huddle at Mix08

Developing Customer Relationship Management Solutions. Web, e-Commerce, Database Design and Software Development.

}

size=2> 

This is
a common technique for avoiding a lock on the read of ‘a’ in the typical
case.  It works just fine on
X86.  But it would be broken by a
legal but weak implementation of the ECMA CLI spec.  It’s true that, according to the ECMA
spec, acquiring a lock has acquire semantics and releasing a lock has release
semantics.

size=2> 

However,
we have to assume that a series of stores have taken place during construction
of ‘a’.  Those stores can be
arbitrarily reordered, including the possibility of delaying them until after
the publishing store which assigns the new object to ‘a’.  At that point, there is a small window
before the store.release implied by leaving the lock.  Inside that window, other CPUs can
navigate through the reference ‘a’ and see a partially constructed
instance.

Also see: Eriskay: a Programming Language Based on Game Semantics

Live Chat Software: Next generation of Live Chat. On-Demand. Easy-to-Use.

size=2> 

We could
fix this code in various ways.  For
example, we could insert a memory barrier of some sort after construction and
before assignment to ‘a’.  Or – if
construction of ‘a’ has no side effects – we could move the assignment outside
the lock, and use an Interlocked.CompareExchange to ensure that assignment only
happens once.  The GC would collect
any extra ‘A’ instances created by this race.

size=2> 

I hope
that this example has convinced you that you don’t want to try writing reliable
code against the documented CLI model.

size=2> 

I wrote
a fair amount of “clever” lock-free thread-safe code in version 1 of the
CLR.  This included techniques like
lock-free synchronization between the class loader, the prestub (which traps
first calls on methods so it can generate code for them), and AppDomain
unloading so that I could back-patch MethodTable slots efficiently.  But I have no desire to write any kind
of code on a system that’s as weak as the ECMA CLI spec.

Also see: From C# to Java: Part 3

Custom Software Solutions. Billing and Invoicing Solutions, eCommerce and Website design.

size=2> 

Even if
I tried to write code that is robust under that memory model, I have no hardware
that I could test it on.  X86, AMD64
and (presumably) IA64 are stronger than what we specified.

size=2> 

In my
opinion, we screwed up when we specified the ECMA memory model.  That model is unreasonable
because:

size=2> 

  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list.5in">All stores to shared memory really require a volatile
    prefix.
  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list.5in">This is not a productive way to code.
  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list.5in">Developers will often make mistakes as they follow this
    onerous discipline.
  • style="MARGIN: 0in 0in 0pt; mso-list: l2 level1 lfo3; tab-stops: list.5in">These mistakes cannot be discovered through testing,
    because the hardware is too strong.

Also see: Web Access for Visual Studio Team System

size=2> 

So what
would make a sensible memory model for the CLR?

size=2> 

Well,
first we would want to have a consistent model across all CLI
implementations.  This would include
the CLR, Rotor, the Compact Frameworks, SPOT, and – ideally – non-Microsoft
implementations like Mono.  So
putting a common memory model into an ECMA spec was definitely a good
idea.

size=2> 

It goes
without saying that this model should be consistent across all possible
CPUs.  We’re in big trouble if
everyone is testing on X86 but then deploying on Alpha (which had a notoriously
weak model).

size=2> 

We would
also want to have a consistent model between the native code generator (JIT or
NGEN) and the CPU.  It doesn’t make
sense to constrain the JIT or NGEN to order stores, but then allow the CPU to
reorder those stores.  Or vice
versa.

Also see: Framework Design Guidelines: LINQ

Customer Help Solution: jbTop is Jabber/XMPP Live Chat Soulution for your website.

size=2> 

Ideally,
the IL generator would also follow the same model.  In other words, your C# compiler should
be allowed to reorder whatever the native code generator and CPU are allowed to
reorder.  There’s some debate
whether the converse is true. 
Arguably, it is okay for an IL generator to apply more aggressive
optimizations than the native code generator and CPU are permitted, because IL
generation occurs on the developer’s box and is subject to testing.

size=2> 

size=2>Ultimately, that last point is a language decision rather than a CLR
decision.  Some IL generators, like
ILASM, will rigorously emit IL in the sequence specified by the source
code.  Other IL generators, like
Managed C++, might pursue aggressive reordering based on their own language
rules and compiler optimization switches. 
If I had to guess, IL generators like the Microsoft compilers for C# and
VB.NET would decide to respect the CLR’s memory model.

Also see: Finally, the Killer App

size=2> 

We’ve
spent a lot of time thinking about what the correct memory model for the CLR
should be.  If I had to guess, we’re
going to switch from the ECMA model to the following model.  I think that we will try to persuade
other CLI implementations to adopt this same model, and that we will try to
change the ECMA specification to reflect this.

size=2> 

  1. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list.5in">Memory ordering only applies to locations which can be
    globally visible or locations that are marked volatile.  Any locals that are not address
    exposed can be optimized without using memory ordering as a constraint since
    these locations cannot be touched by multiple threads in parallel.
  2. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list.5in">Non-volatile loads can be reordered freely.
  3. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list.5in">Every store (regardless of volatile marking) is considered
    a release.
  4. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list.5in">Volatile loads are considered acquire.
  5. style="MARGIN: 0in 0in 0pt; mso-list: l1 level1 lfo4; tab-stops: list.5in">Device oriented software may need special programmer
    care.  Volatile stores are still
    required for any access of device memory.  This is typically not a concern for
    the managed developer.
Help Desk Software: Next generation of Live Chat. Jabber/XMPP Live Chat Service for your website.

size=2> 

If
you’re thinking this looks an awful lot like X86, AMD64 and (presumably) IA64,
you are right.  We also think it
hits the sweet spots for compilers. 
Reordering loads is much more important for enabling optimizations than
reordering stores.

size=2> 

So what
happens in 10 years when these architectures are gone and we’re all using
futuristic Starbucks computers with an ultra-weak model?  Well, hopefully I’ll be living the good
life in retirement on < ?xml:namespace prefix = st1 ns =
"urn:schemas-microsoft-com:office:smarttags" />Maui.  But the CLR’s native code generators
will generate whatever instructions are necessary to keep stores ordered when
executing your existing programs. 
Obviously this will sacrifice some performance.

size=2> 

The
trade-off between developer productivity and computer performance is really an
economic one.  If there’s sufficient
incentive to write code to a weak memory model so it can execute efficiently on
future computers, then developers will do so.  At that point, we will allow them to
mark their assemblies (or individual methods) to indicate that they are “weak
model clean”.  This will permit the
native code generator to emit normal stores rather than store.release
instructions.  You’ll be able to
achieve high performance on weak machines, but this will always be “opt
in”.  And we won’t build this
capability until there’s a real demand for it.

Also see: A Quick Fix for the Validator SetFocusOnError Bug

size=2> 

I
personally believe that for mainstream computing, weak memory models will never
catch on with human developers. 
Human productivity and software reliability are more important than the
increment of performance and scaling these models provide.

size=2> 

Finally,
I think the person asking about memory models was really interested in where he
should use volatile and fences in his code.  Here’s my advice:

size=2> 

  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list.5in">Use managed locks like Monitor.Enter (C# lock / VB.NET
    synclock) for synchronization, except where performance really requires you to
    be “clever”.
  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list.5in">When you’re being “clever”, assume the relatively strong
    model I described above.  Only
    loads are subject to re-ordering.
  • Multisoft Group: Custom Software Development and Consulting Service.
  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list.5in">If you have more than a few places that you are using
    volatile, you’re probably being too clever.  Consider backing off and using managed
    locks instead.
  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list.5in">Realize that synchronization is expensive.  The full fence implied by
    Interlocked.Increment can be many 100’s of cycles on modern hardware.  That penalty may continue to grow, in
    relative terms.
  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list.5in">Consider locality and caching effects like hot spots due to
    false sharing.
  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list.5in">Stress test for days with the biggest MP box you can get
    your hands on.
  • style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo5; tab-stops: list.5in">Take everything I said with a grain of
    salt.

http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx

Comments »

The URI to TrackBack this entry is: http://annil12.blogsome.com/2008/03/17/memory-model-2/trackback/

No comments yet.

RSS feed for comments on this post.

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>



Anti-spam measure: please retype the above text into the box provided.

Get free blog up and running in minutes with Blogsome
Theme designed by Jay of onefinejay.com