§ ¶On reference counting and garbage collection
I find it interesting that both Apple and Microsoft have opted to invest in compiler-assisted garbage collection through reference counting. It's called Automatic Garbage Collection (ARC) on Mac OS X / iOS and it's part of the Windows Runtime environment on the Microsoft side. This is despite reference counting being regarded as the weakest and poorest performing type of garbage collection.
Wikipedia has a much better writeup of reference counting than I could do, but in summary, you find a place to associate a reference count with each object, increment and decrement the reference count whenever references are created or destroyed, and destroy the object when the count reaches zero. This can either be done either intrusively (COM AddRef()/Release()) or non-intrusively (std::shared_ptr). Refcounting is thus a simple and effective mechanism, but the cost of maintaining the reference counts and the inability to handle cycles often make tracing garbage collection a preferable option for language runtimes.
The commonly cited problem with tracing GC is the delay imposed by the mark-and-sweep operation. There are lots of approaches used to mitigating this, such as generational and incremental collection, and for the most part I'd say they're effective on desktop apps, as we're long past the point where a Java program would visibly pause on screen while the GC cleaned house. What I don't see mentioned as often is the cost of getting this in place, namely:
- The need to emit enough metadata to allow for precise garbage collection, where the GC can definitively distinguish references from non-references in memory. This is required for a compaction and still highly desirable for non-compacting GC. I've tried imprecise GC, and trust me, it sucks. You don't realize how bad it is until you've had 10MB of memory leaked by a float that looked like a heap pointer.
- The need to modify code generation to suit the GC, usually to assist in tracking modified references in the heap. This can be done through paging hardware, but that requires OS support and only achieves page granularity (often 4K or 64K depending on the hardware). Doing it in software avoids these problems but requires
tailored code generation and imposes a performance penalty. For example, the .NET CLR has to call a small function every time you modify a reference in order to update a card marking table.
- The interop selfishness imposed by the GC. At a minimum, this is the need for the GC to see all external references to objects in its heap ("GC roots"). For a compacting GC this becomes the need to pin all objects while they could be referenced externally, which is usually also accompanied by documentation stating how you are being unspeakably cruel to the GC by doing so. Then, there's also the "there can only be one" problem, where the GC wants to be the sole overseer of memory and you get cycle-related problems if you tie two GCs together.
These reasons why I don't like attempts to make tracing GC an optional language feature: the GC dependency tends to leak out into the rest of the program and gets its tendrils into everything. This is highly undesirable in C++, which is largely a pay-as-you-go language where you don't pay for a language feature if you don't use it. An imprecise, non-compacting conservative garbage collector is sometimes used to avoid imposing onerous requirements on the C++ compiler and runtime environment, but in my experience this is unsatisfactory: you still have to make sure the GC sees everything and you get memory leaks from the imprecise collection.
This brings us back to reference counting. The often touted advantage to reference counting is that it is deterministic, but I like even more that it has minimal requirements: adding a reference count and AddRef/Release operations to an object is all that is required. Simple and effective, can be used in just about any environment, and used only where needed. I also find that the performance cost of reference counting is overstated since data structures are often read more often than they are modified and there is usually no need to take a strong reference on temporary pointers during read access. Cycles are still a problem but can often be avoided by judicious use of untracked weak pointers, the most common case being parent pointers in a tree structure. Reference counting also doesn't require traversal of live but infrequently accessed objects, which is the main culprit in tracing GC performing badly when a virtual memory system is paging.
Despite this, using reference counting correctly and efficiently does require some manual effort, and I'm skeptical about trying to hide it by building it into the compiler as "automatic." It will definitely be easier and less error prone on average, but there will be cases where the compiler fails to optimize out refcount adjustments on critical paths, and the lower visibility will make it easier to fall into refcounting traps. Looking over the ARC and WinRT docs a bit, I had two thoughts:
- The WinRT docs don't mention the cycle trap. (The ARC docs do.)
- Neither specifies whether it is allowed to temporarily resurrect an object during destruction. This results in a crash with naive Release() implementations that leave the refcount at 0 and attempt a double-destroy when it goes to 1 and back to 0 again.
That's not counting the performance overhead, which I haven't been able to assess not having used either implementation. As such, while I like refcounting, I have to think that compiler-aided refcounting is only suitable as a stop-gap until tracing GC can be further improved enough to be used universally.
While I agree with the general points about pros and cons of both techniques I must pose the question: Is GC in WinRT even viable? It is not, because the GC would have to be able to cover multiple allocation sources written in C++, JS and .NET in a single process.
If GC _was_ viable I am sure it would have been preferred to refcounting because of productivity and reliability enhancements.
tobi - 06 11 11 - 09:07
ARC in practice differs a bit from your characterization. Two counterpoints:
1. In ARC-enabled code, you can't override a class's -release implementation; you're always using the compiler-provided version (sort of like an intrinsic). This means you can't write a broken -release implementation: that privilege is reserved by Apple for themselves.
2. ARC code actually ends up being faster
because it avoids the runtime override implicit in Objective-C's message-based dispatch system. (This relates to point 1.)
Apple introduced ARC specifically as a response to their dissatisfaction with their GC (which was only available on Mac, no iOS, for the collection overhead reasons you mention).
Jon Parise (link) - 06 11 11 - 09:30
Thanks for the clarifications, and good to hear from you again.
When you say ARC code ends up being faster, do you mean compared to Objective C's existing refcounting system? I'm not very familiar with ObjC since I don't do Mac or iOS programming.
WinRT is built on top of COM, so no, I don't think they could. I believe you can build a WinRT component without using the compiler support runtime libraries, so there would be no way to traverse the object graph since internal objects are implementation details not exposed through a standard COM interface. IUnknown doesn't even expose any details about the refcounting itself much less any referenced objects.
Now, if they hadn't used COM, or had required additional traversal interfaces or a common object layout, then perhaps they could have. The way that WinRT is bridged to .NET is interesting in this regard as they have essentially revived all of the same lifetime problems you get when trying to use any COM interface from it, which is that you're bolting a nondeterministic GC on top of refcounting. I remember reading a while back that the Visual Studio team had problems with the .NET GC holding onto COM interfaces too long, so developers started adding Marshal.ReleaseComObject() calls to try to fix those problems, and then those had to be backed out because they were introducing premature release bugs. What might mitigate this problem is the type of objects that the base WinRT system exposes, primarily I/O and UI resources. It's common for those types of objects to export IDisposable on the .NET side anyway, like in WinForms and System.IO, so if it's just some COM objects being held a bit long on the native side with their underlying resources already detached that's not too bad.
Phaeron - 06 11 11 - 10:10
It will be very interesting to see how the interop turns out to be in practice. Could be smooth, could be painful. Thank god I earn my money with ASP.NET.
I wonder how novice programmers will deal with the complexities of "cross-border" lifetime management. .NET developers tend to be less skilled than C++ devs anyway. I see people forgetting to deterministically close files all the time. I have no hope that the "using" pattern will be used enough to make this issue go away for them.
tobi - 06 11 11 - 10:30
Yup. One of the reasons for this is the misconception that garbage collection does memory management for you: all it does is keep allocation consistent with usage, which means it won't help you when the problem is a broken usage pattern. A situation I see in C# code is objects that are accidentally kept alive way too long because they are referenced by a delegate in an event list. If that's a global event list, the result is most likely a permanent memory leak. Then someone tries to add IDisposable to fix the problem, starting an IDisposable cascade throughout the inheritance hierarchy.
The tools for tracking down .NET memory problems are also not as well used. I usually get blank stares when I suggest either the CLR Profiler or SOS.dll.
Phaeron - 06 11 11 - 10:57
>>Reference counting also doesn't require traversal of live but infrequently accessed objects, which is the main culprit in tracing GC performing badly when a virtual memory system is paging.
Conversely, since GC only needs to mark and move live objects, it is unaffected by the size and number of dead objects. Create a million element linked list and then release the last reference to its head. Watch a reference counting system walk the entire list, decreasing each refcount and releasing each node, thrashing cache. A compacting GC would just memset whole pages to 0 when you next need to allocate (there are also talks of being able to request uninitialized memory if your constructor is going to fully initialize every byte anyway, but so far nothing).
Z.T. - 06 11 11 - 11:14
There are pathological cases for either, and yes, refcount traffic is a concern. In this case, I'd make these counter points:
1) If the linked list has been recently accessed, the pages are likely to be in the working set and the only pages that need to be accessed are those in the list. The GC may have to traverse the entire set of live pages with pointers in order to determine that the linked list can be freed, if you hit the dreaded full collection.
2) On the scale of paging I/O, adjusting refcounts and memset-ing a page to 0 are effectively the same as the difference in CPU time is tiny compared to the I/O cost. You're writing the page, which forces it to be paged in and added to working set, and marked as dirty. It may be possible to tell the VM you're doing so and avoid the page-in, but I'm not sure this happens in practice -- VirtualDiscard() on Win32 is a possibility.
In addition, it's a common case for an application to have a large memory footprint relative to its allocation volume: this happens whenever you have a good-sized UI application that's idle waiting for input. This is also a good target for the virtual memory system to evict pages. When you decide to wake up the application, the GC can decide it's time to do a full collection and when it does... you're stuck waiting a minute or more while it traverses the heap, even if you're only using a tiny portion of the application.
Finally, in C/C++ code, it's possible to use memory pools to quickly discard entire data structures similarly to the GC.
Phaeron - 06 11 11 - 11:45
Why are you paging in 2011? I am more concerned with last level cache. Reference counting possibly pollutes cache when decrementing and releasing objects that were not in cache (e.g. deep nodes in data structure that is frequently accessed only breadth first). A GC possibly pollutes cache by chasing pointers through cold live objects, during full collection (RE: paging. A smart GC would have done a full collection before paging out (on receiving memory pressure event from OS) so it would not think to do a collection when paging in). With multi-processing, a GC doing reads
on one core will not slow down application code running on another core, but reference counting does extra writes
, and because of cache coherence protocols this may be more expensive.
I think we agree that memory management is not
a solved problem and a programmer who does not think will run into trouble when correctly using the default tools of any platform. A thinking programmer can make stuff work on any platform (though on some it's harder than on others). Being aware of cache hierarchy and topology, NUMA, etc. is required somewhere in the stack (app code or VM code), and cooperation from the OS scheduler (and memory manager, if possible) is required for good performance.
In the days when nobody had good compilers, good garbage collectors, good relational databases, etc., all those things we now take for ranted, programmers had to work hard to do something that takes half an hour now. So they were used to hard work. Now that you can build toy apps without understanding the complexities of the libraries and frameworks you use, less skilled people can start doing something that appears to work and only when it becomes somewhat successful they run into the problems of needing vectorized code, needing to inline everything and not blow the i-cache, needing static data preallocation, a limited allocation rate, arena allocators, needing reliable storage with high availability, needing scalable networking, etc. etc.
Z.T. - 06 11 11 - 12:56
There are three ways in which ARC'ed code can achieve performance increases:
1. It replaces the existing Objective-C message-based reference counting system with intrinsic reference counting methods. A loose analogy would be replacing virtual AddRef()/Release() methods in C++ with non-virtual, statically compiled function calls.
2. The compiler's static analysis pass can find cases where normally reference-counted call patterns don't actually need to use reference counting. The C++ analogy here is directly new'ing and delete'ing an object type that supports reference counting semantics; as you know, there's often no need to twiddle reference counts for a heap object whose lifetime is limited to a local scope.
3. Objective-C has an 'autorelease' operation that defers an object's -release call until the current autorelease pool is popped. This is how Objective-C deals with returning +1'ed objects out the left-hand side of factory functions, for example. Under ARC, many of these 'autorelease' calls can become no-ops using the same logic behind (2). Not only does that save the additional reference counting overhead, but it also eliminates autorelease pool growth and deferred work.
Jon Parise (link) - 06 11 11 - 13:34
Also, here are two good ARC references:
Jon Parise (link) - 06 11 11 - 13:36
> Why are you paging in 2011?
I am not familiar with other OSes, but even on 2011, desktop systems running Windows page. This is true even if you have 8GB of memory. The reason is that the disk cache is itself considered a process with a working set, and if you do enough in the foreground with disk I/O eventually the system will start hunting around for what it thinks is easy memory to enlarge the disk cache. Doing parallel compiles on an 8 core CPU and then a big link afterward is enough for the OS to enlarge the disk cache. It works out most of the time, until one of the processes that's mostly swapped out decides that it needs all of its pages right *NOW*. Disabling virtual memory on a Windows system is generally risky as programs do not cope well with even a momentary allocation failure.
I've never used memory pressure events or profiled a GC that does, but one problem I see is that a GC responding to such an event could easily make things worse as much as better. If you're guaranteed that the process with the GC won't be paged until low memory events happen, I could see that being effective. If it can be partially paged out beforehand, triggering a GC could actually make the situation worse. That's also assuming the GC can respond in time. In Win32, the low memory event triggers at 32MB-64MB, which seems way too low to me for modern systems.
It sounds like the environments you work in are significantly different than mine. I come from a background where the response to cache pollution trouble from the memory allocator is to use memory pools and allocate/free less; typically I aim to essentially drop the allocator out of the visible part of the profile. This works a bit better in a C/C++ environment which is based on malloc/free and supports multiple allocation paradigms. It doesn't work quite so well in a GC-based environment where pooling actually increases the GC workload, and it definitely assumes a certain level of stability or predictability in the workload.
Phaeron - 06 11 11 - 13:46
The biggest problem refcounting has today is that safely incrementing/decrementing references to objects shared between threads is *expensive* on modern cpus. Minimum of hundreds of cycles, and tens of thousands if the object actually bounces between caches. Python suffers very acutely from this: The language was originally designed with refcounting semantics, which made it's memory model be simple and have well predictable performance. This did, however, doom it to never having more than one workthread running at a time. Many attempts to remove the "global interpreter lock" have failed, and it seems that the only way it can ever run on multiple threads is to ditch the ref counting, which is something some competing implementations have done (JPython).
In languages and/or situations where you are certain that you are only ever going to run in a single thread, ref counting is better today than it used to be, because modern cpus are very wide OoOE machines, where instructions that don't lay on a critical path typically have no execution cost. Mozilla's Rust took this approach -- ref counting is available on objects so long as they have very clear thread ownership semantics. You can pass a refcounted object to another thread only by losing all references you hold to it.
Antti-Ville Tuunainen - 06 11 11 - 23:30
> The biggest problem refcounting has today is that safely incrementing/decrementing references to objects shared between threads is *expensive* on modern cpus.
On the other hand, it is perfectly sensible to decorrelate the counter and the object. Now, your object is counter-free (hurray), and you use thread-local queues to indicate inc/dec + object address. A master GC thread collects the results from the queues and deduces the dead objects. I know of at least one paper on this kind of implementation for the Eiffel language, and it worked pretty well (some work to do to avoid getting decs before the incs, but this can be solved quite easily).
Matthieu M - 07 11 11 - 05:08
If you would like to know more about GC I highly recommend "Garbage Collection Handbook", http://www.crcpress.com/product/isbn/978..
Iouri Goussev - 07 11 11 - 10:59
"2. ARC code actually ends up being faster because it avoids the runtime override implicit in Objective-C's message-based dispatch system. (This relates to point 1.)"
Currently, ARC's retain/release functions still call -retain and -release.
Preston Sumner - 08 11 11 - 09:54
Enjoyable article as always. I wanted to make a couple notes you might find useful thought:
Regarding "Neither specifies whether it is allowed to temporarily resurrect an object during destruction," WinRT's reference counting model starts with COM's IUnknown model. COM implicitly but unequivically forbid's resurrection via reference counting. The full rules for COM can be found here.
As you might have seen in the docs already, WinRT also adds new interfaces IWeakReference and IWeakReferenceSource. The intent is exactly as you describe, "cycles are still a problem but can often be avoided by judicious use of untracked weak pointers".
@Jon - I've been fascinated by Objective-C's "autorelease" mechanism. A common error in COM programming is to accidentally zero out a member pointer on an object that has a call still active on it -- e.g. a form with a button with an event handler, and the event handler nulls out the button pointer. I've seen this come up too often to keep blaming individual developers; it's a tooling problem, and there's probably a pattern that fixes it "right". However, autorelease is an easy quick fix that's close enough to right to get the job done. Honestly, I'm surprised I haven't seen it in more frameworks (e.g. not in COM, Python, etc.)
@antti-ville, you're spot on; refcounts _are_ expensive. X86 isn't too bad (~20 cycles in best case), but I expect to see a small rebellion against them on ARM (~100 cycles I think). I expect to see in the next refcounting framework both (a) thread local refcounts (plain ++ and --, releasing on the real object when it reaches zero) and (b) coallescing refcounts; perhaps in a manner similar to "autorelease", or perhaps only on a basic-block level.
I think there is a lot of runway left for refcounting. It's very tempting after all. As Phaeron alluded, refcounting provides deterministic release, which can be used to compose transaction-like operations (e.g. closing a file when the object that uses it is released). C# has a "using" statement for local variables. Refcounting provides similar deterministic cleanup for class member variables.
In any case, it was an interesting read, especially the comment discussion.
Aaron (link) - 08 11 11 - 10:22
... counting. The full rules for COM can be found here.
IanB - 09 11 11 - 07:14
I did not read all the comments... so please ignore this, if something similar was posted already.
I simply do not understand what a GC should be good for. A "real" language requires you to free your memory when finished. And this very important since it makes you think about what happens on method exit. Have seen a lot of problems caused by this. You wouldn't have them in eg. object pascal.
On the other hand, in java I did not find a way to cleanup automatically when my object is disposed. So there is only the hope the user of my library reads the docs and calls the cleanup method when finished.
And no - for non GC environments there is no need for extra overhead like "try ... finally free" if you do it the http://projectlombok.org/
@Cleanup Type var = new Type();
(well does not work for java, since it is GCed, but the lombok idea would work)
JPT - 08 12 11 - 06:24
@JPT-- GC is used mainly because programmers are lazy. @Cleanup Type/scoped_ptr work, for certain, but they don't allow you to return things from functions. Unique pointers are better for this, but, as implied by the name, you can't have more than one reference to an object at any time, which makes things harder. Quite aside from this, there are perf. advantages to garbage collection. The vast majority of objects don't have finalization semantics (destructors which free resources other than memory), so tracing GC can handle those objects correctly. More importantly, it can handle them all at once. malloc/free and new/delete have a cost--malloc needs to transverse free lists, free to coalesce adjacent blocks--and garbage collection can amortize away the cost. (Allocation in any sort of compacting GC is the cost of an add, a comparison, and a branch which is usually not taken; Baker's treadmill algorithm allows an O(1) construction of a doubly-linked free list after a collection cycle.) There are less direct costs of not using some form of GC, as well. Not using GC (whether tracing or reference-counted) means that copying large immutable data structures costs an allocation and a copy, whereas using GC allows those same structures to be "copied" by simply sharing pointers (and pseudo-atomically incrementing a reference-count field in the case of reference counting).
As for objects which do have destructors, I find to be a severe failing of both Java and C# the complete lack of compiler-enforced deterministic resource management (i.e. stack-allocated objects to implement the C++ smart-pointer classes).
NXTangl - 23 11 12 - 08:53