free
calls in the program.
For sufficiently complex programs this approach leads to
If, OTOH, the program assumes conservatively that the object will be needed when it actually is dead, the result is a memory leak. Depending on the application, this kind of bug is acceptable or disastrous.
The topic of garbage collection regularly comes up in comp.lang.forth, about twice a year. After the last time (in February 1998), I finally sat down and wrote one.
The result is a conservative mark-and-sweep collector. It has an administrative memory overhead of 1.6% (with another 1.6% needed during allocation), which is usually better than the constant per-object overhead of typical implementations of the memory allocation wordset (in addition to this administrative overhead there is also the more significant fragmentation problem, which should be similar for both approaches).
alloc ( usize -- addr ior
)
, just as with allocate
. There is no need (and no
way) to explicitly free this memory. The address returned is naturally
aligned for every Forth data type (including dfaligned
;
natural alignment is alignment to multiples of the data type
size). The memory is initialized with zeroes.
However, gc.fs needs to know if there is a reference to the object (if there is none, the object is certainly dead, and its storage can be recycled). You have to follow some rules to make sure gc.fs notices the reference:
root-address
(see below), or in a
naturally aligned cell in a referenced alloc
ed object
(naturally aligned cell means that its address is divisible
by 1 cells
). I would have liked to add the return
and locals stack, but unfortunately there is no standard way to access
them, so they are out.
alloc
). I.e., you cannot use
addresses pointing into the middle of an object (e.g., a pointer
walking an array), unless you also keep a pointer to the start in a
place where gc.fs will find it.
root-address ( a-addr -- )
that
a-addr
is a memory location that may contain a reference
to an alloc
ed object.
Before you can enjoy these luxuries, you have to load and initialize the garbage collector; there are two ways to do this:
alloc
. This
memory cannot be extended later, so better make it large enough. This
memory is allocated using allocate
; for good Forth/OS
combinations (e.g., Gforth on Linux) you can ask for a lot, without
having to pay for it until the memory is actually used (however, you
usually have to pay for the administrative overhead (1/64)
immediately (but only in virtual memory)).
needs
, loading gc.fs does not
consume these values, so you have to remove them yourself. A typical
loading line might look like
64 1024 * 1024 * ( 64MB ) true false s" gc.fs" included drop 2drop
allocate
d during
initialization does not survive this operation).
For this, you have to load gc4image.fs with one argument, the flag selecting the mark method (see above). This flag is not consumed by loading the code. E.g.,
false s" gc.fs" included dropTo initialize the garbage collector, you have to set the size and verbosity (see above), and then call
init-gc
. E.g.:
64 1024 * 1024 * ( 64MB ) gc-setmem true gc-verbose init-gc
root-address
, or an unaligned cell in an object), then
this reference can become a dangling reference.
gc.fs helps you find such bugs in the following way: It
writes over the memory as soon as it reclaims it; this makes it easier
for you to notice and find the bug. You can force a garbage collection
(and its associated memory overwriting) with collect-garbage (
-- )
; this can be a good idea just before checking a data
structure for consistency during debugging.
Note that destroying a reference is easier than using
free
: You just have to decide whether one reference is no
longer needed (and you can usually do that with local information),
whereas with free
you have to decide whether all
references are no longer needed.
The Forth system has no type information, so gc.fs conservatively has to assume that everything is an address (at least everything that is aligned like an address and is in the right place). As a consequence, some integer can be interpreted as a reference and may cause a dead object to be kept around. Usually, this is no big problem, but if you use huge circular data structures, it can be a problem: a spurious reference to just one object in the data structure will keep the whole data structure alive; at least that's what gc.fs thinks. Similarly, for long linear lists, one spurious reference will on average keep half of the list from being reclaimed. If you have such data structures, be prepared for irregular memory consumption. For more information on this topic, read [boehm93].
alloc
ed), you may damage the data structures used by
the garbage collector. This may result in mysterious bugs.
To make it easier to reveal such problems, gc.fs includes
some consistency checks of its data structures. At assertion level 2
(2 assert-level !
before loading gc.fs), a
number of checks are done, including a consistency check of the
freelists before (and, to check against collector bugs, after) each
garbage collection. At assertion level 3, the collector additionally
checks the freelists before each alloc
ation (this causes
an extreme slowdown and is mainly useful for overnight regression test
runs).
Although gc.fs does not support dynamic resizing directly, you can have it at the cost of an extra indirection, with the following programming technique:
In addition to the object to be resized, alloc
ate a
one-cell object: the handle, which contains a pointer to the
object. Always access the object indirectly, do not keep direct
pointers to it. In this way you know that only the handle points to
the object to be resized, so when you want to resize it, you can
simply alloc
ate an object of the new size, copy the
contents from the old to the new object, and store the address of the
new object in the handle.
This gives you resizable objects, at the price that Amerman was willing to pay, without imposing any additional overhead for non-resizable objects.
The amount of memory actually used is determined in the following way:
memory= (a * livemem / b) + cWhere livemem is the amount of memory in live objects after the last garbage collection, and a, b, and c are tunable parameters. By default, a/b is 3, and c is 256KB.
You can tune these parameters for your application: You can reduce
the memory, at the cost of a higher number of garbage collections, or
you can increase it to reduce the number of garbage collections. Note
that if you reduce memory too much (to less than about 2*livemem), the
number of garbage collections rises extremely. And if a
garbage-collection does not free an appropriate memory chunk,
alloc
will grab more memory regardless of the limit.
The following code sequence demonstrates the adjustment of parameters by resetting them to their default values:
also garbage-collector 3 1 ( a b ) size-factor 2! 256 1024 * ( c ) grain/ size-offset ! set-current-limit previousThe graph shows the effects of setting the limit to a fixed amount of memory. The benchmark (also described in Section Efficiency) constantly has 500-1000 live objects of size 16-80 bytes (average: 51 bytes). Note that the amount of memory actually used is about 50KB during most of the benchmark unless the limit is higher (so reducing the limit below that just makes the benchmark run slower without having a significant effect on memory consumption). The jaggedness of the timings in the middle area is probably due to the freelist implementation and the random sizes of the allocated objects.
free
.
In general, this is wrong: copying collectors make it possible to
use a very fast, allot
-like allocator, and the collection
overhead can be made arbitrarily small by using more memory
(techniques like generational garbage collectors make it possible to
reduce collection overhead without taking as much memory). In
contrast, a memory allocator with explicit deallocation has to manage
a freelist, which creates quite a bit of overhead.
However, a conservative non-copying collector also has to manage a
freelist, so it will typically be somewhat slower than using explicit
deallocation. But if we take the inefficiency of the programming
techniques used to handle explicit deallocation into account, and
spend the time we saved on programming and debugging correct
free
s for performance tuning, garbage collection is
easily faster.
You may be more interested in real timings than in theoretical discussions, therefore I have converted one of the test programs into a benchmark: bench-gc.fs. It allocates 250,000 objects of size 2 cells+0..63 aus (10,793,040 bytes on a byte-addressed 32-bit machine); 1,000 of these objects (41,800 bytes) are alive at the end.
You can load bench-gc.fs with a parameter between 1 and 6
on the stack, resulting in different storage allocation options. You
can run the whole suite with make bench
on your
system. The following table shows the parameter, the storage
allocation option, and the time used (user time or user+system time)
on Gforth-0.3.0 on a 486-66 running Linux.
1 GC with default tuning 23.33 2 alloc-nocollect and free-chunk 32.63 3 allocate and free 5.64 4 GC tuned for no collection 10.55+0.66 5 allocate 4.76+1.02 6 allot 4.67+0.61Note that this benchmark does hardly more than allocating and forgeting objects, making these operations the dominant factor in its run-time. For an application that does something useful, the difference between the rows will be smaller.
The first row shows garbage collection as it is normally used, the second uses the freelist management words of the garbage collector to implement explicit deallocation (i.e., without using the garbage collector). Comparing these two numbers is probably the fairest comparison of the two approaches. Surprisingly, explicit deallocation is slower (by a factor of 1.4) than garbage collection here. I have some theories to explain this, but no empirical support for any of them. Anyway, this evidence suggests that the overhead of garbage collection is small compared to the time spent in freelist management.
Of course, for explicit deallocation one would usually use
allocate
and free
. The result of using this
combination is shown in the third row. On Gforth it is much faster
than the two gc.fs-based solutions. My explanation for this
is:
allocate
and free
are primitives that invoke
C functions that were compiled into native code with an optimizing C
compiler. [ertl&maierhofer95] reports that Gforth's threaded code is
4-8 times slower than optimized C code.
allocate
and
free
have probably been written for speed (it is
well-known that slow implementations of these functions result in
significant overhead for allocation-intensive programs).
By comparing row 1, 2 and 4 we see the overhead of garbage
collection, freeing and freelist management (for row 4 the freelist is
empty, and therefore fast). The comparison of line 4 and 5 again shows
the performance advantage of optimized C code over Gforth's threaded
code. A comparison of row 4 and row 6 shows that even with an empty
freelist, alloc
has a significant overhead over
allot
: checking that the freelist is empty, remembering
the borders between objects, checking whether a garbage collection
should be performed, zeroing the returned memory.