Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > [Slightly OT] Memory management and custom allocator

Reply
Thread Tools

[Slightly OT] Memory management and custom allocator

 
 
Markus Wichmann
Guest
Posts: n/a
 
      01-01-2012
On 31.12.2011 18:52, FtM wrote:
> On Dec 31, 6:24 pm, Kaz Kylheku <(E-Mail Removed)> wrote:
>> On 2011-12-31, FtM <(E-Mail Removed)> wrote:
>>
>>> I would like to implement my custom allocator for the *nix and windows
>>> systems too (of course loosing some portability for performance gain),
>>> and maybe implement something like the C++ allocators to improve the
>>> memory management of the different algorithms (like rare large
>>> allocations on vectors / frequent small allocations on lists), but

>>
>> What makes you think that your malloc doesn't already handle
>> such cases?
>>

>
> Simply because it can't know if I'm going to do 200 little allocations
> or 1 big one. Even if it has the most neat heuristic ever, the libc
> (or the kernel) exposing a simple specific function/syscall would do
> its job way better. Part of the question was about this hypothetical
> function.
>


malloc() is already optimized to exactly those cases, as these are the
by far most common ones: Either you want a storage area, the size of
which is determined at run time, then you will probably have one big
allocation, or you are using fancy data structures other than dynamic
arrays (that includes, but is not limited to, linked lists, trees and
hashes), then chances are you are allocating each node separately so you
get your n+1 little allocations.

Having seen the glibc, eglibc, dietlibc and uclibc allocators, I already
saw that those are the preferred modes of operation. The dietlibc
allocator is especially easy to describe:

For one, it _never_ uses brk(). That syscall is just plain old and does
nothing mmap() couldn't do better. Then it has linked lists prepared for
objects of sizes of powers of two, starting at 16 and stopping at 2048.
It rounds the sum of the object header size and the object size to the
next power of two (using some nifty eye-watering bit-mangling) and puts
it in such a linked list. If the list is too small, it is expanded using
mmap() or mremap().

If the object is bigger than 2048 bytes it gets it's own page(s)!

>> Say, have you profiled the code to see what fraction of the time is actually
>> spent executing the allocator code?
>>
>> Can you answer the question: how much faster will the program be if
>> a magic allocator is substituted which requires zero cycles to execute?
>>

>
> Let's see... So why do the C++ STLs (or boost, for another example) do
> exactly what I'm asking here?
>


Hmmm... I just had a look at the libstdc++ deployed with g++ 4.5.0 and I
saw that at least there, operator new calls malloc(). I also looked at
STLport and saw the same.

Also, you didn't answer the questions. Those are actually pretty good ones!

>>> I've some questions:
>>> - In these systems, to have decent performances, I was thinking to
>>> allocate page-sized (and page-aligned) chunks to have small
>>> allocations in the same page, but how is it possible to request some
>>> memory with this characteristics? Will it help the OS memory manager?

>>
>> Not knowing these basics is an indicator that you need to study more if you
>> want to write memory allocators that beat malloc.

>
> I already said that I don't want to "beat malloc", just to find a
> (probably platform-specific) way to facilitate its job.
> Ciao!
>


You are not making sense. You don't want to beat malloc but you still
want to write a malloc replacement? You'd willingly choose the inferior
implementation?

> P.S. I'm sorry if I'm looking so.. childish(?).. to you but, believe
> me, I've written some physical/virtual memory allocators for many
> simpler system,


That's an entirely different can of worms. Believe me, the memory
management that just uses an OS is far easier than the one the OS has to
maintain.

> I know how systems work behind the scenes, and I
> really don't want to start writing something like that: I'm just
> asking how to deal with the two specific environments above while
> masking the differences between the two and maintaining a standard
> interface. I know that it's hard and that someone already did it: I'm
> asking *how* he did that (and maybe where!)


You could look at the libc source codes. glibc, eglibc, dietlibc and
uclibc are a good place to start for Linux. For Windows I'd recommend
cygwin.

If you are interested in the syscalls that allocate memory: For *nix
that's mmap() with MAP_ANONYMOUS or, failing that, /dev/zero. Failing
_that_, you could try to map a newly created and truncated file. Try to
stay away from brk(), it's marked obsolete.

For Windows, have a look at HeapAlloc(), VirtualAlloc() and
LocalAlloc(). BTW: The MSVC implementation somehow manages to make
access to _one_ byte after the allocated buffer fail with a crash. Good
job, as it makes finding off-by-one errors easier.

But I still don't see your point. malloc() is meant to be portable! The
only thing you might want to tweak against is malloc(0), as that's
unspecified. Several allocators return NULL on that, but several others
do not. They just present you an object you can't even access the first
byte of.

HTH,
Markus
 
Reply With Quote
 
 
 
 
BartC
Guest
Posts: n/a
 
      01-01-2012
"Markus Wichmann" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> On 31.12.2011 18:52, FtM wrote:


>> Simply because it can't know if I'm going to do 200 little allocations
>> or 1 big one.


> For one, it _never_ uses brk(). That syscall is just plain old and does
> nothing mmap() couldn't do better. Then it has linked lists prepared for
> objects of sizes of powers of two, starting at 16 and stopping at 2048.


That exactly what I do in my own allocator! Except I stop at 1024. Although
I probably manage them very differently, for example once a small block size
is allocated, say 64 bytes, it always stays that size.

And one thing about malloc() I don't like, are the overheads in having to
remember the block sizes; if I allocate 16 bytes, malloc() seems to reserve
either 20 or, more often, 24 bytes on the few compilers I've tried. That's a
50% overhead!

Also if you're being sensible and trying to keep to power-of-two
allocations, this 4- or 8-byte overhead is going to screw that up.

My allocators have never stored a block size, and it has never really been a
problem. Either you will know the size (a struct for example), or you need
to keep track of it yourself, in which case it is very handy when you have
an allocation that can grow in never having to keep running back to
realloc().

As an illustration, take this simple example of a string growing from one to
ten million characters:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
char *p;
int i,len;

p=malloc(1);
*p=0;
len=0;

for (i=1; i<=10000000; ++i) {
++len;
p=realloc(p,len+1);
p[len-1]='*';
p[len]=0;
// puts(p);
}
printf("Length= %d\n",strlen(p));
}

On three C compilers, this took about 12 seconds (on DMC, any length of
string resulted in a runtime of 0.4 seconds; a bit odd).

Using my strategy, of avoiding any calls to malloc or realloc unless
absolutely necessary, and using that inside an interpreter, this bit of
script took about 0.1 seconds:

string a

a:=""
to 10 million do
a+:="*"
od
println "Length=",a.len

Of course, real C code wouldn't use such a naive strategy; it would also
seek to minimise malloc()/realloc() calls, but that's also an admission that
these calls have some overheads which it would be desirable to avoid. Hence
I can understand the OP's attempt to experiment with his own versions.

--
Bartc

 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      01-01-2012
On 1/1/2012 7:54 AM, BartC wrote:
> "Markus Wichmann" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>> On 31.12.2011 18:52, FtM wrote:

>
>>> Simply because it can't know if I'm going to do 200 little allocations
>>> or 1 big one.

>
>> For one, it _never_ uses brk(). That syscall is just plain old and does
>> nothing mmap() couldn't do better. Then it has linked lists prepared for
>> objects of sizes of powers of two, starting at 16 and stopping at 2048.

>
> That exactly what I do in my own allocator! Except I stop at 1024.
> Although I probably manage them very differently, for example once a
> small block size is allocated, say 64 bytes, it always stays that size.


Sounds like a recipe for internal fragmentation, but YMMV.

> And one thing about malloc() I don't like, are the overheads in having
> to remember the block sizes; if I allocate 16 bytes, malloc() seems to
> reserve either 20 or, more often, 24 bytes on the few compilers I've
> tried. That's a 50% overhead!


It's only significant if you have many such small allocations.
Even if so, a malloc() implementation can find other ways to remember
the size than by storing it explicitly. For example, malloc() might
set aside regions of memory dedicated to "small" allocations: If an
pointer aims at an address anyhere in [A1,A1+N1), [A2,A2+N2), ... it
could simply be "known" to point at a 32-byte block. (The test can't
be done efficiently in portable C, but a malloc() implementation need
not be portable, nor written in C.)

> Also if you're being sensible and trying to keep to power-of-two
> allocations, this 4- or 8-byte overhead is going to screw that up.


What's "sensible" about allocating in powers of two? Why should
you allocate 128 bytes to hold a line of input, rather than 100 or
80 or 250? Programmers are numerologically superstitious.

> My allocators have never stored a block size, and it has never really
> been a problem. Either you will know the size (a struct for example), or
> you need to keep track of it yourself, in which case it is very handy
> when you have an allocation that can grow in never having to keep
> running back to realloc().


If your allocators never store block sizes, it follows that free()
cannot recycle a released block for re-use by a subsequent malloc(),
because it does not know whether the freed block is big enough to
satisfy the malloc() request. (Re-computing the size by examining the
addresses of nearby busy and free blocks counts as "storing the size"
in my book, just encoding it deviously.)

An allocator with a different API -- one whose free()-substitute
takes the block size as a parameter -- could recycle blocks. But now
you're no longer talking about the O.P.'s drop-in replacement for the
API as it stands.

> As an illustration, take this simple example of a string growing from
> one to ten million characters:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main(void)
> {
> char *p;
> int i,len;
>
> p=malloc(1);
> *p=0;
> len=0;
>
> for (i=1; i<=10000000; ++i) {
> ++len;
> p=realloc(p,len+1);


Side note: Never call realloc() this way in real code, because if
it returns NULL you'll have lost your only pointer to the original
block; you won't even be able to free() it any more.

> p[len-1]='*';
> p[len]=0;
> // puts(p);
> }


With an allocator that does not store the block size, this will
necessarily take a long time, for several reasons:

- Since realloc() does not know how long the block at `p' is, it
cannot tell whether there's any slop at the end that the block
could grow into without moving. Therefore, every realloc() call
must allocate a new block and copy the old contents into it.

- Since realloc() does not know how long the old block is, it
must copy `len+1' bytes from the old block to the new, even
though that's more than the old block holds. (We'll assume a
non-portable implementation can get away with copying the non-
existent bytes.) Total amount copied: 2+...+10000001 bytes;
that's fifty terabytes.

- Since realloc() does not know how long the old block was, it
cannot re-use that memory to satisfy subsequent requests. Thus,
every call allocates a brand-new block and makes the old block
unavailable. Total memory allocated: 1+2+...+10000001 bytes.
50TB to store a 10MB string is an overhead of fifty million
percent -- and you groused about a mere fifty???!

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)d
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      01-01-2012
On 1/1/2012 7:54 AM, BartC wrote:
> "Markus Wichmann" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>> On 31.12.2011 18:52, FtM wrote:

>
>>> Simply because it can't know if I'm going to do 200 little allocations
>>> or 1 big one.

>
>> For one, it _never_ uses brk(). That syscall is just plain old and does
>> nothing mmap() couldn't do better. Then it has linked lists prepared for
>> objects of sizes of powers of two, starting at 16 and stopping at 2048.

>
> That exactly what I do in my own allocator! Except I stop at 1024.
> Although I probably manage them very differently, for example once a
> small block size is allocated, say 64 bytes, it always stays that size.


Sounds like a recipe for internal fragmentation, but YMMV.

> And one thing about malloc() I don't like, are the overheads in having
> to remember the block sizes; if I allocate 16 bytes, malloc() seems to
> reserve either 20 or, more often, 24 bytes on the few compilers I've
> tried. That's a 50% overhead!


It's only significant if you have many such small allocations.
Even if so, a malloc() implementation can find other ways to remember
the size than by storing it explicitly. For example, malloc() might
set aside regions of memory dedicated to "small" allocations: If an
pointer aims at an address anyhere in [A1,A1+N1), [A2,A2+N2), ... it
could simply be "known" to point at a 32-byte block. (The test can't
be done efficiently in portable C, but a malloc() implementation need
not be portable, nor written in C.)

> Also if you're being sensible and trying to keep to power-of-two
> allocations, this 4- or 8-byte overhead is going to screw that up.


What's "sensible" about allocating in powers of two? Why should
you allocate 128 bytes to hold a line of input, rather than 100 or
80 or 250? Programmers are numerologically superstitious.

> My allocators have never stored a block size, and it has never really
> been a problem. Either you will know the size (a struct for example), or
> you need to keep track of it yourself, in which case it is very handy
> when you have an allocation that can grow in never having to keep
> running back to realloc().


If your allocators never store block sizes, it follows that free()
cannot recycle a released block for re-use by a subsequent malloc(),
because it does not know whether the freed block is big enough to
satisfy the malloc() request. (Re-computing the size by examining the
addresses of nearby busy and free blocks counts as "storing the size"
in my book, just encoding it deviously.)

An allocator with a different API -- one whose free()-substitute
takes the block size as a parameter -- could recycle blocks. But now
you're no longer talking about the O.P.'s drop-in replacement for the
API as it stands.

> As an illustration, take this simple example of a string growing from
> one to ten million characters:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main(void)
> {
> char *p;
> int i,len;
>
> p=malloc(1);
> *p=0;
> len=0;
>
> for (i=1; i<=10000000; ++i) {
> ++len;
> p=realloc(p,len+1);


Side note: Never call realloc() this way in real code, because if
it returns NULL you'll have lost your only pointer to the original
block; you won't even be able to free() it any more.

> p[len-1]='*';
> p[len]=0;
> // puts(p);
> }


With an allocator that does not store the block size, this will
necessarily take a long time, for several reasons:

- Since realloc() does not know how long the block at `p' is, it
cannot tell whether there's any slop at the end that the block
could grow into without moving. Therefore, every realloc() call
must allocate a new block and copy the old contents into it.

- Since realloc() does not know how long the old block is, it
must copy `len+1' bytes from the old block to the new, even
though that's more than the old block holds. (We'll assume a
non-portable implementation can get away with copying the non-
existent bytes.) Total amount copied: 2+...+10000001 bytes;
that's fifty terabytes.

- Since realloc() does not know how long the old block was, it
cannot re-use that memory to satisfy subsequent requests. Thus,
every call allocates a brand-new block and makes the old block
unavailable. Total memory allocated: 1+2+...+10000001 bytes.
50TB to store a 10MB string is an overhead of fifty million
percent -- and you groused about a mere fifty???!

--
Eric Sosman
(E-Mail Removed)d
 
Reply With Quote
 
BartC
Guest
Posts: n/a
 
      01-01-2012

"Eric Sosman" <(E-Mail Removed)> wrote in message
news:jdpp3r$8h3$(E-Mail Removed)...
> On 1/1/2012 7:54 AM, BartC wrote:


>> That exactly what I do in my own allocator! Except I stop at 1024.
>> Although I probably manage them very differently, for example once a
>> small block size is allocated, say 64 bytes, it always stays that size.

>
> Sounds like a recipe for internal fragmentation, but YMMV.


That's what I once thought. I've used a similar scheme for years (blocks
could get smaller but not bigger), and originally prepared a defragmentation
algorithm, but I never needed it. (This was for managing many small
allocations in a script language.) It was self-sustaining.

>> Also if you're being sensible and trying to keep to power-of-two
>> allocations, this 4- or 8-byte overhead is going to screw that up.

>
> What's "sensible" about allocating in powers of two? Why should
> you allocate 128 bytes to hold a line of input, rather than 100 or
> 80 or 250? Programmers are numerologically superstitious.


More like habit. But, if you allocate 4096 bytes thinking it would fit
neatly into one page of virtual memory, you will really need 4100 or 4104;
not quite as tidy. (This is an issue with my allocator, which does request
power-of-two sizes, that is implemented on top of malloc()).

>> My allocators have never stored a block size, and it has never really
>> been a problem.


> An allocator with a different API -- one whose free()-substitute
> takes the block size as a parameter -- could recycle blocks. But now
> you're no longer talking about the O.P.'s drop-in replacement for the
> API as it stands.


Yes, that's what I used; free() takes an extra parameter.

>> p=realloc(p,len+1);


> With an allocator that does not store the block size, this will
> necessarily take a long time, for several reasons:


> - Since realloc() does not know how long the block at `p' is, it

<snip>

You're now talking about a hypothetical realloc() that really has no idea
what the original block size was?

That wouldn't work anyway, as it would have no idea whether the block is
expanding or contracting and there would be other issues such as those you
mention.

Such a function *has* to be able to determine the original size, but storing
it as malloc/realloc presumably does, deducing it, or simply being told.
(Actually my scheme doesn't even have a realloc()-type function; but with
block-sizes known, it would be trivial to write)

--
Bartc

 
Reply With Quote
 
FtM
Guest
Posts: n/a
 
      01-01-2012
On Jan 2, 1:54 pm, "BartC" <(E-Mail Removed)> wrote:
> "Markus Wichmann" <(E-Mail Removed)> wrote in message
>
> news:(E-Mail Removed)...
>
> > On 31.12.2011 18:52, FtM wrote:
> >> Simply because it can't know if I'm going to do 200 little allocations
> >> or 1 big one.

> > For one, it _never_ uses brk(). That syscall is just plain old and does
> > nothing mmap() couldn't do better. Then it has linked lists prepared for
> > objects of sizes of powers of two, starting at 16 and stopping at 2048.

>
> That exactly what I do in my own allocator! Except I stop at 1024. Although
> I probably manage them very differently, for example once a small block size
> is allocated, say 64 bytes, it always stays that size.
>


Ok, thank you all for the suggestions (that basically are "use malloc
and don't bother"). I know that the allocation functions are great as
they are: I'm using them right now and I have no problems at all! I
was only trying to expand the code from the embedded part that
contains some code that I already have, that's it. More on, I already
have some handlers that help me debugging the algorithms..

Anyway, regarding the allocator descriptions that Markus and BartC
gave, that's what I was thinking and partially already doing, let's
see if this makes more sense to you:
- first of all, I'm dealing with the memory allocation only *inside*
my library, so I don't need to build a general purpose allocator.
- inside, I'll have three different "objects": the first one is for
little allocations, the second one for the others, the third one for
the medium/big allocations that possibly will need to realloc(). The
third allocator follows the "normal" strategy (and could simply use
the system calls for that matter), while the first and the second one
could simply have some preallocated blocks and a vector to mark if
they are used or not, skipping the natural overhead needed by the
general purpose allocator.

What do you think about this approach (despite the fact that in the
end it probably gives no noticeable advantages)?
Ciao!
 
Reply With Quote
 
Joe keane
Guest
Posts: n/a
 
      01-01-2012
In article <(E-Mail Removed)>,
FtM <(E-Mail Removed)> wrote:
>- Does all of this make any sense?


IMHE this come up when

a) you are not pleased with the default functions' time or space
overhead
b) you like to control things
c) you want the 'memory' in memory-mapped files
d) you want to use shared memory
 
Reply With Quote
 
Joe keane
Guest
Posts: n/a
 
      01-01-2012
In article <jdpp3r$8h3$(E-Mail Removed)>,
Eric Sosman <(E-Mail Removed)> wrote:
>What's "sensible" about allocating in powers of two?


A) VM pages are powers of two, so if you have non-two objects you waste
space or take more VM pages for each object than you would like

B) cache lines are powers of two, so if you have non-two objects you
waste space or take more cache lines for each object than you would like

>Programmers are numerologically superstitious.


i am numerologically superstitious, but you can also file this into the
'maybe someone has figured out something that we don't understand'
category
 
Reply With Quote
 
BartC
Guest
Posts: n/a
 
      01-01-2012
"BartC" <(E-Mail Removed)> wrote in message
news:LPYLq.355$(E-Mail Removed)2...

[growing a string from empty, to 10 million chars]
> char *p;
> int i,len;
>
> p=malloc(1);
> *p=0;
> len=0;
>
> for (i=1; i<=10000000; ++i) {
> ++len;
> p=realloc(p,len+1);
> p[len-1]='*';
> p[len]=0;
> }


> On three C compilers, this took about 12 seconds (on DMC, any length of
> string resulted in a runtime of 0.4 seconds; a bit odd).


Measuring the number of times the value of p changes, with the slow
compilers, this was respectively 2335, 2336, and 2337. With DMC, it was just
46.

Clearly DMC overallocates more than the others (growing allocations by
sqrt(2) apparently).

(My own code, which just uses doubling, will call a memory allocator only
about 20 times, so doesn't have the overheads of calling a function like
realloc() ten million times, hence it was quite quick even though it was
interpreted (but heavily optimised too..))

So with an application with certain memory usage patterns, one
implementation can dramatically outperform another. Another reason, even if
not writing your own allocators, at least to not just use what's provided
without question.

--
Bartc

 
Reply With Quote
 
BartC
Guest
Posts: n/a
 
      01-01-2012
"FtM" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...

> - inside, I'll have three different "objects": the first one is for
> little allocations, the second one for the others, the third one for
> the medium/big allocations that possibly will need to realloc(). The
> third allocator follows the "normal" strategy (and could simply use
> the system calls for that matter), while the first and the second one
> could simply have some preallocated blocks and a vector to mark if
> they are used or not, skipping the natural overhead needed by the
> general purpose allocator.
>
> What do you think about this approach (despite the fact that in the
> end it probably gives no noticeable advantages)?


I don't think there's enough information here; what sort of sizes are the
three classes of allocator handling?

Will any of them always be a fixed size? Will they need deallocating? (As
sometimes this isn't necessary, or the pool of memory from which they're
drawn will be freed as one block.) What strategy will be used to grow the
heap from which the the blocks will come; in fact does this heap need to
grow? Where does the memory come from in the first place? Etc..

--
Bartc

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
C++ Memory Management Innovation: GC Allocator xushiwei C++ 1 04-22-2008 11:51 AM
custom allocator using templates Robert Frunzke C++ 4 01-29-2006 05:30 PM
Custom memory allocator - alignment Romeo Colacitti C Programming 4 03-07-2005 01:02 AM
Custom allocator sample code for vector Alex Vinokur C++ 16 08-16-2004 11:25 AM
Idea for custom thread-safe STL allocator? Brian Genisio C++ 12 01-15-2004 03:41 PM



Advertisments