On Jan 5, 7:51 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
> James Kanze <james.ka...@gmail.com> wrote innews:5b17c708-0ca7-40aa-9dfd-:
> > On Jan 4, 5:31 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
> >> Saeed Amrollahi <amrollahi.sa...@gmail.com> wrote in
> >> news:fe8f1a4b-1315-
> >> 4755-a30a-3566a52bd...@i17g2000vbq.googlegroups.com:
> >> > I am working on an application to manage Tehran Securities
> >> > Exchange organization. In this program, I try to manipulate
> >> > a lot of institutional investors (about 1,500,000). The
> >> > Investors are a struct like this:
> >> > struct Investor {
> >> > int ID;
> >> > std::wstring NID;
> >> > std::wstring Name;
> >> > std::wstring RegNum;
> >> > std::wstring RegDate;
> >> > std::wstring RegProvince;
> >> > std::wstring RegCity;
> >> > std::wstring Type;
> >> > std::wstring HQAddr;
> >> > // the deblanked or squeezed info
> >> > std::wstring sq_Name;
> >> > std::wstring sq_RegNum;
> >> > std::wstring sq_RegProvince;
> >> > std::wstring sq_RegCity;
> >> > };
> >> Size of this struct is 584 bytes with my compiler. 1500000*584 is ca
> >> 835 MB. You are holding them twice (both in vector and map), which
> >> makes it ca 1.7 GB. This brings you already quite close to the
> >> maximum limits for 32-bit applications (2G in Windows by default),
> >> hence the memory problems are expected.
> > Independantly of sizeof(Investor), all those strings are likely
> > to require additional memory (unless they are empty, or unless
> > the compiler uses the small string optimization, and they are
> > small enough---but things like names, provinces and cities
> > generally aren't). This could easily double the necessary
> > memory (with small string optimization) or multiply it by 5 to
> > 10, or more (without small string optimization, but then,
> > sizeof(Investor) would probably be around 64).
> You are right of course, I was assuming most of these strings
> are short and fit in the small string optimization buffer.
What are usual values for the small string cutoff? I seem to
remember 8 being used somewhere, but if sizeof this structure is
584, and wchar_t is 2 bytes (Windows), then I'd guess more
likely 16 (or 8 if wchar_t is 4 bytes, as is the case on most
Unix systems). With 16, your almost certainly right---most of
these fields would fit into 16 characters (supposing the field
names mean what they seem to mean, and judging from the names I
see on maps, etc.). That does change things.
> If he used UTF-8 encoded
> std::string instead of std::wstring, this assumption might be true more
> often (assuming the texts are mostly ASCII of course).
He said that this was for the Teheran securities exchange. The
local language in Teheran uses the Farsi alphabet, a modified
(or extended) form of the Arabic alphabet. In UTF-8, these are
all two byte characters, in the range D8 80--DB BF. If wchar_t
uses UTF-16, he gains absolutely nothing. (If it uses UTF-32,
on the other hand, he divides his memory use by 2. Maybe.)
As to whether it fits into the small string optimization or not,
all depends on how that is implemented. If the optimization
limit depends on the number of code points (char or wchar_t, and
the sizeof you mention would seem to suggest that), then he will
miss the small string optimization more often using UTF-8 than
using UTF-16 or UTF-32. (If the limit depends on the number of
bytes, then UTF-16 and UTF-8 are roughly equal, supposing most
text written in Farsi.)
> >> The correct solution would be to redesign the application so that you
> >> don't need to load all the data in the memory at once. Normally
> >> database applications are happy to load only the data they need at
> >> the moment.
> > It's hard to say what the correct solution is without more
> > context, but there's a very good chance you're right, at least
> > if the actual data is stored in a relational data base.
> >> Another solution would be to use a 64-bit machine, OS and
> >> compilation, this lifts the limits a bit, but does not help if your
> >> database happens to grow 100 times.
> > If he's near the limits on a 32 bit machine, a 64 bit machine
> > should multiply the number he can handle by significantly more
> > than a 100. Moving to 64 bits is probably the simplest
> > solution, *IF* the program really needs to keep all of the data
> > virtually in memory.
> The address space limits would be gone in 64-bit machine, but the new
> limits will be set by the physical RAM and disk space present. A normal
> desktop PC would have e.g. 8GB of RAM, which is only 4 times larger than
> the 2GB limit for Windows. The disks are maybe 200GB or something, if one
> (ab)uses that all for the pagefile, then one could hold theoretically
> hold ca 100 times more data in process memory than for a 32-bit process
> (assuming one has enough patience to wait for such process to complete
> anything!).
Terabyte disks and larger are readily available, and not
expensive. On a 64 bit system, he should easily be able to get
rid of the bad_alloc. With in return so much swapping that the
system becomes unusable.
> >> Yet another solution is to reduce the size of
> >> Investor, for example putting all strings together in a single
> >> std::wstring, separated by some delimiter like a zero byte. This
> >> would make finding, accessing and changing the individual fields
> >> slower and more cumbersome of course.
> > I'm not sure that that would gain that much.
> There are lots of space overhead associated with each std::wstring. In my
> implementation it seems the size of std::wstring is 32 bytes in 32-bit
> compilations, from which only 16 bytes is used for the small string
> buffer.
That seems like a lot, although... It's not too difficult to
implement the small string optimization with an overhead of only
two pointers. On the other hand, any quality implementation
will have a debugging option, which will require extra pointers
to keep a list of iterators, etc. From what I've seen, I have
the impression that VC++ tries to make it possible to link
debugging and non-debugging versions (although at least in VC8,
this doesn't actually work), and so uses the extra space even
when debugging is turned off.
> For example, if all strings are very short and fit in the small
> string buffers, then one can easily reduce the overall memory
> consumptions by half by packing them all in a single string. If the
> strings are large then this would not help of course.
If the strings are large, you would only have one allocation,
instead of many. And each allocation has significant overhead
as well (maybe 16 bytes), without mentionning the small string
buffer which isn't used.
No, you're right. Using a single string with a separator will
represent a significant gain (unless the strings are really,
really long, which I doubt is the case). Regardless of the
implementation of std::basic_string, and whether the strings fit
into the small string buffer or not.
> > What is probably,
> > on the other hand, is that there are a lot of duplicates in the
> > fields with province and city; using a string pool for these
> > could help. And it may be possible to generate the sq_...
> > values algorithmically from the non sq_ value; in that case,
> > they could be eliminated completely from the structure. But
> > none of these optimizations will last for long, if the table
> > significantly increases in size.
> Right
Even without many duplicates: using string pooling would allow
putting all of the actual character data in a single string,
which as you correctly point out, is a win in itself. Each
string in his structure would have type PooledString, which
would consist of two size_t (indexes to the begin and end of the
string in the pool). On a 32 bit machine, that's 8 bytes,
instead of 584; there's still the overhead for the pool, but it
wouldn't surprise me if this halved the total memory use.
--
James Kanze