Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Serialization library, request for feedback

Reply
Thread Tools

Serialization library, request for feedback

 
 
Ulf Åström
Guest
Posts: n/a
 
      12-14-2012
On Dec 14, 10:47*am, BGB <(E-Mail Removed)> wrote:
> On 12/14/2012 12:38 AM, Ian Collins wrote:
> > Ulf str m wrote:
> >> On Dec 13, 10:57 pm, Keith Thompson <(E-Mail Removed)> wrote:
> >>> Ulf str m <(E-Mail Removed)> writes:

>
> >>>> The user needs to set up translators (composed of one or more fields)
> >>>> for each type they wish to serialize. Internally it works by
> >>>> flattening each type into a list of void pointers and replacing them
> >>>> with numeric indices. The output format has a simple "<tag> <value>;"
> >>>> format; it is similar to JSON but it is not a reimplementation of it..

>
> >>> Why not just use JSON? *It would make the flattened files accessible by
> >>> other tools.

>
> >> At a glance they seem to be mostly compatible but I don't know how
> >> well it would mesh with my memory offset <-> value layout or how I
> >> would perform pointer substitution. I haven't decided if I want to
> >> pursue a fully compliant JSON implementation, or take the format in
> >> another direction entirely.

>
> > Why make things complex? *JSON is an ideal candidate for representing
> > structure and array types. *It is after all designed as an object notation.

>
> yep, JSON makes sense.
>
> one possible downside though is that it doesn't normally identify object
> types, which means some means may be needed to identify what sort of
> struct is being serialized, and/or the type of array, ...
>
> it is either that, or deal with data in a dynamically-typed manner,
> rather than directly mapping it to raw C structs.


I can see a couple of problems with JSON. It is really just an
associative array and doesn't specify the type of objects, so I would
either need to guess (from the fields provided) or add a wrapper of
some kind telling what each element is. It also seems JSON has no
built-in way to do object references.

I can imagine something like

{
"some_struct #1" : { "property" : "value" },
"another_struct #2" : { "reference" : "#1" }
}

where another_struct.reference would be defined as a some_struct
pointer, so it seems I can get around this. I'm not sure how useful
the end result would really be, though; even if it is technically
valid as JSON data and other applications can parse it, they still
wouldn't know what to make of it.

Another (minor) problem is that JSON doesn't allow comments, while my
format does.

/Ulf
 
Reply With Quote
 
 
 
 
Ben Bacarisse
Guest
Posts: n/a
 
      12-14-2012
Ian Collins <(E-Mail Removed)> writes:

> Ulf Åström wrote:
>> On Dec 13, 10:57 pm, Keith Thompson <(E-Mail Removed)> wrote:
>>> Ulf Åström <(E-Mail Removed)> writes:
>>>
>>>> The user needs to set up translators (composed of one or more fields)
>>>> for each type they wish to serialize. Internally it works by
>>>> flattening each type into a list of void pointers and replacing them
>>>> with numeric indices. The output format has a simple "<tag> <value>;"
>>>> format; it is similar to JSON but it is not a reimplementation of it.
>>>
>>> Why not just use JSON? It would make the flattened files accessible by
>>> other tools.

>>
>> I'm considering it, but I haven't read the JSON specs in detail yet.

>
> There isn't a great deal to read...
>
>> At a glance they seem to be mostly compatible but I don't know how
>> well it would mesh with my memory offset <-> value layout or how I
>> would perform pointer substitution. I haven't decided if I want to
>> pursue a fully compliant JSON implementation, or take the format in
>> another direction entirely.

>
> Why make things complex? JSON is an ideal candidate for representing
> structure and array types. It is after all designed as an object
> notation.


I don't think JSON is a good fit for the OP's task. There are lots of C
data structures that have no obvious JSON representation at all.

<snip>
--
Ben.
 
Reply With Quote
 
 
 
 
Ulf Åström
Guest
Posts: n/a
 
      12-14-2012
On Dec 14, 3:53 am, Ben Bacarisse <(E-Mail Removed)> wrote:
> Ulf Åström <(E-Mail Removed)> writes:
> > On Dec 14, 12:11 am, Ben Bacarisse <(E-Mail Removed)> wrote:

> <snip>
> >> The other area that I could not understand is how to serialise multiple
> >> pointers into a single object. For example, how does one serialise

>
> >> struct ring_buffer {
> >> T buffer[BSIZE];
> >> T *head, *tail;
> >> };

>
> >> where 'head' and 'tail' both point to an element of 'buffer'? This may
> >> be just because there's no documentation yet -- I had a look at the API
> >> and did not immediately see how this could be done.

>
> > Here is a quick example (assuming int for T):

>
> > struct ring_buffer rbuf;
> > tra = ser_new_tra("ring_buffer", sizeof(struct ring_buffer), NULL);
> > field = ser_new_field(tra, "int", 0, "buffer", &rbuf, &rbuf.buffer);
> > field->repeat = BSIZE;
> > ser_new_field(tra, "int", 1, "head", &rbuf, &rbuf.head);
> > ser_new_field(tra, "int", 1, "tail", &rbuf, &rbuf.tail);

>
> > The 1s to ser_new_field indicate that it is a pointer.

>
> What about pointers to pointers?


It can be done; have a look at test_single_ptr() in demo.c (I still
need to write up a clean example for this). It requires an extra
translator for "pointer to type", so other translators can reference
the pointer type itself. In this case it also needs the _int wrapper
(with a single int at offset 0) since I haven't found a satisfactory
way yet to translate primitives on their own.

> > This exposed a bug, however; the int pointers will only be repointed
> > correctly (during inflation) if they point to rbuf.buffer[0]; other
> > addresses would be duplicated in memory. This is because it will only
> > check for pointers that exactly match a field offset. It should also
> > check all elements of an array.

>
> Thanks. I think I see what's happening. As I understand it, this
> suggests there must be a "root" structure for all of the data, but maybe
> what you are doing it more subtle than that.
>
> As an example, imagine a binary tree of nodes and a separate linked list
> of node pointers. Can I serialise this? Must I serialise the tree
> first?


This is just the kind of problem I'm hoping to solve.

It needs translators for the tree nodes and the list nodes. It is true
it only works on hierarchal data. The only way with the current API
would be to make a wrapper struct (and a translator for it) that
points to both the first tree and list nodes; this should be passed to
ser_ialize(). I can imagine using a variadric function to pass an
arbitrary number of type/pointer pairs and omit the wrapper, but
another problem would arise in the other end since ser_parse() will
only return a single pointer.

> >> As for the design itself, I got stuck on an issue in the first example..
> >> You seem to require a dummy object in order to pass pointers to it's
> >> members to the set-up functions, and a comment purports to explain why
> >> this is needed but I just don't see it. The comment includes a
> >> statement that is wrong (if I understand it) that a pointer to the first
> >> member of a field may not equal (when suitably converted) a pointer to
> >> the structure object itself. Even if this were true, I don't see why
> >> that means you can't just pass the offset of the field. At the least,
> >> more explanation is required.

>
> > It's a typo; it should say "pos must be within st".

>
> I think that a reference to the API documentation ("Note: pos must be
> within pos"). I was talking about the comment in example1.c:
>
> "In this example &thing_dummy and &thing_dummy.a will most probably be
> the same location. &thing_dummy.b on the other hand will typically end
> up 2 to 8 bytes from the base pointer (depending on the computer
> architecture). For this reason we do not pass the offset as a number
> but rely on the compiler to tell us exactly where the members will
> be."
>
> In your example, &thing_dummy.a and &thing_dummy must *always* be the
> same location so the phrase "will most probably be" just looks odd.


True; I was probably overthinking things and imagining a larger
structure.

> That aside, the following bit is what threw me: "For this reason we do
> not pass the offset as a number". Nothing previous to this seems to
> explain why you pass two pointers rather than an offset.
>
> > Using a dummy isn't strictly necessary, you may as well pass zero and
> > an offset. I designed it this way so people won't hardcode offset
> > values, then change their structures (or start porting their program
> > to a different architecture!) but forget to update the translator.

>
> That is a reason but not, to my mind, a strong one. To complicate the
> API because it's possible to so grossly abuse a simpler one seems to be
> contrary to the spirit if C. What's more, if you are serious about
> this, then you should not permit sizeof(myType) in the earlier call,
> since a programmer might hard code the size. Instead, they should be
> obliged to pass two pointers: &thing_dummy and &thing_dummy + 1. If you
> accept a simple size in one case, permit a simple offset in the other.


I agree; I should trust the programmer to know what they are doing. I
will be changing it to take only an offset.

> --
> Ben.


/Ulf
 
Reply With Quote
 
Rui Maciel
Guest
Posts: n/a
 
      12-14-2012
Ulf Åström wrote:

> I'm wondering if there is anything crucial I have missed. It can deal
> with most primitive C types, strings, arrays and almost any
> arrangement of linked structures. There are still a few pieces
> missing; for example I would like to have a clean interface to let the
> user provide custom read/write functions for complex data types. I
> will also work on the type safety, so structures can't be linked
> incorrectly by erroneous or malicious input.


I believe that's a poor approach to a common problem which also has a very
common and simple solution: for the output, a single custom
operator/function that takes in a data type/struct and calls fprintf() to
dump the relevant text following a specific data format, and for the input,
a custom parser. That's it. No need to complicate things.

The "let's blindly dump everything" approach is fundamentally broken,
because it actually doesn't describe the information. Instead, it dumps the
data as how it was stored in a particular data structure and omits any
semantics associated with it. This causes significant problems, because it
doesn't perform any integrity checks on the information and essentially
eliminates the possibility of adding the necessary sanity checks
to the parsing process, other than testing if a particular primitive data
type is actually valid. Therefore, dumping data makes it impossible to test
if the data which has been imported actually represents valid information.
And this is a bad thing.

The snags you've stumbled on are a natural consequence of trying to dump
data instead of exporting information, which happen to be trivial to fix
with the parser/document dump combo approach. Complex data types are never
a problem because the programmer is responsible to write the necessary I/O
routines, and all the necessary sanity checks are automatically performed,
not only with regards to type safety but also in the form of semantics
checks, including the ability to infer if a problem is fatal or recoverable.


Rui Maciel
 
Reply With Quote
 
Rui Maciel
Guest
Posts: n/a
 
      12-14-2012
Ben Bacarisse wrote:

> I don't think JSON is a good fit for the OP's task. There are lots of C
> data structures that have no obvious JSON representation at all.


What data structures do you have in mind?


Rui Maciel

 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      12-14-2012
Rui Maciel <(E-Mail Removed)> writes:

> Ben Bacarisse wrote:
>
>> I don't think JSON is a good fit for the OP's task. There are lots of C
>> data structures that have no obvious JSON representation at all.

>
> What data structures do you have in mind?


My ring_buffer example is one. In fact, any structure where pointers are
"shared" would seem to be a bad fit with JSON.

Maybe "no obvious JSON representation" is too strong because you can
probably always map pointers so some sort of index or string label, but
it's not a natural fit.

--
Ben.
 
Reply With Quote
 
BGB
Guest
Posts: n/a
 
      12-14-2012
On 12/14/2012 4:23 AM, Ian Collins wrote:
> BGB wrote:
>> On 12/14/2012 12:38 AM, Ian Collins wrote:
>>>
>>> Other languages have better support for manipulating JSON objects, but
>>> at least one of them (PHP) uses a C library under the hood.

>>
>> yeah...
>>
>> I use variants of both JSON and S-Expressions, but mostly for
>> dynamically typed data.
>>
>> not depending on the use of type-tags and data mined from headers would
>> require a rather different implementation strategy.
>>
>> most of my code is C, but I make fairly extensive use of dynamic-type
>> tagging.

>
> I originally wrote a JSON library to enable my (C++) sever side web
> application it interact with client side JavaScript. I soon found the
> objects extremely useful for building dynamic type objects in general
> programming. I doubt the same would be true in C, not elegantly at least.
>



I actually use a fair amount of dynamically-typed stuff in C.

another big area has to do with my scripting language (largely itself
based on JavaScript and ActionScript).


the main downsides have to do with performance, namely that
dynamically-typed code just doesn't perform anywhere near as well as
statically-typed code.

in many areas, this isn't a big issue, and/or leads to a hybrid
strategy, with some occasional type-checking, but much of the code
remains statically typed (for example, much of my 3D renderer largely
works this way).

typically, this would be using dynamic type-checking more for things
like classifying "what type of object is this?" and dispatching more to
the appropriate part of the renderer (or, IOW: type-checking on the
level of structures). a few places are "optimized" though, mostly to try
to avoid using dynamically-typed operations in tight-loops (such as
scene-graph object lookup operations, ...).


side note (main project):
http://www.youtube.com/user/BGBTech

http://www.youtube.com/watch?v=hhdq87CXL0U

(note, a lot of parts of the video are from much earlier versions of my
3D engine...).


(although the 3D engine has some amount of code written in my
script-language, I generally don't use script-code within the 3D
renderer, as the languages' performance is a bit weak, and the 3D
renderer is a lot more performance-sensitive).

sadly, performance is kind of a problem area sometimes, but much of this
more has to do with getting everything fed through OpenGL.


in my case, as-noted, the type-tagging is largely transparent to C code,
apart from the use of special allocation and freeing functions, and
supplying type-names, ...

these type-names can be retrieved via API calls, like:
BGBDY_API char *dygettype(dyt val);

or checked with functions like:
BGBDY_API int dyTypeP(dyt val, char *ty);


many operations on dynamically typed values may check types using these
mechanisms, but many internal operations essentially bypass this
(usually, because if/else chains of "strcmp()" calls can get expensive...).


elsewhere, for many core operations rather than fetching the type-name
from the object, the code will instead fetch a "type vtable":

BGBDY_API dyt dyCopyValue(dyt p)
{
BGBDY_ObjVTab *ty;
if(dyCheckValueNoCopyP(p))
return(p);
ty=BGBDY_GetObjVTabFast(p);
if(ty && ty->copyValue)
return(ty->copyValue(p));
return(p);
}

BGBDY_API int dyDropValue(dyt p)
{
BGBDY_ObjVTab *ty;
if(dyCheckValueNoCopyP(p))
return(0);
ty=BGBDY_GetObjVTabFast(p);
if(ty && ty->dropValue)
return(ty->dropValue(p));
return(0);
}

where BGBDY_GetObjVTabFast basically calls off to code for some
(slightly long/ugly) logic for fetching the object-base, fetching the
"ObjType", and returning its "vtable" member.



in a few places (due mostly to performance issues), some of this has
lead to frequent use of things like type-signatures and untagged union
types:

#ifndef DYTVALUE_T
#define DYTVALUE_T
typedef union
{
s32 i; //int
u32 ui; //uint
s64 l; //long
u64 ul; //ulong
f32 f; //float
f64 d; //double
nlint a; //address
void *pv; //raw pointer
dyt r; //variant reference
dytf cr; //variant reference (double)
dycClass cls; //class
dycObject obj; //object
dycArray arr; //array
struct { int x, y; }k; //lexical env index
}dycValue;
#endif


the main reason being that these allow higher performance.

for the untagged union, it is because one can access values like:
dycValue a, b, c;
....
c.i=a.i+b.i;

and, apparently the compiler is smart enough to figure out this at least
(and generate vaguely efficient code).

the reason for type-signatures is that unlike full dynamic typing,
signatures are more readily usable with the "build stuff with function
pointers" strategy.

say, for example:
void DYLL_CopyValueBufI(dycValue *v, void *p)
{ *(s32 *)p=v->i; }
void DYLL_CopyValueBufL(dycValue *v, void *p)
{ *(s64 *)p=v->l; }
....

typedef struct {
void (*CopyValueBuf)(dycValue *v, void *p);
void (*CopyBufValue)(void *p, dycValue *v);
....
int offs;
}DYLL_ValueTransfer;

typedef struct {
DYLL_ValueTransfer *ret;
DYLL_ValueTransfer *args;
int nargs;
void (*CopyValuesBuf)(DYLL_ArgList *ctx, dycValue *args, byte *buf);
void (*CopyBufValues)(DYLL_ArgList *ctx, byte *buf, dycValue *args);
....
}DYLL_ArgList;

....

void DYLL_CopyValuesBuf3(DYLL_ArgList *ctx, dycValue *args, byte *buf)
{
ctx->args[0].CopyValueBuf(args+0, buf+ctx->args[0].offs);
ctx->args[1].CopyValueBuf(args+1, buf+ctx->args[1].offs);
ctx->args[2].CopyValueBuf(args+2, buf+ctx->args[2].offs);
}

BGBDY_API void DYLL_CopyValuesBuf(DYLL_ArgList *ctx,
dycValue *args, byte *buf)
{ ctx->CopyValuesBuf(ctx, args, buf); }

where the "DYLL_ArgList" structure may be built via a function which
parses the signature.

....

which, although arguably not exactly efficient, can always be a good
deal worse...

(there is a good deal more of this sort of logic for working with
"va_list" argument lists as well, typically for indirect calls
originating within C land, a lot of this stuff has to deal with calls
between natively compiled and interpreted code).

generally the "buffer" is a packed buffer-like representation of an
argument list, and is generally what is used by the ASM parts of the
call-handling machinery (which may in-turn repack these buffer into
ABI-specific representations, and then initiate the call).

partly, this is because, it isn't really desirable to have to deal with
the various possible argument-list representations and ASM code at the
same time.


in some places, the function-pointer funkiness serves to bypass older
machinery which often involves lots of sometimes long if/else chains.


a downside though is that a lot of this is, admittedly, a bit more
painful to work with, and the code can get very long/winding and ugly
(so, the spread of this sort of code in many places within the VM core
has both good and bad points).

however, a lot of these sorts of mechanisms end up somewhat abstracted,
and so are generally unseen by "user code" (luckily, it all looks much
nicer from the outside).


or such...

 
Reply With Quote
 
BGB
Guest
Posts: n/a
 
      12-14-2012
On 12/14/2012 5:51 AM, Ulf Åström wrote:
> On Dec 14, 10:47 am, BGB <(E-Mail Removed)> wrote:
>> On 12/14/2012 12:38 AM, Ian Collins wrote:
>>> Ulf str m wrote:
>>>> On Dec 13, 10:57 pm, Keith Thompson <(E-Mail Removed)> wrote:
>>>>> Ulf str m <(E-Mail Removed)> writes:

>>
>>>>>> The user needs to set up translators (composed of one or more fields)
>>>>>> for each type they wish to serialize. Internally it works by
>>>>>> flattening each type into a list of void pointers and replacing them
>>>>>> with numeric indices. The output format has a simple "<tag> <value>;"
>>>>>> format; it is similar to JSON but it is not a reimplementation of it.

>>
>>>>> Why not just use JSON? It would make the flattened files accessible by
>>>>> other tools.

>>
>>>> At a glance they seem to be mostly compatible but I don't know how
>>>> well it would mesh with my memory offset <-> value layout or how I
>>>> would perform pointer substitution. I haven't decided if I want to
>>>> pursue a fully compliant JSON implementation, or take the format in
>>>> another direction entirely.

>>
>>> Why make things complex? JSON is an ideal candidate for representing
>>> structure and array types. It is after all designed as an object notation.

>>
>> yep, JSON makes sense.
>>
>> one possible downside though is that it doesn't normally identify object
>> types, which means some means may be needed to identify what sort of
>> struct is being serialized, and/or the type of array, ...
>>
>> it is either that, or deal with data in a dynamically-typed manner,
>> rather than directly mapping it to raw C structs.

>
> I can see a couple of problems with JSON. It is really just an
> associative array and doesn't specify the type of objects, so I would
> either need to guess (from the fields provided) or add a wrapper of
> some kind telling what each element is. It also seems JSON has no
> built-in way to do object references.
>


actually, it can do both, just it is maybe not entirely obvious.
it helps more to imaging how JavaScript itself works, and JSON more are
a direct mapping to associated JavaScript objects.


> I can imagine something like
>
> {
> "some_struct #1" : { "property" : "value" },
> "another_struct #2" : { "reference" : "#1" }
> }
>
> where another_struct.reference would be defined as a some_struct
> pointer, so it seems I can get around this. I'm not sure how useful
> the end result would really be, though; even if it is technically
> valid as JSON data and other applications can parse it, they still
> wouldn't know what to make of it.
>
> Another (minor) problem is that JSON doesn't allow comments, while my
> format does.
>


this is all, not exactly right...


however, how it actually works gets a little less trivial to explain.

handling it more "effectively" would involve essentially throwing
together dynamic type facilities on top of C.

a way of imagining this, is if you have some sort of opaque type, which
can hold a value of any type (no, it is *not* a string).

a popular solution (though not the one I am using) is basically to treat
the low bits of a pointer as a "magic tag", for example:
low 3 bits:
0=raw pointer
1/5=integer (30-bit)
2/6=float (30-bit)
3, 4=reserved
7=misc values (grabs another 5 bits as an expanded tag, with a 24-bit
value).

(my system is based instead on "magic address ranges" for some types,
and encoding the object type in the memory-manager's header, for other
types).


you might have functions, then, to wrap and unwrap values, say (using my
API as an example):
dyt dyint(s64 v); //wrap integer (or "long")
s64 dyintv(dyt v); //unwrap integer (or "long")
dyt dylong(s64 v); //wrap "long"
s64 dylongv(dyt v); //unwrap "long"
dyt dyfloat(float v); //wrap float
float dyfloatv(dyt v); //unwrap float
dyt dydouble(double v); //wrap double
double dydoublev(dyt v); //unwrap double
....

as well, as functions to perform "type-checks":
int dyintp(dyt v); //returns non-zero if value is an integer
int dyfloatp(dyt v);
....


and, so then, a person can write logic like:
if(dyintp(v))
{
i=dyintv(v);
}else if(dyfloatp(v))
{
f=dyfloatv(v);
}else if(dystringp(v))
{
...
}else if(dyObjectP(v))
{
...
}else ...


as for parsing JSON, it will generally involve looking at the syntax for
every token to identify what it is.

part of this is done in the "lexer"/"tokenizer", with the idea being
that the logic for splitting apart the input text into tokens, will
identify what sort of token it is looking at: numbers vs strings vs
braces vs operators vs ...

the parser logic will then build a representation of the data via the
tokens it sees.

a common strategy for hand-written parsers is "recursive descent", where
basically the code looks at the current tokens and sees what to do, and
then may call itself recursively in order to parse sub-items:
if(tokenty==TOK_INTEGER)
{
value=dyint(atoi(token));
}else if(tokenty==TOK_REAL)
{
value=dydouble(atof(token));
}else if(!strcmp(token, "{") && (tokenty==TOK_BRACE))
{
... parse object ...
}else if(!strcmp(token, "[") && (tokenty==TOK_BRACE))
{
... parse array ...
}else ...


a possible representation for a "naive" object, could be something like:
struct Object_s
{
char **vars;
dyt *vals;
int n_vars;
int m_vars;
};

with vars holding a list of member names, vals holding the values, and
n_vars holding a count for the number of members in an object, and
m_vars holding the max number of members (how many members can be added
before the arrays would need to be reallocated).


there are potentially more efficient ways to handle all this, but this
is a "general idea".


 
Reply With Quote
 
BGB
Guest
Posts: n/a
 
      12-14-2012
On 12/14/2012 9:22 AM, Ben Bacarisse wrote:
> Rui Maciel <(E-Mail Removed)> writes:
>
>> Ben Bacarisse wrote:
>>
>>> I don't think JSON is a good fit for the OP's task. There are lots of C
>>> data structures that have no obvious JSON representation at all.

>>
>> What data structures do you have in mind?

>
> My ring_buffer example is one. In fact, any structure where pointers are
> "shared" would seem to be a bad fit with JSON.
>
> Maybe "no obvious JSON representation" is too strong because you can
> probably always map pointers so some sort of index or string label, but
> it's not a natural fit.
>


yeah, these are general problems with data serialization, namely that
there is no good way to deal with "everything" within any reasonable
level of complexity.

it is easier then to find a reasonable subset of things that can be
serialized via a given mechanism, and use this.


more careful design of a serialization interface would be needed, for
example, to be able to deal with cyclic data structures, and more
in-turn to deal with preserving the identity of serialized objects (1).

1: if obja==objb, will this be true or false after the data is saved,
and re-loaded?...


some options will not preserve any identity, so if the same object is
reached via multiple paths, each may become separate objects when the
same image is read-in later.

other options may in-turn create false identity, where pretty much any
object which "looks the same" will have the same identity, regardless of
whether or not it was the same object prior to serialization.

identity becomes a particularly ugly issue if it needs to be dealt with
across the same object being saved across multiple images, or if network
gets involved (currently, I just don't deal with this problem). (this
part gets much more important if dealing with trying to serialize things
like program-state, say, where a "running program image" may be saved to
disk, and reloaded later, and lots of little "subtle details" will need
to be preserved, or else things will often go "critically wrong" when
the program state gets reloaded later on).


or, at least, short of much simpler but very ugly solutions, like doing
a raw memory-dump of the program's heap (I have done this before, but it
"doesn't really scale well"), but more practical solutions to general
program heap serialization are overly complicated, and it is easier just
to deal with smaller scale serialization and reloading (such as
"serialize this particular data structure").

a basic idea for how the heap-dump serialization works is:
imagine, if you will, that on startup the application uses a
memory-mapped file as the main VM heap, and uses special magic to
preserve any "roots" (generally global variables). on exiting and
reloading, any objects in this heap can be preserved, with the "roots"
automatically restored to their prior values.

yeah, long ago, one of my prior VMs actually did this, although the heap
wasn't a single giant file, but rather a bunch of several-MB "heap chunks".

I later generally abandoned this strategy, as it had some particularly
nasty technical issues (every exit/reload cycle leaked memory that could
not be reclaimed, nasty "bit rot" problems, structure-layout problems, ...).

it is IMO better to actually just try to "serialize the contents of the
data", rather than just "stupidly dump the memory to the HDD and try to
reload it later".



but, in general, dealing with these sorts of issues requires more
careful attention to detail, and sadly, this isn't really something JSON
does very well.

not that this makes JSON useless though.


more generally though, I just use plain files for data storage, as they
just work so-much better and more reliably.


one possible (partial) solution here for cyclic structures/... though is
an "extended and non-standard" JSON variant, say, where everything is
indexed.

#1 = { name: "foo", link: #2 };
#2 = { name: "bar", link: #3 };
#3 = { name: "baz", link: #4, first: #1 };
....
{ name: "someobject", listHead: #1 }

but, dealing effectively with a more comprehensive type-system would
require further somewhat extending the syntax, say, adding some
mechanism to identify up-front the type of object being serialized, ...


this is partly what my "somewhat hacked-up S-Expressions" do, even if
they are a little out of place (long ago, I had a Scheme implementation,
and even now, vestiges of it still remain, floating around in my
codebase...).


or such...

 
Reply With Quote
 
BGB
Guest
Posts: n/a
 
      12-14-2012
On 12/14/2012 8:25 AM, Rui Maciel wrote:
> Ulf Åström wrote:
>
>> I'm wondering if there is anything crucial I have missed. It can deal
>> with most primitive C types, strings, arrays and almost any
>> arrangement of linked structures. There are still a few pieces
>> missing; for example I would like to have a clean interface to let the
>> user provide custom read/write functions for complex data types. I
>> will also work on the type safety, so structures can't be linked
>> incorrectly by erroneous or malicious input.

>
> I believe that's a poor approach to a common problem which also has a very
> common and simple solution: for the output, a single custom
> operator/function that takes in a data type/struct and calls fprintf() to
> dump the relevant text following a specific data format, and for the input,
> a custom parser. That's it. No need to complicate things.
>
> The "let's blindly dump everything" approach is fundamentally broken,
> because it actually doesn't describe the information. Instead, it dumps the
> data as how it was stored in a particular data structure and omits any
> semantics associated with it. This causes significant problems, because it
> doesn't perform any integrity checks on the information and essentially
> eliminates the possibility of adding the necessary sanity checks
> to the parsing process, other than testing if a particular primitive data
> type is actually valid. Therefore, dumping data makes it impossible to test
> if the data which has been imported actually represents valid information.
> And this is a bad thing.
>
> The snags you've stumbled on are a natural consequence of trying to dump
> data instead of exporting information, which happen to be trivial to fix
> with the parser/document dump combo approach. Complex data types are never
> a problem because the programmer is responsible to write the necessary I/O
> routines, and all the necessary sanity checks are automatically performed,
> not only with regards to type safety but also in the form of semantics
> checks, including the ability to infer if a problem is fatal or recoverable.
>


makes sense IMO...


personally, I far more often use specialized file-formats, rather than
dumping data structures, and usually if serialization is used, it is
usually limited mostly to certain sets of dedicated "safe to serialize"
data-types, which in my case are typically "lists" and
"dynamically-typed objects".

most types provide dedicated handlers for serializing and reading back
in their serialized forms, many often using a small "version" tag, as a
way to indicate what fields are present in the data (this itself has
flaws, but basically works), or sometimes the handlers use individual
tags to indicate various items within the serialized structure.

types which don't bother with version-tags are typically things like
integer values or strings, which are basically "too simple to justify a
version tag".


actually, given my frequent use of sequential read/write functions, is a
major part of why, very often, I end up using variable-length integer
representations.

for example, the vast majority of integer-values are fairly small
(basically, forming a bell-curve with its center around 0), making it so
that, on-average, small integers may only require 1 byte, and most
others are 2 or 3 bytes, independent of the physical storage-type of the
integer (32 or 64 bits).

like, although a variable-length integer isn't very good for
random-access data, it is pretty good for data that is read/written
sequentially.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Feedback from feedback on MCP questions Matt Adamson Microsoft Certification 0 04-27-2009 11:13 AM
how to move from java object serialization to xml serialization? Dimitri Ognibene Java 4 09-02-2006 07:32 AM
Object serialization XML vs java serialization plasticfloor@gmail.com Java 3 06-14-2006 03:45 AM
Serialization Problems and books on serialization? sinleeh@hotmail.com Java 8 01-02-2005 02:40 PM
avoiding XML serialization, different WSDL generation, soap serialization Ramunas Urbonas ASP .Net Web Services 1 07-27-2004 09:57 PM



Advertisments