Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > struct: type registration?

Reply
Thread Tools

struct: type registration?

 
 
Giovanni Bajo
Guest
Posts: n/a
 
      06-01-2006
Hello,

given the ongoing work on struct (which I thought was a dead module), I was
wondering if it would be possible to add an API to register custom parsing
codes for struct. Whenever I use it for non-trivial tasks, I always happen to
write small wrapper functions to adjust the values returned by struct.

An example API would be the following:

============================================
def mystring_len():
return 20

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
s = (s + "\0"*20)[:20]
s = struct.pack("20s", s)
return s

def mystring_unpack(s):
assert len(s) == 20
s = struct.unpack("20s", s)[0]
idx = s.find("\0")
if idx >= 0:
s = s[:idx]
return s

struct.register("S", mystring_pack, mystring_unpack, mystring_len)

# then later
foo = struct.unpack("iilS", data)
============================================

This is only an example, any similar API which might fit better struct
internals would do as well.

As shown, the custom packer/unpacker can call the original pack/unpack as a
basis for their work. I guess an issue with this could be the endianess
problem: it would make sense if, when called recursively, struct.pack/unpack
used by the default the endianess specified by the external format string.
--
Giovanni Bajo


 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      06-01-2006
On 1/06/2006 10:50 AM, Giovanni Bajo wrote:
> Hello,
>
> given the ongoing work on struct (which I thought was a dead module), I was
> wondering if it would be possible to add an API to register custom parsing
> codes for struct. Whenever I use it for non-trivial tasks, I always happen to
> write small wrapper functions to adjust the values returned by struct.
>
> An example API would be the following:
>
> ============================================
> def mystring_len():
> return 20
>
> def mystring_pack(s):
> if len(s) > 20:
> raise ValueError, "a mystring can be at max 20 chars"
> s = (s + "\0"*20)[:20]


Have you considered s.ljust(20, "\0") ?

> s = struct.pack("20s", s)
> return s


I am an idiot, so please be gentle with me: I don't understand why you
are using struct.pack at all:

|>>> import struct
|>>> x = ("abcde" + "\0" * 20)[:20]
|>>> x
'abcde\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00'
|>>> len(x)
20
|>>> y = struct.pack("20s", x)
|>>> y == x
True
|>>>

Looks like a big fat no-op to me; you've done all the heavy lifting
yourself.

>
> def mystring_unpack(s):
> assert len(s) == 20
> s = struct.unpack("20s", s)[0]


Errrm, g'day, it's that pesky idiot again:

|>>> z = struct.unpack("20s", y)[0]
|>>> z
'abcde\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00'
|>>> z == y == x
True

> idx = s.find("\0")
> if idx >= 0:
> s = s[:idx]
> return s


Have you considered this:

|>>> z.rstrip("\0")
'abcde'
|>>> ("\0" * 20).rstrip("\0")
''
|>>> ("x" * 20).rstrip("\0")
'xxxxxxxxxxxxxxxxxxxx'

>
> struct.register("S", mystring_pack, mystring_unpack, mystring_len)
>
> # then later
> foo = struct.unpack("iilS", data)
> ============================================
>
> This is only an example, any similar API which might fit better struct
> internals would do as well.
>
> As shown, the custom packer/unpacker can call the original pack/unpack as a
> basis for their work. I guess an issue with this could be the endianess
> problem: it would make sense if, when called recursively, struct.pack/unpack
> used by the default the endianess specified by the external format string.

 
Reply With Quote
 
 
 
 
Giovanni Bajo
Guest
Posts: n/a
 
      06-01-2006
John Machin wrote:

>> given the ongoing work on struct (which I thought was a dead
>> module), I was wondering if it would be possible to add an API to
>> register custom parsing codes for struct. Whenever I use it for
>> non-trivial tasks, I always happen to write small wrapper functions
>> to adjust the values returned by struct.
>>
>> An example API would be the following:
>>
>> ============================================
>> def mystring_len():
>> return 20
>>
>> def mystring_pack(s):
>> if len(s) > 20:
>> raise ValueError, "a mystring can be at max 20 chars"
>> s = (s + "\0"*20)[:20]

>
> Have you considered s.ljust(20, "\0") ?


Right. This happened to be an example...

>> s = struct.pack("20s", s)
>> return s

>
> I am an idiot, so please be gentle with me: I don't understand why you
> are using struct.pack at all:


Because I want to be able to parse largest chunks of binary datas with custom
formatting. Did you miss the whole point of my message:

struct.unpack("3liiSiiShh", data)

You need struct.unpack() to parse these datas, and you need custom
packer/unpacker to avoid post-processing the output of unpack() just because it
just knows of basic Python types. In binary structs, there happen to be *types*
which do not map 1:1 to Python types, nor they are just basic C types (like the
ones struct supports). Using custom formatter is a way to better represent
these types (instead of mapping them to the "most similar" type, and then
post-process it).

In my example, "S" is a basic-type which is a "A 0-terminated 20-byte string",
and expressing it in the struct format with the single letter "S" is more
meaningful in my code than using "20s" and then post-processing the resulting
string each and every time this happens.


>>>>> import struct
>>>>> x = ("abcde" + "\0" * 20)[:20]
>>>>> x

> 'abcde\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00'
>>>>> len(x)

> 20
>>>>> y = struct.pack("20s", x)
>>>>> y == x

> True
>>>>>

>
> Looks like a big fat no-op to me; you've done all the heavy lifting
> yourself.


Looks like you totally misread my message. Your string "x" is what I find in
binary data, and I need to *unpack* into a regular Python string, which would
be "abcde".


>
>> idx = s.find("\0")
>> if idx >= 0:
>> s = s[:idx]
>> return s

>
> Have you considered this:
>
>>>>> z.rstrip("\0")

> 'abcde'



This would not work because, in the actual binary data I have to parse, only
the first \0 is meaningful and terminates the string (like in C). There is
absolutely no guarantees that the rest of the padding is made of \0s as well.
--
Giovanni Bajo


 
Reply With Quote
 
Giovanni Bajo
Guest
Posts: n/a
 
      06-01-2006
Giovanni Bajo wrote:

> You need struct.unpack() to parse these datas, and you need custom
> packer/unpacker to avoid post-processing the output of unpack() just
> because it just knows of basic Python types. In binary structs, there
> happen to be *types* which do not map 1:1 to Python types, nor they
> are just basic C types (like the ones struct supports). Using custom
> formatter is a way to better represent these types (instead of
> mapping them to the "most similar" type, and then post-process it).
>
> In my example, "S" is a basic-type which is a "A 0-terminated 20-byte
> string", and expressing it in the struct format with the single
> letter "S" is more meaningful in my code than using "20s" and then
> post-processing the resulting string each and every time this happens.



Another compelling example is the SSH protocol:
http://www.openssh.com/txt/draft-iet...tecture-12.txt
Go to section 4, "Data Type Representations Used in the SSH Protocols", and it
describes the data types used by the SSH protocol. In a perfect world, I would
write some custom packers/unpackers for those types which struct does not
handle already (like the "mpint" format), so that I could use struct to parse
and compose SSH messages. What I ended up doing was writing a new module
sshstruct.py from scratch, which duplicates struct's work, just because I
couldn't extend struct. Some examples:

client.py: cookie, server_algorithms, guess, reserverd =
sshstruct.unpack("16b10LBu", data[1:])
client.py: prompts = sshstruct.unpack("sssu" + "sB"*num_prompts,
pkt[1:])
connection.py: pkt = sshstruct.pack("busB", SSH_MSG_CHANNEL_REQUEST,
self.recipient_number, type, reply) + custom
kex.py: self.P, self.G = sshstruct.unpack("mm",pkt[1:])

Notice for instance how "s" is a SSH string and unpacks directly to a Python
string, and "m" is a SSH mpint (infinite precision integer) but unpacks
directly into a Python long. Using struct.unpack() this would have been
impossible and would have required much post-processing.

Actually, another thing that struct should support to cover the SSH protocol
(and many other binary protocols) is the ability to parse strings whose size is
not known at import-time (variable-length data types). For instance, type
"string" in the SSH protocol is a string prepended with its size as uint32. So
it's actual size depends on each instance. For this reason, my sshstruct did
not have the equivalent of struct.calcsize(). I guess that if there's a way to
extend struct, it would comprehend variable-size data types (and calcsize()
would return -1 or raise an exception).
--
Giovanni Bajo


 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      06-01-2006
On 1/06/2006 9:52 PM, Giovanni Bajo wrote:
> John Machin wrote:
>
>>> given the ongoing work on struct (which I thought was a dead
>>> module), I was wondering if it would be possible to add an API to
>>> register custom parsing codes for struct. Whenever I use it for
>>> non-trivial tasks, I always happen to write small wrapper functions
>>> to adjust the values returned by struct.
>>>
>>> An example API would be the following:
>>>
>>> ============================================
>>> def mystring_len():
>>> return 20
>>>
>>> def mystring_pack(s):
>>> if len(s) > 20:
>>> raise ValueError, "a mystring can be at max 20 chars"
>>> s = (s + "\0"*20)[:20]

>> Have you considered s.ljust(20, "\0") ?

>
> Right. This happened to be an example...
>
>>> s = struct.pack("20s", s)
>>> return s

>> I am an idiot, so please be gentle with me: I don't understand why you
>> are using struct.pack at all:


Given a choice between whether I was referring to the particular
instance of using struct.pack two lines above, or whether I was doubting
the general utility of the struct module, you appear to have chosen the
latter, erroneously.

>
> Because I want to be able to parse largest chunks of binary datas with custom
> formatting. Did you miss the whole point of my message:


No.

>
> struct.unpack("3liiSiiShh", data)
>
> You need struct.unpack() to parse these datas, and you need custom
> packer/unpacker to avoid post-processing the output of unpack() just because it
> just knows of basic Python types. In binary structs, there happen to be *types*
> which do not map 1:1 to Python types, nor they are just basic C types (like the
> ones struct supports). Using custom formatter is a way to better represent
> these types (instead of mapping them to the "most similar" type, and then
> post-process it).
>
> In my example, "S" is a basic-type which is a "A 0-terminated 20-byte string",
> and expressing it in the struct format with the single letter "S" is more
> meaningful in my code than using "20s" and then post-processing the resulting
> string each and every time this happens.
>
>
>>>>>> import struct
>>>>>> x = ("abcde" + "\0" * 20)[:20]
>>>>>> x

>> 'abcde\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00'
>>>>>> len(x)

>> 20
>>>>>> y = struct.pack("20s", x)
>>>>>> y == x

>> True
>> Looks like a big fat no-op to me; you've done all the heavy lifting
>> yourself.

>
> Looks like you totally misread my message.


Not at all.

Your function:

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
s = (s + "\0"*20)[:20]
s = struct.pack("20s", s)
return s

can be even better replaced by (after reading the manual "For packing,
the string is truncated or padded with null bytes as appropriate to make
it fit.") by:

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
return s
# return s = (s + "\0"*20)[:20] # not needed, according to the manual
# s = struct.pack("20s", s)
# As I said, this particular instance of using struct.pack is a big
fat no-op.

> Your string "x" is what I find in
> binary data, and I need to *unpack* into a regular Python string, which would
> be "abcde".
>


And you unpack it with a custom function that also contains a fat no-op:

def mystring_unpack(s):
assert len(s) == 20
s = struct.unpack("20s", s)[0] # does nothing
idx = s.find("\0")
if idx >= 0:
s = s[:idx]
return s

>
>>> idx = s.find("\0")
>>> if idx >= 0:
>>> s = s[:idx]
>>> return s

>> Have you considered this:
>>
>>>>>> z.rstrip("\0")

>> 'abcde'

>
>
> This would not work because, in the actual binary data I have to parse, only
> the first \0 is meaningful and terminates the string (like in C). There is
> absolutely no guarantees that the rest of the padding is made of \0s as well.


Point taken.

Cheers,
John


 
Reply With Quote
 
Giovanni Bajo
Guest
Posts: n/a
 
      06-01-2006
John Machin wrote:

>> Looks like you totally misread my message.

>
> Not at all.
>
> Your function:
>
> def mystring_pack(s):
> if len(s) > 20:
> raise ValueError, "a mystring can be at max 20 chars"
> s = (s + "\0"*20)[:20]
> s = struct.pack("20s", s)
> return s
>
> can be even better replaced by (after reading the manual "For packing,
> the string is truncated or padded with null bytes as appropriate to
> make it fit.") by:
>
> def mystring_pack(s):
> if len(s) > 20:
> raise ValueError, "a mystring can be at max 20 chars"
> return s
> # return s = (s + "\0"*20)[:20] # not needed, according to the
> manual # s = struct.pack("20s", s)
> # As I said, this particular instance of using struct.pack is a
> big fat no-op.


John, the point of the example was to show that one could write custom
packer/unpacker which calls struct.pack/unpack and, after that,
post-processes the results to obtain some custom data type. Now, I apologize
if my example wasn't exactly the shortest, most compact, most pythonic piece
of code. It was not meant to be. It was meant to be very easy to read and
very clear in what it is being done. You are nitpicking that part of my code
is a no-op. Fine. Sorry if this confused you. I was just trying to show a
simple pattern:

custom packer: adjust data, call struct.pack(), return
custom unpacker: call struct.unpack(), adjust data, return

I should have chosen a most complex example probably, but I did not want to
confuse readers. It seems I have confused them by choosing too simple an
example.
--
Giovanni Bajo


 
Reply With Quote
 
Serge Orlov
Guest
Posts: n/a
 
      06-01-2006
Giovanni Bajo wrote:
> John Machin wrote:
> > I am an idiot, so please be gentle with me: I don't understand why you
> > are using struct.pack at all:

>
> Because I want to be able to parse largest chunks of binary datas with custom
> formatting. Did you miss the whole point of my message:
>
> struct.unpack("3liiSiiShh", data)


Did you want to write struct.unpack("Sheesh", data) ? Seriously, the
main problem of struct is that it uses ad-hoc abbreviations for
relatively rarely[1] used functions calls and that makes it hard to
read.

If you want to parse binary data use pyconstruct
<http://pyconstruct.wikispaces.com/>

[1] Relatively to regular expression and string formatting calls.

 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      06-01-2006
On 2/06/2006 3:44 AM, Giovanni Bajo wrote:
> John Machin wrote:
>
>>> Looks like you totally misread my message.

>> Not at all.
>>
>> Your function:
>>
>> def mystring_pack(s):
>> if len(s) > 20:
>> raise ValueError, "a mystring can be at max 20 chars"
>> s = (s + "\0"*20)[:20]
>> s = struct.pack("20s", s)
>> return s
>>
>> can be even better replaced by (after reading the manual "For packing,
>> the string is truncated or padded with null bytes as appropriate to
>> make it fit.") by:
>>
>> def mystring_pack(s):
>> if len(s) > 20:
>> raise ValueError, "a mystring can be at max 20 chars"
>> return s
>> # return s = (s + "\0"*20)[:20] # not needed, according to the
>> manual # s = struct.pack("20s", s)
>> # As I said, this particular instance of using struct.pack is a
>> big fat no-op.

>
> John, the point of the example was to show that one could write custom
> packer/unpacker which calls struct.pack/unpack and, after that,
> post-processes the results to obtain some custom data type.


What you appear to be doing is proposing an API for extending struct by
registering custom type-codes (ASCII alphabetic?) each requiring three
call-back functions (mypacker, myunpacker, mylength).

Example registration for an "S" string (fixed storage length, true
length determined on unpacking by first occurrence of '\0' (if any)).

struct.register("S", packerS, unpackerS, lengthS)

You give no prescription for what those functions should do. You provide
"examples" which require reverse engineering to deduce of what they are
intended to be exemplars.

Simple-minded folk like myself might expect that the functions would
work something like this:

Packing: when struct.pack reaches the custom code in the format, it does
this (pseudocode):
obj = _get_next_arg()
itemstrg = mypacker(obj)
_append_to_output_string(itemstrg)

Unpacking: when struct.unpack reaches a custom code in the format, it
does this (pseudocode):
n = mylength()
# exception if < n bytes remain
obj = myunpacker(remaining_bytes[:n])
_append_to_output_tuple(obj)

Thus, in a simple case like the NUL-terminated string:

def lengthS():
return 20
def packerS(s):
assert len(s) <= 20
return s.ljust(20, '\0')
# alternatively, return struct.pack("20s", s)
def unpackerS(bytes):
assert len(bytes) == 20
i = bytes.find('\0')
if i >= 0:
return bytes[:i]
return bytes

In more complicated cases, it may be useful for either/both the
packer/unpacker custom functions to call struct.pack/unpack to assist in
the assembly/disassembly exercise. This should be (1) possible without
perturbing the state of the outer struct.pack/unpack invocation (2)
sufficiently obvious to warrant little more than a passing mention.

> Now, I apologize
> if my example wasn't exactly the shortest, most compact, most pythonic piece
> of code. It was not meant to be. It was meant to be very easy to read and
> very clear in what it is being done. You are nitpicking that part of my code
> is a no-op. Fine.


Scarcely a nitpick. It was very clear that parts of it were doing
absolutely nothing in a rather byzantine & baroque fashion. What was
unclear was whether this was by accident or design. You say (*after* the
examples) that "As shown, the custom packer/unpacker can call the
original pack/unpack as a basis for their work. ... when called
recursively ...". What basis for what work? As for recursion, I see no
"19s", "18s", etc here


> Sorry if this confused you.


It didn't. As a self-confessed idiot, I am resolutely and irredeemably
unconfused.

> I was just trying to show a
> simple pattern:
>
> custom packer: adjust data, call struct.pack(), return
> custom unpacker: call struct.unpack(), adjust data, return
>
> I should have chosen a most complex example probably, but I did not want to
> confuse readers. It seems I have confused them by choosing too simple an
> example.


The problem was that you chose an example that had minimal justification
(i.e. only the length check) for a custom packer at all (struct.pack
pads the "s" format with NUL bytes) and no use at all for a call to
struct.unpack inside the custom unpacker.

Cheers,
John
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      06-01-2006
On 2/06/2006 4:18 AM, Serge Orlov wrote:
> Giovanni Bajo wrote:
>> John Machin wrote:
>>> I am an idiot, so please be gentle with me: I don't understand why you
>>> are using struct.pack at all:

>> Because I want to be able to parse largest chunks of binary datas with custom
>> formatting. Did you miss the whole point of my message:
>>
>> struct.unpack("3liiSiiShh", data)

>
> Did you want to write struct.unpack("Sheesh", data) ? Seriously, the
> main problem of struct is that it uses ad-hoc abbreviations for
> relatively rarely[1] used functions calls and that makes it hard to
> read.


Indeed. The first time I saw something like struct.pack('20H', ...) I
thought it was a FORTRAN format statement

>
> If you want to parse binary data use pyconstruct
> <http://pyconstruct.wikispaces.com/>
>


Looks promising on the legibility and functionality fronts. Can you make
any comment on the speed? Reason for asking is that Microsoft Excel
files have this weird "RK" format for expressing common float values in
32 bits (refer http://sc.openoffice.org, see under "Documentation"
heading). I wrote and support the xlrd module (see
http://cheeseshop.python.org/pypi/xlrd) for reading those files in
portable pure Python. Below is a function that would plug straight in as
an example of Giovanni's custom unpacker functions. Some of the files
can be very large, and reading rather slow.

Cheers,
John

from struct import unpack

def unpack_RK(rk_str): # arg is 4 bytes
flags = ord(rk_str[0])
if flags & 2:
# There's a SIGNED 30-bit integer in there!
i, = unpack('<i', rk_str)
i >>= 2 # div by 4 to drop the 2 flag bits
if flags & 1:
return i / 100.0
return float(i)
else:
# It's the most significant 30 bits
# of an IEEE 754 64-bit FP number
d, = unpack('<d', '\0\0\0\0' + chr(flags & 252) + rk_str[1:4])
if flags & 1:
return d / 100.0
return d
 
Reply With Quote
 
Serge Orlov
Guest
Posts: n/a
 
      06-02-2006
John Machin wrote:
> On 2/06/2006 4:18 AM, Serge Orlov wrote:
> > If you want to parse binary data use pyconstruct
> > <http://pyconstruct.wikispaces.com/>
> >

>
> Looks promising on the legibility and functionality fronts. Can you make
> any comment on the speed?


I don't know really. I used it for small data parsing, its performance
was acceptable. As I understand it is implemented right now as pure
python code using struct under the hood. The biggest concern is the
lack of comprehensive documentation, if that scares you, it's not for
you.

> Reason for asking is that Microsoft Excel
> files have this weird "RK" format for expressing common float values in
> 32 bits (refer http://sc.openoffice.org, see under "Documentation"
> heading). I wrote and support the xlrd module (see
> http://cheeseshop.python.org/pypi/xlrd) for reading those files in
> portable pure Python. Below is a function that would plug straight in as
> an example of Giovanni's custom unpacker functions. Some of the files
> can be very large, and reading rather slow.


I *guess* that the *current* implementation of pyconstruct will make
parsing slightly slower. But you have to try to find out.

> from struct import unpack
>
> def unpack_RK(rk_str): # arg is 4 bytes
> flags = ord(rk_str[0])
> if flags & 2:
> # There's a SIGNED 30-bit integer in there!
> i, = unpack('<i', rk_str)
> i >>= 2 # div by 4 to drop the 2 flag bits
> if flags & 1:
> return i / 100.0
> return float(i)
> else:
> # It's the most significant 30 bits
> # of an IEEE 754 64-bit FP number
> d, = unpack('<d', '\0\0\0\0' + chr(flags & 252) + rk_str[1:4])
> if flags & 1:
> return d / 100.0
> return d


I had to lookup what < means Since nobody except this function cares
about internals of RK number, you don't need to use pyconstruct to
parse at bit level. The code will be almost like you wrote except you
replace unpack('<d', with Construct.LittleFloat64("").parse( and plug
the unpack_RK into pyconstruct framework by deriving from Field class.
Sure, nobody is going to raise your paycheck because of this rewrite
The biggest benefit comes from parsing the whole data file with
pyconstruct, not individual fields.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
type(d) != type(d.copy()) when type(d).issubclass(dict) kj Python 5 12-26-2010 06:48 PM
#define ALLOCIT(Type) ((Type*) malloc (sizeof (Type))) Yevgen Muntyan C Programming 10 02-13-2007 02:52 AM
type casting vs. type converting Toby VHDL 3 09-07-2005 01:42 PM
Re: Type casting- a larger type to a smaller type pete C Programming 4 04-02-2004 05:19 PM
Re: Type casting- a larger type to a smaller type heyo C Programming 3 04-01-2004 06:35 PM



Advertisments