Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Re: Handling delimited strings

Reply
Thread Tools

Re: Handling delimited strings

 
 
michael@preece.net
Guest
Posts: n/a
 
      10-29-2005

http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> One problem I see: your delimiters being in the range 128..255 rather
> assume that "real" file identifiers will not contain characters with
> these values, and this is not the case in Windows.
>
> International experience (the use of Chinese characters in Windows file
> identifiers) has shown me that owing to the proprietary character of
> Windows, the file identifier's syntax was never defined, to my
> knowledge, formally and instead a minimal file syntax applies where ANY
> unicode character other than the semicolon, backslash, asterisk and
> question mark can be and will be accepted by most Windows installations
> as part of the file id.
>
> It is well known also that the period doesn't left-delimit the file
> type, instead the file name to the right of the type can contain
> multiple periods with the right period delimiting the type.
>
> If M$ means Microsoft, then I suggest you BNF formulate the minimal
> syntax of a file identifier and use this to parse the file identifier.
>


Sorry. I'm a bit confused. I was only looking for something to handle
delimited text strings within a single file. How do M$'s file naming
"conventions" come into it. Were you expanding on the idea of using
ReiserFS instead of a program? I realise that the characters within
each string will be limited to the ASCII chars 0-127 inclusive (except
that I'd also like to exclude char0).

If you're wondering where I'm heading with this, think of nested data -
like XML (only far more compact). I guess you could say that any
characters allowed in XML should be allowed. Further.. think of two
associated delimited strings - one to hold markup etc., the other the
data.

Mike.

 
Reply With Quote
 
 
 
 
Steve O'Hara-Smith
Guest
Posts: n/a
 
      10-29-2005
On 28 Oct 2005 23:16:53 -0700
(E-Mail Removed) wrote:

> If you're wondering where I'm heading with this, think of nested data -
> like XML (only far more compact).


If that's the goal look into ASN1.

--
C:>WIN | Directable Mirror Arrays
The computer obeys and wins. | A better way to focus the sun
You lose and Bill collects. | licences available see
| http://www.sohara.org/
 
Reply With Quote
 
 
 
 
spinoza1111@yahoo.com
Guest
Posts: n/a
 
      10-30-2005

(E-Mail Removed) wrote:
> (E-Mail Removed) wrote:
> > One problem I see: your delimiters being in the range 128..255 rather
> > assume that "real" file identifiers will not contain characters with
> > these values, and this is not the case in Windows.
> >
> > International experience (the use of Chinese characters in Windows file
> > identifiers) has shown me that owing to the proprietary character of
> > Windows, the file identifier's syntax was never defined, to my
> > knowledge, formally and instead a minimal file syntax applies where ANY
> > unicode character other than the semicolon, backslash, asterisk and
> > question mark can be and will be accepted by most Windows installations
> > as part of the file id.
> >
> > It is well known also that the period doesn't left-delimit the file
> > type, instead the file name to the right of the type can contain
> > multiple periods with the right period delimiting the type.
> >
> > If M$ means Microsoft, then I suggest you BNF formulate the minimal
> > syntax of a file identifier and use this to parse the file identifier.
> >

>
> Sorry. I'm a bit confused. I was only looking for something to handle
> delimited text strings within a single file. How do M$'s file naming
> "conventions" come into it. Were you expanding on the idea of using
> ReiserFS instead of a program? I realise that the characters within
> each string will be limited to the ASCII chars 0-127 inclusive (except
> that I'd also like to exclude char0).


OK, my mistake. Thought you were parsing a file name. You said "in
filename" and not "in the file".
>
> If you're wondering where I'm heading with this, think of nested data -
> like XML (only far more compact). I guess you could say that any
> characters allowed in XML should be allowed. Further.. think of two
> associated delimited strings - one to hold markup etc., the other the
> data.
>



> Mike.


 
Reply With Quote
 
michael@preece.net
Guest
Posts: n/a
 
      10-31-2005

(E-Mail Removed) wrote:

>
> OK, my mistake. Thought you were parsing a file name. You said "in
> filename" and not "in the file".
>


I guess you see now that I meant "in the file called FILENAME". Sorry
for the confusion. The capitals aren't meant to be read loud btw - it's
just a kind of notation that has become a habit, where the capitalized
word relates to a declared variable (or constant). Well - I know what I
mean

Cheers
Mike.

 
Reply With Quote
 
michael@preece.net
Guest
Posts: n/a
 
      10-31-2005

Steve O'Hara-Smith wrote:

> > If you're wondering where I'm heading with this, think of nested data -
> > like XML (only far more compact).

>
> If that's the goal look into ASN1.
>


Isn't ANS1 mostly about encoding data *type* - along with the data?
That's a separate issue. I'm looking to handle nested delimited strings
of any, or no specified, type. The data type (required for conversion
to/from ASN1, say) of each delimited string, or group of strings, along
with any other metadata such as markup, can be described or defined in
an associated nested delimited string, or two, or three, or whatever.

Nested data is all around. Indented program code, newsgroups and the
threads within them, folders/directories, etc. etc.. It would be nice
to have a really simple way to represent and manipulate nested
structures up to 128 levels deep - much simpler than ASN1 and much more
compact than XML, and yet easily transformable into either, or any
other, format.

If you take any nested data in any format - XML is an obvious example -
it should be possible to represent it as a simple delimited string as I
described in my OP. It would be good, I reckon, if I (with a little
help) can come up with simple cross-platform tools to perform the
functions also described in my OP.

Cheers
Mike.

 
Reply With Quote
 
Steve O'Hara-Smith
Guest
Posts: n/a
 
      10-31-2005
On 30 Oct 2005 17:02:25 -0800
(E-Mail Removed) wrote:

>
> Steve O'Hara-Smith wrote:
>
> > > If you're wondering where I'm heading with this, think of nested data -
> > > like XML (only far more compact).

> >
> > If that's the goal look into ASN1.
> >

>
> Isn't ANS1 mostly about encoding data *type* - along with the data?


I looked around for some references to give you and I found
it hard to spot the nested tag-length-value mechanism I met as ASN.1
around 1990 in the documentation for ASN.1 now. I think it's still there
under the hood of standard types and constructions though.

The essence of what I was thinking about was nested TLV
structures which always seemed to me to be more robust than the
paired delimiters of XML.

--
C:>WIN | Directable Mirror Arrays
The computer obeys and wins. | A better way to focus the sun
You lose and Bill collects. | licences available see
| http://www.sohara.org/
 
Reply With Quote
 
Michael Wojcik
Guest
Posts: n/a
 
      11-01-2005

[Followups restricted to comp.programming.]

In article <(E-Mail Removed)>, Steve O'Hara-Smith <(E-Mail Removed)> writes:
>
> The essence of what I was thinking about was nested TLV
> structures which always seemed to me to be more robust than the
> paired delimiters of XML.


What would make TLV (by which I assume you mean type-length-value
vectors, presumably with binary, fixed-length encodings for type and
length) more robust than XML? It has less redundancy, and therefore
less capacity for error detection and correction.

A trivial example: say type is a single octet, and all 256 type codes
are defined. Then it is impossible to detect if a type value is
wrong (for whatever reason - program error, transmission error, etc),
without additional context.

XML makes many tradeoffs, and there are certainly applications where
a TLV encoding of some sort is preferable due to various plausible
constraints. But TLV is not "more robust" than XML in general.

That said, I agree that nested TLV structures looks like a better
choice for representing arbitrary structure data than the OP's
proposal of in-band signalling with special flag bytes. That means
restricting the domain of ordinary data values, which means some
kind of shift-encoding of values that are outside that doman, and
that's invariably a mess, error-prone, difficult to enhance while
maintaining backward compatibility, and inefficient.

--
Michael Wojcik (E-Mail Removed)

Unfortunately, as a software professional, tradition requires me to spend New
Years Eve drinking alone, playing video games and sobbing uncontrollably.
-- Peter Johnson
 
Reply With Quote
 
Dave Thompson
Guest
Posts: n/a
 
      11-14-2005
On 30 Oct 2005 17:02:25 -0800, (E-Mail Removed) wrote:

>
> Steve O'Hara-Smith wrote:
>
> > > If you're wondering where I'm heading with this, think of nested data -
> > > like XML (only far more compact).

> >
> > If that's the goal look into ASN1.
> >

>
> Isn't ANS1 mostly about encoding data *type* - along with the data?
> That's a separate issue. I'm looking to handle nested delimited strings
> of any, or no specified, type. The data type (required for conversion
> to/from ASN1, say) of each delimited string, or group of strings, along
> with any other metadata such as markup, can be described or defined in
> an associated nested delimited string, or two, or three, or whatever.
>

Not inherently. ASN.1 is about encoding any structure defined in a
(specified) data language.

You could certainly do n-ary trees of character strings as array of
(discriminated) either string or (recursively) tree of strings. And
since these types have different primitive tags, you don't need any
added application tags. IIRC, may not be exactly right, I don't
currently have tools or references at hand to check:

StringTree ::= SEQUENCE OF CHOICE { IA5String, StringTree }

or to include the (trivial) case of only one string

StringTree ::= CHOICE { IA5String, SEQUENCE OF StringTree }

ASN.1 is frequently, I think probably more often than not, _used_ in
applications where it is desirable to encode data with type to allow
for extensibility and upgradability in distributed applications. For
example in crypto applications, the ones I have mostly worked on, when
we want to transmit or store a key, what is in the key depends on the
algorithm used, and we know from experience that over time new
algorithms will be created and wanted, so standards like X.509 and
PKCS 4, 10, 8/12 have ASN.1 constructs roughly equivalent to:
struct { OID-identifying-algorithm , data-depending-on-that-OID }

That way when some subset of the users and systems add a new
algorithm, the other ones can unambiguously recognize that it's
something they don't know (yet); and with only a little care in
defining the ASN.1 they can skip the data they don't understand, and
as long as they don't actually need to process that data (only store
or forward it etc.) can proceed OK without even being upgraded. This
is useful for applications that want it, but not mandatory.

That said, I basically concur with mwojcik: ASN.1 is _a_ choice, with
advantages and disadvantages; there are others. One of the features,
IMO often a disadvantage, it shares with XML is that both are designed
very generally, to handle essentially everything anybody wants, so
tools that handle that generality are usually complex and arguably
bloated. But if you don't use those tools and develop your own more
limited specific ones you (must) reimplement quite a few wheels.

- David.Thompson1 at worldnet.att.net
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
convert non-delimited to delimited RyanL Python 6 08-28-2007 12:06 AM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
Re: Handling delimited strings michael@preece.net C++ 7 11-14-2005 07:26 AM
splitting delimited strings Mark Harrison Python 10 06-16-2005 03:21 AM
Extracting strings delimited by other strings Scott Bass Perl Misc 4 05-12-2005 09:06 PM



Advertisments