Velocity Reviews > f python?

# f python?

BartC
Guest
Posts: n/a

 04-08-2012
"Kaz Kylheku" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...

> Worse, the one byte Unix mistake being covered is, disappointingly, just a
> clueless rant against null-terminated strings.
>
> Null-terminated strings are infinitely better than the ridiculous
> encapsulation of length + data.
>
> For one thing, if s is a non-empty null terminated string then, cdr(s) is
> also
> a string representing the rest of that string without the first character,
> where cdr(s) is conveniently defined as s + 1.

If strings are represented as (ptr,length), then a cdr(s) would have to
return (ptr+1,length-1), or (nil,0) if s was one character. No big deal.

(Note I saw your post in comp.lang.python; I don't about any implications of
that for Lisp.)

And if, instead, you want to represent all but the last character of the
string, then it's just (ptr,length-1). (Some checking is needed around empty
strings, but similar checks are needed around s+1.)

In addition, if you want to represent the middle of a string, then it's also
very easy: (ptr+a,b).

> Not only can compilers compress storage by recognizing that string
> literals are
> the suffixes of other string literals, but a lot of string manipulation
> code is
> simplified, because you can treat a pointer to interior of any string as a
> string.

Yes, the string "bart" also contains "art", "rt" and "t". But with counted
strintgs, it can also contain "bar", "ba", "b", etc....

There are a few advantages to counted strings too...

> length + data also raises the question: what type is the length field? One
> byte? Two bytes? Four?

Depends on the architecture. But 4+4 for 32-bits, and 8+8 bytes for 64-bits,
I would guess, for general flex strings of any length.

There are other ways of encoding a length.

(For example I use one short string type of maximum M characters, but the
current length N is encoded into the string, without needing any extra count
byte (by fiddling about with the last couple of bytes). If you're trying to
store a short string in an 8-byte field in a struct, then this will let you
use all 8 bytes; a zero-terminated one, only 7.)

> And then you have issues of byte order.

Which also affects every single value of more than one byte.

> Null terminated
> C strings can be written straight to a binary file or network socket and
> be
> instantly understood on the other end.

But they can't contains nulls!

> Null terminated strings have simplified all kids of text manipulation,
> lexical
> scanning, and data storage/communication code resulting in immeasurable
> savings over the years.

They both have their uses.

--
Bartc

Nobody
Guest
Posts: n/a

 04-08-2012
On Sun, 08 Apr 2012 04:11:20 -0700, Xah Lee wrote:

> Ok no problem. My sloppiness. After all, my implementation wasn't
> portable. So, let's fix it. After a while, discovered there's the
> os.sep. Ok, replace "/" to os.sep, done. Then, bang, all hell
> went lose. Because, the backslash is used as escape in string, so any
> regex that manipulate path got ****ed majorly. So, now you need to
> find a quoting mechanism.

if os.altsep is not None:
sep_re = '[%s%s]' % (os.sep, os.altsep)
else:
sep_re = '[%s]' % os.sep

But really, you should be ranting about regexps rather than Python.
They're convenient if you know exactly what you want to match, but a
nuisance if you need to generate the expression based upon data which is
only available at run-time (and re.escape() only solves one very specific
problem).

Xah Lee
Guest
Posts: n/a

 04-09-2012
Xah Lee wrote:

Â« http://xahlee.org/comp/****_python.html Â»

David Canzi wrote

Â«When Microsoft created MS-DOS, they decided to use '\' as the
separator in file names. Â*This was at a time when several previously
existing interactive operating systems were using '/' as the file name
separator and at least one was using '\' as an escape character. Â*As a
result of Microsoft's decision to use '\' as the separator, people
have had to do extra work to adapt programs written for Windows to run
in non-Windows environments, and vice versa. Â*People have had to do
extra work to write software that is portable between these
environments. People have done extra work while creating tools to
make writing portable software easier. Â*And people have to do extra
work when they use these tools, because using them is still harder
than writing portable code for operating systems that all used '/' as
their separator would have been.Â»

namekuseijin wrote:

> yes, absolutely. Â*But you got 2 inaccuracies there: Â*1) Microsoft didn't create DOS; 2) ****ing DOS was written in C, and guess what, it uses \ as escape character. Â*****ing microsoft.
>
> > So, when you say **** Python, are you sure you're shooting at the
> > right target?

>
> I agree. Â***** winDOS and ****ing microsoft.

No. The choice to use backslash than slash is actually a good one.

because, slash is one of the useful char, far more so than backslash.
Users should be able to use that for file names.

i don't know the detailed history of path separator, but if i were to
blame, it's **** unix. The entirety of unix, unix geek, unixers, unix

ã€ˆOn Unix Filename Characters Problemã€‰
http://xahlee.org/UnixResource_dir/w...ame_chars.html

ã€ˆOn Unix File System's Case Sensitivityã€‰
http://xahlee.org/UnixResource_dir/_/fileCaseSens.html

ã€ˆUNIX Tar Problem: File Length Truncation, Unicode Name Supportã€‰
http://xahlee.org/comp/unix_tar_problem.html

ã€ˆWhat Characters Are Not Allowed in File Names?ã€‰
http://xahlee.org/mswin/allowed_char...ile_names.html

ã€ˆUnicode Support in File Names: Windows, Mac, Emacs, Unison, Rsync,
USB, Zipã€‰
http://xahlee.org/mswin/unicode_support_file_names.html

ã€ˆThe Nature of the Unix Philosophyã€‰
http://xahlee.org/UnixResource_dir/writ/unix_phil.html

Xah

Alex Mizrahi
Guest
Posts: n/a

 04-09-2012
>> Ok no problem. My sloppiness. After all, my implementation wasn't
>> portable. So, let's fix it. After a while, discovered there's the
>> os.sep. Ok, replace "/" to os.sep, done. Then, bang, all hell
>> went lose. Because, the backslash is used as escape in string, so any
>> regex that manipulate path got ****ed majorly. So, now you need to
>> find a quoting mechanism.

>
> if os.altsep is not None:
> sep_re = '[%s%s]' % (os.sep, os.altsep)
> else:
> sep_re = '[%s]' % os.sep
>
> But really, you should be ranting about regexps rather than Python.
> They're convenient if you know exactly what you want to match, but a
> nuisance if you need to generate the expression based upon data which is
> only available at run-time (and re.escape() only solves one very specific
> problem).

It isn't a problem of regular expressions, but a problem of syntax for
specification of regular expressions (i.e. them being specified as a
string).

Common Lisp regex library cl-ppcre allows to specify regex via a parse
tree. E.g. "(foo[/\\]bar)" becomes

(:REGISTER (:SEQUENCE "foo" (:CHAR-CLASS #\/ #\\) "bar"))

This is more verbose, but totally unambiguous and requires no escaping.

So this definitely is a problem of Python's regex library, and a problem
of lack of support for nice parse tree representation in code.

cl-ppcre supports both textual perl-compatible regex specification and
parse tree. I would start with a simple string specification, then when
**** hits fan I can call cl-ppcre:arse-string to get those parse trees
and replaces forward slash with back slash. Moreover, I can
automatically convert regexes:

(defun scan-auto/ (regex target-string)
(let ((fixed-parse-tree (subst '(:char-class #\/ #\\) '(:char-class #\/)
(cl-ppcre:arse-string regex)
:test 'equal)))
(cl-ppcre:scan-to-strings fixed-parse-tree target-string)))

CL-USER> (scan-auto/ "foo[/]bar" "foo\\bar")
"foo\\bar"
#()

Roy Smith
Guest
Posts: n/a

 04-09-2012
In article <4f82d3e2$1$fuzhry+tra$(E-Mail Removed)>, Shmuel (Seymour J.) Metz <(E-Mail Removed)> wrote: > >Null terminated strings have simplified all kids of text > >manipulation, lexical scanning, and data storage/communication > >code resulting in immeasurable savings over the years. > > Yeah, especially code that needs to deal with lengths and nulls. It's > great for buffer overruns too. I once worked on a C++ project that used a string class which kept a length count, but also allocated one extra byte and stuck a null at the end of every string. Kaz Kylheku Guest Posts: n/a  04-09-2012 On 2012-04-09, Shmuel Metz <(E-Mail Removed)> wrote: > In <(E-Mail Removed)>, on 04/08/2012 > at 07:14 PM, Kaz Kylheku <(E-Mail Removed)> said: > >>Null-terminated strings are infinitely better than the ridiculous >>encapsulation of length + data. > > ROTF,LMAO! > >>For one thing, if s is a non-empty null terminated string then, >>cdr(s) is also a string representing the rest of that string >>without the first character, > > Are you really too clueless to differentiate between C and LISP? In Lisp we can burn a list literal like '(a b c) into ROM, and compute (b c) without allocating any memory. Null-terminated C strings do the same thing. In some Lisp systems, in fact, "CDR coding" was used to save space when allocating a list all at once. This created something very similar to a C string: a vector-like object of all the CARs, with a terminating convention marking the end. It's logically very similar. I need not repeat the elegant recursion example for walking a C string. That example is not possible with the length + data representation. (Not without breaking the encapsulation and passing the length as a separate recursion parameter to a recursive routine that works with the raw data part of the string.) >>Null terminated strings have simplified all kids of text >>manipulation, lexical scanning, and data storage/communication >>code resulting in immeasurable savings over the years. > > Yeah, especially code that needs to deal with lengths and nulls. To get the length of a string, you call a function, in either representation, so it is not any more complicated from a coding point of view. The function is, of course, more expensive if the string is null terminated, but you can code with awareness of this and not call length wastefully. If all else was equal (so that the expense of the length operation were the /only/ issue) then of course the length + data would be better. However, all else is not equal. One thing that is darn useful, for instance, is that p + strlen(p) still points to a string which is length zero, and this sort of thing is widely exploited in text processing code. e.g. size_t digit_prefix_len = strspn(input_string, "0123456789"); const char *after_digits = input-string + digit_prefix_len; if (*after_digits == 0) { /* string consists only of digits: nothing after digits */ } else { /* process part after digits */ } It's nice that after_digits is a bona-fide string just like input_string, without any memory allocation being required. We can lexically analyze a string without ever asking it what its length is, and as we march down the string, the remaining suffix of that string is always a string so we can treat it as one, recurse on it, whatever. Code that needs to deal with null "characters" is manipulating binary data, not text, and should use a suitable data structure for that. > It's great for buffer overruns too. If we scan for a null terminator which is not there, we have a buffer overrun. If a length field in front of string data is incorrect, we also have a buffer overrrun. A pattern quickly emerges here: invalid, corrupt data produced by buggy code leads to incorrect results, and behavior that is not well-defined! Kaz Kylheku Guest Posts: n/a  04-09-2012 On 2012-04-09, Roy Smith <(E-Mail Removed)> wrote: > In article <4f82d3e2$1$fuzhry+tra$(E-Mail Removed)>,
> Shmuel (Seymour J.) Metz <(E-Mail Removed)> wrote:
>
>> >Null terminated strings have simplified all kids of text
>> >manipulation, lexical scanning, and data storage/communication
>> >code resulting in immeasurable savings over the years.

>>
>> Yeah, especially code that needs to deal with lengths and nulls. It's
>> great for buffer overruns too.

>
> I once worked on a C++ project that used a string class which kept a
> length count, but also allocated one extra byte and stuck a null at the
> end of every string.

Me too! I worked on numerous C++ projects with such a string template
class.

It was usually called

std::basic_string

and came from this header called:

#include <string>

which also instantiated it into two flavors under two nicknames:
std::basic_string<char> being introduced as std::string, and
std::basic_string<wchar_t> as std::wstring.

This class had a c_str() function which retrieved a null-terminated
string and so most implementations just stored the data that way, but
some of the versions of that class cached the length of the string
to avoid doing a strlen or wcslen operation on the data.

Rainer Weikusat
Guest
Posts: n/a

 04-09-2012
Shmuel (Seymour J.) Metz <(E-Mail Removed)> writes:

[...]

>>For one thing, if s is a non-empty null terminated string then,
>>cdr(s) is also a string representing the rest of that string
>>without the first character,

>
> Are you really too clueless to differentiate between C and LISP?

In LISP, a list is a set of conses (pairs) whose car (first element of
the pair) contains a value and whose cdr (second element of the pair)
links to the next cons that's part of the list. The end of a list is
marked by a cdr whose value is nil. A so-called 'C string' is a
sequentially allocated sequence of memory locations which contain the
characters making up the string and the end of it is marked by a
memory location holding the value 0. This is logically very similar
to the LISP list and it shouldn't be to difficult to understand that
'cdr(s) is also a string representing the rest of the string' means
'given that s points to a non-empty C string, s + 1 points to a
possibly empty C string which is identical with s with the first
character removed'.

>>Null terminated strings have simplified all kids of text
>>manipulation, lexical scanning, and data storage/communication
>>code resulting in immeasurable savings over the years.

>
> Yeah, especially code that needs to deal with lengths and nulls. It's
> great for buffer overruns too.

This is, I think, a case where the opinions of people who have used C
strings and the opinions of people who haven't differ greatly. A nice
German proverb applicable to situations like that would be 'Was der
Bauer nicht kennt das frisst er nicht' ...

Rainer Weikusat
Guest
Posts: n/a

 04-09-2012
Rainer Weikusat <(E-Mail Removed)> writes:
> Shmuel (Seymour J.) Metz <(E-Mail Removed)> writes:
>
> [...]
>
>>>For one thing, if s is a non-empty null terminated string then,
>>>cdr(s) is also a string representing the rest of that string
>>>without the first character,

>>
>> Are you really too clueless to differentiate between C and LISP?

>
> In LISP, a list is a set of conses (pairs) whose car (first element of
> the pair) contains a value and whose cdr (second element of the pair)
> links to the next cons that's part of the list. The end of a list is
> marked by a cdr whose value is nil.

Addition: This can also be implemented very neatly in Perl by using
two element array references as 'cons cells', toy example

-----------
sub car
{
return $_[0][0]; } sub cdr { return$_[0][1];
}

sub list
{
@_ && [shift, &list];
}

$l = list(0 .. 100); while ($l) {
print(car($l), ' ');$l = cdr($l); } print("\n"); ----------- and for algorithms which are well-suited for linked lists, this can even outperform (when suitably implemented) an equivalent algorithm using arrays. BartC Guest Posts: n/a  04-10-2012 "Shmuel (Seymour J.)Metz" <(E-Mail Removed)> wrote in message news:4f8410ff$2$fuzhry+tra$(E-Mail Removed) ...
> In <(E-Mail Removed)>, on 04/09/2012
> at 06:55 PM, Kaz Kylheku <(E-Mail Removed)> said:

>>If we scan for a null terminator which is not there, we have a
>>buffer overrun.

>
> You're only thinking of scanning an existing string; think of
> constructing a string. The null only indicates the current length, not
> the amount allocated.
>
>>If a length field in front of string data is incorrect, we also have
>>a buffer overrrun.

>
> The languages that I'm aware of that use a string length field also
> use a length field for the allocated storage. More precisely, they
> require that attempts to store beyond the allocated length be
> detected.

I would have thought trying to *read* beyond the current length would be an
error.

Writing beyond the current length, and perhaps beyond the current allocation
might be OK if the string is allowed grow, otherwise that's also an error.

In any case, there is no real need for an allocated length to be passed
around with the string, if you are only going to be reading it, or only
modifying the existing characters. And depending on the memory management
arrangements, such a length need not be stored at all.

--
Bartc