Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > "convert" string to bytes without changing data (encoding)

Reply
Thread Tools

"convert" string to bytes without changing data (encoding)

 
 
Chris Angelico
Guest
Posts: n/a
 
      03-29-2012
On Fri, Mar 30, 2012 at 5:00 AM, Ross Ridge <(E-Mail Removed)> wrote:
> Sorry, it would've been more accurate to label the flavour of kool-aid
> Chris Angelico was trying to push as "it's impossible ... without
> encoding":
>
> * * * *What is a string? It's not a series of bytes. You can't convert
> * * * *it without encoding those characters into bytes in some way.


I still stand by that statement. Do you try to convert a "dictionary
of filename to open file object" into a "series of bytes" inside
Python? It doesn't matter that, on some level, it's *stored as* a
series of bytes; the actual object *is not* a series of bytes. There
is no logical equivalency, ergo it is illogical and nonsensical to
expect to turn one into the other without some form of encoding.
Python does include an encoding that can handle lists and
dictionaries. It's called Pickle, and it returns (in Python 3) a bytes
object - which IS a series of bytes. It doesn't simply return some
internal representation.

ChrisA
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      03-30-2012
On Thu, 29 Mar 2012 17:36:34 +0000, Prasad, Ramit wrote:

>> > Technically, ASCII goes up to 256 but they are not A-z letters.
>> >

>> Technically, ASCII is 7-bit, so it goes up to 127.

>
>> No, ASCII only defines 0-127. Values >=128 are not ASCII.
>>
>> >From https://en.wikipedia.org/wiki/ASCII:

>>
>> ASCII includes definitions for 128 characters: 33 are non-printing
>> control characters (now mostly obsolete) that affect how text and
>> space is processed and 95 printable characters, including the space
>> (which is considered an invisible graphic).

>
>
> Doh! I was mistaking extended ASCII for ASCII. Thanks for the
> correction.


There actually is no such thing as "extended ASCII" -- there is a whole
series of many different "extended ASCIIs". If you look at the encodings
available in (for example) Thunderbird, many of the ISO-8859-* and
Windows-* encodings are "extended ASCII" in the sense that they extend
ASCII to include bytes 128-255. Unfortunately they all extend ASCII in a
different way (hence they are different encodings).


--
Steven
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      03-30-2012
On Thu, 29 Mar 2012 11:30:19 -0400, Ross Ridge wrote:

> Steven D'Aprano <(E-Mail Removed)> wrote:
>>Your reaction is to make an equally unjustified estimate of Evan's
>>mindset, namely that he is not just wrong about you, but *deliberately
>>and maliciously* lying about you in the full knowledge that he is wrong.

>
> No, Evan in his own words admitted that his post was ment to be harsh,
> "a bit harsher than it deserves", showing his malicious intent.


Being harsher than it deserves is not synonymous with malicious. You are
making assumptions about Evan's mental state that are not supported by
the evidence. Evan may believe that by "punishing" (for some feeble sense
of punishment) you harshly, he is teaching you better behaviour that will
be to your own benefit; or that it will act as a warning to others.
Either way he may believe that he is actually doing good.

And then he entirely undermined his own actions by admitting that he was
over-reacting. This suggests that, in fact, he wasn't really motivated by
either malice or beneficence but mere frustration.

It is quite clear that Evan let his passions about writing maintainable
code get the best of him. His rant was more about "people like you" than
you personally.

Evan, if you're reading this, I think you owe Ross an apology for flying
off the handle. Ross, I think you owe Evan an apology for unjustified
accusations of malice.


> He made
> accusations that where neither supported by anything I've said


Now that is not actually true. Your posts have defended the idea that
copying the raw internal byte representation of strings is a reasonable
thing to do. You even claimed to know how to do so, for any version of
Python (but so far have ignored my request for you to demonstrate).


> in this
> thread nor by the code I actually write. His accusation about me were
> completely made up, he was not telling the truth and had no reasonable
> basis to beleive he was telling the truth. He was malicously lying and
> I'm completely justified in saying so.


No, they were not completely made up. Your posts give many signs of being
somebody who might very well write code to the implementation rather than
the interface. Whether you are or not is a separate question, but your
posts in this thread indicate that you very likely could be.

If this is not the impression you want to give, then you should
reconsider your posting style.

Ross, to be frank, your posting style in this thread has been cowardly
and pedantic, an obnoxious combination. Please take this as constructive
criticism and not an attack -- you have alienated people in this thread,
leading at least one person to publicly kill-file your future posts. I
choose to assume you aren't aware of why that is than that you are doing
so deliberately.

Without actually coming out and making a clear, explicit statement that
you approve or disapprove of the OP's attempt to use implementation
details, you *imply* support without explicitly giving it; you criticise
others for saying it can't be done without demonstrating that it can be
done. If this is a deliberate rhetorical trick, then shame on you for
being a coward without the conviction to stand behind concrete
expressions of your opinion. If not, then you should be aware that you
are using a rhetorical style that will make many people predisposed to
think you are a ****.

You *might* have said

Guys, you're technically wrong about this. This is how you can
retrieve the internal representation of a string as a sequence
of bytes: ...code... but you shouldn't use this in production
code because it is fragile and depends on implementation details
that may break in PyPy and Jython and IronPython.

But you didn't.

You *might* have said

Wrong, you can convert a string into a sequence of bytes without
encoding or decoding: ...code... but don't do this.

But you didn't.

Instead you puffed yourself up as a big shot who was more technically
correct than everyone else, but without *actually* demonstrating that you
can do what you said you can do. You labelled as "bullshit" our attempts
to discourage the OP from his misguided approached.

If your intention was to put people off-side, you succeeded very well. If
not, you should be aware that you have, and consider how you might avoid
this in the future.



--
Steven
 
Reply With Quote
 
Michael Ströder
Guest
Posts: n/a
 
      03-30-2012
Steven D'Aprano wrote:
> On Thu, 29 Mar 2012 17:36:34 +0000, Prasad, Ramit wrote:
>
>>>> Technically, ASCII goes up to 256 but they are not A-z letters.
>>>>
>>> Technically, ASCII is 7-bit, so it goes up to 127.

>>
>>> No, ASCII only defines 0-127. Values >=128 are not ASCII.
>>>
>>> >From https://en.wikipedia.org/wiki/ASCII:
>>>
>>> ASCII includes definitions for 128 characters: 33 are non-printing
>>> control characters (now mostly obsolete) that affect how text and
>>> space is processed and 95 printable characters, including the space
>>> (which is considered an invisible graphic).

>>
>>
>> Doh! I was mistaking extended ASCII for ASCII. Thanks for the
>> correction.

>
> There actually is no such thing as "extended ASCII" -- there is a whole
> series of many different "extended ASCIIs". If you look at the encodings
> available in (for example) Thunderbird, many of the ISO-8859-* and
> Windows-* encodings are "extended ASCII" in the sense that they extend
> ASCII to include bytes 128-255. Unfortunately they all extend ASCII in a
> different way (hence they are different encodings).


Yupp.

Looking at RFC 1345 some years ago (while having to deal with EBCDIC) made
this all pretty clear to me. I appreciate that someone did this heavy work of
collecting historical encodings.

Ciao, Michael.
 
Reply With Quote
 
Serhiy Storchaka
Guest
Posts: n/a
 
      03-30-2012
28.03.12 21:13, Heiko Wundram написав(ла):
> Reading from stdin/a file gets you bytes, and
> not a string, because Python cannot automagically guess what format the
> input is in.


In Python3 reading from stdin gets you string. Use sys.stdin.buffer.raw
for access to byte stream. And reading from file opened in text mode
gets you string too.

 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      03-30-2012
On Sat, Mar 31, 2012 at 6:06 AM, Serhiy Storchaka <(E-Mail Removed)> wrote:
> 28.03.12 21:13, Heiko Wundram написав(ла):
>
>> Reading from stdin/a file gets you bytes, and
>> not a string, because Python cannot automagically guess what format the
>> input is in.

>
>
> In Python3 reading from stdin gets you string. Use sys.stdin.buffer.raw for
> access to byte stream. And reading from file opened in text mode gets you
> string too.


True. But that's only if it's been told the encoding of stdin (which I
believe is the normal case on Linux). It's still not "automagically
guess(ing)", it's explicitly told.

ChrisA
 
Reply With Quote
 
Piet van Oostrum
Guest
Posts: n/a
 
      08-29-2012
Ross Ridge <(E-Mail Removed)> writes:

>
> But it is in fact only stored in one particular way, as a series of bytes.
>

No, it can be stored in different ways. Certainly in Python 3.3 and
beyond. And in 3.2 also, depending on wide/narrow build.
--
Piet van Oostrum <(E-Mail Removed)>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
 
Reply With Quote
 
Piet van Oostrum
Guest
Posts: n/a
 
      08-29-2012
Heiko Wundram <(E-Mail Removed)> writes:

> Reading from stdin/a file gets you bytes, and
> not a string, because Python cannot automagically guess what format the
> input is in.
>

Huh?

Python 3.3.0rc1 (v3.3.0rc1:8bb5c7bc46ba, Aug 25 2012, 10:09:29)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x = input()

abcd123
>>> x

'abcd123'
>>> type(x)

<class 'str'>

>>> y = sys.stdin.readline()

abcd123
>>> y

'abcd123\n'
>>> type(y)

<class 'str'>

--
Piet van Oostrum <(E-Mail Removed)>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      08-30-2012
On Wed, 29 Aug 2012 19:39:15 -0400, Piet van Oostrum wrote:

>> Reading from stdin/a file gets you bytes, and not a string, because
>> Python cannot automagically guess what format the input is in.
>>

> Huh?


Oh, it can certainly guess (in the absence of any other information, it
uses the current locale). Whether or not that guess is correct is a
different matter.

Realistically, if you want sensible behaviour from Python 3.x, you need
to use an ISO-8859-1 locale. That ensures that conversion between str and
bytes will never fail, and an str-bytes-str or bytes-str-bytes round-trip
will pass data through unmangled.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Ratio of Bytes Delayed to Bytes Sent netproj Cisco 0 12-21-2005 08:08 PM
4-bytes or 8-bytes alignment? mrby C Programming 8 11-02-2004 08:45 PM
Private Bytes vs. # Bytes in all Heaps in Perfmon Jason Collins ASP .Net 3 02-18-2004 03:59 PM
Re: receiving Bytes and sending Bytes Ieuan Adams Computer Support 0 07-24-2003 07:46 PM
Re: receiving Bytes and sending Bytes The Old Sourdough Computer Support 0 07-23-2003 01:23 PM



Advertisments