Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > "convert" string to bytes without changing data (encoding)

Reply
Thread Tools

"convert" string to bytes without changing data (encoding)

 
 
Steven D'Aprano
Guest
Posts: n/a
 
      03-28-2012
On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I am
> often dealing with data, that is basically text, but it can contain
> 8-bit bytes.


All bytes are 8-bit, at least on modern hardware. I think you have to go
back to the 1950s to find 10-bit or 12-bit machines.

> In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.


Well you can't do that, because *by definition* you are changing a
CHARACTER into ONE OR MORE BYTES. So the question you have to ask is,
*how* do you want to change them?

You can use an error handler to convert any untranslatable characters
into question marks, or to ignore them altogether:

bytes = string.encode('ascii', 'replace')
bytes = string.encode('ascii', 'ignore')

When going the other way, from bytes to strings, it can sometimes be
useful to use the Latin-1 encoding, which essentially cannot fail:

string = bytes.decode('latin1')

although the non-ASCII chars that you get may not be sensible or
meaningful in any way. But if there are only a few of them, and you don't
care too much, this may be a simple approach.

But in a nutshell, it is physically impossible to map the millions of
Unicode characters to just 256 possible bytes without either throwing
some characters away, or performing an encoding.



> As it seems, this would be far easier with python 2.x.


It only seems that way until you try.


--
Steven
 
Reply With Quote
 
 
 
 
Prasad, Ramit
Guest
Posts: n/a
 
      03-28-2012
> You can read as bytes and decode as ASCII but ignoring the troublesome

> non-text characters:
>

> >>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))

> Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
> (Parittsbit) auf den Kommunikationsleitungen oder fr andere
> Steuerungsaufgaben verwendet werden.Heute wird es aber fast immer zur
> Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
> Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
> so dass alle im ASCII definierten Zeichen auch in den verschiedenen
> Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
> einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
> Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.
>
> The paragraph is from the German Wikipedia on ASCII, in UTF-8.


I see no non-ASCII characters, notsure if that is because the source
has none or something else. Fromthis example I would not say that
the rest of the text is "unchanged". Decode converts to Unicode,
did you mean encode?

I think "ignore" will remove non-translatable characters and not
leave them in the returned string.

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
 
Reply With Quote
 
 
 
 
Tim Chase
Guest
Posts: n/a
 
      03-28-2012
On 03/28/12 13:05, Ross Ridge wrote:
> Ross Ridge<(E-Mail Removed)> wr=
>> But a Python Unicode string might be stored in several
>> ways; for all you know, it might actually be stored as a sequence of
>> apples in a refrigerator, just as long as they can be referenced
>> correctly.

>
> But it is in fact only stored in one particular way, as a series of bytes.
>
>> There's no logical Python way to turn that into a series of bytes.

>
> Nonsense. Play all the semantic games you want, it already is a series
> of bytes.


Internally, they're a series of bytes, but they are MEANINGLESS
bytes unless you know how they are encoded internally. Those
bytes could be UTF-8, UTF-16, UTF-32, or any of a number of other
possible encodings[1]. If you get the internal byte stream,
there's no way to meaningfully operate on it unless you also know
how it's encoded (or you're willing to sacrifice the ability to
reliably get the string back).

-tkc

[1]
http://docs.python.org/library/codec...dard-encodings




 
Reply With Quote
 
Ethan Furman
Guest
Posts: n/a
 
      03-28-2012
Prasad, Ramit wrote:
>> You can read as bytes and decode as ASCII but ignoring the troublesome
>> non-text characters:
>>
>>>>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))

>> Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
>> (Parittsbit) auf den Kommunikationsleitungen oder fr andere
>> Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
>> Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
>> Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
>> so dass alle im ASCII definierten Zeichen auch in den verschiedenen
>> Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
>> einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
>> Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.
>>
>> The paragraph is from the German Wikipedia on ASCII, in UTF-8.

>
> I see no non-ASCII characters, not sure if that is because the source
> has none or something else.


The 'ignore' argument to .decode() caused all non-ascii characters to be
removed.

~Ethan~
 
Reply With Quote
 
Prasad, Ramit
Guest
Posts: n/a
 
      03-28-2012
> >The right way to convert bytes to strings, and vice versa, is via

>>encoding and decoding operations.

>
> If you want to dictateto the original poster the correct way to do
> things then you don't need to do anything more that. You don't need to
> pretend like Chris Angelico that there's isn't a direct mapping from
> the his Python 3 implementation's internal respresentation of strings
> to bytes in order to label what he's asking for as being "silly".


It might be technically possible to recreate internal implementation,
or get the byte data. That does not mean it will make any sense or
be understood in a meaningful manner. I think Ian summarized it
very well:


>You can't generally just "deal with the ascii portions" without
>knowing something about the encoding. Say you encounter a byte
>greater than 127. Is it a single non-ASCII character, or is it the
>leading byte of a multi-byte character? If the next character is less
>than 127, is it an ASCII character, or a continuation of the previous
>character? For UTF-8 you could safely assume ASCII, but without
>knowing the encoding, there is no way to be sure. If you just assume
>it's ASCII and manipulate it as such, you could be messing up
>non-ASCII characters.


Technically, ASCII goes up to 256 but they are not A-z letters.

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
 
Reply With Quote
 
Ross Ridge
Guest
Posts: n/a
 
      03-28-2012
Tim Chase <(E-Mail Removed)> wrote:
>Internally, they're a series of bytes, but they are MEANINGLESS
>bytes unless you know how they are encoded internally. Those
>bytes could be UTF-8, UTF-16, UTF-32, or any of a number of other
>possible encodings[1]. If you get the internal byte stream,
>there's no way to meaningfully operate on it unless you also know
>how it's encoded (or you're willing to sacrifice the ability to
>reliably get the string back).


In practice the number of ways that CPython (the only Python 3
implementation) represents strings is much more limited. Pretending
otherwise really isn't helpful.

Still, if Chris Angelico had used your much less misleading explaination,
then this could've been resolved much quicker. The original poster
didn't buy Chris's bullshit for a minute, instead he had to find out on
his own that that the internal representation of strings wasn't what he
expected to be.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] http://www.velocityreviews.com/forums/(E-Mail Removed)
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
 
Reply With Quote
 
Evan Driscoll
Guest
Posts: n/a
 
      03-28-2012
On 01/-10/-28163 01:59 PM, Ross Ridge wrote:
> Steven D'Aprano<(E-Mail Removed)> wrote:
>> The right way to convert bytes to strings, and vice versa, is via
>> encoding and decoding operations.

>
> If you want to dictate to the original poster the correct way to do
> things then you don't need to do anything more that. You don't need to
> pretend like Chris Angelico that there's isn't a direct mapping from
> the his Python 3 implementation's internal respresentation of strings
> to bytes in order to label what he's asking for as being "silly".


That mapping may as well be:

def get_bytes(some_string):
import random
length = random.randint(len(some_string), 5*len(some_string))
bytes = [0] * length
for i in xrange(length):
bytes[i] = random.randint(0, 255)
return bytes

Of course this is hyperbole, but it's essentially about as much
guarantee as to what the result is.

As many others have said, the encoding isn't defined, and I would guess
varies between implementations. (E.g. if Jython and IronPython use their
host platforms' native strings, both have 16-bit chars and thus probably
use UTF-16 encoding. I am not sure what CPython uses, but I bet it's
*not* that.)

It's even guaranteed that the byte representation won't change! If
something is lazily evaluated or you have a COW string or something, the
bytes backing it will differ.


So yes, you can say that pretending there's not a mapping of strings to
internal representation is silly, because there is. However, there's
nothing you can say about that mapping.

Evan
 
Reply With Quote
 
Albert W. Hopkins
Guest
Posts: n/a
 
      03-28-2012
On Wed, 2012-03-28 at 14:05 -0400, Ross Ridge wrote:
> Ross Ridge <(E-Mail Removed)> wr=
> > Of course it is. =A0Conceptually you're not supposed to think of it that
> > way, but a string is stored in memory as a series of bytes.

>
> Chris Angelico <(E-Mail Removed)> wrote:
> >Note that distinction. I said that a string "is not" a series of
> >bytes; you say that it "is stored" as bytes.

>
> The distinction is meaningless. I'm not going argue with you about what
> you or I ment by the word "is".
>


Off topic, but obligatory:

https://www.youtube.com/watch?v=j4XT-l-_3y0


 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      03-28-2012
On 3/28/2012 10:43 AM, Peter Daum wrote:
> On 2012-03-28 12:42, Heiko Wundram wrote:
>> Am 28.03.2012 11:43, schrieb Peter Daum:


> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.


So why let the data get into a "str" type at all? Do everything
end to end with "bytes" or "bytearray" types.

John Nagle
 
Reply With Quote
 
Grant Edwards
Guest
Posts: n/a
 
      03-28-2012
On 2012-03-28, Steven D'Aprano <(E-Mail Removed)> wrote:
> On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:
>
>> The longer story of my question is: I am new to python (obviously), and
>> since I am not familiar with either one, I thought it would be advisory
>> to go for python 3.x. The biggest problem that I am facing is, that I am
>> often dealing with data, that is basically text, but it can contain
>> 8-bit bytes.

>
> All bytes are 8-bit, at least on modern hardware. I think you have to
> go back to the 1950s to find 10-bit or 12-bit machines.


Well, on anything likely to run Python that's true. There are modern
DSP-oriented CPUs where a byte is 16 or 32 bits (and so is an int and
a long, and a float and a double).

>> As it seems, this would be far easier with python 2.x.

>
> It only seems that way until you try.


It's easy as long as you deal with nothing but ASCII and Latin-1.

--
Grant Edwards grant.b.edwards Yow! Somewhere in Tenafly,
at New Jersey, a chiropractor
gmail.com is viewing "Leave it to
Beaver"!
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Ratio of Bytes Delayed to Bytes Sent netproj Cisco 0 12-21-2005 08:08 PM
4-bytes or 8-bytes alignment? mrby C Programming 8 11-02-2004 08:45 PM
Private Bytes vs. # Bytes in all Heaps in Perfmon Jason Collins ASP .Net 3 02-18-2004 03:59 PM
Re: receiving Bytes and sending Bytes Ieuan Adams Computer Support 0 07-24-2003 07:46 PM
Re: receiving Bytes and sending Bytes The Old Sourdough Computer Support 0 07-23-2003 01:23 PM



Advertisments