Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > permuting letters and fairy tales

Reply
Thread Tools

permuting letters and fairy tales

 
 
Johannes Nix
Guest
Posts: n/a
 
      11-11-2004

Hello,

yesterday I met a cute person (after my dance class) who told me about an
interesting experiment regarding cognition. People were told to read a
typed text; However, in every word in the text, only the first and the
last letter were in the right place. The other letters had their
positions changed in an arbitrary manner. The surprising result, I was
told, was that people can read this mixed-up text fairly well.

Because I am a somewhat sceptical guy, at times, and because I thought
that I deserved some play, I decided to code the rule above in a
scriptlet. The resulting 23 lines are below, and the outcome is quite
interesting, not only from the point of view of librarians :

--------------------------------------------------
#!/usr/bin/python
import sys
import locale
import string
import re
import Numeric
import RandomArray

locale.setlocale(locale.LC_CTYPE, '')
wordsep = re.compile('([^%s])' % string.letters)

for line in sys.stdin.xreadlines():
for word in wordsep.split(line):
if word and word[0] in string.letters:
word = string.lower(word)
wlen = len(word)
if wlen > 3:
wa = Numeric.array(word)
perm = RandomArray.permutation(wlen-2)
wa[1:wlen-1] = Numeric.take(wa[1:wlen-1],perm)
word = wa.tostring()
sys.stdout.write('%s' % word)

--------------------------------------------------


For the Uninitiated, Numeric is a package which deals with array data;
arrays are mutable sequences and Numeric.take() can reorder items in
them; RandomArray.permutation() delivers the randomized reordering we
need.

Now I have two innocent questions:

- Is it possible to make it a bit more concise )) ?

- Can it coerced to run a little bit faster ?
(on my oldish, 300 MHz-AMD K6 , run time looks like this
for a famous, 2663-word-long fairy tale from the Grimm's brothers:

nix@aster:~> time <HaenselundGretel.txt ./python/perlmutt.py >v

real 0m6.970s
user 0m3.634s
sys 0m0.120s


And two remarks what is interesting about it:

- It's a good example how powerful libraries, like
Numeric, make one's life easier. (BTW, why is Numeric
and stuff like take() still not included in the standard
Library ? Batteries included, but calculator not ?)

- Perhaps it's useful to protect messages in some
regions with not-so-democratic forms of government
against automatic scanning by making the message
machine-unreadable, causing some Orwellian Confusion ?
Of course, texts from Pythonistas would remain suspicious,
due to the large number of "y" occurring in them....

have a nice evening....

Johannes


 
Reply With Quote
 
 
 
 
Steven Bethard
Guest
Posts: n/a
 
      11-11-2004
Johannes Nix <Johannes.Nix <at> gmx.net> writes:
> Now I have two innocent questions:
>
> - Is it possible to make it a bit more concise )) ?
>
> - Can it coerced to run a little bit faster ?
> (on my oldish, 300 MHz-AMD K6 , run time looks like this
> for a famous, 2663-word-long fairy tale from the Grimm's brothers:


So, I don't have Numeric, but I believe I've translated your code to the
equivalent numarray code. I've also written a version of it using only the
standard Python libraries:

------------------------------------------------------------
import sys
import locale
import string
import re
import numarray
import numarray.strings
import numarray.random_array
import random

def f1(file_in, file_out):
wordsep = re.compile('([^%s])' % string.letters)
for line in file_in.xreadlines():
for word in wordsep.split(line):
if word and word[0] in string.letters:
word = string.lower(word)
wlen = len(word)
if wlen > 3:
wa = numarray.strings.array(list(word))
perm = numarray.random_array.permutation(wlen-2)
wa[1:wlen-1] = numarray.take(wa[1:wlen-1],perm)
word = wa.tostring()
file_out.write('%s' % word)
file_out.write('\n')

def f2(file_in, file_out):
word_sep_matcher = re.compile(r'(\W+)')
for line in file_in:
for word in word_sep_matcher.split(line):
if word and word.isalpha():
word = word.lower()
if len(word) > 3:
inner = list(word[1:-1])
random.shuffle(inner)
word = '%s%s%s' % (word[0], ''.join(inner), word[-1])
file_out.write('%s' % word)
file_out.write('\n')
------------------------------------------------------------

As far as conciceness goes, a couple things of note:
(1) I believe r'([^A-z])' does the same thing as your re
(2) xreadlines is deprecated in Python 2.3
(3) str.isalpha() is probably a good substitute for s[0] in string.letters. (If
you want to be strict about the translation, you can do s[0].isalpha())
(4) string.lower(word) is deprecated in favor of word.lower()
(5) perhaps Numeric doesn't support this, but I believe you can replace
wa[1:wlen-1] with wa[1:-1]
(6) you can do something much like what your numarray code does by using
random.shuffle


As far as speed goes, here's the timings I got using your email (minus the code)
as input (your code is saved in 'temp.txt'):

>python -m timeit -n 10000 -s "import temp; file_in = file('temp.txt'); file_out

= file('out.txt', 'w')" "temp.f1(file_in, file_out)"
10000 loops, best of 3: 31.6 usec per loop

>python -m timeit -n 10000 -s "import temp; file_in = file('temp.txt'); file_out

= file('out.txt', 'w')" "temp.f2(file_in, file_out)"
10000 loops, best of 3: 8.02 usec per loop

Looks to me like the one without numarray is much quicker, but I can't be sure
that Numeric would behave in the same manner.

> - It's a good example how powerful libraries, like
> Numeric, make one's life easier. (BTW, why is Numeric
> and stuff like take() still not included in the standard
> Library ? Batteries included, but calculator not ?)


Well, at least in numarray, take is just the functional form of indexing by an
array:

>>> arr = na.arange(20)*2
>>> na.take(arr, na.array([3, 5, 13]))

array([ 6, 10, 26])
>>> arr[na.array([3, 5, 13])]

array([ 6, 10, 26])

While it's true that Python's builtin lists don't support this directly, list
comprehensions make this pretty easy to reproduce:

>>> lst = [2*x for x in range(20)]
>>> [lst[i] for i in [3, 5, 13]]

[6, 10, 26]

> - Perhaps it's useful to protect messages in some
> regions with not-so-democratic forms of government
> against automatic scanning by making the message
> machine-unreadable, causing some Orwellian Confusion ?


Unfortunately, this 'encoding scheme' doesn't work in all languages -- what's
the first 'letter' of a two-character word in, say, Mandarin?

Steve

 
Reply With Quote
 
 
 
 
Andrew Dalke
Guest
Posts: n/a
 
      11-11-2004
Johannes Nix wrote:
> yesterday I met a cute person (after my dance class)


Another dancer, eh? I mostly do Latin dancing (salsa, cha-cha,
nerengue, though haven't got bachata down) some tango and a bit
of swing.

> who told me about an
> interesting experiment regarding cognition. People were told to read a
> typed text; However, in every word in the text, only the first and the
> last letter were in the right place.


That was going around a couple months ago. Here's some links:
http://jwz.livejournal.com/256229.html
http://slashdot.org/article.pl?sid=03/09/15/2227256

> - Is it possible to make it a bit more concise )) ?


Yes. Try this

#!/usr/bin/pthoyn
irpomt re
imropt rondam
ipormt snitrg

# This is lcloae aware, so long as '][-' aren't ltreets.
# (Orhtiewse use re.epscae)

wrod_pat = re.colmipe('[' + string.letters + ']{4,}')

def _jmulbe(m):
s = m.gruop(0)
leretts = lsit(s[1:-1])
rondam.sulfhfe(leetrts)
return s[0] + "".jion(ltrtees) + s[-1]

def jmblue_file(ilnife, ouflite):
for lnie in iinlfe:
lnie = wrod_pat.sub(_julbme, lnie)
olftuie.wirte(line)

if __name__ == "__mian__":
ioprmt sys
julbme_file(sys.sdtin, sys.sdtout)


> - Can it coerced to run a little bit faster ?


You'll have to test it for yourself. I don't have a copy
of your data set and can't find it on-line.

> - It's a good example how powerful libraries, like
> Numeric, make one's life easier. (BTW, why is Numeric
> and stuff like take() still not included in the standard
> Library ? Batteries included, but calculator not ?)


Well, random.shuffle works nicely as does using re.sub along
with a callable. I rarely need Numeric.

> - Perhaps it's useful to protect messages in some
> regions with not-so-democratic forms of government
> against automatic scanning by making the message
> machine-unreadable, causing some Orwellian Confusion ?
> Of course, texts from Pythonistas would remain suspicious,
> due to the large number of "y" occurring in them....


There are many more ways to do that. Eg, see what the
spammers do to get through the repressive forms of email
filters I use.

Oh, and to make life easier for you,

#!/usr/bin/python
import re
import random
import string

# This is locale aware, so long as '][-' aren't letters.
# (Otherwise use re.escape)

word_pat = re.compile('[' + string.letters + ']{4,}')

def _jumble(m):
s = m.group(0)
letters = list(s[1:-1])
random.shuffle(letters)
return s[0] + "".join(letters) + s[-1]

def jumble_file(infile, outfile):
for line in infile:
line = word_pat.sub(_jumble, line)
outfile.write(line)

if __name__ == "__main__":
import sys
jumble_file(sys.stdin, sys.stdout)

Andrew
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Scott David Daniels
Guest
Posts: n/a
 
      11-12-2004
Andrew Dalke wrote:
>

Not to prolong this, but if you are dealing with camelCase:

> word_pat = re.compile('[' + string.letters + ']{4,}')


you might want to use:
word_pat = re.compile('[' + string.letters + ']['
+ string.lowercase + ']{3,}')


-Scott David Daniels
(E-Mail Removed)
 
Reply With Quote
 
Steven Bethard
Guest
Posts: n/a
 
      11-12-2004
Scott David Daniels <Scott.Daniels <at> Acm.Org> writes:
>
> Not to prolong this, but if you are dealing with camelCase:
>
> > word_pat = re.compile('[' + string.letters + ']{4,}')

>
> you might want to use:
> word_pat = re.compile('[' + string.letters + ']['
> + string.lowercase + ']{3,}')


Is there any reason not to use [A-z] type regexps?

>>> p = re.compile('[' + string.letters + ']{4,}')
>>> p.findall('fd 234 asdf454 asdfsa4 sadf Qsdha asdfAded')

['asdf', 'asdfsa', 'sadf', 'Qsdha', 'asdfAded']
>>> p = re.compile('[A-z]{4,}')
>>> p.findall('fd 234 asdf454 asdfsa4 sadf Qsdha asdfAded')

['asdf', 'asdfsa', 'sadf', 'Qsdha', 'asdfAded']

>>> p = re.compile('[' + string.letters + ']['+ string.lowercase + ']{3,}')
>>> p.findall('fd 234 asdf454 asdfsa4 sadf Qsdha asdfAded')

['asdf', 'asdfsa', 'sadf', 'Qsdha', 'asdf', 'Aded']
>>> p = re.compile('[A-z][a-z]{3,}')
>>> p.findall('fd 234 asdf454 asdfsa4 sadf Qsdha asdfAded')

['asdf', 'asdfsa', 'sadf', 'Qsdha', 'asdf', 'Aded']

Seems to do the same thing to me, and doesn't require importing string (which
will hopefully be deprecated one of these days...) =)

Steve

 
Reply With Quote
 
Andrew Dalke
Guest
Posts: n/a
 
      11-12-2004
Steven Bethard wrote:
> Is there any reason not to use [A-z] type regexps?


Better support for internationalization, so it will
work in Espa˝a and G÷teborg.



Andrew
(E-Mail Removed)
 
Reply With Quote
 
Steven Bethard
Guest
Posts: n/a
 
      11-12-2004
Andrew Dalke <adalke <at> mindspring.com> writes:
>
> Steven Bethard wrote:
> > Is there any reason not to use [A-z] type regexps?

>
> Better support for internationalization, so it will
> work in Espa├▒a and G├Âteborg.


Ahh. That makes sense of course. Thanks!

I looked again at the re module, and it seems that \w and \W do have
internationalization support... Is there any way to match \w but not \d? Maybe
something like:
r'[^\d\W]{4,}'

This seems to work (maybe?):

>>> p = re.compile(r'[^\d\W]{4,}', re.UNICODE)
>>> p.findall(u'├ęl me compr├│ un globo. 1234 a342')

[u'compr\xf3', u'globo']

I don't know how to check how this works in different locales though...

Steve

 
Reply With Quote
 
Andrew Dalke
Guest
Posts: n/a
 
      11-12-2004
Steven Bethard wrote:
> I looked again at the re module, and it seems that \w and \W do have
> internationalization support... Is there any way to match \w but not \d? Maybe
> something like:
> r'[^\d\W]{4,}'
>
> This seems to work (maybe?):


Yeah, I tried that originally but noticed that digits and '_'
are included, which ruined the idea of the scramble. So I
opted with the OP's choice and hard-coded just the letters.

>>>>p = re.compile(r'[^\d\W]{4,}', re.UNICODE)


Nice way to do it. I would also put '_' in that exclude list.

> I don't know how to check how this works in different
> locales though...


I think I don't understand locales well enough either. It
looks like I need to use re.UNICODE more often.

>>> import string
>>> string.letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW XYZ'
>>> import re
>>> pat = re.compile(r"[^\d^\W]{4,}", re.UNICODE)
>>> GOT = u"G\N{LATIN SMALL LETTER O WITH DIAERESIS}teborg"
>>> print GOT.encode("utf8")

G├Âteborg
>>> pat.search(GOT).group(0)

u'G\xf6teborg'
>>> pat = re.compile(r"[^\d^\W]{4,}")
>>> pat.search(GOT).group(0)

u'teborg'
>>>
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")

'C'
>>>


So you can see that re.UNICODE uses the Unicode definition
of what is a letter despite the locale being C.

Therefore, I think your approach is better.

Andrew
(E-Mail Removed)
 
Reply With Quote
 
Carl Banks
Guest
Posts: n/a
 
      11-13-2004
Johannes Nix <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> yesterday I met a cute person


Obviously this wouldn't have happened if you were using Perl.


> (after my dance class) who told me about an
> interesting experiment regarding cognition. People were told to read a
> typed text; However, in every word in the text, only the first and the
> last letter were in the right place. The other letters had their
> positions changed in an arbitrary manner. The surprising result, I was
> told, was that people can read this mixed-up text fairly well.


I never entirely bought this, although I've never gone to the
empirical extreme you did.

If you look at the oft-cited example (see
http://www.snopes.com/language/apocryph/cambridge.asp ), you will see
that the intetior letters are not even close to being in random order.
For the most part, letters near the front of the word remain near the
front, and letters towards the back remain near the back.

My thought is that, the brain doesn't read the word as a whole as they
claim, but neither does it pay too much attention to the _exact_
ordering. So you could scramble the letters a little, and it will
still be recognizable; but scramble them a lot, and you will have
trouble.

I suspect that a fairy tale wouldn't be too hard even with random
ordering, though, since fairy tales don't use a lot of big words.


> Because I am a somewhat sceptical guy, at times, and because I thought
> that I deserved some play, I decided to code the rule above in a
> scriptlet. The resulting 23 lines are below, and the outcome is quite
> interesting, not only from the point of view of librarians :
>
> --------------------------------------------------
> #!/usr/bin/python
> import sys
> import locale
> import string
> import re
> import Numeric
> import RandomArray
>
> locale.setlocale(locale.LC_CTYPE, '')
> wordsep = re.compile('([^%s])' % string.letters)
>
> for line in sys.stdin.xreadlines():
> for word in wordsep.split(line):
> if word and word[0] in string.letters:
> word = string.lower(word)
> wlen = len(word)
> if wlen > 3:
> wa = Numeric.array(word)
> perm = RandomArray.permutation(wlen-2)
> wa[1:wlen-1] = Numeric.take(wa[1:wlen-1],perm)
> word = wa.tostring()
> sys.stdout.write('%s' % word)
>
> --------------------------------------------------
>
>
> For the Uninitiated, Numeric is a package which deals with array data;
> arrays are mutable sequences and Numeric.take() can reorder items in
> them; RandomArray.permutation() delivers the randomized reordering we
> need.
>
> Now I have two innocent questions:
>
> - Is it possible to make it a bit more concise )) ?


Yes, but I'd say you're already at the point where more conciseness
causes an unacceptable hit in the readability. I suspect you also
thought this.


> - Can it coerced to run a little bit faster ?
> (on my oldish, 300 MHz-AMD K6 , run time looks like this
> for a famous, 2663-word-long fairy tale from the Grimm's brothers:
>
> nix@aster:~> time <HaenselundGretel.txt ./python/perlmutt.py >v
>
> real 0m6.970s
> user 0m3.634s
> sys 0m0.120s


Well, the thing about Numeric and numarray is that it's not designed
to scale down well. If performance is your concern, you are probably
better off using regular Python lists if your operations act on small
arrays, as they do in this case. It doesn't surprise me that Steven
Bethard found that ordinary Python ran faster.

However, maybe we can use numarray to a greater advantage. The best
way to use numarray is to operate on as much data as possible, which
in this case is the whole text file. The question is: can we use
operations on the whole text file to do this job? Well, we probably
can't scramble the letters of individual words by operating on the
whole text file at once. But could we find the word boundaries that
way? The answer is yes. Have a look at my snippet to see how:

-----------------------
import numarray as na
from numarray import random_array as ra

def interior_scramble(text):

# First, build a length 256 table indicating if this is a real
# letter. It's a boolean-valued array.

ltable = na.array([ chr(c).isalpha() for c in range(256) ])

# Turn the above text into a numarray array We add a space
# before and after the text, to make it possible to see the
# boundaries on the first and last words

atext = na.fromstring(" %s " % text, na.UInt

# Now, calculate a mask. This uses the binary values of the
# text string to index into the ltable, so that whereever a
# letter appears in the text, a one appears in the mask.

amask = ltable[atext]

# Here's the magic: to find the word boundaries, we xor two
# slices that are offset by one. Learning to use slicing in
# this way is one of the greatest ways to channel the power of
# numarray.

# In the mask array, a start boundary will be a zero followed
# by a one. An end will be a one followed by a zero.
# Therefore, if you xor adjacent entries in the mask array,
# the result will be nonzero whereever there's a boundary.
# So, let's xor two one-off slices (which has the effect of
# xoring adjacent entries) of the mask to get the word
# boundaries.

pat = amask[:-1] ^ amask[1:]

# pat will be nonzero wherever there's a word boundary. Let's
# use numarray.nonzero to get the indices of the word
# boundaries. The word boundaries will be
# start,end,start,end,... so we can reshape the array so that
# each row is a start,end pair.

words = na.reshape(na.nonzero(pat)[0],(-1,2))

# Because of how the resulting array aligns, we have to add 2
# to each start value to get the index where the scrambling
# begins.

words[:] += [2,0]

# We got the word boundaries. Now, loop through them and
# scramble. Because the whole file is a numarray, we don't
# need to convert each word to a numarray and back
# individually. That also saves some time.

for start,end in words:
n = end-start
if n < 2:
continue
perm = ra.permutation(end-start)
atext[start:end] = atext[start:end][perm]

# It's scrambled. Convert the applicable slice back to a
# string.

return atext[1:-1].tostring()
-----------------

I'm not sure if this is faster than your method or the pure-Python
method; I didn't check (I worked too hard writing it . But,
generally speaking, whenever you can operate on your whole data set
using numarray at once, doing so will be a big time savings.


--
CARL BANKS
 
Reply With Quote
 
Bengt Richter
Guest
Posts: n/a
 
      11-13-2004
On Thu, 11 Nov 2004 22:31:57 +0100, Johannes Nix <(E-Mail Removed)> wrote:

>
>Hello,
>
>yesterday I met a cute person (after my dance class) who told me about an
>interesting experiment regarding cognition. People were told to read a
>typed text; However, in every word in the text, only the first and the
>last letter were in the right place. The other letters had their
>positions changed in an arbitrary manner. The surprising result, I was
>told, was that people can read this mixed-up text fairly well.
>
>Because I am a somewhat sceptical guy, at times, and because I thought
>that I deserved some play, I decided to code the rule above in a
>scriptlet. The resulting 23 lines are below, and the outcome is quite
>interesting, not only from the point of view of librarians :
>
>--------------------------------------------------
>#!/usr/bin/python
>import sys
>import locale
>import string
>import re
>import Numeric
>import RandomArray
>
>locale.setlocale(locale.LC_CTYPE, '')
>wordsep = re.compile('([^%s])' % string.letters)
>
>for line in sys.stdin.xreadlines():
> for word in wordsep.split(line):
> if word and word[0] in string.letters:
> word = string.lower(word)
> wlen = len(word)
> if wlen > 3:
> wa = Numeric.array(word)
> perm = RandomArray.permutation(wlen-2)
> wa[1:wlen-1] = Numeric.take(wa[1:wlen-1],perm)
> word = wa.tostring()
> sys.stdout.write('%s' % word)
>
>--------------------------------------------------
>
>
>For the Uninitiated, Numeric is a package which deals with array data;
>arrays are mutable sequences and Numeric.take() can reorder items in
>them; RandomArray.permutation() delivers the randomized reordering we
>need.
>
>Now I have two innocent questions:
>
>- Is it possible to make it a bit more concise )) ?
>
>- Can it coerced to run a little bit faster ?
> (on my oldish, 300 MHz-AMD K6 , run time looks like this
> for a famous, 2663-word-long fairy tale from the Grimm's brothers:
>
> nix@aster:~> time <HaenselundGretel.txt ./python/perlmutt.py >v
>
> real 0m6.970s
> user 0m3.634s
> sys 0m0.120s
>
>
>And two remarks what is interesting about it:
>
>- It's a good example how powerful libraries, like
> Numeric, make one's life easier. (BTW, why is Numeric
> and stuff like take() still not included in the standard
> Library ? Batteries included, but calculator not ?)
>
>- Perhaps it's useful to protect messages in some
> regions with not-so-democratic forms of government
> against automatic scanning by making the message
> machine-unreadable, causing some Orwellian Confusion ?
> Of course, texts from Pythonistas would remain suspicious,
> due to the large number of "y" occurring in them....
>
>have a nice evening....
>

Don't know the speed, but this seems fairly self-documenting to me
(with a little thought :

>>> import random
>>> def messwith(s):

... seqisalpha = False; seq = []
... for c in s:
... if c.isalpha() == seqisalpha: seq.append(c); continue
... elif seqisalpha and len(seq)>3:
... mid = seq[1:-1]
... random.shuffle(mid)
... seq[1:-1] = mid
... yield ''.join(seq)
... seq = [c]
... seqisalpha = c.isalpha()
... if seq: yield ''.join(seq)
...
>>> def jumble(s): return ''.join(messwith(s))

...
>>> jumble('This is an example. It has 7 words ')

'This is an elmxape. It has 7 wrods '

Regards,
Bengt Richter
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
It's not true, it's just all lies and fairy tales!! Lookout@rip.ax.lt Computer Support 0 04-05-2010 01:18 AM
permuting over nested dicts? Christian Meesters Python 8 11-08-2007 12:37 PM
Fairy Tales Roy G Digital Photography 5 12-12-2006 08:40 PM
Permuting using any number of given chars Brian Wakem Perl Misc 1 05-17-2005 10:21 PM
DVD Verdict reviews: SYLVESTER AND THE MAGIC PEBBLE AND MORE MAGICAL TALES and more! DVD Verdict DVD Video 0 04-07-2005 08:10 AM



Advertisments