Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > regexp for removing {} around latin1 characters

Reply
Thread Tools

regexp for removing {} around latin1 characters

 
 
Michael Friendly
Guest
Posts: n/a
 
      11-27-2009
I have BibTeX files containing accented characters in forms like
{Johann Peter S{\"u}ssmilch}
{Johann Peter S\"ussmilch}
where, in BibTex, the {} are optional.

To export these to, e.g., EndNote, I have to translate these latex
encodings to latin1, which I can largely do with the unix recode tool.
However, recode cheerfully copies the {}s which mess up things when
I import them.

% echo '{Johann Peter S{\"u}ssmilch},' | recode latex..latin1
{Johann Peter S{}ssmilch},

So, I'm looking to complete the process by finding a regexp to remove
the braces around single accented latin1 characters.

recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"


--
Michael Friendly Email: http://www.velocityreviews.com/forums/(E-Mail Removed)
Professor, Psychology Dept.
York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT M3J 1P3 CANADA
 
Reply With Quote
 
 
 
 
Michael Friendly
Guest
Posts: n/a
 
      11-27-2009
Peter J. Holzer wrote:
> On 2009-11-27 17:39, Glenn Jackman <(E-Mail Removed)> wrote:
>> At 2009-11-27 12:05PM, "Michael Friendly" wrote:
>>> I have BibTeX files containing accented characters in forms like
>>> {Johann Peter S{\"u}ssmilch}
>>> {Johann Peter S\"ussmilch}
>>> where, in BibTex, the {} are optional.

>> [...]
>>> So, I'm looking to complete the process by finding a regexp to remove
>>> the braces around single accented latin1 characters.
>>>
>>> recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"

>> Maybe:
>>
>> s#{(\\.(?:{.+?}|.+?))}#$1#g

>
> more likely:
>
> perl -pe "s|\{([\xA0-\xFF])\}|$1|g"
>
>
> I think you are trying to replace the recode, too, but for that you need
> a lookup table with all the accented characters.
>
> hp
>


No, all I want to do is to strip the {} around the accented characters;
recode does the conversion well. With the small test bib file below,
here's what I get using only recode, vs. recode + perl


% recode latex..latin1 < timeref.bib | grep ssmilch
@BOOK{Sussmilch:1741,
author = {Johann Peter S{}ssmilch},

% recode latex..latin1 < timeref.bib | perl -pe
"s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
@BOOK{Sussmilch:1741,
author = {Johann Peter Sssmilch},

Note that the just disappears.

---begin timeref.bib ---
@ARTICLE{Buache:1752,
author = {Buache, Phillippe},
title = {Essai De G{\'e}ographie Physique},
journal = {M{\'e}moires de L'Acad{\'e}mie Royale des Sciences},
year = {1752},
pages = {399--416},
note = {\Loc{BNF: Ge.FF-8816-8822}},
annote = {Contour map},
oldnum = {2}
}

@BOOK{Crome:1785,
title = {{\"U}ber die Gr{\"o}sse and Bev{\"o}lkerung der
S{\"a}mtlichen Europ{\"a}schen
Staaten},
publisher = {Weygand},
year = {1785},
author = {Crome, August F. W.},
address = {Leipzig},
annote = {Superimposed squares to compare areas (of European states)},
oldnum = {5}
}

@BOOK{Sussmilch:1741,
title = {Die g{\"o}ttliche Ordnung in den Ver\"anderungen des
menschlichen Geschlechts,
aus der Geburt, Tod, und Fortpflantzung},
publisher = {n.p.},
year = {1741},
author = {Johann Peter S{\"u}ssmilch},
address = {Germany},
note = {(published in French translation as \emph{L'ordre divin. dans les
changements de l'esp\`ece humaine, d{\'e}montr{\'e} par la naissance,
la mort et la propagation de celle-ci}, trans: Jean-Marc Rohrbasser,
Paris: INED, 1998, ISBN 2-7332-1019-X)},
url = {http://www.ined.fr/publicat/collections/classiques/Ordivin.htm}
}
--- end timeref.bib -----



--
Michael Friendly Email: (E-Mail Removed)
Professor, Psychology Dept.
York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT M3J 1P3 CANADA
 
Reply With Quote
 
 
 
 
sln@netherlands.com
Guest
Posts: n/a
 
      11-27-2009
On Fri, 27 Nov 2009 15:59:59 -0500, Michael Friendly <(E-Mail Removed)> wrote:

>Peter J. Holzer wrote:
>> On 2009-11-27 17:39, Glenn Jackman <(E-Mail Removed)> wrote:
>>> At 2009-11-27 12:05PM, "Michael Friendly" wrote:
>>>> I have BibTeX files containing accented characters in forms like
>>>> {Johann Peter S{\"u}ssmilch}
>>>> {Johann Peter S\"ussmilch}
>>>> where, in BibTex, the {} are optional.
>>> [...]
>>>> So, I'm looking to complete the process by finding a regexp to remove
>>>> the braces around single accented latin1 characters.
>>>>
>>>> recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
>>> Maybe:
>>>
>>> s#{(\\.(?:{.+?}|.+?))}#$1#g

>>
>> more likely:
>>
>> perl -pe "s|\{([\xA0-\xFF])\}|$1|g"
>>
>>
>> I think you are trying to replace the recode, too, but for that you need
>> a lookup table with all the accented characters.
>>
>> hp
>>

>
>No, all I want to do is to strip the {} around the accented characters;
>recode does the conversion well. With the small test bib file below,
>here's what I get using only recode, vs. recode + perl
>
>
> % recode latex..latin1 < timeref.bib | grep ssmilch
>@BOOK{Sussmilch:1741,
> author = {Johann Peter S{}ssmilch},
>
> % recode latex..latin1 < timeref.bib | perl -pe
>"s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
>@BOOK{Sussmilch:1741,
> author = {Johann Peter Sssmilch},
>
>Note that the just disappears.
>

I didn't have problem with the substitution when run as
a stand-alone Perl program. The one liner may need its STDOUT
adjusted with binmode().

----------------------
perl gg.pl > itt.txt

itt.txt (from word):
252
unix crlf encoding(iso-8859-1) utf8
{Johann Peter Sssmilch}
::
However, I don't need to set the STDOUT
encoding. The default does the same thing,
probably because internally it remained as
byte strings during the regex since 0-255
latin has common utf8 code points.

-sln
--------------------

use strict;
use warnings;

print ord(''),"\n";

my $str = "{Johann Peter S{\xFC}ssmilch}";

$str =~ s/\{([\xC0-\xFF])\}/$1/g;

# try one of these:
binmode (STDOUT, ":encoding(latin-1)");
#binmode (STDOUT);
#binmode (STDOUT, ":raw");

print "@{[PerlIO::get_layers(STDOUT)]}\n";

print "$str\n";

 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-28-2009
[Please don't cc usenet postings]

On 2009-11-27 20:59, Michael Friendly <(E-Mail Removed)> wrote:
> Peter J. Holzer wrote:
>> On 2009-11-27 17:39, Glenn Jackman <(E-Mail Removed)> wrote:
>>> At 2009-11-27 12:05PM, "Michael Friendly" wrote:
>>>> I have BibTeX files containing accented characters in forms like
>>>> {Johann Peter S{\"u}ssmilch}
>>>> {Johann Peter S\"ussmilch}
>>>> where, in BibTex, the {} are optional.
>>> [...]
>>>> So, I'm looking to complete the process by finding a regexp to remove
>>>> the braces around single accented latin1 characters.
>>>>
>>>> recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
>>> Maybe:
>>>
>>> s#{(\\.(?:{.+?}|.+?))}#$1#g

>>
>> more likely:
>>
>> perl -pe "s|\{([\xA0-\xFF])\}|$1|g"
>>
>>
>> I think you are trying to replace the recode, too, but for that you need
>> a lookup table with all the accented characters.

>
> No, all I want to do is to strip the {} around the accented characters;


Yes, I was following up to Glenn here, so the "you" was referring him.
If you look at his regexp, you will see that it matches the for example
{\"{u}} or {\"u}. That doesn't work after the recode, because the \" has
already been replaced.


> recode does the conversion well. With the small test bib file below,
> here's what I get using only recode, vs. recode + perl
>
>
> % recode latex..latin1 < timeref.bib | grep ssmilch
> @BOOK{Sussmilch:1741,
> author = {Johann Peter S{}ssmilch},
>
> % recode latex..latin1 < timeref.bib | perl -pe
> "s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
> @BOOK{Sussmilch:1741,
> author = {Johann Peter Sssmilch},
>
> Note that the just disappears.


You are using a unixish system? The shell replaces $1 inside the double
quotes with the current value of the shell variable $1 (in your case
probably nothing), so that the code that perl sees is:

s|\{([\xA0-\xFF])\}||g

On unixish systems you should always use single quotes to enclose perl
code unless you want the shell to substitute part of your code. Since
you were using double quotes I was assuming you are on Windows.

In general you should only use one-liners if you are familiar with the
shell you are using.

hp

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
restructuredtext latin1 encoding (FAQ?) Helmut Jarausch Python 2 07-03-2007 10:31 AM
ascii to latin1 Luis P. Mendes Python 14 05-10-2006 11:56 AM
URI encoding ASCII, LATIN1 or UNICODE? Fritz Bayer Java 2 04-20-2005 01:19 PM
codecs latin1 unicode standard output file Marko Faldix Python 8 12-15-2003 09:52 PM



Advertisments