Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regular expression for BOM required

Reply
Thread Tools

Regular expression for BOM required

 
 
Peter Gordon
Guest
Posts: n/a
 
      01-12-2013
#!/cygdrive/c/cygwin/bin/perl
use strict;
use warnings;
use 5.14.0;
open my $fh, '<:encoding(utf16le)', "00Tst.zpl" or die "File opening error
\n";
while( <$fh> ) {
say "Found regular expression" if /\xFE\xFF/;
# say "Found it!" if s/\A.*nm=//;
print;
}

# I'm trying to match a byte order mask in a file. Below is
# the start of an octal dump of the file.
# 0000000 177377 000156 000155 000075 000142 000157 000164 000164
# The line:
# say "Found it!" if s/\A.*nm=//;
# works correctly, but I can't write a regular expression which matches
# octal 0000000 177377 at the start of a line. Help with the
# regular expression would be appreciated.
# If it matters, I'm working on Windows 7.
 
Reply With Quote
 
 
 
 
Peter J. Holzer
Guest
Posts: n/a
 
      01-12-2013
On 2013-01-12 11:54, Peter Gordon <petergoATnetspace.net.au> wrote:
> #!/cygdrive/c/cygwin/bin/perl
> use strict;
> use warnings;
> use 5.14.0;
> open my $fh, '<:encoding(utf16le)', "00Tst.zpl" or die "File opening error
> \n";
> while( <$fh> ) {
> say "Found regular expression" if /\xFE\xFF/;


You want to match the single character U+FEFF BOM here, not a sequence
of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
LETTER Y WITH DIAERESIS.

So you have to write

say "Found regular expression" if /\x{FEFF}/;

> print;
> }
>
> # I'm trying to match a byte order mask in a file. Below is
> # the start of an octal dump of the file.
> # 0000000 177377 000156 000155 000075 000142 000157 000164 000164

^^^^^^
The default output format of od (little endian 16 bit values in octal)
is confusing. Yes, 0xFEFF is 0177377 in octal, but 177377 looks too much
like 7FFF for me to do the bitshift intuitively in my head.

Better to use "od -tx1" or "od -tx2".

hp



--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | http://www.velocityreviews.com/forums/(E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
 
 
 
Peter Gordon
Guest
Posts: n/a
 
      01-12-2013
"Peter J. Holzer" <(E-Mail Removed)> wrote in news:slrnkf30s7.kis.hjp-
(E-Mail Removed):

> You want to match the single character U+FEFF BOM here, not a sequence
> of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
> LETTER Y WITH DIAERESIS.
>
> So you have to write
>
> say "Found regular expression" if /\x{FEFF}/;
>
> print;
> }
>

Thanks Peter,
It was the curly braces which I was missing.
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      01-14-2013
On 2013-01-14 10:12, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
> Peter Gordon wrote:
>> "Peter J. Holzer" <(E-Mail Removed)> wrote in news:slrnkf30s7.kis.hjp-
>> (E-Mail Removed):
>>
>>> You want to match the single character U+FEFF BOM here, not a sequence
>>> of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
>>> LETTER Y WITH DIAERESIS.
>>>
>>> So you have to write
>>>
>>> say "Found regular expression" if /\x{FEFF}/;
>>>
>>> print;
>>> }
>>>

>> Thanks Peter,
>> It was the curly braces which I was missing.
>>

>
> Presumably you also have to check for the "other order" ?


No. After decoding there is no byte order any more, just characters, and
the character you want to match is \x{FEFF}.

If you try to open a big-endian file with :encoding(utf16le), the script
will die trying to read the first line.

(If you open it with :encoding(utf16), the BOM will be used to determine
endianness and *not* passed through - this seems a little inconsistent
to me)

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | (E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
Peter Gordon
Guest
Posts: n/a
 
      01-14-2013
bugbear <bugbear@trim_papermule.co.uk_trim> wrote in
news(E-Mail Removed) o.uk:

> Peter Gordon wrote:
>> "Peter J. Holzer" <(E-Mail Removed)> wrote in

news:slrnkf30s7.kis.hjp-
>> (E-Mail Removed):
>>
>>> You want to match the single character U+FEFF BOM here, not a

sequence
>>> of two characters U+00FE LATIN SMALL LETTER THORN U+00FF LATIN SMALL
>>> LETTER Y WITH DIAERESIS.
>>>
>>> So you have to write
>>>
>>> say "Found regular expression" if /\x{FEFF}/;
>>>
>>> print;
>>> }
>>>

>> Thanks Peter,
>> It was the curly braces which I was missing.
>>

>
> Presumably you also have to check for the "other order" ?
>
> BugBear

The files I'm editing are the playlists of Zoomplayer which is
an Israeli media player, thus they are consistent in their Unicode
and format. Is there a method for getting Unicode to work with
the combination of the diamond operator and In-place editing?
The code below runs fine when run as a program eg: $insertTT.pl aa.zpl
but crashes when I try to run it with the -i command line option. eg:
$perl -i insertTT.pl aa.zpl

#!/cygdrive/c/cygwin/bin/perl
# Used to insert a "tt=NUMBER: " line in a new .df files.
use strict;
use warnings;
use 5.14.0;
use Encode qw(encode decode);
use open qw(:std IN :encoding(utf16-le));

# $^I = ".bak";
my $first = 1;
while( <> ) {
my $line = $_;
if ( $first == 1 ) {
$line =~ s/\x{FEFF}nm=(.*)/nm=$1/;
$first = 0;
}
$line = decode("utf8", $line);
print $line;
if ( $line =~ /nm=/ ) {
my $num = $line;
chomp($num);
$num =~ s/nm=.*?(\d+).*/$1/;
print "tt=$num: \n";
}
}

 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      01-15-2013
On 2013-01-14 21:04, Peter Gordon <petergoATnetspace.net.au> wrote:
> The code below runs fine when run as a program eg: $insertTT.pl aa.zpl
> but crashes when I try to run it with the -i command line option. eg:


If perl crashes you should file a bug report.

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | (E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      01-17-2013
On 2013-01-17 12:16, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
> Peter J. Holzer wrote:
>> On 2013-01-14 10:12, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
>>> Peter Gordon wrote:
>>>> "Peter J. Holzer" <(E-Mail Removed)> wrote in news:slrnkf30s7.kis.hjp-
>>>> (E-Mail Removed):

[$_ was read from a file opened with ":encoding(utf16le)"]
>>>>> say "Found regular expression" if /\x{FEFF}/;

[...]
>>> Presumably you also have to check for the "other order" ?

>>
>> No. After decoding there is no byte order any more, just characters, and
>> the character you want to match is \x{FEFF}.
>>
>> If you try to open a big-endian file with :encoding(utf16le), the script
>> will die trying to read the first line.
>>
>> (If you open it with :encoding(utf16), the BOM will be used to determine
>> endianness and *not* passed through - this seems a little inconsistent
>> to me)

>
> I had (perhaps wrongly) assumed that the OP's true intent (or need)
> was to read the BOM and use it to decide *which* byte order
> was being used, and hence to use the correct decoder.


If that was the intent of the OP, opening the file in one byte order and
checking for a reversed BOM wouldn't work: The diamond operator dies
when it encounters the wrong BOM (of course you could catch the
exception and then try the other endianness).

I think there are two good ways to open UTF-16 files with unknown byte
order:

1) The carefree method: Just use :encoding(utf16), and it will
automatically determine the endianness from the BOM, and you don't
have to care whether the file is little or big endian. Plus, the BOM
is automatically filtered out so you don't have to. On the flipside,
you lose the information about the endianness and the BOM, so if you
need that, this isn't for you.

2) Open the file in binary mode and read the first few bytes. Determine
the correct encoding from those, rewind and set the encoding layer.
This is more work, but a lot more flexible: You can detect any
encoding you want.

As always, there are probably more ways to do it.

hp

--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | (E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
regular expression AND required validator together? Keith G Hicks ASP .Net 9 02-21-2008 11:09 PM
Regular expression for required alpha and numeric characters .Net Sports Javascript 7 04-19-2007 05:27 PM
Regular Expression / Required Field Validator John . ASP .Net 5 03-23-2005 04:47 PM
Regular Expression validators NOT working, Required Field validators ARE working Ratman ASP .Net 0 09-14-2004 09:36 PM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments