Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Help to find a regular expression to parse po file

Reply
Thread Tools

Help to find a regular expression to parse po file

 
 
gialloporpora
Guest
Posts: n/a
 
      07-06-2009
Hi all,
I would like to extract string from a PO file. To do this I have created
a little python function to parse po file and extract string:

import re
regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
m=r.findall(s)

where s is a po file like this:

msgctxt "write ubiquity commands.description"
msgid "Takes you to the Ubiquity <a
href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
dei comandi</a> di Ubiquity."


#. list ubiquity commands command:
#. use | to separate multiple name values:
msgctxt "list ubiquity commands.names"
msgid "list ubiquity commands"
msgstr "elenco comandi disponibili"

msgctxt "list ubiquity commands.description"
msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
list</a>\n"
" of all Ubiquity commands available and what they all do."
msgstr "Apre una <a
href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
" in cui sono elencati tutti i comandi disponibili e per ognuno
viene spiegato in breve a cosa serve."



#. change ubiquity settings command:
#. use | to separate multiple name values:
msgctxt "change ubiquity settings.names"
msgid "change ubiquity settings|change ubiquity preferences|change
ubiquity skin"
msgstr "modifica impostazioni di ubiquity|modifica preferenze di
ubiquity|modifica tema di ubiquity"

msgctxt "change ubiquity settings.description"
msgid "Takes you to the <a
href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
" where you can change your skin, key combinations, etc."
msgstr "Apre la pagina <a
href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
di Ubiquity,\n"
" dalla quale è possibile modificare la combinazione da tastiera
utilizzata per richiamare Ubiquity, il tema, ecc."



but, obviusly, with the code above the last string is not matched. If
I use re.DOTALL to match also new line character it not works because it
match the entire file, I would like to stop the matching when "msgstr"
is found.

regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)

is it possible or not ?




 
Reply With Quote
 
 
 
 
Hallvard B Furuseth
Guest
Posts: n/a
 
      07-06-2009
gialloporpora writes:
> I would like to extract string from a PO file. To do this I have created
> a little python function to parse po file and extract string:
>
> import re
> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
> m=r.findall(s)


I don't know the syntax of a po file, but this works for the
snippet you posted:

arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
find_re = re.compile(
r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)

However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
something.
Can there be other keywords between msgid and msgstr? If so,
add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
Can msgstr come before msgid? If so, forget using a single regexp.
Anything else to the syntax to look out for? Single quotes, maybe?

Is it a problem if the regexp isn't quite right and doesn't match all
cases, yet doesn't report an error when that happens?

All in all, it may be a bad idea to sqeeze this into a single regexp.
It gets ugly real fast. Might be better to parse the file in a more
regular way, maybe using regexps just to extract each (keyword, "value")
pair.

--
Hallvard
 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      07-06-2009
gialloporpora wrote:
> Hi all,
> I would like to extract string from a PO file. To do this I have created
> a little python function to parse po file and extract string:
>
> import re
> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
> m=r.findall(s)
>
> where s is a po file like this:
>
> msgctxt "write ubiquity commands.description"
> msgid "Takes you to the Ubiquity <a
> href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
> msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
> dei comandi</a> di Ubiquity."
>
>
> #. list ubiquity commands command:
> #. use | to separate multiple name values:
> msgctxt "list ubiquity commands.names"
> msgid "list ubiquity commands"
> msgstr "elenco comandi disponibili"
>
> msgctxt "list ubiquity commands.description"
> msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
> list</a>\n"
> " of all Ubiquity commands available and what they all do."
> msgstr "Apre una <a
> href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
> " in cui sono elencati tutti i comandi disponibili e per ognuno
> viene spiegato in breve a cosa serve."
>
>
>
> #. change ubiquity settings command:
> #. use | to separate multiple name values:
> msgctxt "change ubiquity settings.names"
> msgid "change ubiquity settings|change ubiquity preferences|change
> ubiquity skin"
> msgstr "modifica impostazioni di ubiquity|modifica preferenze di
> ubiquity|modifica tema di ubiquity"
>
> msgctxt "change ubiquity settings.description"
> msgid "Takes you to the <a
> href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
> " where you can change your skin, key combinations, etc."
> msgstr "Apre la pagina <a
> href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
> di Ubiquity,\n"
> " dalla quale è possibile modificare la combinazione da tastiera
> utilizzata per richiamare Ubiquity, il tema, ecc."
>
>
>
> but, obviusly, with the code above the last string is not matched. If
> I use re.DOTALL to match also new line character it not works because it
> match the entire file, I would like to stop the matching when "msgstr"
> is found.
>
> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)
>
> is it possible or not ?
>

You could try:

regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")

and then, if necessary, tidy what you get.
 
Reply With Quote
 
gialloporpora
Guest
Posts: n/a
 
      07-06-2009
Risposta al messaggio di Hallvard B Furuseth :


>
> I don't know the syntax of a po file, but this works for the
> snippet you posted:
>
> arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
> arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
> find_re = re.compile(
> r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)
>
> However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
> something.
> Can there be other keywords between msgid and msgstr? If so,
> add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
> Can msgstr come before msgid? If so, forget using a single regexp.
> Anything else to the syntax to look out for? Single quotes, maybe?
>
> Is it a problem if the regexp isn't quite right and doesn't match all
> cases, yet doesn't report an error when that happens?
>
> All in all, it may be a bad idea to sqeeze this into a single regexp.
> It gets ugly real fast. Might be better to parse the file in a more
> regular way, maybe using regexps just to extract each (keyword, "value")
> pair.
>

Thank you very much, Haldvard, it seem to works, there is a strange
match in the file header but I could skip the first match.


The po files have this structure:
http://bit.ly/18qbVc

msgid "string to translate"
" second string to match"
" n string to match"
msgstr "translated sting"
" second translated string"
" n translated string"
One or more new line before the next group.

In past I have created a Python script to parse PO files where msgid
and msgstr are in two sequential lines, for example:

msgid "string to translate"
msgstr "translated string"

now the problem is how to match also (optional) string between msgid and
msgstr.

Sandro





 
Reply With Quote
 
gialloporpora
Guest
Posts: n/a
 
      07-06-2009
Risposta al messaggio di MRAB :

> gialloporpora wrote:
>> Hi all,
>> I would like to extract string from a PO file. To do this I have created
>> a little python function to parse po file and extract string:
>>
>> import re
>> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
>> m=r.findall(s)
>>
>> where s is a po file like this:
>>
>> msgctxt "write ubiquity commands.description"
>> msgid "Takes you to the Ubiquity<a
>> href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
>> msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
>> dei comandi</a> di Ubiquity."
>>
>>
>> #. list ubiquity commands command:
>> #. use | to separate multiple name values:
>> msgctxt "list ubiquity commands.names"
>> msgid "list ubiquity commands"
>> msgstr "elenco comandi disponibili"
>>
>> msgctxt "list ubiquity commands.description"
>> msgid "Opens<a href=\"chrome://ubiquity/content/cmdlist.html\">the
>> list</a>\n"
>> " of all Ubiquity commands available and what they all do."
>> msgstr "Apre una<a
>> href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
>> " in cui sono elencati tutti i comandi disponibili e per ognuno
>> viene spiegato in breve a cosa serve."
>>
>>
>>
>> #. change ubiquity settings command:
>> #. use | to separate multiple name values:
>> msgctxt "change ubiquity settings.names"
>> msgid "change ubiquity settings|change ubiquity preferences|change
>> ubiquity skin"
>> msgstr "modifica impostazioni di ubiquity|modifica preferenze di
>> ubiquity|modifica tema di ubiquity"
>>
>> msgctxt "change ubiquity settings.description"
>> msgid "Takes you to the<a
>> href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
>> " where you can change your skin, key combinations, etc."
>> msgstr "Apre la pagina<a
>> href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
>> di Ubiquity,\n"
>> " dalla quale è possibile modificare la combinazione da tastiera
>> utilizzata per richiamare Ubiquity, il tema, ecc."
>>
>>
>>
>> but, obviusly, with the code above the last string is not matched. If
>> I use re.DOTALL to match also new line character it not works because it
>> match the entire file, I would like to stop the matching when "msgstr"
>> is found.
>>
>> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)
>>
>> is it possible or not ?
>>

> You could try:
>
> regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")
>
> and then, if necessary, tidy what you get.



MRAB, thank you for your help, I have tried the code posted by Hallvard
because I have seen it before and it works. Now I'll check also your
suggestions.
Sandro

--
*Pink Floyd – The Great Gig in the Sky* - http://sn.im/kggo7
* FAQ* di /it-alt.comp.software.mozilla/: http://bit.ly/1MZ04d
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[OT] Using LISP/PROLOG to parse regular expression Man-wai Chang C++ 2 03-03-2012 02:36 PM
Need help with regular expression to parse URLs Neil Java 32 08-13-2009 04:36 PM
regular expression to parse {"hello", "hello world","1hello-2*hello"} Roy Java 6 01-07-2008 08:06 PM
Need to parse SQL statements...use regular expression? Justin F Perl Misc 4 03-05-2004 04:43 PM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments