Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Help to find a regular expression to parse po file (http://www.velocityreviews.com/forums/t690245-help-to-find-a-regular-expression-to-parse-po-file.html)

gialloporpora 07-06-2009 02:21 PM

Help to find a regular expression to parse po file
 
Hi all,
I would like to extract string from a PO file. To do this I have created
a little python function to parse po file and extract string:

import re
regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
m=r.findall(s)

where s is a po file like this:

msgctxt "write ubiquity commands.description"
msgid "Takes you to the Ubiquity <a
href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
dei comandi</a> di Ubiquity."


#. list ubiquity commands command:
#. use | to separate multiple name values:
msgctxt "list ubiquity commands.names"
msgid "list ubiquity commands"
msgstr "elenco comandi disponibili"

msgctxt "list ubiquity commands.description"
msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
list</a>\n"
" of all Ubiquity commands available and what they all do."
msgstr "Apre una <a
href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
" in cui sono elencati tutti i comandi disponibili e per ognuno
viene spiegato in breve a cosa serve."



#. change ubiquity settings command:
#. use | to separate multiple name values:
msgctxt "change ubiquity settings.names"
msgid "change ubiquity settings|change ubiquity preferences|change
ubiquity skin"
msgstr "modifica impostazioni di ubiquity|modifica preferenze di
ubiquity|modifica tema di ubiquity"

msgctxt "change ubiquity settings.description"
msgid "Takes you to the <a
href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
" where you can change your skin, key combinations, etc."
msgstr "Apre la pagina <a
href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
di Ubiquity,\n"
" dalla quale è possibile modificare la combinazione da tastiera
utilizzata per richiamare Ubiquity, il tema, ecc."



but, obviusly, with the code above the last string is not matched. If
I use re.DOTALL to match also new line character it not works because it
match the entire file, I would like to stop the matching when "msgstr"
is found.

regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)

is it possible or not ?





Hallvard B Furuseth 07-06-2009 03:04 PM

Re: Help to find a regular expression to parse po file
 
gialloporpora writes:
> I would like to extract string from a PO file. To do this I have created
> a little python function to parse po file and extract string:
>
> import re
> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
> m=r.findall(s)


I don't know the syntax of a po file, but this works for the
snippet you posted:

arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
find_re = re.compile(
r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)

However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
something.
Can there be other keywords between msgid and msgstr? If so,
add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
Can msgstr come before msgid? If so, forget using a single regexp.
Anything else to the syntax to look out for? Single quotes, maybe?

Is it a problem if the regexp isn't quite right and doesn't match all
cases, yet doesn't report an error when that happens?

All in all, it may be a bad idea to sqeeze this into a single regexp.
It gets ugly real fast. Might be better to parse the file in a more
regular way, maybe using regexps just to extract each (keyword, "value")
pair.

--
Hallvard

MRAB 07-06-2009 03:12 PM

Re: Help to find a regular expression to parse po file
 
gialloporpora wrote:
> Hi all,
> I would like to extract string from a PO file. To do this I have created
> a little python function to parse po file and extract string:
>
> import re
> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
> m=r.findall(s)
>
> where s is a po file like this:
>
> msgctxt "write ubiquity commands.description"
> msgid "Takes you to the Ubiquity <a
> href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
> msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
> dei comandi</a> di Ubiquity."
>
>
> #. list ubiquity commands command:
> #. use | to separate multiple name values:
> msgctxt "list ubiquity commands.names"
> msgid "list ubiquity commands"
> msgstr "elenco comandi disponibili"
>
> msgctxt "list ubiquity commands.description"
> msgid "Opens <a href=\"chrome://ubiquity/content/cmdlist.html\">the
> list</a>\n"
> " of all Ubiquity commands available and what they all do."
> msgstr "Apre una <a
> href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
> " in cui sono elencati tutti i comandi disponibili e per ognuno
> viene spiegato in breve a cosa serve."
>
>
>
> #. change ubiquity settings command:
> #. use | to separate multiple name values:
> msgctxt "change ubiquity settings.names"
> msgid "change ubiquity settings|change ubiquity preferences|change
> ubiquity skin"
> msgstr "modifica impostazioni di ubiquity|modifica preferenze di
> ubiquity|modifica tema di ubiquity"
>
> msgctxt "change ubiquity settings.description"
> msgid "Takes you to the <a
> href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
> " where you can change your skin, key combinations, etc."
> msgstr "Apre la pagina <a
> href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
> di Ubiquity,\n"
> " dalla quale è possibile modificare la combinazione da tastiera
> utilizzata per richiamare Ubiquity, il tema, ecc."
>
>
>
> but, obviusly, with the code above the last string is not matched. If
> I use re.DOTALL to match also new line character it not works because it
> match the entire file, I would like to stop the matching when "msgstr"
> is found.
>
> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)
>
> is it possible or not ?
>

You could try:

regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")

and then, if necessary, tidy what you get.

gialloporpora 07-06-2009 04:32 PM

Re: Help to find a regular expression to parse po file
 
Risposta al messaggio di Hallvard B Furuseth :


>
> I don't know the syntax of a po file, but this works for the
> snippet you posted:
>
> arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
> arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
> find_re = re.compile(
> r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)
>
> However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
> something.
> Can there be other keywords between msgid and msgstr? If so,
> add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
> Can msgstr come before msgid? If so, forget using a single regexp.
> Anything else to the syntax to look out for? Single quotes, maybe?
>
> Is it a problem if the regexp isn't quite right and doesn't match all
> cases, yet doesn't report an error when that happens?
>
> All in all, it may be a bad idea to sqeeze this into a single regexp.
> It gets ugly real fast. Might be better to parse the file in a more
> regular way, maybe using regexps just to extract each (keyword, "value")
> pair.
>

Thank you very much, Haldvard, it seem to works, there is a strange
match in the file header but I could skip the first match.


The po files have this structure:
http://bit.ly/18qbVc

msgid "string to translate"
" second string to match"
" n string to match"
msgstr "translated sting"
" second translated string"
" n translated string"
One or more new line before the next group.

In past I have created a Python script to parse PO files where msgid
and msgstr are in two sequential lines, for example:

msgid "string to translate"
msgstr "translated string"

now the problem is how to match also (optional) string between msgid and
msgstr.

Sandro






gialloporpora 07-06-2009 05:42 PM

Re: Help to find a regular expression to parse po file
 
Risposta al messaggio di MRAB :

> gialloporpora wrote:
>> Hi all,
>> I would like to extract string from a PO file. To do this I have created
>> a little python function to parse po file and extract string:
>>
>> import re
>> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n")
>> m=r.findall(s)
>>
>> where s is a po file like this:
>>
>> msgctxt "write ubiquity commands.description"
>> msgid "Takes you to the Ubiquity<a
>> href=\"chrome://ubiquity/content/editor.html\">command editor</a> page."
>> msgstr "Apre l'<a href=\"chrome://ubiquity/content/editor.html\">editor
>> dei comandi</a> di Ubiquity."
>>
>>
>> #. list ubiquity commands command:
>> #. use | to separate multiple name values:
>> msgctxt "list ubiquity commands.names"
>> msgid "list ubiquity commands"
>> msgstr "elenco comandi disponibili"
>>
>> msgctxt "list ubiquity commands.description"
>> msgid "Opens<a href=\"chrome://ubiquity/content/cmdlist.html\">the
>> list</a>\n"
>> " of all Ubiquity commands available and what they all do."
>> msgstr "Apre una<a
>> href=\"chrome://ubiquity/content/cmdlist.html\">pagina</a>\n"
>> " in cui sono elencati tutti i comandi disponibili e per ognuno
>> viene spiegato in breve a cosa serve."
>>
>>
>>
>> #. change ubiquity settings command:
>> #. use | to separate multiple name values:
>> msgctxt "change ubiquity settings.names"
>> msgid "change ubiquity settings|change ubiquity preferences|change
>> ubiquity skin"
>> msgstr "modifica impostazioni di ubiquity|modifica preferenze di
>> ubiquity|modifica tema di ubiquity"
>>
>> msgctxt "change ubiquity settings.description"
>> msgid "Takes you to the<a
>> href=\"chrome://ubiquity/content/settings.html\">settings</a> page,\n"
>> " where you can change your skin, key combinations, etc."
>> msgstr "Apre la pagina<a
>> href=\"chrome://ubiquity/content/settings.html\">delle impostazioni</a>
>> di Ubiquity,\n"
>> " dalla quale è possibile modificare la combinazione da tastiera
>> utilizzata per richiamare Ubiquity, il tema, ecc."
>>
>>
>>
>> but, obviusly, with the code above the last string is not matched. If
>> I use re.DOTALL to match also new line character it not works because it
>> match the entire file, I would like to stop the matching when "msgstr"
>> is found.
>>
>> regex=re.compile("msgid (.*)\\nmsgstr (.*)\\n\\n\\n",re.DOTALL)
>>
>> is it possible or not ?
>>

> You could try:
>
> regex = re.compile(r"msgid (.*(?:\n".*")*)\nmsgstr (.*(?:\n".*")*)$")
>
> and then, if necessary, tidy what you get.



MRAB, thank you for your help, I have tried the code posted by Hallvard
because I have seen it before and it works. Now I'll check also your
suggestions.
Sandro

--
*Pink Floyd – The Great Gig in the Sky* - http://sn.im/kggo7
* FAQ* di /it-alt.comp.software.mozilla/: http://bit.ly/1MZ04d


All times are GMT. The time now is 02:56 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.