Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Words to numbers

Reply
Thread Tools

Words to numbers

 
 
william
Guest
Posts: n/a
 
      09-25-2008
I'm writing perl scripts to retrieve data from email messages. Here
are two .txt files.
ACNI050124_05_04_59.txt

received fifteen thousand dollars from
an unaffiliated third party

Section 27A of the Securities Act of 1933 and Section 21E of the
Securities Exchange Act of 1934,

involve a number of risks
and uncertainties which could cause actual results to differ
materially from those presently anticipated.

ZLDV060318_19_32_11.txt
We have received one hundred thirty five thousand free trading shares
from a
third party not an officer, director or affiliate shareholder for our
services. We intend to
sell all these shares now, which could cause the stock to go down,
resulting in losses for you.
Do your due diligence before you invest.


I want to achieve the following output to an excel table.

filename
dollars shares
ACNI050124_05_04_59.txt 15000 -9
ZLDV060318_19_32_11.txt -9 135000

-9 simply means that we don't find any information related to shares
or dollars in the file.

It seems to be a simple task at first. But I realize that it is quite
complicated when I start to write the script. Any suggestions from you
will be highly appreciated.

William
 
Reply With Quote
 
 
 
 
Jim Gibson
Guest
Posts: n/a
 
      09-26-2008
In article
<(E-Mail Removed)>,
william <(E-Mail Removed)> wrote:

> I'm writing perl scripts to retrieve data from email messages. Here
> are two .txt files.
> ACNI050124_05_04_59.txt
>
> received fifteen thousand dollars ...
>
> ZLDV060318_19_32_11.txt
> We have received one hundred thirty five thousand ...


>
>
> I want to achieve the following output to an excel table.
>
> filename
> dollars shares
> ACNI050124_05_04_59.txt 15000 -9
> ZLDV060318_19_32_11.txt -9 135000
>
> -9 simply means that we don't find any information related to shares
> or dollars in the file.
>
> It seems to be a simple task at first. But I realize that it is quite
> complicated when I start to write the script. Any suggestions from you
> will be highly appreciated.


It doesn't seem simple at all. You are trying to parse free-form
English written by various people and extract numerical data from
alphabetic number names. My suggestion is to give it up before you
start.

--
Jim Gibson
 
Reply With Quote
 
 
 
 
Ted Zlatanov
Guest
Posts: n/a
 
      09-26-2008
On Thu, 25 Sep 2008 17:51:48 -0700 Jim Gibson <(E-Mail Removed)> wrote:

JG> In article
JG> <(E-Mail Removed)>,
JG> william <(E-Mail Removed)> wrote:

>> I'm writing perl scripts to retrieve data from email messages. Here
>> are two .txt files.
>> ACNI050124_05_04_59.txt
>>
>> received fifteen thousand dollars ...
>>
>> ZLDV060318_19_32_11.txt
>> We have received one hundred thirty five thousand ...


>> I want to achieve the following output to an excel table.
>>
>> filename
>> dollars shares
>> ACNI050124_05_04_59.txt 15000 -9
>> ZLDV060318_19_32_11.txt -9 135000
>>
>> -9 simply means that we don't find any information related to shares
>> or dollars in the file.


(the comments are for the OP mainly)

Have you considered empty fields instead of special values to denote
absence of value? Specifically, you may need negative numbers for
shares later if you want to indicate buy/sell modes.

>>
>> It seems to be a simple task at first. But I realize that it is quite
>> complicated when I start to write the script. Any suggestions from you
>> will be highly appreciated.


JG> It doesn't seem simple at all. You are trying to parse free-form
JG> English written by various people and extract numerical data from
JG> alphabetic number names. My suggestion is to give it up before you
JG> start.

It's not impossible, and certainly it's interesting. Perhaps
http://web.media.mit.edu/~hugo/montylingua/ will be useful; it has Java
and Python interfaces and a Perl interface may be doable. At the very
least you can parse the montylingua analyzer output.

Ted
 
Reply With Quote
 
william
Guest
Posts: n/a
 
      10-12-2008
Thank you all for the suggestions. Nevertheless, I've accomplished the
number extraction with Perl script. I first build a library of
possible misspellings and convert them to correct ones. Then I use
perl to do a certain pattern search and convert the english numbers to
arabic numbers. Finally I can extract the numbers using kind of fuzzy
logic. As to the -9, because only positive numbers are needed in my
research design. So I use -9 to indicate all non-positive numbers or
cannot find the appropriate number.

Using perl to do natural language processing is really very
interesting. Thank you all again for you inputs.

William
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      10-12-2008
On Sun, 12 Oct 2008 09:51:57 -0700 (PDT), william <(E-Mail Removed)> wrote:

>Thank you all for the suggestions. Nevertheless, I've accomplished the
>number extraction with Perl script. I first build a library of
>possible misspellings and convert them to correct ones. Then I use
>perl to do a certain pattern search and convert the english numbers to
>arabic numbers. Finally I can extract the numbers using kind of fuzzy
>logic. As to the -9, because only positive numbers are needed in my
>research design. So I use -9 to indicate all non-positive numbers or
>cannot find the appropriate number.
>
>Using perl to do natural language processing is really very
>interesting. Thank you all again for you inputs.
>
>William


-9-1-1

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Words and non-words, according to Microsoft et al Steve B NZ Computing 11 03-21-2008 11:52 PM
Replace stop words (remove words from a string) BerlinBrown Python 6 01-17-2008 02:37 PM
Words Words utab C++ 6 02-16-2006 07:00 PM
Non-noise words are incorrectly recognised as noise words. Peter Strĝiman ASP .Net 1 08-23-2005 01:26 PM
Re: A little bit of help regarding my linked list program required. - "words.c" - "words.c" Richard Heathfield C Programming 7 10-05-2003 02:38 PM



Advertisments