Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Extract Numeric values from string

Reply
Thread Tools

Extract Numeric values from string

 
 
Vishal G
Guest
Posts: n/a
 
      09-11-2008
Hi there,

I have searched the whole group looking for solution to my problem.

Actually, I dont understand the perl regular expression properly...
working on it...

Here is the problem..

I have string which contain numbers...

$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
there are 112 million values

I would like to extract numveric values from specific position till
some position using regular expression.. I dont want to use split caue
it uses lot of memory..

for example:

offset = 3; length = 4;

so the result string should be $str = "454 67 59 298928";

Thanks in advance

Vishal
 
Reply With Quote
 
 
 
 
Tomislav Novak
Guest
Posts: n/a
 
      09-11-2008
Vishal G <(E-Mail Removed)> writes:

> I have string which contain numbers...
>
> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
> there are 112 million values
>
> I would like to extract numveric values from specific position till
> some position using regular expression.. I dont want to use split caue
> it uses lot of memory..
>
> for example:
>
> offset = 3; length = 4;
>
> so the result string should be $str = "454 67 59 298928";


Well, you could always do something like:

my $regex =
qr/
^
(?:\d+\s*) {$offset}
((?:\d+\s*){$length})
/x;

my ($result) = $str =~ /$regex/;


--
T.
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      09-11-2008

Quoth Tomislav Novak <(E-Mail Removed)>:
> Vishal G <(E-Mail Removed)> writes:
>
> > I have string which contain numbers...
> >
> > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
> > there are 112 million values
> >
> > I would like to extract numveric values from specific position till
> > some position using regular expression.. I dont want to use split caue
> > it uses lot of memory..
> >
> > for example:
> >
> > offset = 3; length = 4;
> >
> > so the result string should be $str = "454 67 59 298928";

>
> Well, you could always do something like:
>
> my $regex =
> qr/
> ^
> (?:\d+\s*) {$offset}
> ((?:\d+\s*){$length})
> /x;


The string apparently contains 112M values. {} quantifiers in Perl cannot
be larger than 32766.

I would suggest running through the string using substr to check each
character at a time. Count the number of spaces, and collect up the
digits as needed. This will be slow, but will avoid copying the string.

In general, perl has a policy of trading memory for speed. If you are
short of memory, I would suggest using a different language with more
appropriate tradeoffs.

Ben

--
Raise your hand if you're invulnerable.
[(E-Mail Removed)]
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      09-11-2008
Vishal G schreef:

> I have string which contain numbers...
>
> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
> there are 112 million values
>
> I would like to extract numveric values from specific position till
> some position using regular expression.. I dont want to use split caue
> it uses lot of memory..
>
> for example:
>
> offset = 3; length = 4;
>
> so the result string should be $str = "454 67 59 298928";



Maybe you are looking for something like this:

$ perl -Mstrict -Mwarnings -le '
print scalar localtime;
my $s; $s .= "$_ " for 1..10_000_000;
print scalar localtime;

my $offset = 9_999_903;
my $count = 4;

while ($s =~ m/([0-9]+)/g) {
$count or last;
--$offset > 0 and next;
$count-- and print $1;
}
print scalar localtime;
'
Thu Sep 11 14:42:40 2008
Thu Sep 11 14:42:47 2008
9999903
9999904
9999905
9999906
Thu Sep 11 14:42:53 2008


--
Affijn, Ruud

"Gewoon is een tijger."

 
Reply With Quote
 
cartercc
Guest
Posts: n/a
 
      09-11-2008
On Sep 11, 4:46*am, Vishal G <(E-Mail Removed)> wrote:
> I have string which contain numbers...
>
> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
> there are 112 million values
>
> I would like to extract numveric values from specific position till
> some position using regular expression.. I dont want to use split caue
> it uses lot of memory..
>
> for example:
>
> offset = 3; length = 4;
>
> so the result string should be $str = "454 67 59 298928";


Is your string already in memory, or does it come from storage? If the
latter, you might consider replacing the spaces with new lines and
then using a counter to iterate through the file with something like
this:

while (<INFILE>)
{ $counter++;
if ($counter < $offset) { next; }
elsif ($counter >= $offset and $counter < $length)
{ print OUTFILE; }
elsif ($counter > ($length + $offset)) { last; }
else { print "ERROR"; }
}

If your string is already in memory, I would use the C trick of getc()
and test each character, again using a counter for the white space.
Using inline C would probably be faster and you could discard all the
characters you don't need.

while ((char c = getc()) != EOF)
{ //test c, count whitespace, and save what you need
}

CC
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      09-11-2008
Dr.Ruud schreef:
> Vishal G:


>> I have string which contain numbers...
>>
>> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
>> there are 112 million values
>>
>> I would like to extract numveric values from specific position till
>> some position using regular expression.. I dont want to use split
>> caue it uses lot of memory..
>>
>> for example:
>>
>> offset = 3; length = 4;
>>
>> so the result string should be $str = "454 67 59 298928";

>
>
> Maybe you are looking for something like this:
>
> $ perl -Mstrict -Mwarnings -le '
> print scalar localtime;
> my $s; $s .= "$_ " for 1..10_000_000;
> print scalar localtime;
>
> my $offset = 9_999_903;
> my $count = 4;
>
> while ($s =~ m/([0-9]+)/g) {
> $count or last;
> --$offset > 0 and next;
> $count-- and print $1;
> }
> print scalar localtime;
> '
> Thu Sep 11 14:42:40 2008
> Thu Sep 11 14:42:47 2008
> 9999903
> 9999904
> 9999905
> 9999906
> Thu Sep 11 14:42:53 2008


Which means that the while(regexp) skips about 2 million numbers per
second.
So with $offset = 100_000_000 it may take about a minute.

--
Affijn, Ruud

"Gewoon is een tijger."

 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      09-11-2008

Quoth cartercc <(E-Mail Removed)>:
> On Sep 11, 4:46*am, Vishal G <(E-Mail Removed)> wrote:
> > I have string which contain numbers...
> >
> > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
> > there are 112 million values
> >
> > I would like to extract numveric values from specific position till
> > some position using regular expression.. I dont want to use split caue
> > it uses lot of memory..
> >
> > for example:
> >
> > offset = 3; length = 4;
> >
> > so the result string should be $str = "454 67 59 298928";

>
> Is your string already in memory, or does it come from storage? If the
> latter, you might consider replacing the spaces with new lines and
> then using a counter to iterate through the file with something like
> this:
>
> while (<INFILE>)


No need to replace the spaces. $/ = " " will work just fine.

<snip>
> If your string is already in memory, I would use the C trick of getc()


getc reads from a file, not from memory.

Ben

--
You poor take courage, you rich take care:
The Earth was made a common treasury for everyone to share
All things in common, all people one.
'We come in peace'---the order came to cut them down. [(E-Mail Removed)]
 
Reply With Quote
 
Leon Timmermans
Guest
Posts: n/a
 
      09-11-2008
On Thu, 11 Sep 2008 01:46:08 -0700, Vishal G wrote:

> Hi there,
>
> I have searched the whole group looking for solution to my problem.
>
> Actually, I dont understand the perl regular expression properly...
> working on it...
>
> Here is the problem..
>
> I have string which contain numbers...
>
> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string there
> are 112 million values
>
> I would like to extract numveric values from specific position till some
> position using regular expression.. I dont want to use split caue it
> uses lot of memory..
>
> for example:
>
> offset = 3; length = 4;
>
> so the result string should be $str = "454 67 59 298928";
>
> Thanks in advance
>
> Vishal


Why do you store that in a free-format string? I can think of a a number
of better ways to store it. You could store it in a binary array (like in
C) and then access it using vec(). Tie::Array:acked may also be an
interesting approach. By storing your data smarter, you can make an O(N)
algorithm O(1).

Regards,

Leon Timmermans
 
Reply With Quote
 
Ted Zlatanov
Guest
Posts: n/a
 
      09-11-2008
On Thu, 11 Sep 2008 01:46:08 -0700 (PDT) Vishal G <(E-Mail Removed)> wrote:

VG> Here is the problem..

VG> I have string which contain numbers...

VG> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
VG> there are 112 million values

VG> I would like to extract numveric values from specific position till
VG> some position using regular expression.. I dont want to use split caue
VG> it uses lot of memory..

This works for me. To avoid dealing with edge cases, I surround the
input with spaces. The assumption is that only digits and spaces are in
your data; the algorithm uses that to find the next space or the next
digit. Note also that slow_extract() is there as a reference to check
the algorithm works OK. It's very possible it has bugs: I wrote it to
show you the general idea of iterating through the string, and tests are
what you see in __DATA__ which is minimal.

You should consider keeping large data sets like this in a database,
e.g. SQLite. Then operating on it from Perl or other languages is much
easier, especially if you index your columns appropriately.

Ted

#!/usr/bin/perl

use warnings;
use strict;
use Data:umper;
use List::Util qw/min/;

my $str = <DATA>; # we keep it global so it's not passed around
chomp $str;
$str = " $str ";


while (<DATA>)
{
my ($pos, $offset) = m/(\d+)\D+(\d+)/;
my $slow_result = slow_extract($pos, $offset);
my $fast_result = fast_extract($pos, $offset);
my $ok = $slow_result eq $fast_result;
print "position $pos, offset $offset: $slow_result / $fast_result / OK=$ok\n";
}

sub slow_extract
{
my $logical_pos = shift @_;
my $n = shift @_;

my @numbers = split ' ', $str;
return join ' ', grep { defined } @numbers[$logical_pos .. $logical_pos+$n-1];
}

sub fast_get_number
{
my $start_pos = shift @_;

my @matches = grep { defined && $_ > 0 } map { index($str, $_, $start_pos) } 0..9;

return unless scalar @matches;

my $start = min(@matches);
my $end = index($str, ' ', $start);
return ($end, substr($str, $start, $end-$start));
}

sub fast_extract
{
my $logical_pos = shift @_;
my $n = shift @_;

my $at = 0;
my $current_logical_pos = 0;

my @numbers;
while (1)
{
my @next = fast_get_number($at);
print Dumper \@next;
last unless scalar @next;
last if $next[0] < 0;
if ($current_logical_pos >= $logical_pos)
{
push @numbers, $next[1];
}
$at = $next[0];
last if scalar @numbers == $n;
$current_logical_pos++;
}

return join ' ', @numbers;
}

__DATA__
93430 574 454 67 59 298928 74 4875 8 93430
3 4
5 6
7 8
10 2
 
Reply With Quote
 
cartercc
Guest
Posts: n/a
 
      09-11-2008
This is why I read this group, always learning things at the (small)
cost of exhibiting my own ignorance. It always amazes me the depth of
knowledge that some people have, and a little bit depressing as to my
own lack of knowledge.

I have several friends who are medical doctors, and know several of
their children who are in various stages of the medical education
process, and I've always liked that approach: two years in the
classroom and four (or more) in the field. In a job you get stuck in a
rut where you might have the same experience thousands of times,
unlike a forum like c.l.p.m. where you can broaden your knowledge by
way of specific, limited example.

All this as a rather wordy 'Thanks'.

CC

On Sep 11, 9:57*am, Ben Morrow <(E-Mail Removed)> wrote:
> Quoth cartercc <(E-Mail Removed)>:
>
>
>
> > On Sep 11, 4:46*am, Vishal G <(E-Mail Removed)> wrote:
> > > I have string which contain numbers...

>
> > > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
> > > there are 112 million values

>
> > > I would like to extract numveric values from specific position till
> > > some position using regular expression.. I dont want to use split caue
> > > it uses lot of memory..

>
> > > for example:

>
> > > offset = 3; length = 4;

>
> > > so the result string should be $str = "454 67 59 298928";

>
> > Is your string already in memory, or does it come from storage? If the
> > latter, you might consider replacing the spaces with new lines and
> > then using a counter to iterate through the file with something like
> > this:

>
> > while (<INFILE>)

>
> No need to replace the spaces. $/ = " " will work just fine.
>
> <snip>
>
> > If your string is already in memory, I would use the C trick of getc()

>
> getc reads from a file, not from memory.
>
> Ben
>
> --
> You poor take courage, you rich take care:
> The Earth was made a common treasury for everyone to share
> All things in common, all people one.
> 'We come in peace'---the order came to cut them down. * * * [(E-Mail Removed)]


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract the numeric and alphabetic part from an alphanumeric string Sandhya Prabhakaran Python 6 08-03-2009 04:40 PM
Best way to extract numeric values from a report? bobmct Perl Misc 4 05-08-2009 07:26 AM
int to numeric numeric(18,2) ? jobs ASP .Net 2 07-22-2007 12:32 AM
Arithmetic overflow error converting numeric to data type numeric. darrel ASP .Net 4 07-19-2007 09:57 PM
check if string contains numeric, and check string length of numeric value ief@specialfruit.be C++ 5 06-30-2005 01:08 PM



Advertisments