Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > An odd sort requirement - data munging

Reply
Thread Tools

An odd sort requirement - data munging

 
 
cartercc
Guest
Posts: n/a
 
      11-04-2008
I have a series of date related values, such as 01/2007, 02/2007,
03/2008, 04/2006, etc. These values represent a column in various
files that range from several dozen rows deep to almost 1M rows deep.
My job is to create reports from a collection of these types of files.

I create a number of refs to hashes that have the general appearance
of this:
$h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

I write to an outfile generally like this:
foreach $k1 (sort keys %h) {
foreach $k2 (sort keys %{$h{$k1}}) (
print OUTFILE " <k1>$k1</k1> <k2>$k2</k2> <val>$h{$k1}{$k2}</val>
\n";
}
}

When I run this, it works perfectly, sorting the date values into
perfect numerical order. When something's so perfect, you know it's
wrong!

Here's the problem: the ordering of the dates isn't numerical, the
proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
03/2006, 04/2006 ... 01/2007, 02/2007
03/2007, 04/2007 ... 01/2008, 02/2008

Here's another view of the problem - when I print data for a year, the
ordering is:
03/07, 04/07, 05/07 ... 01/07, 02/07

What I would like to do is overload the sort operator (call it
'sort_y') to sort in this non-numerical order. Can this be done? Can
it be done perhaps in C and compiled to run in Perl? Can it be done
algorithmetically by passing sort a function of some kind? I've tried
this, and the logic is very clumsy and full of stupid relational
operators and elsifs.

The former solution was to copy the data into Excel and manually cut
and paste the columns in the correct order. Some of these reports are
enormous (25 cols by 2500 rows) and I don't want to do this. This took
a lot of time and was very much error prone.

Thanks, CC.
 
Reply With Quote
 
 
 
 
Charlton Wilbur
Guest
Posts: n/a
 
      11-04-2008
>>>>> "cc" == cartercc <(E-Mail Removed)> writes:

cc> What I would like to do is overload the sort operator (call it
cc> 'sort_y') to sort in this non-numerical order. Can this be done?

perldoc -f sort, and notice that sort takes an optional code block or
function name.

Charlton


--
Charlton Wilbur
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
 
 
 
Uri Guttman
Guest
Posts: n/a
 
      11-04-2008
>>>>> "c" == cartercc <(E-Mail Removed)> writes:

c> I have a series of date related values, such as 01/2007, 02/2007,
c> 03/2008, 04/2006, etc. These values represent a column in various
c> files that range from several dozen rows deep to almost 1M rows deep.
c> My job is to create reports from a collection of these types of files.

c> I create a number of refs to hashes that have the general appearance
c> of this:
c> $h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

you don't need to build up such a hash tree to sort this data. it will
slow you down. you can sort directly on the fields but we don't know the
original record format (a single long string?).

c> I write to an outfile generally like this:
c> foreach $k1 (sort keys %h) {
c> foreach $k2 (sort keys %{$h{$k1}}) (

those are string comparisons (the default for sort).

c> When I run this, it works perfectly, sorting the date values into
c> perfect numerical order. When something's so perfect, you know it's
c> wrong!

you say numerical order (but use string compares as i said above)

c> Here's the problem: the ordering of the dates isn't numerical, the
c> proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
c> 03/2006, 04/2006 ... 01/2007, 02/2007
c> 03/2007, 04/2007 ... 01/2008, 02/2008


c> Here's another view of the problem - when I print data for a year, the
c> ordering is:
c> 03/07, 04/07, 05/07 ... 01/07, 02/07

can you show the code for this sort?

c> The former solution was to copy the data into Excel and manually cut
c> and paste the columns in the correct order. Some of these reports are
c> enormous (25 cols by 2500 rows) and I don't want to do this. This took
c> a lot of time and was very much error prone.

25 cols x 2500 rows is not enormous for perl.

try using Sort::Maker for this. you just define how you want each key
extracted (using a regex or substr or any code) from your record and how
to sort it (numeric/string, up/down). in your case i would say use a
string sort since you have padded numbers. then make sure you sort the
keys in the order you want - year first, and then month. you will get a
clean and fast sort without all of your extra code and not needing to
use external programs.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
 
Reply With Quote
 
cartercc
Guest
Posts: n/a
 
      11-04-2008
Thanks, Uri,

On Nov 4, 11:34*am, Uri Guttman <(E-Mail Removed)> wrote:
> >>>>> "c" == cartercc *<(E-Mail Removed)> writes:

>
> * c> I have a series of date related values, such as 01/2007, 02/2007,
> * c> 03/2008, 04/2006, etc. These values represent a column in various
> * c> files that range from several dozen rows deep to almost 1M rows deep.
> * c> My job is to create reports from a collection of these types of files.
>
> * c> I create a number of refs to hashes that have the general appearance
> * c> of this:
> * c> $h{$k1}{$k2}{$k3} => data (generally but not always a simple count).
>
> you don't need to build up such a hash tree to sort this data. it will
> slow you down. you can sort directly on the fields but we don't know the
> original record format (a single long string?).


Actually, this is a general solution to a number of different
problems. The keys could represent a number of different types, such
as states, cities, course names (e.g., ART 109, BIO 225, MAT 4556),
people names, and so on. The common factor is that every type of field
is both unique and sortable, so I can use the same data structure
regardless of the type of data.

With particular reference to this sort problem, it's the only one type
that can't be sorted normally, which is the out-of-order date types.

> * c> I write to an outfile generally like this:
> * c> foreach $k1 (sort keys %h) {
> * c> * foreach $k2 (sort keys %{$h{$k1}}) (
>
> those are string comparisons (the default for sort).


You are right. Actually, isn't it an ASCII sort? When I sort character
values, case makes a difference.

> * c> When I run this, it works perfectly, sorting the date values into
> * c> perfect numerical order. When something's so perfect, you know it's
> * c> wrong!
>
> you say numerical order (but use string compares as i said above)


True -- excuse my casualness.

> * c> Here's the problem: the ordering of the dates isn't numerical, the
> * c> proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
> * c> 03/2006, 04/2006 ... 01/2007, 02/2007
> * c> 03/2007, 04/2007 ... 01/2008, 02/2008
>
> * c> Here's another view of the problem - when I print data for a year,the
> * c> ordering is:
> * c> 03/07, 04/07, 05/07 ... 01/07, 02/07
>
> can you show the code for this sort?


There is no code other than that shown above. I have the data in a
hash of hash refs. What I meant was -- the INSTITUTIONAL ordering is
different than the NUMERIC ordering. What it prints is, "1 2 3 4 5".
What I want it to print is "3 4 5 1 2". And, I want it to print this
only for this kind of string, not for other strings that might contain
numeric characters, such as zip codes, area codes, ID numbers, etc.

> * c> The former solution was to copy the data into Excel and manually cut
> * c> and paste the columns in the correct order. Some of these reports are
> * c> enormous (25 cols by 2500 rows) and I don't want to do this. This took
> * c> a lot of time and was very much error prone.
>
> 25 cols x 2500 rows is not enormous for perl.


Right. However, it is for manual processing. This is why I want a
scripted solution. My scripted solution is about 95% complete, and if
I could solve this problem it would be close to 100%.

> try using Sort::Maker for this. you just define how you want each key
> extracted (using a regex or substr or any code) from your record and how
> to sort it (numeric/string, up/down). in your case i would say use a
> string sort since you have padded numbers. then make sure you sort the
> keys in the order you want - year first, and then month. you will get a
> clean and fast sort without all of your extra code and not needing to
> use external programs.


Thanks, I'll look at this.

And in replying to your message, an idea occurred to me.

The central component in all these strings is the slash: /. If I write
a function that takes a key as the parameter such as "2006/03" and
returns a string such as "2006/1/03", then I could control exactly how
the sort would work. When I was finished with the report, I could run
it through another function that would substitute "/1/" with just "/"
and that would solve the problem.

Thanks, CC.
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      11-04-2008
cartercc <(E-Mail Removed)> wrote:
>I have a series of date related values, such as 01/2007, 02/2007,
>03/2008, 04/2006, etc. These values represent a column in various
>files that range from several dozen rows deep to almost 1M rows deep.
>My job is to create reports from a collection of these types of files.
>
>I create a number of refs to hashes that have the general appearance
>of this:
>$h{$k1}{$k2}{$k3} => data (generally but not always a simple count).


After scratching my head for some time I am guessing that probably $k1
contains the month and $k2 contains the year.
If you had provided a minimal, self-contained script as requested in the
posting guidelines it would have been much easier to identify your data
structure.

>I write to an outfile generally like this:
>foreach $k1 (sort keys %h) {
> foreach $k2 (sort keys %{$h{$k1}}) (
> print OUTFILE " <k1>$k1</k1> <k2>$k2</k2> <val>$h{$k1}{$k2}</val>
>\n";
> }
>}
>Here's the problem: the ordering of the dates isn't numerical, the
>proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
>03/2006, 04/2006 ... 01/2007, 02/2007
>03/2007, 04/2007 ... 01/2008, 02/2008


Well, that is what you are asking for. Assuming $k1 and $k2 are month
and year respectively then you are sorting your data by month and within
each month by year.
You could just reverse those two, sorting by year first and then within
each year by month.

Another solution would be to write a custom compare function. You will
have to pass the pair of year and month for each of $a and $b as those
are actually the number you want to sort. And then once you get that
sorted list just loop through it and print the corresponding values from
the data set.
I'd be interested in coding it but I'm not good enough to do it without
any testing and since you didn't provide any self-contained program that
could be used a test bed that's not an option.

Yet another solution would be to change your data structure. Your HoHoA
has the granularity of the time spans reversed. Had you put year as the
top value, then your algorithm above would have worked naturally.

>What I would like to do is overload the sort operator (call it
>'sort_y') to sort in this non-numerical order. Can this be done?


Why would you want to do that? Why don't you simply write your own
custom compare function and use that instead of the default <=>?

>Can
>it be done perhaps in C and compiled to run in Perl? Can it be done
>algorithmetically by passing sort a function of some kind?


Dah, did you even read the man page for sort()? That's what the first
argument of sort() is all about!

jue
 
Reply With Quote
 
cartercc
Guest
Posts: n/a
 
      11-04-2008
On Nov 4, 12:23*pm, Jürgen Exner <(E-Mail Removed)> wrote:
> cartercc <(E-Mail Removed)> wrote:
> >I have a series of date related values, such as 01/2007, 02/2007,
> >03/2008, 04/2006, etc. These values represent a column in various
> >files that range from several dozen rows deep to almost 1M rows deep.
> >My job is to create reports from a collection of these types of files.

>
> >I create a number of refs to hashes that have the general appearance
> >of this:
> >$h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

>
> After scratching my head for some time I am guessing that probably $k1
> contains the month and $k2 contains the year.
> If you had provided a minimal, self-contained script as requested in the
> posting guidelines it would have been much easier to identify your data
> structure.


No. $k1, etc., contains ANYTHING that's sortable and unique. It can
contain names, like "Exner, J", "New York," or "Baltimore" or numbers
(telephone, area code, ID numbers) or other values. The contents of
the keys are not relevant to the code or to the question.

> >Here's the problem: the ordering of the dates isn't numerical, the
> >proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
> >03/2006, 04/2006 ... 01/2007, 02/2007
> >03/2007, 04/2007 ... 01/2008, 02/2008

>
> Well, that is what you are asking for. Assuming $k1 and $k2 are month
> and year respectively then you are sorting your data by month and within
> each month by year.
> You could just reverse those two, sorting by year first and then within
> each year by month.


Actually, no. Here is a sample of a data file:
"07/T1","A27","117"
"07/T1","D01","3"
"07/T1","EA27","30"
"07/T1","EF20","52"
....
"08/T5","V26","17"
"08/T5","W03","11"
"08/T5","W04","4"
"08/T5","W05","1"

Hee is a sample of another data file:
1222413 G07 07/T2 07/RFA 07/T2
1247990 FH1 08/T4 08/RSP 08/T4
1094529 EARMY 05/T4 05/T4 05/T5 07/T1 07/RFA 07/T2
1247991 V24 08/T4 08/RSP 08/T4

As you can see, the 'date' values are unary values and I don't have
any real need to split them.

> Another solution would be to write a custom compare function. You will
> have to pass the pair of year and month for each of $a and $b as those
> are actually the number you want to sort. And then once you get that
> sorted list just loop through it and print the corresponding values from
> the data set.
> I'd be interested in coding it but I'm not good enough to do it without
> any testing and since you didn't provide any self-contained program that
> could be used a test bed that's not an option.


I just posted a half-assed idea of a solution that would require
processing the file two more times to convert and unconvert this field
to something that would sort naturally. I would like to see a custom
compare function, and if you want, I can send you sample data files
(they contain no confidential or sensitive information) and a script
that I use to product an OUTFILE. I've spend a non-trivial amount of
time thinking about it, and I can't see a solution.

> Yet another solution would be to change your data structure. Your HoHoA
> has the granularity of the time spans reversed. Had you put year as the
> top value, then your algorithm above would have worked naturally.


No, because of this:
Calendar year -
07/01, 07/02, 07/03, ...
Reporting year -
07/03, 07/04, ... 07/01, 07/02
Academic year -
07/01, 07/02 ... 08/04, 08/05

Please note that the Academic year crosses year boundries, i.e., from
2007 to 2008, while the Reporting year crosses month boundries, i.e.,
'03' starts the series and '01','02' ends the series.


> Dah, did you even read the man page for sort()? That's what the first
> argument of sort() is all about!


Actually, no, I didn't. I know that sort can take a function as an
argument, but I was focused on the algorithm, not the implementation.
But I'm headed that way now.

CC
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      11-04-2008
>>>>> "c" == cartercc <(E-Mail Removed)> writes:

>> you don't need to build up such a hash tree to sort this data. it will
>> slow you down. you can sort directly on the fields but we don't know the
>> original record format (a single long string?).


c> Actually, this is a general solution to a number of different
c> problems. The keys could represent a number of different types, such
c> as states, cities, course names (e.g., ART 109, BIO 225, MAT 4556),
c> people names, and so on. The common factor is that every type of field
c> is both unique and sortable, so I can use the same data structure
c> regardless of the type of data.

but then you still have issues with key processing (getting the order
you want) and sort key ordering. this is why my module can help.

>> those are string comparisons (the default for sort).


c> You are right. Actually, isn't it an ASCII sort? When I sort character
c> values, case makes a difference.

yes, sort use asciibetical sorting. you can fold case by extracting
string keys and upper/lower casing them all.

>> can you show the code for this sort?


c> There is no code other than that shown above. I have the data in a
c> hash of hash refs. What I meant was -- the INSTITUTIONAL ordering is
c> different than the NUMERIC ordering. What it prints is, "1 2 3 4 5".
c> What I want it to print is "3 4 5 1 2". And, I want it to print this
c> only for this kind of string, not for other strings that might contain
c> numeric characters, such as zip codes, area codes, ID numbers, etc.

oh, i didn't get you wanted an internal wacky ordering. the easiest
thing to do is to extract your keys and map them through a hash which
converts them to an order value that you want. something like this
(highly untested):

# based on the order you show above.

my %sort_ordering = (
1 => 3,
2 => 4,
....

) ;

and a basic sort would be like this:

@sorted = sort { $sort_ordering{$a} cmp $sort_ordering{$b} } @input ;

switch cmp for <=> as desired.

>> try using Sort::Maker for this. you just define how you want each key
>> extracted (using a regex or substr or any code) from your record and how
>> to sort it (numeric/string, up/down). in your case i would say use a
>> string sort since you have padded numbers. then make sure you sort the
>> keys in the order you want - year first, and then month. you will get a
>> clean and fast sort without all of your extra code and not needing to
>> use external programs.


c> Thanks, I'll look at this.

sort::maker will simplify key extraction and custom ordering. you would
build a ordering hash as above but put that conversion in the key
extract code. something like this:

my $sorter = make_sorter( 'GRT',
string => '$sort_ordering{get_the_key($_)}'
) ;

get_the_key is either direct code or a call to (ahem!) get the key from
$_.

c> The central component in all these strings is the slash: /. If I write
c> a function that takes a key as the parameter such as "2006/03" and
c> returns a string such as "2006/1/03", then I could control exactly how
c> the sort would work. When I was finished with the report, I could run
c> it through another function that would substitute "/1/" with just "/"
c> and that would solve the problem.

that is the conversion hash i spoke of. it can be a sub too if the
conversion needs more work or such. actually with sort::maker you would
sort on multiple keys each having its own sort ordering hash as
needed. then you don't need an extraction sub which is slower.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      11-04-2008
>>>>> "GJ" == Glenn Jackman <(E-Mail Removed)> writes:

GJ> At 2008-11-04 09:07AM, "cartercc" wrote:
>> Here's the problem: the ordering of the dates isn't numerical, the
>> proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
>> 03/2006, 04/2006 ... 01/2007, 02/2007
>> 03/2007, 04/2007 ... 01/2008, 02/2008


GJ> An example of the Schwartzian Transform (look it up in wikipedia):

GJ> my @dates = qw(
GJ> 03/2005 04/2005 01/2006 02/2006 03/2006 04/2006
GJ> 01/2007 02/2007 03/2007 04/2007 01/2008 02/200;
GJ> # or @dates = keys %somehash

GJ> my @sorted =
GJ> map { $_->[0] }
GJ> sort { $a->[1] <=> $b->[1] }
GJ> map { [$_, substr($_,3,4) . substr($_,0,2)] } @dates;


and sort::maker can generate an ST and spit out the code for you or you
can use the code ref it creates. it also removes redundancy (duplicate
key code and hides all the sort/map/sort syntax. and it can generate GRT
sorts which are much faster than the ST.

another trick is to pregenerate sorters for various key sets and print
them out off line. then paste in those sorters (name the subs) and call
the desired one based on the data set being sorted. then you don't need
to hand code all the sort variants of key sets.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Odd requirement behavior between two environments Ian Whitney Ruby 2 07-16-2007 07:05 PM
Data munging, the choices available. irishhacker@gmail.com Perl Misc 0 04-22-2007 01:55 AM
Munging URLs to store data jmensch@shaw.ca ASP .Net 2 06-10-2005 11:06 AM
Ado sort error-Ado Sort -Relate, Compute By, or Sort operations cannot be done on column(s) whose key length is unknown or exceeds 10 KB. Navin ASP General 1 09-09-2003 07:16 AM
Cookie munging problem ASP .Net 5 08-02-2003 06:07 PM



Advertisments