Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Parse using Text::CSV into Hash

Reply
Thread Tools

Parse using Text::CSV into Hash

 
 
batrams
Guest
Posts: n/a
 
      11-10-2011
Hi guys - this hunk of code seems to function as intended. After I get
code to work I'm often curious to run it by you gurus to see how you
might have done it. Please don't laugh too hard but any comments are
welcome.




open (TEXTFILE, "USES.txt") || die "not found\n";
my $csv = Text::CSV->new();

while ( <TEXTFILE> ) {
$csv->parse($_);
push(@H_USES, [$csv->fields]);
}
close(TEXTFILE);

# load up the hash
for my $i ( 0 .. $#H_USES ) {
$H_USES{$H_USES[$i][0]} = $H_USES[$i];
}




DB
 
Reply With Quote
 
 
 
 
Jim Gibson
Guest
Posts: n/a
 
      11-10-2011
In article <j9gs9g$e1r$(E-Mail Removed)>, batrams <(E-Mail Removed)> wrote:

> Hi guys - this hunk of code seems to function as intended. After I get
> code to work I'm often curious to run it by you gurus to see how you
> might have done it. Please don't laugh too hard but any comments are
> welcome.
>
>
>
>
> open (TEXTFILE, "USES.txt") || die "not found\n";


Lexical variables with lower-case names are better than package globals
in uppercase. The 'or' operator had better precedence for this use than
'||'. It is always good to include the open error ($!) in the die
message. I would usually put the file name in a scalar variable and
include that in the error message as well.

See the examples in the documentation for Text::CSV for a better style
of programming to read in a text file.

> my $csv = Text::CSV->new();
>
> while ( <TEXTFILE> ) {
> $csv->parse($_);
> push(@H_USES, [$csv->fields]);


I would use lower-case variable names, and I always include the
parentheses for function/subroutine/method calls ($csv->fields()), even
if they are not required, as it gives me a better clue of what is going
on.

You should always 'use strict;' and declare your variables.

> }
> close(TEXTFILE);
>
> # load up the hash
> for my $i ( 0 .. $#H_USES ) {
> $H_USES{$H_USES[$i][0]} = $H_USES[$i];
> }


Loops should be indented, and the integer index is not needed:

for my $row ( @H_USES ) {
$H_USES{$row->[0]} = $row;
}

But I would populate the hash in the while loop, rather than in a
separate loop afterwards.

--
Jim Gibson
 
Reply With Quote
 
 
 
 
J. Gleixner
Guest
Posts: n/a
 
      11-10-2011
On 11/10/11 15:57, batrams wrote:
> Hi guys - this hunk of code seems to function as intended. After I get
> code to work I'm often curious to run it by you gurus to see how you
> might have done it. Please don't laugh too hard but any comments are
> welcome.


If you want a hash, see the getline_hr method.

 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      11-10-2011
Ben Morrow <(E-Mail Removed)> writes:

[...]

> open (my $TEXTFILE, "<", $file)
> or die "Couldn't open '$file': $!\n";


[...]

> - I put back the "\n" on the die message: file and line are useful
> for a programmer, and should be left in on messages you don't
> expect to ever see in production, but messages addressed to the
> end user don't want them.


The end-user is going to report the problem as "It crashed" no matter
what actually occurred and any information said end-user can just
copy&paste into an e-mail is possibly helpful to the person who is
supposed to deal with the issue.

[...]

>> close(TEXTFILE);

>
> Another advantage of lexical filehandles is they auto-close when they go
> out of scope. It's risky taking advantage of this if you have written to
> the handle: in that case you should be calling close explicitly and
> checking the return value.


Or so the Linux close(2) manpage claims. The reason it does this is
because data written to a file residing on a NFS filesystem (and
possibly, other network file systems) isn't necessarily sent to the
server until the (NFS) filehandle is closed. Because of this, the
return value of a close of such a file may be some error which ocurred
during network communication. For practical purpose, this is often
irrelevant and - much more importantly - checking the return value of
the close doesn't help except if reporting "This didn't work. Bad for
you!" is considered to be useful. Programs which care about the
integrity of data written to some persistent medium, at least to the
degree this is possible, need to use fsync before closing the file and
need to implement some strategy for dealing with errors returned by
that (for Perl, this means invoking IO::Handle::flush, followed by
IO::Handle::Sync).

The actual write operation can actually fail after close returned
success even for local files.
 
Reply With Quote
 
George Mpouras
Guest
Posts: n/a
 
      11-11-2011
my %H_USES;
while (<DATA>) {
$_=[split ','];$H_USES{$_->[0]}=$_
}

use Data:umper; print Dumper(\%H_USES);

__DATA__
a1,b1,c1,d1
a2,b2,c2,d2
a3,b3,c3,d3
a3,b3,c3,d3


 
Reply With Quote
 
Justin C
Guest
Posts: n/a
 
      11-11-2011
On 2011-11-10, Ben Morrow <(E-Mail Removed)> wrote:

[snip]

> The only important improvement here is the 'my $TEXTFILE'. Bareword
> filehandles are always global


Hey! And there's the answer to a question I was going to post! Couldn't
figure our why I had to pass a sub a ref to a hash from the calling
hash, but it figured out the FH all by itself. Now I know... and I know
why not to do it too. TY!

[snip]

Justin.

--
Justin C, by the sea.
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      11-11-2011
On 2011-11-11 08:18, George Mpouras wrote:

> split ','


Hmm. What do you think that

split '[,]'

does?

--
Ruud
 
Reply With Quote
 
batrams
Guest
Posts: n/a
 
      11-11-2011
On 11/10/2011 2:10 PM, J. Gleixner wrote:
> hash, see the getline_hr



Thanks! That looks like what I need

DB
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      11-11-2011
Ben Morrow <(E-Mail Removed)> writes:
> Quoth Rainer Weikusat <(E-Mail Removed)>:
>> Ben Morrow <(E-Mail Removed)> writes:
>>
>> > open (my $TEXTFILE, "<", $file)
>> > or die "Couldn't open '$file': $!\n";

>>
>> > - I put back the "\n" on the die message: file and line are useful
>> > for a programmer, and should be left in on messages you don't
>> > expect to ever see in production, but messages addressed to the
>> > end user don't want them.

>>
>> The end-user is going to report the problem as "It crashed" no matter
>> what actually occurred and any information said end-user can just
>> copy&paste into an e-mail is possibly helpful to the person who is
>> supposed to deal with the issue.

>
> You are certainly right for a large (and important) class of end-users.
> Programs aimed at such people should, as you say, simply give a message
> saying 'there was a problem' plus some way to send someone competent a
> logfile with as much detailed information as possible.


That wasn't what I was saying. I intended to argue in favor of
omitting the \n because an event which causes a program to die is
usually one some kind of developer will need to deal with and hence,
including information which is only relevant to a developer in the
diagnostic makes sense.

[...]

> Look at the error messages given by the standard Unix utilities. They
> (usually) do include filenames and strerror messages, but very rarely
> include __FILE__ and __LINE__ information.


In C-code, I always include the name of the function where the error
occurred. Granted, the default 'sudden death message suffix' is more
verbose than I consider necessary, OTOH, it contains enough
information to locate the operation that failed in the code and it is
already built in.

[...]

>> > Another advantage of lexical filehandles is they auto-close when they go
>> > out of scope. It's risky taking advantage of this if you have written to
>> > the handle: in that case you should be calling close explicitly and
>> > checking the return value.

>>
>> Or so the Linux close(2) manpage claims. The reason it does this is
>> because data written to a file residing on a NFS filesystem (and
>> possibly, other network file systems) isn't necessarily sent to the
>> server until the (NFS) filehandle is closed. Because of this, the
>> return value of a close of such a file may be some error which ocurred
>> during network communication.

>
> You don't need to go anywhere near that far. PerlIO (and stdio in C)
> buffers writes, so when you call close (that is, the equivalent of
> fclose(3), not close(2)) there is still a partial bufferful of data
> which hasn't been sent to the kernel yet. If that write(2) gives an
> error, that error will be returned from close.


So, more generally put: Using hidden buffering mechansisms is risky in
the sense that they decouple the operation which provides the data and
the operation which moves the data to its destination. As side effects
of that, the time window during which data loss could happen becomes
much larger and an application can no longer determine where in the
produced 'output data stream' the problem actually occurred (and try
to recover from that). The justification for hidden buffering is
'usually, it works and it usually improves the performance of the
system while it is working significantly'.

'Reliable I/O' or even just 'reliably reporting I/O errors' and
'hidden buffering' can't go together and ...

> Additionally, close will return an error if *any* of the previous
> writes failed (unless you call IO::Handle->clearerr), so checking
> the return value of close means you don't need to check the return
> value of print.


.... checking the return value of close is just a fig leaf covering the
problem: If everything worked, it is redundant and when something
didn't work, reporting that "something didn't work" (but I know neither
what nor when and can't do anything to repair that) is not helpful.

>> For practical purpose, this is often irrelevant and - much more
>> importantly - checking the return value of the close doesn't help
>> except if reporting "This didn't work. Bad for you!" is considered
>> to be useful.

>
> It is. Always. It's better to *know* something failed, and have some
> idea as to why, than to think it succeeded when it didn't.


Let's assume that somebody runs a long-running data processing task
supposed to generate a lot of important output. On day three, the disk
is full but the program doesn't notice that. On day fifteen, after the
task has completed, it checks the return value of close and prints
"The disk became full! You lost!".

Do you really think a user will appreciate that?

>> Programs which care about the integrity of data written to some
>> persistent medium, at least to the degree this is possible, need to
>> use fsync before closing the file and need to implement some strategy
>> for dealing with errors returned by that (for Perl, this means
>> invoking IO::Handle::flush, followed by IO::Handle::Sync).

>
> Yes, if you have strict integrity requirements simply checking close
> isn't enough. It does guard against the two most common causes of error,
> though: disk full (or quota exceeded) and network failure.


Or so you hope: In case of a regular file residing on some local
persistent storage medium, a write system call encountering a 'too
little free space to write that' problem is supposed to write as many
bytes as can be written and return this count to the caller. This
means a final write on close will not return ENOSPC but will silently
truncate the file. Since close doesn't return a count, there's no way
to detect that. And there are - of course - people who think "Well, 0
is as good a number as any other number, so, in the case that write
really can't write anything, it is supposed to write as many bytes as
it can, namely, 0 bytes, and report this as successful write of size 0
to the caller" (I'm not making this up). Then, there are 'chamber of
horrors' storage media where 'sucessfully written' doesn't mean 'can
be read back again' and 'can be read back now' doesn't mean 'can still
be read back five minutes from now', aka 'flash ROM'. In case of
network communication, a remote system is perfectly capable of sending
an acknowledgment for something and die a sudden death before the
acknowledged data actually hit the persistent storage medium and so
on.

Eg, so far, I have written one program supposed to work reliably
despite it is running in Germany and writes to a NFS-mounted
filesystem residing in the UK and this program not only checks the
return value of each I/O operation but additionally checks that the
state of remote file system is what it was supposed to be if the
operation was actually performed sucessfully and retries everything
(with exponential backoff) until success or administrative
intervention --- network programming remains network programming, no
matter if the file system API is used for high-level convenience.
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      11-11-2011
Ben Morrow <(E-Mail Removed)> writes:
> Quoth Rainer Weikusat <(E-Mail Removed)>:
>> Ben Morrow <(E-Mail Removed)> writes:
>> > Quoth Rainer Weikusat <(E-Mail Removed)>:
>> >>
>> >> The end-user is going to report the problem as "It crashed" no matter
>> >> what actually occurred and any information said end-user can just
>> >> copy&paste into an e-mail is possibly helpful to the person who is
>> >> supposed to deal with the issue.
>> >
>> > You are certainly right for a large (and important) class of end-users.
>> > Programs aimed at such people should, as you say, simply give a message
>> > saying 'there was a problem' plus some way to send someone competent a
>> > logfile with as much detailed information as possible.

>>
>> That wasn't what I was saying. I intended to argue in favor of
>> omitting the \n because an event which causes a program to die is
>> usually one some kind of developer will need to deal with

>
> I disagree.
>
> ~% grep mauzo /etc/paswd
> grep: /etc/paswd: No such file or directory
>
> That message would *not* be improved by including the file and line in
> grep's source where the error happened to be discovered.


As I already wrote:

,----
|In C-code, I always include the name of the function where the error
|occurred. Granted, the default 'sudden death message suffix' is more
|verbose than I consider necessary, OTOH, it contains enough
|information to locate the operation that failed in the code and it is
|already built in.
`----

>> >> For practical purpose, this is often irrelevant and - much more
>> >> importantly - checking the return value of the close doesn't help
>> >> except if reporting "This didn't work. Bad for you!" is considered
>> >> to be useful.
>> >
>> > It is. Always. It's better to *know* something failed, and have some
>> > idea as to why, than to think it succeeded when it didn't.

>>
>> Let's assume that somebody runs a long-running data processing task
>> supposed to generate a lot of important output. On day three, the disk
>> is full but the program doesn't notice that. On day fifteen, after the
>> task has completed, it checks the return value of close and prints
>> "The disk became full! You lost!".
>>
>> Do you really think a user will appreciate that?

>
> As opposed to the program falsely claiming it completed successfully?
> Yes, absolutely.


I agree with that: Should the program print "I'm happy to inform you
that your task ran to completion" despite it didn't, the person whose
time got lost because of the lack of a sensible error handling
strategy will certainly be even more ****ed then when the program just
prints "Sorry. Couldn't be bothered to check print return
values. Something failed".

To state this again: Either, the application doesn't care for the
result of an I/O operation (and there are valid reasons for that, eg,
short running, interactive tasks where any failure will be obvious to
the user). In this case, checking the return value of close is
pointless. Or the application does care for the results of I/O
operations. In this case, the hidden buffering facility whose
shortcomings are supposed to be plastered over by checking the return
value of close must not be used. I'll just say "Abracadabra" and firmly
believe everything's fine doesn't work.

>> >> Programs which care about the integrity of data written to some
>> >> persistent medium, at least to the degree this is possible, need to
>> >> use fsync before closing the file and need to implement some strategy
>> >> for dealing with errors returned by that (for Perl, this means
>> >> invoking IO::Handle::flush, followed by IO::Handle::Sync).
>> >
>> > Yes, if you have strict integrity requirements simply checking close
>> > isn't enough. It does guard against the two most common causes of error,
>> > though: disk full (or quota exceeded) and network failure.

>>
>> Or so you hope: In case of a regular file residing on some local
>> persistent storage medium, a write system call encountering a 'too
>> little free space to write that' problem is supposed to write as many
>> bytes as can be written and return this count to the caller. This
>> means a final write on close will not return ENOSPC but will silently
>> truncate the file. Since close doesn't return a count, there's no way
>> to detect that.

>
> RTFS, of either perl or libc (or, you know, the manpages). fclose(3)
> calls fflush(3), which will repeatedly call write(2) until the whole of
> the remaining buffer has been successfully written,


Or so you hope. You cannot possibly have checked the source code of
each past, present and future hidden buffering facility and I
seriously doubt that you did check even a large minority of them, not
the least because the source code of many of them isn't available. And
- of course - since the kernel will usually provide another layer of
hidden buffering, this is still technically equivalent to "La la la. I
didn't hear a thing!".

> Perl's PerlIO functions do exactly the same thing. stdio certainly has
> some design bugs, but it wasn't designed by idiots.


This is not a matter of the design but of the implementation.

>> And there are - of course - people who think "Well, 0
>> is as good a number as any other number, so, in the case that write
>> really can't write anything, it is supposed to write as many bytes as
>> it can, namely, 0 bytes, and report this as successful write of size 0
>> to the caller" (I'm not making this up).

>
> SUSv4 says:
>
> | Where this volume of POSIX.1-2008 requires -1 to be returned and errno
> | set to [EAGAIN], most historical implementations return zero [...].
>
> so, while this behaviour is not strictly POSIX-conforming, a decently-
> written application should handle that case correctly.


And that is

These functions shall fail if:

[EAGAIN]
The O_NONBLOCK flag is set for the file descriptor and the thread
would be delayed in the write() operation.

Not applicable. In case of the 'write returning zero instead of
ENOSPC' issue (and this is a real implementation), the program will
loop forever in the flushing routine.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
hash of hash of hash of hash in c++ rp C++ 1 11-10-2011 04:45 PM
Hash#select returns an array but Hash#reject returns a hash... Srijayanth Sridhar Ruby 19 07-02-2008 12:49 PM
Need to parse a HTML table into an Array/hash Ernst Tanaka Ruby 8 12-05-2007 10:33 PM
How to parse text file into hash table anitawa@gmail.com Perl Misc 3 04-25-2007 08:25 PM
Subselecting hash back into hash (oneliner?) Jon Baer Ruby 3 01-27-2006 07:21 AM



Advertisments