Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > logic question for text file updates

Reply
Thread Tools

logic question for text file updates

 
 
ccc31807
Guest
Posts: n/a
 
      03-26-2010
We have a csv source file of many thousands of records, with two
columns, the ID and a
status field. It has very recently come to my attention that
occasionally the status of a record will change, with the change being
significant enough that the record must be updated before the process
runs. The update files consist of a small subset, sometimes a very
small subset, of the records in the source file. (The update file has
a number of other fields that can change also, but I'm only concerned
with the status field.)

My first inclination is to open the update file, create a hash with
the ID as the key and the status as value, then open the source file,
read each line, update the line if it exists in the hash, and write
each line to a new output file. However, I can think of several
different ways to do this -- I just don't know which way would be
best. I don't particularly want to read every line and write every
line of a source file when only a few lines (if any) need to be
modified.

My second inclination would be to use a database and write an update
query for the records in the update file. But this seems a heavy
weight solution to a light weight problem -- I would only be using the
database to modify records, not to to any of the things we ordinarily
use databases for.

I've never had to do a small number of updates to a large file before,
and it seems too trivial a task to use a database for. Any suggestions
on a better way to do this?

Thanks, CC.

P.S. - The end product of this process is a data file with
approximately 20 fields, written as comma separated, double quote
delimited text, designed to be imported into Excel and Access by end
users in performance of their duties.
 
Reply With Quote
 
 
 
 
sln@netherlands.com
Guest
Posts: n/a
 
      03-26-2010
On Fri, 26 Mar 2010 11:14:36 -0700 (PDT), ccc31807 <(E-Mail Removed)> wrote:

>We have a csv source file of many thousands of records, with two
>columns, the ID and a
>status field. It has very recently come to my attention that
>occasionally the status of a record will change, with the change being
>significant enough that the record must be updated before the process
>runs. The update files consist of a small subset, sometimes a very
>small subset, of the records in the source file. (The update file has
>a number of other fields that can change also, but I'm only concerned
>with the status field.)
>
>My first inclination is to open the update file, create a hash with
>the ID as the key and the status as value, then open the source file,
>read each line, update the line if it exists in the hash, and write
>each line to a new output file. However, I can think of several
>different ways to do this -- I just don't know which way would be
>best. I don't particularly want to read every line and write every
>line of a source file when only a few lines (if any) need to be
>modified.
>
>My second inclination would be to use a database and write an update
>query for the records in the update file. But this seems a heavy
>weight solution to a light weight problem -- I would only be using the
>database to modify records, not to to any of the things we ordinarily
>use databases for.
>
>I've never had to do a small number of updates to a large file before,
>and it seems too trivial a task to use a database for. Any suggestions
>on a better way to do this?
>
>Thanks, CC.
>
>P.S. - The end product of this process is a data file with
>approximately 20 fields, written as comma separated, double quote
>delimited text, designed to be imported into Excel and Access by end
>users in performance of their duties.


You don't have to write a new file back out to disk for a small change.
You could design a disk that can grow or shrink its magnetic material
on the fly, and just insert/remove metal sectors as needed.

But, I think they trashed that idea when they invented frag-
mentation capabilities.

To circumvent fragments on a data level, you could re-design
the file record so that a particular field of a record is
fixed width relative to surrounding fields, sufficient enough
to hold the largest variable data that field could possibly
encounter.

And if the field is small enough to accomodate all possible
values, there is not that much "air" involved in relation to
the overall file size.

Since the field is fixed, the offset into the record to the
field can be surmised and added to the location of the last
record end position, allowing you to write a new fixed width
value, guaranteeing not to overwrite the next field in that
record.

-sln
 
Reply With Quote
 
 
 
 
ccc31807
Guest
Posts: n/a
 
      03-26-2010
On Mar 26, 2:35*pm, (E-Mail Removed) wrote:
> On Fri, 26 Mar 2010 11:14:36 -0700 (PDT), ccc31807 <(E-Mail Removed)> wrote:
> >We have a csv source file of many thousands of records, with two
> >columns, the ID and a
> >status field. It has very recently come to my attention that
> >occasionally the status of a record will change, with the change being
> >significant enough that the record must be updated before the process
> >runs. The update files consist of a small subset, sometimes a very
> >small subset, of the records in the source file. (The update file has
> >a number of other fields that can change also, but I'm only concerned
> >with the status field.)

>
> >My first inclination is to open the update file, create a hash with
> >the ID as the key and the status as value, then open the source file,
> >read each line, update the line if it exists in the hash, and write
> >each line to a new output file. However, I can think of several
> >different ways to do this -- I just don't know which way would be
> >best. I don't particularly want to read every line and write every
> >line of a source file when only a few lines (if any) need to be
> >modified.

>
> >My second inclination would be to use a database and write an update
> >query for the records in the update file. But this seems a heavy
> >weight solution to a light weight problem -- I would only be using the
> >database to modify records, not to to any of the things we ordinarily
> >use databases for.

>
> >I've never had to do a small number of updates to a large file before,
> >and it seems too trivial a task to use a database for. Any suggestions
> >on a better way to do this?

>
> >Thanks, CC.

>
> >P.S. - The end product of this process is a data file with
> >approximately 20 fields, written as comma separated, double quote
> >delimited text, designed to be imported into Excel and Access by end
> >users in performance of their duties.

>
> You don't have to write a new file back out to disk for a small change.
> You could design a disk that can grow or shrink its magnetic material
> on the fly, and just insert/remove metal sectors as needed.
>
> But, I think they trashed that idea when they invented frag-
> mentation capabilities.
>
> To circumvent fragments on a data level, you could re-design
> the file record so that a particular field of a record is
> fixed width relative to surrounding fields, sufficient enough
> to hold the largest variable data that field could possibly
> encounter.
>
> And if the field is small enough to accomodate all possible
> values, there is not that much "air" involved in relation to
> the overall file size.
>
> Since the field is fixed, the offset into the record to the
> field can be surmised and added to the location of the last
> record end position, allowing you to write a new fixed width
> value, guaranteeing not to overwrite the next field in that
> record.
>
> -sln


The key will always be a seven character integer. The value will
always be a string with fewer than 20 characters. I COULD use a fixed
width format, but my current format (for the source file) is pipe
separated (e.g. 0059485|Current) and all my logic splits input on the
pipe symbol.

The keys are not consecutive, not ordered, and have large skips, i.e.,
for several million records I might have ten thousand records in the
source file which are randomly ordered (is that an oxymoron?).
Treating the source file as an array would require many more array
elements than records in the file.

CC.
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      03-26-2010
On Fri, 26 Mar 2010 12:00:17 -0700 (PDT), ccc31807 <(E-Mail Removed)> wrote:

>On Mar 26, 2:35*pm, (E-Mail Removed) wrote:
>> On Fri, 26 Mar 2010 11:14:36 -0700 (PDT), ccc31807 <(E-Mail Removed)> wrote:


[snip]

>> >My first inclination is to open the update file, create a hash with
>> >the ID as the key and the status as value, then open the source file,
>> >read each line, update the line if it exists in the hash, and write
>> >each line to a new output file.


>> >P.S. - The end product of this process is a data file with
>> >approximately 20 fields, written as comma separated, double quote
>> >delimited text, designed to be imported into Excel and Access by end
>> >users in performance of their duties.

>>


>> Since the field is fixed, the offset into the record to the
>> field can be surmised and added to the location of the last
>> record end position, allowing you to write a new fixed width
>> value, guaranteeing not to overwrite the next field in that
>> record.
>>

>
>The key will always be a seven character integer. The value will
>always be a string with fewer than 20 characters. I COULD use a fixed
>width format, but my current format (for the source file) is pipe
>separated (e.g. 0059485|Current) and all my logic splits input on the
>pipe symbol.
>
>The keys are not consecutive, not ordered, and have large skips, i.e.,
>for several million records I might have ten thousand records in the
>source file which are randomly ordered (is that an oxymoron?).
>Treating the source file as an array would require many more array
>elements than records in the file.
>
>CC.


So, if you have a source file, delimited by | that you eventually make
a dbl quote comma delimited csv file, you could make that status field fixed
width (what 20 chars tops?) in the source. When you generate the dat, csv
file, just strip white space from the beginning and end of the field
before you double quote it to a csv file.

Source file:
- fields all dynamic width except status (which is fixed 20 char).
- format

<field1>|<field2>|<field3>|<field4>|<- status, 20 char ->|<field6>|<field7>|<field_last>\n

You know the file position of the previous EOR. Use index() to find the pipe '|' char
of the status field of the current record (4th in the example), add that to the previous
EOR to get the write() position for the new status (if it changed).

To find out if the status changed, do your split /'|'/ to get all the fields, check
the ID/status from the update file, write out new "fixed width" status (format with
printf or something) to the source file.

When it comes time to generate the csv from the source, just trim spaces before you
write it out.

-sln










 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      03-26-2010
On Fri, 26 Mar 2010 19:38:03 +0000, Ben Morrow <(E-Mail Removed)> wrote:

>

[snip]
>Is there any way of 'blanking' a record? Normal CSV doesn't support
>comments, and if you're importing into Excel you can't extend it to do
>so; what does Excel do if you give it a file like
>
> one|two|three
> ||||||||||||||
> four|five|six


In this case, the newline is the record delimeter, '|' is the
field delimeter.
You can have excel treat consecutive field delimeters as one.
In this case, the ||||||||||| produces a blank record.
This is the way Excel 2002 works, don't know if you can
auto-remove blank records in newer versions though.

-sln
 
Reply With Quote
 
ccc31807
Guest
Posts: n/a
 
      03-26-2010
On Mar 26, 3:42*pm, (E-Mail Removed) wrote:
> So, if you have a source file, delimited by | that you eventually make
> a dbl quote comma delimited csv file, you could make that status field fixed
> width (what 20 chars tops?) in the source. When you generate the dat, csv
> file, just strip white space from the beginning and end of the field
> before you double quote it to a csv file.


Working backwards, my ultimate output file looks like this:
"id","field2","field3","status","field5","field6"\ n

The 'status' field should be the current status, which rarely changes,
but it's critical to use the most current status.

I get about ten update files a year with the current status and a
number of other fields that I don't care about. I take these files,
strip out everything except the ID and the STATUS, and write that data
into memory.

Working frontwards, I build a source file with the two fields I
referenced, like this: [id|status]

I went ahead and bit the bullet, since I had to do something. I (1)
save the source file to a backup, (2) read in the source file and save
it to a hash on the ids, (3) read in the update file the same way, and
(4) print out the hash to the source file. It's reasonably quick, less
than a second (although I haven't bench marked it) and seems to be
reliable.

That said, I'd like to learn a more elegant way to do it.

CC.

> You know the file position of the previous EOR. Use index() to find the pipe '|' char
> of the status field of the current record (4th in the example), add that to the previous
> EOR to get the write() position for the new status (if it changed).
>
> To find out if the status changed, do your split /'|'/ to get all the fields, check
> the ID/status from the update file, write out new "fixed width" status (format with
> printf or something) to the source file.
>
> When it comes time to generate the csv from the source, just trim spaces before you
> write it out.


That sounds a lot more complicated than the brute force approach I
used. But I appreciate your suggestion as treating the files as fixed
width, and I will explore that later.

CC
 
Reply With Quote
 
ccc31807
Guest
Posts: n/a
 
      03-26-2010
On Mar 26, 3:38*pm, Ben Morrow <(E-Mail Removed)> wrote:
> Is there any way of 'blanking' a record? Normal CSV doesn't support
> comments, and if you're importing into Excel you can't extend it to do
> so; what does Excel do if you give it a file like


Actually, I comment CSV files all the time, not for use by Excel, but
for use by my scripts. The 'comments' are on interspersed lines
beginning with #, so I can do this:
while (<INPUT>)
{
next if /^#/;
...
}

> * * - read the update file(s) into a hash,
> * * - open the source file read/write,
> * * - go through it looking for the appropriate records,
> * * - when you find one, wipe it out without changing the length or
> * * * removing the newline,
> * * - add the changed records onto the end of the file, since the
> * * * records weren't in order anyway.


I don't see any real difference between this and reading the entire
file into memory, at least for the size files I'm dealing this. IO is
always a bottleneck, and unless space is limited it's better to use
space than time.

> It's generally not worth messing around with approaches like this,
> though. Rewriting a file of a few MB doesn't exactly take long, and it's
> much easier to get right.


Yeah, I'm beginning to think that my investment in my time isn't worth
the results.

Thanks, CC.
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      03-26-2010
On Fri, 26 Mar 2010 13:42:08 -0700 (PDT), ccc31807 <(E-Mail Removed)> wrote:

>
>That sounds a lot more complicated than the brute force approach I
>used. But I appreciate your suggestion as treating the files as fixed
>width, and I will explore that later.
>
>CC


Actually, the brute force method you cite is far and away the
much more complicated approach.

Good luck!

-sln
 
Reply With Quote
 
ccc31807
Guest
Posts: n/a
 
      03-26-2010
On Mar 26, 5:15*pm, (E-Mail Removed) wrote:
> Actually, the brute force method you cite is far and away the
> much more complicated approach.


I would be very interested in why you think this. It may depend on
your definition of 'complicated.'

In terms of writing the code, it was pretty simple. First, open the
source file and read it into a hash. Second, open the update file and
read it into the SAME(!) hash (thereby overwriting the old values
where the hash keys are duplicated.) Third, write the hash back out to
the source file.

As to 'complicated' I have know people that use Access for data
processing files, spending hours on end creating and building Access
databases, queries, and reports to manipulate data. It takes them a
lot longer to generate a report using Access than it does me, using
Perl to munge the data. They say that my way is more 'complicated'
because I use Perl (which is 'harder') and Access is easier. I say my
way is less 'complicated' because I don't have to mess around with
Access. Frankly, when I read some of the scripts you post to c.l.p.m.,
I have a very hard time understanding them, and (from my POV) I would
say that you have a weird conception of 'complicated.'

CC.
 
Reply With Quote
 
Martijn Lievaart
Guest
Posts: n/a
 
      03-26-2010
On Fri, 26 Mar 2010 12:00:17 -0700, ccc31807 wrote:

> The key will always be a seven character integer. The value will always
> be a string with fewer than 20 characters. I COULD use a fixed width
> format, but my current format (for the source file) is pipe separated
> (e.g. 0059485|Current) and all my logic splits input on the pipe symbol.


Hurray for encapsulation. If from the start you encapsulated this (i.e.
get a line, call a sub split_to_fields) you only would have had to update
one sub.

Not much help now, but something to keep in the back of your mind when
designing your next program.

M4
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Python Logic Map/Logic Flow Chart. (Example Provided) spike Python 8 02-09-2010 12:31 PM
Asynchronous Logic Gates and Analog Logic Gates Jyoti Ballabh Software 3 11-26-2009 06:48 PM
Access an indexed "html:text" in "logic:iterate" from javascript Zhao Java 2 12-04-2007 06:59 PM
checking "Show Updates" doesn't reveal any Windows XP updates =?Utf-8?B?d2xzNTA4?= Windows 64bit 5 05-26-2006 04:42 PM
screen updates don't match program logic/state or order robert Javascript 1 08-23-2003 02:39 PM



Advertisments