Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > File Auditing - Fast DB import and data manipulation

Reply
Thread Tools

File Auditing - Fast DB import and data manipulation

 
 
bcorrigan78@gmail.com
Guest
Posts: n/a
 
      03-20-2006
Hey Folks,

Im looking for some suggestions here. I need to process 200 GZ
compressed files each day that contain comma delimited information in
the following format:
HostName, FileName, DirName, Modified_On

Each file contains upwards of 200K rows. I need to compare this
information to the information recieved the day before to look for
files that have changed.

My current plan was to:
##
do read file and uncompress line
do import line into Mysql
end
end
run several comparison queries to find changes save changes to a table
automatically review changes based on rules, and those that are
leftover are unauthorized.
##

The MYSQL Import is SLOOOOW. Its taking 10 minutes per file.
Extrapolating this, it will take 2000 minutes, or 33 hours each day to
do just the import. Unfortunately, earth days only have 24 hours.

So, I need some way to compare todays file, to yesterdays and see
changes. Any good way to do this using the text files and skip the
import process? Im worried that this will slow down the comparison
process, but I'd like to try it....

Thanks

Brian

 
Reply With Quote
 
 
 
 
Corey Lawson
Guest
Posts: n/a
 
      03-21-2006
Well, if you're doing it on *nix, use the 'diff' command. On Windows,
use 'fc', and write some other code to handle the differences. Loading
the files into the database is not the best way forward. Use 'gunzip'
on either system (you'll need to download gunzip from gnuwin32
(http://gnuwin32.sourceforge.net/packages.html. Heck, might as well
download the package with diff and grep in it as well).

Wait, this has come up before, and sparked quite the flamefest each time!

With moderately clever code, especially on *nix, you should even be
able to do this in a shell script with minimal lines of code,
something like this really bad mix of CMD and sh:

for each f in (*.gz) do
if ff .neq. "" then
#arg, gotta gunzip the file...
gunzip %f %f.txt
f =3D %f.txt
diff %f %ff | grep -v '$>' > %f.diff
end if
ff =3D %f
#next f

The databases I've dealt with (Oracle, SQL Server, Access, Postgres)
don't have a good way to do this in SQL or stored procedure language,
because they don't really support recursion at the SQL level. Plus,
it's just too expensive for the database to allocate table space (temp
or not) to load a couple of files, run all the queries that you would
need to run to spot the differences and extract them meaningfully, and
then go on to the next file.

Otherwise, another quick-and-dirty way to do it is to set up the table
to import so that there is a multi-column PK or unique constraint, add
a field to store the file name (but keep it out of the constraint),
and then just bulk import each file. If you updated the filename field
the last time, then all the records that were successfully added will
have NULL for a filename. Depending on the database, the rows that
failed to be added due to constraint violations might be in the error
logs, or they just may not really matter at all. Run any reports you
need to for the new records before updating the filename field.

Lather-rinse-repeat.

For the SQL Server geeks, doing the "loop over the files in a
directory" is possible via DTS (sqldts.com has sample VBScript code to
do it...), but it's a sucky pain in the ass to do. But maybe it's
easier to do in SS2005.

On 3/20/06, http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:
> Hey Folks,
>
> Im looking for some suggestions here. I need to process 200 GZ
> compressed files each day that contain comma delimited information in
> the following format:
> HostName, FileName, DirName, Modified_On
>
> Each file contains upwards of 200K rows. I need to compare this
> information to the information recieved the day before to look for
> files that have changed.
>
> My current plan was to:
> ##
> do read file and uncompress line
> do import line into Mysql
> end
> end
> run several comparison queries to find changes save changes to a table
> automatically review changes based on rules, and those that are
> leftover are unauthorized.
> ##
>
> The MYSQL Import is SLOOOOW. Its taking 10 minutes per file.
> Extrapolating this, it will take 2000 minutes, or 33 hours each day to
> do just the import. Unfortunately, earth days only have 24 hours.
>
> So, I need some way to compare todays file, to yesterdays and see
> changes. Any good way to do this using the text files and skip the
> import process? Im worried that this will slow down the comparison
> process, but I'd like to try it....
>
> Thanks
>
> Brian
>
>
>



 
Reply With Quote
 
 
 
 
Vance Heron
Guest
Posts: n/a
 
      03-26-2006
You can use the 'mysqlimport' utility to do a bulk load
of the data.

Vance

(E-Mail Removed) wrote:

>Hey Folks,
>
>Im looking for some suggestions here. I need to process 200 GZ
>compressed files each day that contain comma delimited information in
>the following format:
>HostName, FileName, DirName, Modified_On
>
>Each file contains upwards of 200K rows. I need to compare this
>information to the information recieved the day before to look for
>files that have changed.
>
>My current plan was to:
>##
>do read file and uncompress line
> do import line into Mysql
> end
>end
>run several comparison queries to find changes save changes to a table
>automatically review changes based on rules, and those that are
>leftover are unauthorized.
>##
>
>The MYSQL Import is SLOOOOW. Its taking 10 minutes per file.
>Extrapolating this, it will take 2000 minutes, or 33 hours each day to
>do just the import. Unfortunately, earth days only have 24 hours.
>
>So, I need some way to compare todays file, to yesterdays and see
>changes. Any good way to do this using the text files and skip the
>import process? Im worried that this will slow down the comparison
>process, but I'd like to try it....
>
>Thanks
>
>Brian
>
>
>
>



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Authentication and Auditing: Incorrect Username in Audit tables JimLad ASP .Net 0 09-12-2006 06:24 PM
C2 auditing helensmith Software 2 05-27-2006 01:25 PM
international journal of auditing david_liteman Computer Information 0 04-23-2004 06:43 PM
Open Source auditing Hairy One Kenobi Computer Security 4 02-04-2004 02:58 PM
Auditing question for the 70-270 MS book jones_net MCSE 2 10-22-2003 08:20 AM



Advertisments