Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Efficient Data Storage

Reply
Thread Tools

Efficient Data Storage

 
 
Aaron DeLoach
Guest
Posts: n/a
 
      09-08-2004
Hello,

I have run into unfamiliar ground. Some guidance would be appreciated.

This project has grown from 1,000 or so users to over 50,000 users. The
project has been an overall success, so it's time to spend a little on the
investment. Currently, we are getting our own servers (in lieu of ISP shared
servers) setup with mod_perl and are revisiting a lot of the code to make
things more efficient. Hopefully, in a month or so we can make the switch.

At present, user records are stored each in a single file using the
Data:umper module and the whole project works through the %user = eval
<FILE> method. User files are stored in directories named after the first
two characters of the user ID to keep the directories smaller, in theory,
for quicker searching of files (?). The records are read/written throughout
the use of the program in the method described.

I don't know how much efficiency would be gained by using an alternate
storage method. Perhaps MySQL? None of us are very familiar with databases,
although it doesn't seem very hard. We are looking into storing the records
as binary files which seems promising, but would like some input on the data
storage/retrieval methods available before we do anything.

I should mention that the project was first written in Perl and will remain
that way. Some suggestions were to investigate a different language. But
that's out of the question for now. We would rather increase efficiency in
the Perl code. Servers will remain Linux/Apache.

Any thoughts?




 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      09-08-2004
Aaron DeLoach <(E-Mail Removed)> wrote:

> This project has grown from 1,000 or so users to over 50,000 users.


> At present, user records are stored each in a single file using the
> Data:umper module



> I don't know how much efficiency would be gained by using an alternate
> storage method. Perhaps MySQL?



Some form of relational database would be an easy way to get
performance gains over a roll-your-own flat file approach.

I'd recommend postgreSQL over MySQL though.


> We are looking into storing the records
> as binary files which seems promising, but would like some input on the data
> storage/retrieval methods available before we do anything.



If you use an RDBMS you won't _need_ to do anything with regard
to storage and retrieval as the DB will handle all of that for you.

That wheel has been invented and heavily refined, just roll with it!


> Any thoughts?



Use an RDBMS.


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
 
Uri Guttman
Guest
Posts: n/a
 
      09-08-2004
>>>>> "AD" == Aaron DeLoach <(E-Mail Removed)> writes:

AD> At present, user records are stored each in a single file using
AD> the Data:umper module and the whole project works through the
AD> %user = eval <FILE> method. User files are stored in directories
AD> named after the first two characters of the user ID to keep the
AD> directories smaller, in theory, for quicker searching of files
AD> (?). The records are read/written throughout the use of the
AD> program in the method described.

as tad suggested a dbms would be a good idea if you want to migrate from
a flat file. but just using File::Slurp will get you some immediate
speedups over <FILE> with almost no code changes.

also changing from data::dumper to Storable will also speed things up
and also require minimal code changes. try those before you make the
leap to a dbms.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
Sherm Pendley
Guest
Posts: n/a
 
      09-08-2004
Aaron DeLoach wrote:

> At present, user records are stored each in a single file using the
> Data:umper module and the whole project works through the %user = eval
> <FILE> method.


This suggests a minor tweak that could result in big gains under
mod_perl. Under traditional CGI, the file needs to be read and eval()'d
for each hit on the CGI.

Reducing the time it takes to read a user record is a good idea, but
with mod_perl you can also reduce the number of times a record is read.
You could take advantage of mod_perl's persistent environment here; keep
a hash of user records, and use an "Orcish Maneuver" to read and eval a
record only if the record you want is currently undef:

$users{$this_user} |= get_user($this_user);

The same sort of thing can be done for output templates, XSLT
transformer objects, and more. It's a very common technique for writing
mod_perl optimized code - Google for "Orcish Maneuver" for many examples.

There are naturally trade-offs to consider too. For example, if the file
has changed, the new data won't be read until the next time a new server
instance spawns. If your traffic is very high, and your server instances
have a lifetime measured in seconds, that may not be a problem. If not,
you might need a more involved conditional that also checks for the age
of the file, instead of the simplistic |= used above.

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
 
Reply With Quote
 
Glenn Jackman
Guest
Posts: n/a
 
      09-08-2004
At 2004-09-08 04:36PM, Sherm Pendley <(E-Mail Removed)> wrote:
[...]
> a hash of user records, and use an "Orcish Maneuver" to read and eval a
> record only if the record you want is currently undef:
>
> $users{$this_user} |= get_user($this_user);


you mean:
$users{$this_user} ||= get_user($this_user);


--
Glenn Jackman
NCF Sysadmin
(E-Mail Removed)
 
Reply With Quote
 
Sherm Pendley
Guest
Posts: n/a
 
      09-09-2004
Glenn Jackman wrote:

> you mean:
> $users{$this_user} ||= get_user($this_user);


Yes, of course. Dang fingers don't always type what they're told to
type...

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
 
Reply With Quote
 
ctcgag@hotmail.com
Guest
Posts: n/a
 
      09-09-2004
"Aaron DeLoach" <(E-Mail Removed)> wrote:
> Hello,
>
> I have run into unfamiliar ground. Some guidance would be appreciated.
>
> This project has grown from 1,000 or so users to over 50,000 users. The
> project has been an overall success, so it's time to spend a little on
> the investment.


It makes a huge difference whether those 50,000 users access one cgi page
per week, on average, or one cgi page per minute.

> Currently, we are getting our own servers (in lieu of ISP
> shared servers) setup with mod_perl and are revisiting a lot of the code
> to make things more efficient. Hopefully, in a month or so we can make
> the switch.


Do you have specific performance complaints? One should keep general
efficiency in mind, but it is better focus on specific problems if they
exist.

> At present, user records are stored each in a single file using the
> Data:umper module and the whole project works through the %user = eval
> <FILE> method.


How big are these files?

> User files are stored in directories named after the first
> two characters of the user ID to keep the directories smaller, in theory,
> for quicker searching of files (?). The records are read/written
> throughout the use of the program in the method described.


Hopefully there are a few subroutines which are invoked throughout the
program to cause the files to be read or written. If it is IO code, not
subroutines, which are throughout the program, than any changes will be
difficult. Then the first thing I would do is leave the actual physical
storage the same, but consildate all the IO into a few subroutines, so that
you can just swap out subroutines to test different methods.


> I don't know how much efficiency would be gained by using an alternate
> storage method. Perhaps MySQL?


My gut feeling is that it would not lead to large performance
improvements if all you do is use MySQL instead of the file system
as a bit bucket to store your Data:umper strings. Especially if
your server has a lot of memory and aggresively caches the FS.


> None of us are very familiar with
> databases, although it doesn't seem very hard. We are looking into
> storing the records as binary files which seems promising, but would like
> some input on the data storage/retrieval methods available before we do
> anything.


By binary files, do you mean using Storable rather than Data:umper?

I would expect that to make more of a performance difference than the
MySQL vs file system.

> I should mention that the project was first written in Perl and will
> remain that way. Some suggestions were to investigate a different
> language. But that's out of the question for now. We would rather
> increase efficiency in the Perl code. Servers will remain Linux/Apache.
>
> Any thoughts?


I'd spend some time investigating where the time is going now.
Make a script that does something like:

use strict;
use blahblah;
## all the other preliminaries that your real programs have to go through.

exit if $ARGV[0] == 1;

my $data = load_file_for_user($ARGV[1]);
exit if $ARGV[0] == 2;

my $user_ref = eval $data;
exit if $ARGV[0] == 3;

Do_whatever_your_most_common_task_is($user_ref);
exit if $ARGV[0] ==4;
### etc.


then write another program:

my $start=time;
foreach my $level (1..4) {
foreach (1..1000) {
my $u = randomly_chosen_user();
system "./first_program.pl $level $u" and die $!;
};
print "Level $level, ", $start-time, "\n";
};

if level 1 time is almost as big as level 4 time, then the overhead
of compilation and startup is your biggest problem. etc.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Aaron DeLoach
Guest
Posts: n/a
 
      09-10-2004
[...]

>> This project has grown from 1,000 or so users to over 50,000 users. The
>> project has been an overall success, so it's time to spend a little on
>> the investment.

>
> It makes a huge difference whether those 50,000 users access one cgi page
> per week, on average, or one cgi page per minute.


On average, there are approximately 15,000 'hits' each day. These 'hits' are
through 2 different cgi programs, using several different sub-programs each.

>> Currently, we are getting our own servers (in lieu of ISP
>> shared servers) setup with mod_perl and are revisiting a lot of the code
>> to make things more efficient. Hopefully, in a month or so we can make
>> the switch.

>
> Do you have specific performance complaints? One should keep general
> efficiency in mind, but it is better focus on specific problems if they
> exist.


We've noticed some performance issues as the user base grows, and wish to
regain some by moving from the ISP to our own server(s) where we can benefit
from unavailable modules/configurations (mod_perl, etc.). Our ISP is working
with us to smooth the transition and share knowledge. They're a local
company, and we were one of their first customers.

>> At present, user records are stored each in a single file using the
>> Data:umper module and the whole project works through the %user = eval
>> <FILE> method.

>
> How big are these files?


Average 4kb. However, we plan to store some additional information in them
to eliminate calls to slower methods/modules. This will make them average
10kb or so.

> Hopefully there are a few subroutines which are invoked throughout the
> program to cause the files to be read or written. If it is IO code, not
> subroutines, which are throughout the program, than any changes will be
> difficult. Then the first thing I would do is leave the actual physical
> storage the same, but consildate all the IO into a few subroutines, so
> that
> you can just swap out subroutines to test different methods.


The entire record (file) is eval(ed) into a href in the beginning of the
program(s). The user actions are performed upon the href until the end of
the session. The href is then written back to the file (replacing the
original contents). Writing is achieved through the Dataumper module. Uri
has suggested using different methods which we're looking into.

>> I don't know how much efficiency would be gained by using an alternate
>> storage method. Perhaps MySQL?

>
> My gut feeling is that it would not lead to large performance
> improvements if all you do is use MySQL instead of the file system
> as a bit bucket to store your Data:umper strings. Especially if
> your server has a lot of memory and aggresively caches the FS.


Your gut feeling is correct according to our initial research into the db
server/methods. Increasing the FS cache would compensate for any gains from
using a db scenario. Memory is not going to be an issue. It seems that the
mod_perl requirements are going to govern that.

>> None of us are very familiar with
>> databases, although it doesn't seem very hard. We are looking into
>> storing the records as binary files which seems promising, but would like
>> some input on the data storage/retrieval methods available before we do
>> anything.

>
> By binary files, do you mean using Storable rather than Data:umper?


We were educated on the Storable module by Uri. Until then, there was just a
knowledge of the process.

> I would expect that to make more of a performance difference than the
> MySQL vs file system.


I'm glad to hear that. It's the way we were hoping to go.

> I'd spend some time investigating where the time is going now.
> Make a script that does something like:
>
> use strict;
> use blahblah;
> ## all the other preliminaries that your real programs have to go through.
>
> exit if $ARGV[0] == 1;
>
> my $data = load_file_for_user($ARGV[1]);
> exit if $ARGV[0] == 2;
>
> my $user_ref = eval $data;
> exit if $ARGV[0] == 3;
>
> Do_whatever_your_most_common_task_is($user_ref);
> exit if $ARGV[0] ==4;
> ### etc.
>
>
> then write another program:
>
> my $start=time;
> foreach my $level (1..4) {
> foreach (1..1000) {
> my $u = randomly_chosen_user();
> system "./first_program.pl $level $u" and die $!;
> };
> print "Level $level, ", $start-time, "\n";
> };
>
> if level 1 time is almost as big as level 4 time, then the overhead
> of compilation and startup is your biggest problem. etc.


Hmmm... I will update you on the results...

Thanks!


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory efficient tuple storage psaffrey@googlemail.com Python 11 03-19-2009 05:15 PM
How to access the external storage unit of storage router =?Utf-8?B?SWduYXRpdXM=?= Wireless Networking 4 11-06-2006 06:40 AM
Difference b/w storage class and storage class specifier sarathy C Programming 2 07-17-2006 05:06 PM
Efficient storage of a temporary string Randy Kramer Ruby 1 03-02-2005 10:39 PM
Efficient configuration storage sebsauvage Python 4 09-22-2004 05:10 PM



Advertisments