Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Data Storage Issue (Basic Issue)

Reply
Thread Tools

Data Storage Issue (Basic Issue)

 
 
Srini
Guest
Posts: n/a
 
      05-09-2008
Which is more efficient? Why??
Storing or Reading a file in database or on to disk in a web server
environment? considering all things like network, connections, memory
are ideal.

My knowledge says that using database is more efficient but few my
pierce argue that storing in disk is more efficient when concurrent
people accessing it...
 
Reply With Quote
 
 
 
 
Tom Anderson
Guest
Posts: n/a
 
      05-09-2008
On Fri, 9 May 2008, Srini wrote:

> Which is more efficient? Why?? Storing or Reading a file in database or
> on to disk in a web server environment? considering all things like
> network, connections, memory are ideal.
>
> My knowledge says that using database is more efficient but few my
> pierce argue that storing in disk is more efficient when concurrent
> people accessing it...


Databases typically store their data in a file [1]. That means that a
file-based solution can always be at least as fast as using a database,
because it can just do what the database does.

The problem is that to make a file-based solution that's as fast as a good
database and also provides things like transactionality, you may have to
write code that's as complex as a database. Which is not good.

If you're mostly reading your data, so you don't have to worry about
concurrency and transactionality, and you have a straightforward
organisation (like having fixed-size records which you can refer to by
index in a sequence), then you can write a simple file-based
implementation that should be faster than a database, because it avoids
the overhead and complexity.

There's nothing in the java libraries, that i'm aware of, for doing this
kind of non-database structured file access. There are things for some
texual formats, like XML and properties files (remember those?!), but
nothing like DBM or COBOL's record-oriented files. There are third-party
libraries, though - see Berkeley DB, Java Edition:

http://www.oracle.com/technology/pro.../je/index.html

and JDBM:

http://jdbm.sourceforge.net/

It's also not that hard to write your own fixed-size record manager, and
not that hard to layer variable-sized records on top of such a thing.

There's also an excellent trick for using the unix filesystem as a
database by storing data in symbolic links: the path of a symlink is
actually an arbitrary text string, so you can store information, rather
than an actual path, in it. Gives you hierarchically organised,
string-keyed records of up to a kilobyte (YMMV) without any actual file
IO!

A performance question for the wise: i hacked up a little fixed-size
record manager, and wrote two backends, one using RandomAccessFile, and
one using a NIO MappedByteBuffer. For both, i provided a way to flush to
disk after each write - with RandomAccessFile, via getFD().sync(), and
with MappedByteBuffer with force(). Timings to do a batch of reads and
writes (100 000 operations, 75% reads, on 10 000 records of 256 bytes
each; a different random pattern each time, on a machine doing nothing but
this and playing MP3s):

Implementation Flush? Time (ms)

RandomAccessFile no 733
RandomAccessFile yes 20659
MappedByteBuffer no 63
MappedByteBuffer yes 33087

The mapped file is an order of magnitude faster without flushing, but 50%
slower with. Any idea why?

tom

[1] Okay, so seriously heavyweight ones use disk extents/partitions and
bypass the filesystem; how much of a difference does that make?

--
.... but when you spin it it looks like a dancing foetus!
 
Reply With Quote
 
 
 
 
Tom Anderson
Guest
Posts: n/a
 
      05-10-2008
On Fri, 9 May 2008, Lew wrote:

> Tom Anderson wrote:
>
>> Databases typically store their data in a file [1]. ... [1] Okay, so
>> seriously heavyweight ones use disk extents/partitions and bypass the
>> filesystem; how much of a difference does that make?

>
> Anecdotally, on a large-scale test, above five-to-one.


Yikes! That's kind of damning for filesystems. I know they do a lot of
stuff that a raw disk doesn't but still. Wow.

tom

--
Finals make a man mean; let's fusc up and write!
 
Reply With Quote
 
Patricia Shanahan
Guest
Posts: n/a
 
      05-10-2008
Lew wrote:
> Eric Sosman wrote:
>> No. "Suggestive," perhaps, or even "cautionary." But for
>> my own part I wouldn't confer "very indicative" on a lab result
>> that cannot be reproduced by an independent experimenter. A
>> citation of a full report or published paper or something of that
>> nature might change my mind, but at the moment ... no.

>
> OK, that's fine. It was something that convinced me that it worked in
> that case because I could interact with people who conducted the test.
> You choose to reject the anecdote because I have no credibility with you.
>
> I'm sure the result can be reproduced by an independent experimenter, if
> they were authorized to use the various proprietary parts.
>
> I would not be surprised if more objective results are available to you
> with a quick Google. Given the amount of time Oracle and others have
> offered direct disk-management modules, I would be surprised if such
> results were not available to you. Do let us know what you find, won't
> you?
>


The Transaction Processing Performance Council, http://www.tpc.org,
creates a series of benchmarks that are strongly dependent on database
performance, and archives results. It should be possible to work out how
the disks are managed, at least from the full disclosure reports.

If anyone is interested in researching this, TPC may be a starting point.

Patricia
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      05-11-2008
On Fri, 9 May 2008 07:37:29 -0700 (PDT), Srini <(E-Mail Removed)>
wrote, quoted or indirectly quoted someone who said :

>Storing or Reading a file in database or on to disk in a web server
>environment? considering all things like network, connections, memory
>are ideal.


It doesn't matter since the reading and writing are not
interchangeable.

Reading is faster since it can be cached with no physical disk i/o.

In an SQL database with writing you have to do actual physical i/o
PLUS disk i/o to handle transaction logging and rollback
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      05-11-2008
On Sun, 11 May 2008, Eric Sosman wrote:

> Lew wrote:
>> Eric Sosman wrote:
>>> No. "Suggestive," perhaps, or even "cautionary." But for
>>> my own part I wouldn't confer "very indicative" on a lab result
>>> that cannot be reproduced by an independent experimenter. A
>>> citation of a full report or published paper or something of that
>>> nature might change my mind, but at the moment ... no.

>>
>> OK, that's fine. It was something that convinced me that it worked in that
>> case because I could interact with people who conducted the test. You
>> choose to reject the anecdote because I have no credibility with you.

>
> No, no, no, Lew: Your credibility is just fine (or my credulity; take
> it either way). All I'm saying is that the information you've presented
> is insufficient to support the inference Tom Anderson drew from it
> ("kind of damning for filesystems"), and that he should consider being a
> little less hasty in drawing sweeping conclusions from sketchy and
> incomplete reports.


I'm not saying filesystems are worthless. Just that when it comes to
shunting data on and off a disk, the fact that a different method can work
five times faster than a filesystem means that if you're in the business
of shunting data on and off a disk as fast as possible, a filesystem is
probably not a great idea.

Of course, if there was some particular reason why there was such a big
difference in this case that doesn't apply in all or most cases, then fair
enough, i'm over-generalising. I can't think of what that might be,
though.

> If someone told you there was a "five-to-one difference" between
> ArrayList and plain array, would you conclude that this was "kind of
> damning" for Collections? Of course not. I'm urging Tom not to make
> the analogous error.


If someone (credible) told me that there was a five-to-one difference in
times for an ArrayList.get and an array[], then, if we were in a context
where performance was important, i certainly would take that as kind of
damning for collections. I think that's perfectly reasonable.

tom

--
It's almost over now.
 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      05-12-2008
Lew wrote:
> From what I've been reading and hearing from DBAs (disclaimer: I am but
> a lowly programmer myself, not a DBA), journaling file syatems are not
> the key to safety in big-iron DMBSes like PostgreSQL and Oracle.


I don't think PostgreSQL is big-iron.

Oracle, DB2 and to some extent Sybase and SQLServer are
rather dominant in that area.

Arne
 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      05-12-2008
Lew wrote:
> Tom Anderson wrote:
>> Databases typically store their data in a file [1]. ...
>> [1] Okay, so seriously heavyweight ones use disk extents/partitions
>> and bypass the filesystem; how much of a difference does that make?

>
> Anecdotally, on a large-scale test, above five-to-one.


I doubt that.

There are obviously overhead in the file system, but 400%
overhead in writing to a pre allocated file isntead of a raw
partition does not make any sense.

(if the file needs to be extended during write, then it sounds
very plausible, but that is not a fair comparison)

My impression is that raw partitions was not preferred for
performance but to ensure that there were no file systems
cached not being flushed to the plates.

Arne
 
Reply With Quote
 
Arne Vajhj
Guest
Posts: n/a
 
      05-12-2008
Srini wrote:
> Which is more efficient? Why??
> Storing or Reading a file in database or on to disk in a web server
> environment? considering all things like network, connections, memory
> are ideal.
>
> My knowledge says that using database is more efficient but few my
> pierce argue that storing in disk is more efficient when concurrent
> people accessing it...


There is a myth among many web developers that storing files
in database kills performance.

And my guess is that ASP accessing an Access 95 database on
a Pentium running NT 4 possible did.

But things are different today. With some good hardware and
a good database, then you should be able to get good
performance out if storing files in the database.

And it is so much easier to ensure data integrity and
administer the database solution.

I once did some experiments on the topic. The results are
attached below.

They are ASP.NET against MySQL, but technologies should
not matter much.

Arne

==================================================

File (1 threads): 1,9 get per second
File (10 threads): 2,2 get per second
File with web app cache (1 threads): 6,6 get per second
File with web app cache (10 threads): 16,1 get per second
File directly by web server (1 threads): 17,6 get per second
File directly by web server (10 threads): 20,1 get per second
Database (1 threads): 10,7 get per second
Database (10 threads): 11,8 get per second
Database with web app cache (1 threads): 15,4 get per second
Database with web app cache (10 threads): 19,6 get per second
File (1 threads): 1,9 get per second
File (10 threads): 2,2 get per second
File with web app cache (1 threads): 16,7 get per second
File with web app cache (10 threads): 19,4 get per second
File directly by web server (1 threads): 17,1 get per second
File directly by web server (10 threads): 20 get per second
Database (1 threads): 10,7 get per second
Database (10 threads): 11,9 get per second
Database with web app cache (1 threads): 17,3 get per second
Database with web app cache (10 threads): 19,6 get per second
File (1 threads): 1,9 get per second
File (10 threads): 2,1 get per second
File with web app cache (1 threads): 16,8 get per second
File with web app cache (10 threads): 17,8 get per second
File directly by web server (1 threads): 17,6 get per second
File directly by web server (10 threads): 20,1 get per second
Database (1 threads): 10,7 get per second
Database (10 threads): 11,6 get per second
Database with web app cache (1 threads): 17,1 get per second
Database with web app cache (10 threads): 19,5 get per second
 
Reply With Quote
 
John W Kennedy
Guest
Posts: n/a
 
      05-14-2008
Lew wrote:
> Tom Anderson wrote:
>> If someone (credible) told me that there was a five-to-one difference
>> in times for an ArrayList.get and an array[], then, if we were in a
>> context where performance was important, i certainly would take that
>> as kind of damning for collections. I think that's perfectly reasonable.

>
> Eric is correct that you should not generalize from a single data point,
> especially when full details are not available for that one case. I
> presented the anecdote not to prove anything but to provide evidence, at
> least that the matter is worth pursuing more objectively.
>
> My comment that I lack credibility was not about personality but about
> evidence and what makes it valid. I provided no hard data, nor enough
> information about the anecdote to make it something on which you can
> base conclusions. By hard standards, I indeed do lack credibility to
> assert that direct-to-disk is five times faster. OTOH, it is enough to
> share one experience that indicates that the claim is potentially
> supportable, just maybe.


It depends on how much effort you're willing to put in. A major reason
for databases running faster when off the file system, for example, is
doing things like putting SQL tables into a continuous range of
cylinders, and putting the lowest level of the primary index on track
zero of those same cylinders. Then, to find a given record, you go
through the high-level parts of the index tree, which are probably
cached, anyway; move the access arm to the correct cylinder; read the
low-level index record from track zero (perhaps it is replicated as many
times as it will fit on the track, so that it comes up as fast as
possible); and then read the data without moving the access arm again.
That gives you indexed access only imperceptibly slower than hashed.

Much-larger-than-RAM sorts can use similar tricks, such as merging
RAM-sorted strings from the top half of the disk to the bottom half and
then back again, until the final merge to the target.

IBM mainframes have used methods like this since the mid 1960s.

If, on the other hand, you're just doing the equivalent of reading a
FileInputStream or writing a FileOutputStream, stick with the file
system. Moving it off will accomplish almost nothing.

--
John W. Kennedy
"Give up vows and dogmas, and fixed things, and you may grow like
That. ...you may come to think a blow bad, because it hurts, and not
because it humiliates. You may come to think murder wrong, because it
is violent, and not because it is unjust."
-- G. K. Chesterton. "The Ball and the Cross"
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Why is enterprise storage so much more expensive than personal storage? John Computer Support 4 03-17-2008 09:50 PM
How to access the external storage unit of storage router =?Utf-8?B?SWduYXRpdXM=?= Wireless Networking 4 11-06-2006 06:40 AM
Difference b/w storage class and storage class specifier sarathy C Programming 2 07-17-2006 05:06 PM
Application[] storage issue Curt tabor ASP .Net 2 11-04-2005 08:01 PM
Tutorial: ASP.Net's Data Storage Objects Karl ASP .Net 1 09-06-2004 04:59 PM



Advertisments