Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > File-Reading Best Practices?

Reply
Thread Tools

File-Reading Best Practices?

 
 
Andreas Wenzke
Guest
Posts: n/a
 
      04-03-2010
I want to parse an XML file manually (but my question would be the same
for any other file format):
What are best-practice guidelines for doing that?

I currently use a char buffer in conjunction with istream::read and then
walk through the buffer step by step.
However, problems will arise when tags span across the buffer, i.e. when
the buffer contains "<h" at the end and the next characters to be read
from the stream are "tml>".
I'm considering using memmove, but I just think there has to be a better
option.

As this is for a university project, I'm not allowed to use the STL
(std::string and so on).
 
Reply With Quote
 
 
 
 
Stefan Ram
Guest
Posts: n/a
 
      04-03-2010
Andreas Wenzke <(E-Mail Removed)> writes:
>I want to parse an XML file manually (but my question would be the same
>for any other file format):
>What are best-practice guidelines for doing that?
>I currently use a char buffer in conjunction with istream::read and then
>walk through the buffer step by step.


You seem to think about implementations ("char buffer") early.
I prefer to think about interfaces (.getNextSymbol()) early.

A char is a byte, while XML files are composed of Unicode
characters (code points). If you read them as chars, you
will first have to decode them, so you should at least
implement an UTF-8-reader.

>However, problems will arise when tags span across the buffer, i.e. when
>the buffer contains "<h" at the end and the next characters to be read
>from the stream are "tml>".
>I'm considering using memmove, but I just think there has to be a better
>option.


Again, it seems strange to me, to mention parsing and then
mention memmove, too low-level thinking. You are thinking
about low-level implementation details too early. They should
be hidden behind interfaces, so that they can be changed
later.

>As this is for a university project, I'm not allowed to use the STL
>(std::string and so on).


This newsgroup is about using C++, and when you are not
allowed to use ::std::string and so on, you are not allowed
to use C++, so you are in the wrong newsgroup. In C++, also,
there is nothing that is being called »STL« by
ISO/IEC 14882:2003(E), so you possibly are being taught
out-dated terms. Maybe that university also is too low-level.

 
Reply With Quote
 
 
 
 
Carlo Milanesi
Guest
Posts: n/a
 
      04-03-2010
Andreas Wenzke wrote:
> I want to parse an XML file manually (but my question would be the same
> for any other file format):
> What are best-practice guidelines for doing that?
>
> I currently use a char buffer in conjunction with istream::read and then
> walk through the buffer step by step.
> However, problems will arise when tags span across the buffer, i.e. when
> the buffer contains "<h" at the end and the next characters to be read
> from the stream are "tml>".
> I'm considering using memmove, but I just think there has to be a better
> option.
>
> As this is for a university project, I'm not allowed to use the STL
> (std::string and so on).


Why universities prohibit STL?

I think the simplest way to read a file is by using a memory-mapped
files. They are not standard though. Does your university allow them?
Here you can find a useful library:
http://en.wikibooks.org/wiki/Optimiz...ry-mapped_file
You may use its class InputMemoryFile to read a file that can fit into
your address space.

--

Carlo Milanesi
http://digilander.libero.it/carlmila
 
Reply With Quote
 
Andreas Wenzke
Guest
Posts: n/a
 
      04-03-2010
Christian Hackl schrieb:
> What are you allowed to use at all, then?


<iostream> and C libraries like <string.h>.

> "STL" is not a synonym for "standard library". In particular,
> std::string is considered a different part of the library than the
> container/algorithm part. If your lecturer does not allow you to use the
> entire standard library except of the C part, then of course streams
> cannot be used, either.


Sorry, <iostream> can be used, of course.

> Anyway, I think that with such course requirements best-practice
> guidelines for file reading in C++ simply cannot be met. (I originally
> learned C++ that way, too, and later had to unlearn much of what had
> been taught to us. It bothers me that C++ is still treated this way at
> universities.)


STL will be taught in detail, though not in this class where the
lecturer wants us to understand the implementation first.
 
Reply With Quote
 
Andreas Wenzke
Guest
Posts: n/a
 
      04-03-2010
Stefan Ram schrieb:
> You seem to think about implementations ("char buffer") early.
> I prefer to think about interfaces (.getNextSymbol()) early.


Care to elaborate a little on this?

> A char is a byte, while XML files are composed of Unicode
> characters (code points). If you read them as chars, you
> will first have to decode them, so you should at least
> implement an UTF-8-reader.


The file-reading part is only a very small part of the whole project.
Implementing UTF-8 parsing isn't likely to have any benefits for my
program (strings will be stored "as is" anyway) and probably isn't going
to earn me many bonus points. However, it would probably make things
more complicated as I'd have to distinguish between ANSI and Unicode chars.

> Again, it seems strange to me, to mention parsing and then
> mention memmove, too low-level thinking. You are thinking
> about low-level implementation details too early. They should
> be hidden behind interfaces, so that they can be changed
> later.


I understand your objection, and I don't really know how to implement
that for my current task.

>> As this is for a university project, I'm not allowed to use the STL
>> (std::string and so on).

>
> This newsgroup is about using C++, and when you are not
> allowed to use ::std::string and so on, you are not allowed
> to use C++, so you are in the wrong newsgroup. In C++, also,
> there is nothing that is being called »STL« by
> ISO/IEC 14882:2003(E), so you possibly are being taught
> out-dated terms. Maybe that university also is too low-level.


<iostream> and C libraries like <string.h> are allowed.
Other "STL" classes like std::string, std::vector will be allowed in
follow-up classes.

Also, I am of course allowed to implement my own string class etc.
 
Reply With Quote
 
Andreas Wenzke
Guest
Posts: n/a
 
      04-03-2010
Carlo Milanesi schrieb:
> Why universities prohibit STL?


Because they want the students to understand the implementation details
first.
The STL will be allowed in follow-up classes.

> I think the simplest way to read a file is by using a memory-mapped
> files. They are not standard though. Does your university allow them?


If they're not standard, probably not.

> Here you can find a useful library:


Third-party libraries aren't allowed...
 
Reply With Quote
 
1jam
Guest
Posts: n/a
 
      04-03-2010
Stefan Ram wrote:

>
>>As this is for a university project, I'm not allowed to use the STL
>>(std::string and so on).

>
> This newsgroup is about using C++, and when you are not
> allowed to use ::std::string and so on, you are not allowed
> to use C++, so you are in the wrong newsgroup.


Not true, in embedded C++ development STL is still usually shunned. Plus C++
was used for decades before STL implementations finally matured and became
used.
 
Reply With Quote
 
Stefan Ram
Guest
Posts: n/a
 
      04-03-2010
Andreas Wenzke <(E-Mail Removed)> writes:
>>You seem to think about implementations ("char buffer") early.
>>I prefer to think about interfaces (.getNextSymbol()) early.

>Care to elaborate a little on this?


I separate the code into sub-units.

To parse an XML file, the obvious sub-units would be: a
characters source (a source for the Unicode code points),
then, a scanner (lexical analyzer) then, a parser (syntactical
analyzer). But you also need to know whether you want to
create a DOM (document object model) parser or calls to
client functions (like a SAX parser) or something else.

Anyway, between those units, there are interfaces.
Interfaces are also known as APIs and similar to abstract
datatypes, they are sets of documented calls. So I start by
writing them.

Only then, I will start to write implementations of these
calls.

Some German language notes about software design by me:

http://www.purl.org/stefan_ram/pub/a...sser_programme

>The file-reading part is only a very small part of the whole project.
>Implementing UTF-8 parsing isn't likely to have any benefits for my
>program (strings will be stored "as is" anyway) and probably isn't going
>to earn me many bonus points. However, it would probably make things
>more complicated as I'd have to distinguish between ANSI and Unicode chars.


The XML specification says:

»All XML processors MUST accept the UTF-8 and UTF-16
encodings of Unicode [Unicode]« (uppercase emphasis
was done by the W3C, not by me [Stefan Ram])

http://www.w3.org/TR/REC-xml/

(ISO-8859-1 processing, on the other hand is not required.)

Reading the XML specification and then writing a correct
implementation is a huge project. Now, you tell me this is
only a very small part of the whole project. You are to use C++,
but then are not allowed to use C++, you are to read XML,
but then are not required to read XML as it's specified.

Such an attitude of doing a huge project in such a messy way
(calling »C++« what is not C++, calling »XML« what is not XML)
seems to be highly inappropriate for a scientific university.
It even would be inappropriate for any other teaching situation,
like, say, a »university of applied science« (»Fachhochschule«).

Let me end this post by a quote from Rob Walling:

»I've known smart developers who don't pay attention to detail.
The result is misspelled database columns, uncommented code,
projects that aren't checked into source control,
software that's not unit tested, unimplemented features,
and so on. All of these can be easily dealt with if
you're building a Google mash-up or a five page website.
But in corporate development each of these screw-ups is
a death knell.

So I'll say it very loud, but I promise I'll only say it once:

I have /never, ever, ever/ seen a great software
developer who does not have amazing attention to detail.«

 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      04-03-2010
On Apr 3, 10:32 am, Andreas Wenzke <(E-Mail Removed)> wrote:
> I want to parse an XML file manually (but my question would be
> the same for any other file format):
> What are best-practice guidelines for doing that?


> I currently use a char buffer in conjunction with
> istream::read and then walk through the buffer step by step.
> However, problems will arise when tags span across the buffer,
> i.e. when the buffer contains "<h" at the end and the next
> characters to be read from the stream are "tml>". I'm
> considering using memmove, but I just think there has to be a
> better option.


> As this is for a university project, I'm not allowed to use
> the STL (std::string and so on).


The most obvious solution is to ensure that the buffer never
does end in the middle of a token. Say by using getline to read
it. This has the additional advantage of making it trivial to
output the line number in error messages. In the case of real
XML, it's probably not a good idea, since WWW requires
recognizing several different line ending conventions (although
it wouldn't be that difficult to write a custom getline which
recognized them all), but I doubt that that's relevant for a
school project (at least at a level where you aren't allowed to
use the STL).

Another solution is to read character by character, using a
state machine to determine where the token ends, and put each
character into your final buffer. In this way, you never have
more than one token in the buffer, and the buffering in filebuf
least at a level where you aren't allowed to use the STL).

Another solution is to read character by character, using a
state machine to determine where the token ends, and put each
character into your final buffer. In this way, you never have
more than one token in the buffer, and the buffering in filebuf
takes care of the least at a level where you aren't allowed to
use the STL).

Another solution is to read character by character, using a
state machine to determine where the token ends, and put each
character into your final buffer. In this way, you never have
more than one token in the buffer, and filebuf takes care of the
actual IO buffering.

--
James Kanze
 
Reply With Quote
 
Andreas Wenzke
Guest
Posts: n/a
 
      04-04-2010
Stefan Ram schrieb:
> To parse an XML file, the obvious sub-units would be: a
> characters source (a source for the Unicode code points),
> then, a scanner (lexical analyzer) then, a parser (syntactical
> analyzer). But you also need to know whether you want to
> create a DOM (document object model) parser or calls to
> client functions (like a SAX parser) or something else.


As I only want to parse one certain format, I think this isn't necessary.
Usually, a specific expected token has to be read, otherwise a parsing
error would occur.

> Anyway, between those units, there are interfaces.
> Interfaces are also known as APIs and similar to abstract
> datatypes, they are sets of documented calls. So I start by
> writing them.


I have several years of programming experience in C#, so I'm generally
used to developing against interfaces.

But one thing is that I lack experience in C++ and the other is that I
want to get this XML parser done as quickly as possible, so I can
concentrate on the actual project task.

> Some German language notes about software design by me:
>
> http://www.purl.org/stefan_ram/pub/a...sser_programme


"You ain't gonna need it"

I generally understand your objection, and in this case I just want to
get this (pseudo) parser done.

> The XML specification says:
>
> »All XML processors MUST accept the UTF-8 and UTF-16
> encodings of Unicode [Unicode]« (uppercase emphasis
> was done by the W3C, not by me [Stefan Ram])


Actually, I don't think this is an emphasis, but rather the normal RFC
way of pointing out that "MUST", "CAN" etc. are to be interpreted as
keywords (see also RFC 2119).

But that aside, I do accept those encodings, I just don't decode them.

> Such an attitude of doing a huge project in such a messy way
> (calling »C++« what is not C++, calling »XML« what is not XML)
> seems to be highly inappropriate for a scientific university.
> It even would be inappropriate for any other teaching situation,
> like, say, a »university of applied science« (»Fachhochschule«).


You have to start /somewhere/. You can't just put everything into a
three-hours-per-week class.

The lecturer is very good (and believe me, I have seen bad classes like
someone teaching a C# "beginner's class" where she would teach "design
patterns" without even explaining what polymorphism or interfaces are),
and whilst I don't think using XML as the input format was quite
necessary, he does a good job.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Canon wins Award for Best Professional D-SLR, best Prosumer D-SLR (EOS 1Ds, EOS-10D) George Preddy Digital Photography 3 05-24-2004 03:29 AM
Where is best place for best price buying Mobo & CPU combo? Arawak Computer Support 6 02-05-2004 04:46 PM
Viewers are the best albums! Best digital photo & picture organizing Morgan Ohlson Digital Photography 8 01-05-2004 09:25 PM
Re: Best Buy No Longer A "Best" Buy - At Least Not At Brooklyn NYC Store Mike & Jane Digital Photography 5 08-15-2003 12:57 AM
Best sample app for learning best practices, OO & asp.net? karim ASP .Net 0 07-13-2003 04:26 AM



Advertisments