Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Most Interesting Bug Track Down

Reply
Thread Tools

Most Interesting Bug Track Down

 
 
Frederick Gotham
Guest
Posts: n/a
 
      11-24-2006

I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

I'm sure several regulars here have more interesting stories...

--

Frederick Gotham
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      11-24-2006
Frederick Gotham wrote:

> I thought it might be interesting to share experiences of tracking down a
> subtle or mysterious bug. I myself haven't much experience with tracking
> down bugs, but there's one in particular which comes to mind.
>
> I was writing usable which dealt with strings. As per usual with my code, I
> made it efficient to the extreme. One thing I did was replace, where
> possible, any usages of "strlen" with something like:
>
> struct PtrAndLen {
> char *p;
> size_t len;
> };
>
> This could be initialised with a string as follows:
>
> struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
>
> From that point forward in the code, "pal.len" would be used in place of
> strlen.
>
> The code grew though, and at one stage I needed to store info about two
> strings in one of these structures. To do this, I used a null separator,
> e.g.:
>
> PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};
>
> All of this, however, was expanded by macros, so I actually had something
> like:
>
> MAKE_STR_INFO("Hello\0Bonjour")
>
> The problem with this, however, was that "strlen" and "pal.len" had
> different values, because strlen only read as far as the first null
> terminator. Anyway, I had to read through the code in detail before the bug
> jumped out at me.


Looks like it may not have jumped high enough: The bug is there
even without embedded '\0' characters.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid
 
Reply With Quote
 
 
 
 
Jack Klein
Guest
Posts: n/a
 
      11-24-2006
On Fri, 24 Nov 2006 22:12:15 GMT, Frederick Gotham
<(E-Mail Removed)> wrote in comp.lang.c:

>
> I thought it might be interesting to share experiences of tracking down a
> subtle or mysterious bug. I myself haven't much experience with tracking
> down bugs, but there's one in particular which comes to mind.
>
> I was writing usable which dealt with strings. As per usual with my code, I
> made it efficient to the extreme. One thing I did was replace, where
> possible, any usages of "strlen" with something like:


Why "as usual"? Once you had the application working correctly, was
it too slow? Did you profile or otherwise test and to prove that this
was a bottleneck?

> struct PtrAndLen {
> char *p;
> size_t len;
> };
>
> This could be initialised with a string as follows:
>
> struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
>
> From that point forward in the code, "pal.len" would be used in place of
> strlen.
>
> The code grew though, and at one stage I needed to store info about two
> strings in one of these structures. To do this, I used a null separator,
> e.g.:
>
> PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};


So now, thanks to premature optimization, you have violated your
design. You are initializing with something that is not a string, or
more precisely a string plus something else.

> All of this, however, was expanded by macros, so I actually had something
> like:
>
> MAKE_STR_INFO("Hello\0Bonjour")
>
> The problem with this, however, was that "strlen" and "pal.len" had
> different values, because strlen only read as far as the first null
> terminator. Anyway, I had to read through the code in detail before the bug
> jumped out at me.


How could that be a problem? You just said you eliminated all use of
strlen().

> I'm sure several regulars here have more interesting stories...


Sounds like a poor design, aggravated by violating its constraints in
use.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html
 
Reply With Quote
 
Frederick Gotham
Guest
Posts: n/a
 
      11-24-2006
Eric Sosman:

> Looks like it may not have jumped high enough: The bug is there
> even without embedded '\0' characters.


Ah yes, I should have mentioned that I took the "sizeof "Hello" - 1" into
account.

--

Frederick Gotham
 
Reply With Quote
 
Frederick Gotham
Guest
Posts: n/a
 
      11-24-2006
Jack Klein:

>> As per usual with my
>> code, I made it efficient to the extreme. One thing I did was replace,
>> where possible, any usages of "strlen" with something like:

>
> Why "as usual"? Once you had the application working correctly, was
> it too slow? Did you profile or otherwise test and to prove that this
> was a bottleneck?



I have never written a program for monetary gain. I program purely for the
enjoyment of programming. If I achieve a certain object, I am not satisfied
-- I want to achieve the objective as efficiently as is possible.


>> struct PtrAndLen {
>> char *p;
>> size_t len;
>> };
>>
>> This could be initialised with a string as follows:
>>
>> struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
>>
>> From that point forward in the code, "pal.len" would be used in place
>> of strlen.
>>
>> The code grew though, and at one stage I needed to store info about two
>> strings in one of these structures. To do this, I used a null
>> separator, e.g.:
>>
>> PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

>
> So now, thanks to premature optimization, you have violated your
> design. You are initializing with something that is not a string, or
> more precisely a string plus something else.



I can live with the minor complication though, given that my code runs
several orders of magnitude faster than the "play it safe" equivalent.

--

Frederick Gotham
 
Reply With Quote
 
CBFalconer
Guest
Posts: n/a
 
      11-25-2006
Frederick Gotham wrote:
>

.... snip ...
>
> I was writing usable which dealt with strings. As per usual with my
> code, I made it efficient to the extreme. One thing I did was
> replace, where possible, any usages of "strlen" with something like:
>
> struct PtrAndLen {
> char *p;
> size_t len;
> };
>
> This could be initialised with a string as follows:
>
> struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
>
> From that point forward in the code, "pal.len" would be used in
> place of strlen.
>
> The code grew though, and at one stage I needed to store info about
> two strings in one of these structures. To do this, I used a null
> separator, e.g.:
>
> PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};


Ugh. You deserved anything that happened to you.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>


 
Reply With Quote
 
Gordon Burditt
Guest
Posts: n/a
 
      11-25-2006
>I have never written a program for monetary gain. I program purely for the
>enjoyment of programming. If I achieve a certain object, I am not satisfied
>-- I want to achieve the objective as efficiently as is possible.


If it doesn't have to work correctly, any program can run in 0 time
and 0 bytes.

Even programs like the OS/360's IEFBR14, or the C equivalent:

int main(void)
{
return 0;
}

can be made to work faster if they don't have to run correctly.

 
Reply With Quote
 
Mark F. Haigh
Guest
Posts: n/a
 
      11-25-2006
Frederick Gotham wrote:
> I thought it might be interesting to share experiences of tracking down a
> subtle or mysterious bug. I myself haven't much experience with tracking
> down bugs, but there's one in particular which comes to mind.
>

<snip>
>
> I'm sure several regulars here have more interesting stories...


Ok, I'll bite.

One of the most unusual things I've tracked down was a production-level
C program that suddenly failed with the addition of a single comment in
a single .c file. Mind you, this was an application deployed on custom
hardware and in active use at thousands of sites worldwide. It wasn't
even a new comment-- it was the addition of about 30 characters to an
existing comment.

The usual suspects were considered. Perhaps it was a build-order
issue, I thought. I've run into those before. If there are data
corruption problems in the code, changes in the order that files are
compiled and linked can cause the linker to allocate different storage
locations for different variables. This can cause a "more critical"
(ie pointer) variable to be clobbered rather than another (ie
statistics counter) when corruption occurs.

No dice. My original build, a clean rebuild, and a rebuild of a clean
build with several hundred random files touched all failed. Hmmm. At
best this was inconclusive.

Now this was an embedded product running the vxWorks RTOS. In the bad
old days, on the bad old MMU-less platforms, it could take a real
effort to get debugging information out of a production board if it
crashed hard. Unfortunately, I couldn't track down any boards with
debug facilities, so I had to disassemble the chassis and have a rework
tech solder a DB-9 on it so I could get some serial output from it (my
soldering isn't the greatest).

Much to my chagrin, I quickly realized that any substantial code
modifications other than that comment line caused the code to spring
back to life. "If you can't beat 'em, diff 'em", I thought. I
reverted the original change and rebuilt the file, except this time
saving the assembly and preprocessed output. I then re-added the
problematic comment and rebuilt, again saving the assembly and
preprocessed output. On each build I set the randomization seeds to
the same values, so any pseudo-random numbers used by the compiler
would have the same values on both builds.

I fired up the graphical diff program. No changes to the preprocessed
files except the line numbers, which differed by one. No changes to
the assembly files, except...

.... A single additional line of assembly code. Aparrently a
preprocessor macro in a header file far, far away had decided to use
the value of __LINE__ for field debugging of production code. Since
the CPU was an ARM7, and the ARM instruction set can only load a 12 bit
immediate IIRC, and (shockingly) the __LINE__ value was exactly on the
boundary, the compiler had generated a "load immediate; increment by 1"
instruction sequence instead of just a "load immediate".

But who cares!?! The instruction sequence was correct, anyways.
Perhaps an assembler bug? This did not seem to be likely, given "The
Two General Rules of Compiler and Assembler Bugs":

1. It's not a compiler or assembler bug.
2. See rule #1*.

* Unless you can prove it.

I verified with objdump and the ARM Architecture Reference Manual that
the opcodes being generated by the assembler were indeed correct.
Hmmm.

Every code modification I could think of caused things to start working
again. Things were seeming to me to point toward a hardware bug of
some kind. But given "The Two General Rules of Hardware Bugs": ...

1. It's not a hardware bug.
2. See rule #1*.

* Unless you can prove it.

.... I now had to prove it, especially since the units were already in
the field. On a hunch I looked at the assembly code directly after
this new additional instruction. I saw a load of a value from memory
that was added to another value and used as a pointer.

I inserted a jump to a routine I wrote in assembly that moved the
freshly-loaded value to a known memory location. It then jumped to
some C code in another translation unit that dumped the value, and then
reloaded and re-dumped it.

The value of the original read was 0, and the reload of the same
location generated a valid pointer value. Obviously they should have
been equal (I had already ruled out concurrent access problems
previously).

Turns out the memory subsystem was laid out by a junior designer at
another company who had incorrectly chained some of the RAM clock
traces (IIRC), where they should have all had equal lengths. This
particular instruction sequence, combined with the peculiar cache
behavior of the application had caused it to issue memory reads that
exposed the RAM subsystem timing problems. Rumor had it that somebody
at the other company had done a quickie board spin to increase the
amount of RAM without properly reviewing the change. Amazingly, the
rest of the half-million lines of code seemed to run just fine. How, I
will never know.

The entire thing was so unlikely that I couldn't help but think,
"Great. I'll probably be hit with a meteorite on the way to my car
tonight." Fortunately the bad luck stopped with a new spin of the
board, and I'm still here to tell the tale.


Mark F. Haigh
(E-Mail Removed)

 
Reply With Quote
 
websnarf@gmail.com
Guest
Posts: n/a
 
      11-25-2006
Frederick Gotham wrote:
> I thought it might be interesting to share experiences of tracking down a
> subtle or mysterious bug. I myself haven't much experience with tracking
> down bugs, but there's one in particular which comes to mind.
>
> I was writing usable which dealt with strings. As per usual with my code, I
> made it efficient to the extreme. One thing I did was replace, where
> possible, any usages of "strlen" with something like:
>
> struct PtrAndLen {
> char *p;
> size_t len;
> };
>
> This could be initialised with a string as follows:
>
> struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
>
> From that point forward in the code, "pal.len" would be used in place of
> strlen.
>
> The code grew though, and at one stage I needed to store info about two
> strings in one of these structures. To do this, I used a null separator,
> e.g.:
>
> PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};
>
> All of this, however, was expanded by macros, so I actually had something
> like:
>
> MAKE_STR_INFO("Hello\0Bonjour")
>
> The problem with this, however, was that "strlen" and "pal.len" had
> different values, because strlen only read as far as the first null
> terminator. Anyway, I had to read through the code in detail before the bug
> jumped out at me.


First of all (sizeof "inline string") is 1+strlen ("inline string").
So I assume you compensated for this in your macro.

Second of all, in "The Better String Library", which does the same
thing, this is not a bug but it fact, the correct behavior. '\0' is a
legitimate character, not a string terminator. Where the semantics
coincide (which is most of the time, when dealing with pure text data)
you can assume strlen(bstring->data) is the same as b->slen. In
Bstrlib, you would never try to mash two strings together using some
kind of hacked representation such as "string1\0string2", that would
make no sense. Because Bstrlib is more consistent in this respect,
these sorts of bugs are far less likely.

> I'm sure several regulars here have more interesting stories...


Oh sure:

1) if (a < 0) a = -a; b = sqrt (a);

2) Anything involving a stack overrun with stack checking turned off.
You just have to be inspired to imagine that this is your problem. The
standard is worthless for helping you here.

3) Assuming that vararg parameters were passed by value and could be
"reset" by retrieving its original value. (No debugger or compiler
diagnostic can help you figure out what is going wrong here.)

4) Watching Microsoft Visual C++ barf on struct tagbstring b = {
sizeof("string")-1, -__LINE__, "string" }; because MS's preprocessor
emitted something like _line+425 for __LINE__, and it complained that
it was not a compile time constant.

5) Adventures with WATCOM C/C++ v11.x's optimizations with "-ol" turned
on. It just fails to build correct code for about 10% of the source
I've written. These are real fun to track down. Like the stack
checking thing, you just have to be inspired to try turning the flag
off to see if it fixes the problem.

Then there's the standard "I forgot I made assumption X in function Y
then passed it parameters which technically violated X even though it
wasn't obvious that it was". Unfortunately, in the C language, these
assumptions often take the form of "allocated at least some certain
amount of space" or "the parameter is a well form non-empty linked
list" etc, and the error is usually undefined behavior.

I don't do a lot of heap or stack smashing anymore these days, as I
generally wrap things in rigourous enough abstractions, and I just
generally use debug heaps while developing. But there can still be
problems of convention. A hash table I implemented has an iterator
mechanism, and I made the termination condition when the index was
greater than the current hash table size -- the problem is that when I
came back to reuse this code after more than a year, I forgot my
convention for termination and thought it was when the index was < 0.
So I walked off the end of the hash array nicely because I did not
sufficiently document the convention. The problem is that I was using
-1 as the start-up index (since 0 may or may not be a valid entry, and
you *have* to perform an increment on every call to the iterator
incrementor) and so could not use < 0 as the terminator condition. But
it meant that my intuition conficted with what was necessary. I fixed
this by creating an "isDone" macro for the iterator.

With multithreaded errors, I already know a priori that they are
difficult. When I can, and I detect such a bug, I will spend a short
amount of time try to track it down. If I can't get it, I junk the
contentious code and start over. Its just a matter of productivity --
these bugs can be so hard, that it will take longer to track them down
than to rewrite the code. Sometimes I don't learn/figure out what I
did wrong, but life is too short.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

 
Reply With Quote
 
Frederick Gotham
Guest
Posts: n/a
 
      11-25-2006
Gordon Burditt:

>>I have never written a program for monetary gain. I program purely for
>>the enjoyment of programming. If I achieve a certain object, I am not
>>satisfied -- I want to achieve the objective as efficiently as is
>>possible.

>
> If it doesn't have to work correctly, any program can run in 0 time
> and 0 bytes.



Which is a tremendous argument, only that my programs do run right in the
end.

--

Frederick Gotham
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
insert track reference in mp3 track D@Z Computer Support 1 05-02-2008 03:32 PM
Flickr: difference between "most relevant" and "most interesting" Max Digital Photography 7 09-26-2007 10:38 PM
Old certification track vs new certification track. Barkley Hughes MCAD 5 11-12-2006 02:41 PM
Hmmmm, most interesting... Mick Sterbs Digital Photography 2 06-29-2004 02:14 AM
Re: Q. What's the 'most interesting thing' that can be done with plain html ? spaghetti HTML 6 08-09-2003 07:58 PM



Advertisments