Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > extracting text from powerpoint file

Reply
Thread Tools

extracting text from powerpoint file

 
 
code_wrong
Guest
Posts: n/a
 
      09-12-2005
hi,
I decided to extract the text from some powerpoint files. The results have
thrown up some questions.

When I use the 'char *valid' character array (in the program below) to
choose the characters to write in the new file... the result is totally
different to when I use the line with isalpha() and isdigit().

Yes .. There are more valid characters in the valid array but this is not
the problem .. Using it, I see extra spaces in the new file and it is more
difficult to read (in notepad there appears to be a space between each
character .. in wordpad there are boxes between characters).. why?

anyone care to investigate and enlighten me? .. the code is below all you
need to do is comment and uncommment to achieve the differences I am talking
about

To use the program (with MS Windows) all you need to do is drag the file you
want to process onto the .exe file

cheeers
cw

the program:
############

#include<stdio.h>
#include<ctype.h>

void writeFile(FILE *infile,FILE *outfile);

int main(int argc, char *argv[])
{
FILE *outfile = NULL; //the file to write to
FILE *infile = NULL; //the file to read

if(((infile=fopen(argv[1],"rb"))==NULL)||((outfile=fopen("new.txt","wb"))== NULL))
{
printf("error opening file - fatal error - goodbye");
getchar();
exit(1);
}
writeFile(infile,outfile);
fflush(stdout);
system("pause");
return 0;
}

void writeFile(FILE *infile,FILE *outfile)
{
char *valid =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW XYZ0123456789
\n.;:<>?/|\\!\"$%^&*()_-=+,#~[]{}";

int byte;

while(1)
{
byte = fgetc(infile);/*read one byte*/
if(feof(infile)){break;}/*break from while at end of file*/

/*if(strchr(valid,byte))*/
if((isalpha(byte))||(isdigit(byte))||(byte==' ')||(byte == '\n'))
{
fputc(byte,outfile);
}
else
{ }

}
}

############


 
Reply With Quote
 
 
 
 
Irrwahn Grausewitz
Guest
Posts: n/a
 
      09-12-2005
"code_wrong" <(E-Mail Removed)> wrote:
<snip>
>When I use the 'char *valid' character array (in the program below) to
>choose the characters to write in the new file... the result is totally
>different to when I use the line with isalpha() and isdigit().
>
>Yes .. There are more valid characters in the valid array but this is not
>the problem .. Using it, I see extra spaces in the new file and it is more
>difficult to read (in notepad there appears to be a space between each
>character .. in wordpad there are boxes between characters).. why?

<snip>
>void writeFile(FILE *infile,FILE *outfile)
>{
> char *valid =
>"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV WXYZ0123456789
>\n.;:<>?/|\\!\"$%^&*()_-=+,#~[]{}";


You'd better off declaring the array static, but that's not the
problem.

> int byte;
>
> while(1)
> {
> byte = fgetc(infile);/*read one byte*/
> if(feof(infile)){break;}/*break from while at end of file*/
>
> /*if(strchr(valid,byte))*/


I've only skimmed over your code, and won't comment style flaws, but
above line (the one giving you troubles, if uncommented, right?) does
not check for 0 bytes. In the strchr function, the terminating null
character is considered to be part of the string. You want something
like:

if( byte && strchr(valid,byte))
> {
> fputc(byte,outfile);
> }
> else
> { }
>
> }
>}


Best regards
--
Irrwahn Grausewitz ((E-Mail Removed))
welcome to clc : http://www.ungerhu.com/jxh/clc.welcome.txt
clc faq-list : http://www.faqs.org/faqs/C-faq/faq/
clc frequent answers: http://benpfaff.org/writings/clc
 
Reply With Quote
 
 
 
 
code_wrong
Guest
Posts: n/a
 
      09-12-2005

"Irrwahn Grausewitz" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...

snip

> I've only skimmed over your code, and won't comment style flaws, but
> above line (the one giving you troubles, if uncommented, right?) does
> not check for 0 bytes. In the strchr function, the terminating null
> character is considered to be part of the string. You want something
> like:
>
> if( byte && strchr(valid,byte))


snip

Thanks, you have identified the line of code that was producing the
boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
So I guess the program reads a null character in the file and writes it to
the output file ...

wonder why there are so many null characters in the powerpoint file (every
second character) ....interesting

cheers
cw




 
Reply With Quote
 
Mike Wahler
Guest
Posts: n/a
 
      09-12-2005

"code_wrong" <(E-Mail Removed)> wrote in message
news:4325d6c4$(E-Mail Removed)...
>
> "Irrwahn Grausewitz" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>
> snip
>
>> I've only skimmed over your code, and won't comment style flaws, but
>> above line (the one giving you troubles, if uncommented, right?) does
>> not check for 0 bytes. In the strchr function, the terminating null
>> character is considered to be part of the string. You want something
>> like:
>>
>> if( byte && strchr(valid,byte))

>
> snip
>
> Thanks, you have identified the line of code that was producing the
> boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
> So I guess the program reads a null character in the file and writes it to
> the output file ...
>
> wonder why there are so many null characters in the powerpoint file (every
> second character) ....interesting


Well, it's a 'binary' file (as opposed to 'plain text'), in which embedded
zero characters are common. Your remark about 'every second character'
makes me guess that perhaps (at least part of) the data might be stored
as multibyte or 'wide' characters (e.g. Unicode). You might want to look
into that possibility.

-Mike


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting JPG file ppt file [Powerpoint] file zxcvar Digital Photography 7 06-22-2009 07:54 PM
need help extracting data from a text file nephish@xit.net Python 7 11-09-2005 05:56 PM
Extracting Powerpoint Charts Bijoy Naick ASP .Net 2 01-16-2005 02:13 AM
extracting unique strings from text file Bubbles ASP .Net 0 03-03-2004 06:55 PM
powerpoint text extractor... Help! cstudent79 C Programming 4 10-14-2003 06:52 AM



Advertisments