Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Trying to compare two files and output it into a third file.

Reply
Thread Tools

Trying to compare two files and output it into a third file.

 
 
chutsu
Guest
Posts: n/a
 
      07-29-2009
Ok. So basically I have two files in the form of:

File1:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

File2:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

Now these two files are similar and they are not ordered, my quest is
to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
read the second file to find the same exact phrase.

Once you get a match obtain both second columns (ie The numbers) and
output as follows:

File 3:
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133


Can someone help me, I have no idea how to approach this.
Thanks
Chris
 
Reply With Quote
 
 
 
 
Moi
Guest
Posts: n/a
 
      07-29-2009
On Wed, 29 Jul 2009 13:25:00 -0700, chutsu wrote:

> Ok. So basically I have two files in the form of:
>
> File1:
> asdfkjsdlfkjsdf 1232
> afasdfklsdjfksf 12312
> sdflsadsdffdsfs 32323
>
> File2:
> asdfkjsdlfkjsdf 1232
> afasdfklsdjfksf 12312
> sdflsadsdffdsfs 32323
>
> Now these two files are similar and they are not ordered, my quest is to
> get the first column from the first file (ie "asdfkjsdlfkjsdf") and read
> the second file to find the same exact phrase.
>
> Once you get a match obtain both second columns (ie The numbers) and
> output as follows:
>
> File 3:
> asdfdsfdsfdssa 1232 133
> asdfdsfdsfdssa 1232 133
> asdfdsfdsfdssa 1232 133
>
>
> Can someone help me, I have no idea how to approach this. Thanks



Sort/merge "nested table scan"
Some hashing might help.

NB I don't know where the 133 in the result set comes from.
And I don't know why there are *three* tuples in the result set.


HTH,
AvK
 
Reply With Quote
 
 
 
 
Default User
Guest
Posts: n/a
 
      07-29-2009
chutsu wrote:

> Ok. So basically I have two files in the form of:
>
> File1:
> asdfkjsdlfkjsdf 1232
> afasdfklsdjfksf 12312
> sdflsadsdffdsfs 32323
>
> File2:
> asdfkjsdlfkjsdf 1232
> afasdfklsdjfksf 12312
> sdflsadsdffdsfs 32323
>
> Now these two files are similar and they are not ordered, my quest is
> to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
> read the second file to find the same exact phrase.
>
> Once you get a match obtain both second columns (ie The numbers) and
> output as follows:
>
> File 3:
> asdfdsfdsfdssa 1232 133
> asdfdsfdsfdssa 1232 133
> asdfdsfdsfdssa 1232 133
>
>
> Can someone help me, I have no idea how to approach this.


You have NO idea? Well, how would you do it by hand? What exactly is
giving you trouble? Do you know how to open files? Read from them?
Compare strings? Do you know what loops are?

If you seriously have no idea how to approach this problem, then you
need to fall back and learning C and programming from the start.
Otherwise, you need to show us what you've tried so was can help direct
you along the correct approach.



Brian

--
Day 177 of the "no grouchy usenet posts" project
 
Reply With Quote
 
Gene
Guest
Posts: n/a
 
      07-29-2009
On Jul 29, 4:25*pm, chutsu <(E-Mail Removed)> wrote:
> Ok. So basically I have two files in the form of:
>
> File1:
> asdfkjsdlfkjsdf * *1232
> afasdfklsdjfksf * *12312
> sdflsadsdffdsfs * 32323
>
> File2:
> asdfkjsdlfkjsdf * *1232
> afasdfklsdjfksf * *12312
> sdflsadsdffdsfs * 32323
>
> Now these two files are similar and they are not ordered, my quest is
> to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
> read the second file to find the same exact phrase.
>
> Once you get a match obtain both second columns (ie The numbers) and
> output as follows:
>
> File 3:
> asdfdsfdsfdssa * *1232 * *133
> asdfdsfdsfdssa * *1232 * *133
> asdfdsfdsfdssa * *1232 * *133
>


Perhaps it's OT, but sometimes IMO the best C is no C. I.e. this is
the kind of problem that perl, awk, and similar languages were meant
to solve.

In perl you'd need only something like this _untested_ code.

our %pairs;

sub scan {
my $fn = shift;
open(F, $fn) || die;
while (<F>) {
my ($key, $val) = /^(\S+)\s+(\d+)$/;
die "bad data" unless $key;
push @{ $pairs{$1} }, $2;
}
close F;
}

my report {
my $fn = shift;
open(F, "> $fn") || die;
foreach my $key (keys %pairs) {
next unless scalar(@{ $pairs{$key} }) > 1;
print "$key\t" . join("\t", @{ $pairs{$key} }) . "\n";
}
close F;
}

scan("file1");
scan("file2");
report;
 
Reply With Quote
 
chutsu
Guest
Posts: n/a
 
      07-29-2009
On Jul 29, 9:51*pm, "Default User" <(E-Mail Removed)> wrote:
> chutsu wrote:
> > Ok. So basically I have two files in the form of:

>
> > File1:
> > asdfkjsdlfkjsdf * *1232
> > afasdfklsdjfksf * *12312
> > sdflsadsdffdsfs * 32323

>
> > File2:
> > asdfkjsdlfkjsdf * *1232
> > afasdfklsdjfksf * *12312
> > sdflsadsdffdsfs * 32323

>
> > Now these two files are similar and they are not ordered, my quest is
> > to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
> > read the second file to find the same exact phrase.

>
> > Once you get a match obtain both second columns (ie The numbers) and
> > output as follows:

>
> > File 3:
> > asdfdsfdsfdssa * *1232 * *133
> > asdfdsfdsfdssa * *1232 * *133
> > asdfdsfdsfdssa * *1232 * *133

>
> > Can someone help me, I have no idea how to approach this.

>
> You have NO idea? Well, how would you do it by hand? What exactly is
> giving you trouble? Do you know how to open files? Read from them?
> Compare strings? Do you know what loops are?
>
> If you seriously have no idea how to approach this problem, then you
> need to fall back and learning C and programming from the start.
> Otherwise, you need to show us what you've tried so was can help direct
> you along the correct approach.
>
> Brian
>
> --
> Day 177 of the "no grouchy usenet posts" project


to understand my code you need know more about what these files are.
So I'm trying to sort out some DNA data I got, the first stage is to
compare which sequences appear to be common, and how many repeats or
"reads" occur.
The first field is the sequence (or tag in my code), the second is the
number of reads.
The data file 1 and 2 will therefore look like:
CAGCTCACTGCA 123
ACGTGCCCCCTT 847
etc... etc...

I've been writing this code and I have no idea why it doesn't work:

the usual inclues and opening file...
This is the bit I can't get it to work

// Read file1
while(!feof(file)){

// Get the tag sequence and the read number
fscanf(file, "%s", tag_1);

// Validate the tag is a sequence and not the reads
if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){

// Read file2
fscanf(file2, "%s", tag_2);

// Validate the tag2 is a sequence and not the reads
if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T')
{

// Now compare tag1 with tag2 to see if they match
if(strcmp(tag_1, tag_2)==0){
printf("match!: %s", tag_1);
}
}
}
}

note this is by no means finish, I'm working in stages, but this is as
far as I got.
 
Reply With Quote
 
Morris Keesan
Guest
Posts: n/a
 
      07-29-2009
On Wed, 29 Jul 2009 19:14:35 -0400, chutsu <(E-Mail Removed)> wrote:

> to understand my code you need know more about what these files are.
> So I'm trying to sort out some DNA data I got, the first stage is to
> compare which sequences appear to be common, and how many repeats or
> "reads" occur.
> The first field is the sequence (or tag in my code), the second is the
> number of reads.
> The data file 1 and 2 will therefore look like:
> CAGCTCACTGCA 123
> ACGTGCCCCCTT 847
> etc... etc...


Clarification: It looks like you only want to find a match between
the two files if the matching base sequence is on the same line
number in both files? That appears to be the intent of your code.

And a few questions: How large are these files? Is there
any particular reason to avoid sorting them? And do you have
a guarantee that the two files have the same number of lines?

>
> I've been writing this code and I have no idea why it doesn't work:
>
> the usual inclues and opening file...
> This is the bit I can't get it to work
>
> // Read file1
> while(!feof(file)){


This is a very common error in C code: feof only returns true after
you've attempted to read past the end of the file, NOT when you've
read the last byte of the file.

>
> // Get the tag sequence and the read number
> fscanf(file, "%s", tag_1);
>
> // Validate the tag is a sequence and not the reads
> if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){
>
> // Read file2
> fscanf(file2, "%s", tag_2);
>
> // Validate the tag2 is a sequence and not the reads
> if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T')


Note that you've repeated the code for comparing the first character to
ACGT -- very poor programming practice. If you were going to do this,
it would be worth extracting this test into a subroutine.
But reading the file this way, scanning a single token at a time and
testing the content to figure out which column you've read, is clunky
and error-prone, and I think it's a major source of the confusion in
your code. I suggest code something like this:

/* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
* nreads_1 nreads_2 are integers. Also assume that you're
* super-confident of your data format, and that the sequences
* can't possibly be large enough to overflow tag_1 and tag_2
*/

while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
{
fscanf(file2, "%s %d", tag_2, &nreads_2);
if (strcmp(tag_1, tag_2) == 0)
printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);
}

This needs some additional error-checking, but there's a basic
framework for you.
 
Reply With Quote
 
chutsu
Guest
Posts: n/a
 
      07-30-2009
On Jul 30, 12:41*am, "Morris Keesan" <(E-Mail Removed)> wrote:
> On Wed, 29 Jul 2009 19:14:35 -0400, chutsu <(E-Mail Removed)> wrote:
> > to understand my code you need know more about what these files are.
> > So I'm trying to sort out some DNA data I got, the first stage is to
> > compare which sequences appear to be common, and how many repeats or
> > "reads" occur.
> > The first field is the sequence (or tag in my code), the second is the
> > number of reads.
> > The data file 1 and 2 will therefore look like:
> > CAGCTCACTGCA * *123
> > ACGTGCCCCCTT * *847
> > etc... etc...

>
> Clarification: It looks like you only want to find a match between
> the two files if the matching base sequence is on the same line
> number in both files? *That appears to be the intent of your code.
>
> And a few questions: How large are these files? *Is there
> any particular reason to avoid sorting them? *And do you have
> a guarantee that the two files have the same number of lines?
>
>
>
> > I've been writing this code and I have no idea why it doesn't work:

>
> > the usual inclues and opening file...
> > This is the bit I can't get it to work

>
> > // Read file1
> > while(!feof(file)){

>
> This is a very common error in C code: feof only returns true after
> you've attempted to read past the end of the file, NOT when you've
> read the last byte of the file.
>
>
>
> > * * // Get the tag sequence and the read number
> > * * fscanf(file, "%s", tag_1);

>
> > * * // Validate the tag is a sequence and not the reads
> > * * if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){

>
> > * * * * // Read file2
> > * * * * fscanf(file2, "%s", tag_2);

>
> > * * * * // Validate the tag2 is a sequence and not the reads
> > * * * * if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T')

>
> Note that you've repeated the code for comparing the first character to
> ACGT -- very poor programming practice. *If you were going to do this,
> it would be worth extracting this test into a subroutine.
> But reading the file this way, scanning a single token at a time and
> testing the content to figure out which column you've read, is clunky
> and error-prone, and I think it's a major source of the confusion in
> your code. *I suggest code something like this:
>
> /* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
> * * nreads_1 nreads_2 are integers. *Also assume that you're
> * * super-confident of your data format, and that the sequences
> * * can't possibly be large enough to overflow tag_1 and tag_2
> * */
>
> while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
> {
> * * *fscanf(file2, "%s %d", tag_2, &nreads_2);
> * * *if (strcmp(tag_1, tag_2) == 0)
> * * * * *printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);
>
> }
>
> This needs some additional error-checking, but there's a basic
> framework for you.



Wow, that is so much more simplified. Anyways I tried your code, but
the it doesn't return anything.
I have done some error analysis and noticed that if you added a
"printf" after the second "fscanf"
the value of tag_1 does not register anymore.

while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
fscanf(file2, "%s %d", tag_2, &reads_2);
printf("%s\n", tag_1);
if (strcmp(tag_1, tag_2) == 0)
printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
}

the program does prints a bunch of blank lines, but if I moved the
printf statement before the second "fscanf"
displays the content.
I'm so confused
 
Reply With Quote
 
Thomas Matthews
Guest
Posts: n/a
 
      07-30-2009
chutsu wrote:
> Ok. So basically I have two files in the form of:
>
> File1:
> asdfkjsdlfkjsdf 1232
> afasdfklsdjfksf 12312
> sdflsadsdffdsfs 32323
>
> File2:
> asdfkjsdlfkjsdf 1232
> afasdfklsdjfksf 12312
> sdflsadsdffdsfs 32323
>
> Now these two files are similar and they are not ordered, my quest is
> to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
> read the second file to find the same exact phrase.
>
> Once you get a match obtain both second columns (ie The numbers) and
> output as follows:
>
> File 3:
> asdfdsfdsfdssa 1232 133
> asdfdsfdsfdssa 1232 133
> asdfdsfdsfdssa 1232 133
>
>
> Can someone help me, I have no idea how to approach this.
> Thanks
> Chris


You have a key column, sounds like a map data structure
would be very helpful.
struct
{
char * key;
char * file1_data;
char * file2_data;
};

Read all the data from the first file into a struct like above.
Sort by the key field.
Read the key field from the second file. Search for the key
in the memory. If key field is the same, set the data in the
structure. If field is unique, append a new struct and
resort.

In some languages, you can split the key and values into two
pieces:
struct Value
{
char * first_value;
char * second_value;
};

You would then use a map (directory, associative array):
map[key] = value;


--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library
 
Reply With Quote
 
Morris Keesan
Guest
Posts: n/a
 
      07-30-2009
On Wed, 29 Jul 2009 20:01:55 -0400, chutsu <(E-Mail Removed)> wrote:

> On Jul 30, 12:41*am, "Morris Keesan" <(E-Mail Removed)> wrote:

<snip>
>> Clarification: It looks like you only want to find a match between
>> the two files if the matching base sequence is on the same line
>> number in both files? *That appears to be the intent of your code.
>>
>> And a few questions: How large are these files? *Is there
>> any particular reason to avoid sorting them? *And do you have
>> a guarantee that the two files have the same number of lines?





I note that you haven't answered these questions, leaving the rest
of us to guess what it is that you're really trying to do.

<snip>


>> *I suggest code something like this:
>>
>> /* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
>> * * nreads_1 nreads_2 are integers. *Also assume that you're
>> * * super-confident of your data format, and that the sequences
>> * * can't possibly be large enough to overflow tag_1 and tag_2
>> * */
>>
>> while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
>> {
>> * * *fscanf(file2, "%s %d", tag_2, &nreads_2);
>> * * *if (strcmp(tag_1, tag_2) == 0)
>> * * * * *printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);
>>
>> }
>>
>> This needs some additional error-checking, but there's a basic
>> framework for you.

>
>
> Wow, that is so much more simplified. Anyways I tried your code, but
> the it doesn't return anything.
> I have done some error analysis and noticed that if you added a
> "printf" after the second "fscanf"
> the value of tag_1 does not register anymore.
>
> while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
> fscanf(file2, "%s %d", tag_2, &reads_2);
> printf("%s\n", tag_1);
> if (strcmp(tag_1, tag_2) == 0)
> printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
> }
>
> the program does prints a bunch of blank lines, but if I moved the
> printf statement before the second "fscanf"
> displays the content.
> I'm so confused


First: Scroll up a couple of screens, look at the questions I asked
before,
and please answer them.

Second: This is all wild speculation without seeing your actual code, but
notice
the comment above my code fragment, stating the assumptions that would
need to
be made in order for this to work. Note especially the assumptions about
tag_1
and tag_2 pointing to memory which is large enough to hold the strings.
Without
seeing your actual code, I can only guess, but I wouldn't be at all
surprised if
you have declarations like

char *tag_1;
char *tag_2;

and no code which allocates any space for them to point at.
Please post the whole function which is doing this, or at least
the declarations and the code which opens the files.
 
Reply With Quote
 
chutsu
Guest
Posts: n/a
 
      07-30-2009

> >> Clarification: It looks like you only want to find a match between
> >> the two files if the matching base sequence is on the same line
> >> number in both files? *That appears to be the intent of your code.


Yes I'm trying to match the base sequence, however the match does not
necessary mean
they are both on the same line number. So my code was to:
- read the base sequence from the first file
- store that in some variable (ie tag_1)
- read the second file to see if a match is found
- if found printf match found
- and loops until there are no more base sequence in file 1

Note: I actally want to do more than just printf, but one at a time.

> >> And a few questions: How large are these files? *Is there
> >> any particular reason to avoid sorting them? *And do you have
> >> a guarantee that the two files have the same number of lines?


These files are very large, about 120,000 lines long, so I tried
creating
multi-dimensional arrays, but its just too big. The two files don't
have
the same line numbers but do have the same format.



>
>
> >> *I suggest code something like this:

>
> >> /* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
> >> * * nreads_1 nreads_2 are integers. *Also assume that you're
> >> * * super-confident of your data format, and that the sequences
> >> * * can't possibly be large enough to overflow tag_1 and tag_2
> >> * */

>
> >> while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
> >> {
> >> * * *fscanf(file2, "%s %d", tag_2, &nreads_2);
> >> * * *if (strcmp(tag_1, tag_2) == 0)
> >> * * * * *printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);

>
> >> }

>
> >> This needs some additional error-checking, but there's a basic
> >> framework for you.

>
> > Wow, that is so much more simplified. Anyways I tried your code, but
> > the it doesn't return anything.
> > I have done some error analysis and noticed that if you added a
> > "printf" after the second "fscanf"
> > the value of tag_1 does not register anymore.

>
> > * *while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
> > * * * * * *fscanf(file2, "%s %d", tag_2, &reads_2);
> > * * * * * *printf("%s\n", tag_1);
> > * * * * * *if (strcmp(tag_1, tag_2) == 0)
> > * * * * * *printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
> > * *}

>
> > the program does prints a bunch of blank lines, but if I moved the
> > printf statement before the second "fscanf"
> > displays the content.
> > I'm so confused

>
> First: *Scroll up a couple of screens, look at the questions I asked *
> before,
> and please answer them.
>
> Second: This is all wild speculation without seeing your actual code, but *
> notice
> the comment above my code fragment, stating the assumptions that would *
> need to
> be made in order for this to work. *Note especially the assumptions about *
> tag_1
> and tag_2 pointing to memory which is large enough to hold the strings. *
> Without
> seeing your actual code, I can only guess, but I wouldn't be at all *
> surprised if
> you have declarations like
>
> * * *char *tag_1;
> * * *char *tag_2;
>
> and no code which allocates any space for them to point at.
> Please post the whole function which is doing this, or at least
> the declarations and the code which opens the files.


My full code at the moment is:

#include <stdio.h>
#include <string.h>

int main(int argc, char * argv[])
{

char *file_path="../../data/clustered_tags/clustered_tags_DB2.txt";
char *file_path2="../../data/clustered_tags/clustered_tags_SC3.txt";
char tag_1[21];
char tag_2[21];
int reads_1;
int reads_2;
int i=0;
FILE *file;
FILE *file2;


// Opening file
file = fopen( file_path, "r" );
file2 = fopen( file_path2, "r" );

if(file==NULL || file2==NULL) {
printf("Error: can't open file.\n");
return 1;
}
else {
printf("File opened!\n");
}

while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
fscanf(file2, "%s %d", tag_2, &reads_2);
printf("%s\n", tag_1);
if (strcmp(tag_1, tag_2) == 0)
printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
}

fclose(file);
fclose(file2);
return 0;
}
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to pass a third argument to compare function? Lambda C++ 3 06-24-2008 07:17 AM
XSLT Compare two documents and output differences super.raddish@gmail.com XML 4 06-26-2007 11:54 AM
how to compare value of two fileds and based on that insert value into third fileds Tradeorganizer ASP General 5 01-31-2007 04:51 AM
How to compare two SOAP Envelope or two Document or two XML files GenxLogic Java 3 12-06-2006 08:41 PM
Having a Problem Trying to call two html iles to be loaded into two different frames Jofio Javascript 3 10-09-2005 09:50 AM



Advertisments