Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Need help with pattern matching/substitution.

Reply
Thread Tools

Need help with pattern matching/substitution.

 
 
Hemant Shah
Guest
Posts: n/a
 
      09-26-2006

Folks,

I have a script that reads data from a file. Each line in a file is comma
seperated list of values.

Example:

DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '

I want to check if any value contains a quote and add another quote to it.

Example:

DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN''T DELETE '


How to I formulate a pattern for substitution?

Thanks.

--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
Reply With Quote
 
 
 
 
Peter Scott
Guest
Posts: n/a
 
      09-27-2006
On Tue, 26 Sep 2006 22:31:57 +0000, Hemant Shah wrote:
> I have a script that reads data from a file. Each line in a file is comma
> seperated list of values.
>
> Example:
>
> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '


That isn't valid in any CSV format I know. To keep us from guessing,
please post the specification for the syntax of the lines you are parsing.

--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/

 
Reply With Quote
 
 
 
 
Hemant Shah
Guest
Posts: n/a
 
      09-27-2006
While stranded on information super highway Peter Scott wrote:
> On Tue, 26 Sep 2006 22:31:57 +0000, Hemant Shah wrote:
>> I have a script that reads data from a file. Each line in a file is comma
>> seperated list of values.
>>
>> Example:
>>
>> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
>> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '

>
> That isn't valid in any CSV format I know. To keep us from guessing,
> please post the specification for the syntax of the lines you are parsing.


Yes, this is not a valid CSV format. The format is:

TableName|CSV format for data values.


>
> --
> Peter Scott
> http://www.perlmedic.com/
> http://www.perldebugged.com/
>


--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      09-28-2006
Hemant Shah schreef:

> I have a script that reads data from a file. Each line in a file is
> comma seperated list of values.
>
> Example:
>
> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
>
> I want to check if any value contains a quote and add another quote
> to it.


That is only possible if you can make several assumptions, from the
structure of each record.
Are they fixed length?

An error prone approach:

#!/usr/bin/perl
use warnings ;
use strict ;

while ( <DATA> )
{
chomp ;

# replace '...'<comma> by "..."<newline>
s/'(.*?)',/"$1"\n/g ;

# special treatment of the last field
s/(?:\n|,)'(.*)'$/\n"$1"/ ;

# double any quotes
s/'/''/g ;

# change all <newline>s back to commas
s/\n/,/g ;

print "$_\n" ;
}

__DATA__
DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006',10,'CAN'T DELETE '

This prints:
DWH_TRRM|"001","A000855747",100,"02-20-2006","CAN DELETE "
DWH_TRRM|"002","A000855737",146,"02-20-2006","CAN''T DELETE "
DWH_TRRM|"002","A000855737",146,"02-20-2006",10,"CAN''T DELETE "

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
Peter Scott
Guest
Posts: n/a
 
      09-28-2006
On Wed, 27 Sep 2006 18:06:56 +0000, Hemant Shah wrote:
> While stranded on information super highway Peter Scott wrote:
>> On Tue, 26 Sep 2006 22:31:57 +0000, Hemant Shah wrote:
>>> I have a script that reads data from a file. Each line in a file is comma
>>> seperated list of values.
>>>
>>> Example:
>>>
>>> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
>>> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '

>>
>> That isn't valid in any CSV format I know. To keep us from guessing,
>> please post the specification for the syntax of the lines you are parsing.

>
> Yes, this is not a valid CSV format. The format is:
>
> TableName|CSV format for data values.


It still isn't valid. The fields are single quoted yet the last field in
the second record has an unescaped single quote in it. That doesn't
correspond to any CSV format I know of and is ambiguous. Without more
information limiting the syntax, Abigail is right: your problem is
unsolvable.

--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/

 
Reply With Quote
 
Ted Zlatanov
Guest
Posts: n/a
 
      09-28-2006
On 28 Sep 2006, wrote:

On Wed, 27 Sep 2006 18:06:56 +0000, Hemant Shah wrote:
> While stranded on information super highway Peter Scott wrote:
>> On Tue, 26 Sep 2006 22:31:57 +0000, Hemant Shah wrote:
>>>> I have a script that reads data from a file. Each line in a file is comma
>>>> seperated list of values.
>>>>
>>>> Example:
>>>>
>>>> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
>>>> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
>>>
>>> That isn't valid in any CSV format I know. To keep us from guessing,
>>> please post the specification for the syntax of the lines you are parsing.

>>
>> Yes, this is not a valid CSV format. The format is:
>>
>> TableName|CSV format for data values.

>
> It still isn't valid. The fields are single quoted yet the last field in
> the second record has an unescaped single quote in it. That doesn't
> correspond to any CSV format I know of and is ambiguous. Without more
> information limiting the syntax, Abigail is right: your problem is
> unsolvable.


I think with the (maybe reasonable, maybe not) assumption that
internal quotes will never have a comma next to them, it's solvable.
I agree that as it stands, the format is not parseable.

The OP may want to look at the producers of that data and see if they
can be fixed to produce real CSV at the source. It will be a lot
easier than fixing it after the damage to the data has been done.

Ted
 
Reply With Quote
 
Hemant Shah
Guest
Posts: n/a
 
      09-28-2006
While stranded on information super highway Dr.Ruud wrote:
> Hemant Shah schreef:
>
>> I have a script that reads data from a file. Each line in a file is
>> comma seperated list of values.
>>
>> Example:
>>
>> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
>> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
>>
>> I want to check if any value contains a quote and add another quote
>> to it.

>
> That is only possible if you can make several assumptions, from the
> structure of each record.
> Are they fixed length?


No, each line may be of different length, depending on number of columns in
the table, also the string with quote can also be in any position in the
list.

The problem is that this file is generated from a COBOL program that reads
EBCDIC data and generated ASCII file. My perl script has to read the file
and dump data into DB2 tables. There is not much string manipulation they can
do in COBOL so I have to do it in perl.

It was decided that the strings will never have a '?' so the COBOL program
now uses '?' instead of a quote in the data and my perl script replaces all
'?' with two quotes.


>
> An error prone approach:
>
> #!/usr/bin/perl
> use warnings ;
> use strict ;
>
> while ( <DATA> )
> {
> chomp ;
>
> # replace '...'<comma> by "..."<newline>
> s/'(.*?)',/"$1"\n/g ;
>
> # special treatment of the last field
> s/(?:\n|,)'(.*)'$/\n"$1"/ ;
>
> # double any quotes
> s/'/''/g ;
>
> # change all <newline>s back to commas
> s/\n/,/g ;
>
> print "$_\n" ;
> }
>
> __DATA__
> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
> DWH_TRRM|'002','A000855737',146,'02-20-2006',10,'CAN'T DELETE '
>
> This prints:
> DWH_TRRM|"001","A000855747",100,"02-20-2006","CAN DELETE "
> DWH_TRRM|"002","A000855737",146,"02-20-2006","CAN''T DELETE "
> DWH_TRRM|"002","A000855737",146,"02-20-2006",10,"CAN''T DELETE "
>
> --
> Affijn, Ruud
>
> "Gewoon is een tijger."
>
>


--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
C++ and design Pattern (Composite design Pattern ) Pallav singh C++ 0 01-22-2012 10:26 PM
C++ and design Pattern (Composite design Pattern ) Pallav singh C++ 0 01-22-2012 10:25 PM
May I have a example of design pattern of "composite", I still feel fuzzy after reading book of Addison-Wesley's"design pattern " jones9413@yahoo.com C++ 1 08-31-2007 04:09 AM
documents related to factory design pattern and Abstract foctory pattern. sunny C++ 1 12-07-2006 04:26 AM
boolean endsWith(String s, Pattern pattern) lepikhin@gmail.com Java 17 11-16-2005 10:31 AM



Advertisments