Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > need help-how to parse references

Reply
Thread Tools

need help-how to parse references

 
 
susan
Guest
Posts: n/a
 
      07-25-2003
Hello everyone:

I am a perl beginner. I am practicing to parse a list of different
references. The list looks like any references followed a paper. In
the list, every reference has different numbers of authors. Most
references are either books or journals. I would like to separate each
field, for example, the result I assume looks like:
author article name journal name or book name volume#...year
Alison Balter Access 2000 development 1999

I feel it is hard to find a regular expression to separate them. Does
anyone advise me where I can find more inforamtion?

Thanks.

Susan
 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      07-25-2003
susan <(E-Mail Removed)> wrote:


> I am practicing to parse a list of different
> references.



If you show us several good examples of your data, we would
probably be able to help you.

But you didn't, so we can't.

Have you seen the Posting Guidelines that are posted here frequently?


> for example, the result I assume looks like:

^^^^^^^^^^
^^^^^^^^^^
> author article name journal name or book name volume#...year
> Alison Balter Access 2000 development 1999



So that is the output you want from your program?

What does the input look like?

We cannot parse data that we know nothing about.

If that _is_ meant to be your input, then why must you "assume"
what it looks like?

We must know the input with great precision if we are to devise
a way to process it. "Assuming" what the input looks like will
not result in an answer that is useable in real life.


> I feel it is hard to find a regular expression to separate them.



Maybe you do not need a regular expression to separate them.

Maybe you could use some other approach...


> Does
> anyone advise me where I can find more inforamtion?



.... but without knowing what you have, and how you want to
transform it, we cannot advise one way or the other.


Your post does not contain the information we need to answer your question.

Show use some example input. (one record is not good enough)

Show use some desired output (for that same data).

Tell (and show) us anything you know about the format of the input data:

Can fields be "missing" or "empty"? How can you tell when they are?
Do the fields always line up in columns?
Is there some separator between each column?

If you can do something like that, then we would have a really
good chance of being able to help you with your problem.


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
 
susan
Guest
Posts: n/a
 
      07-26-2003
Hello friends,

I am sorry I didn't provide enough information about the input. Here
is the example of my text file for the references:

REFERENCES
Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
Cell,2nd edition, Garland Publishing, New York,1989.
Van Holde.,Chromatic, Springer-Verlag, New York, 1989.
Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
Sci.Am.243, 100-125(1980).
Zhuang,X.,Ha,T.,Kim,H.D.,Cartner,T.m, Lebeit,S., and Chi,
S.,Fluorescence Quen-
ching: A tool for Single-Molecule Protein-Folding Study,
Natl.Acad.Sci.19,14,41-
64(2000).

I wrote: if (($Line =~/^[A-Z](.*)\d{4}(\))*\.$/)and ($Line =~/(\.,)/))
{ print "$Line\n"}
This will keep all the references having a single line. But I don't
know how to tell the computer to consider from "Alberts..." to "1989."
is only one citation. Further more, I want to separate the
inforamtion, output should look like:

1st author 2nd author Article_Name Journal_or_BookName
Alberts,B Bray,D Molecular Biology of the Cell
Van Holde Chromatic
Bauer,W.R Crick,F.H.C Supercoiled DNA Sci.Am.

I plan to parse the 1st author, article name and jouranl name first
since they provide basic information. The final goal is try to parse
all the information.

Thanks for your advice.

Susan


(E-Mail Removed) (Tad McClellan) wrote in message news:<(E-Mail Removed)>.. .
> susan <(E-Mail Removed)> wrote:
>
>
> >

I am practicing to parse a list of different
> > references.

>
>
> If you show us several good examples of your data, we would
> probably be able to help you.
>
> But you didn't, so we can't.
>
> Have you seen the Posting Guidelines that are posted here frequently?
>
>
> > for example, the result I assume looks like:

> ^^^^^^^^^^
> ^^^^^^^^^^
> > author article name journal name or book name volume#...year
> > Alison Balter Access 2000 development 1999

>
>



> So that is the output you want from your program?
>
> What does the input look like?
>
> We cannot parse data that we know nothing about.
>
> If that _is_ meant to be your input, then why must you "assume"
> what it looks like?
>
> We must know the input with great precision if we are to devise
> a way to process it. "Assuming" what the input looks like will
> not result in an answer that is useable in real life.
>
>
> > I feel it is hard to find a regular expression to separate them.

>
>
> Maybe you do not need a regular expression to separate them.
>
> Maybe you could use some other approach...
>
>
> > Does
> > anyone advise me where I can find more inforamtion?

>
>
> ... but without knowing what you have, and how you want to
> transform it, we cannot advise one way or the other.
>
>
> Your post does not contain the information we need to answer your question.
>
> Show use some example input. (one record is not good enough)
>
> Show use some desired output (for that same data).
>
> Tell (and show) us anything you know about the format of the input data:
>
> Can fields be "missing" or "empty"? How can you tell when they are?
> Do the fields always line up in columns?
> Is there some separator between each column?
>
> If you can do something like that, then we would have a really
> good chance of being able to help you with your problem.

 
Reply With Quote
 
Sam Holden
Guest
Posts: n/a
 
      07-26-2003
On 25 Jul 2003 17:36:27 -0700, susan <(E-Mail Removed)> wrote:
> Hello friends,
>
> I am sorry I didn't provide enough information about the input. Here
> is the example of my text file for the references:
>
> REFERENCES
> Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
> Cell,2nd edition, Garland Publishing, New York,1989.
> Van Holde.,Chromatic, Springer-Verlag, New York, 1989.
> Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
> Sci.Am.243, 100-125(1980).
> Zhuang,X.,Ha,T.,Kim,H.D.,Cartner,T.m, Lebeit,S., and Chi,
> S.,Fluorescence Quen-
> ching: A tool for Single-Molecule Protein-Folding Study,
> Natl.Acad.Sci.19,14,41-
> 64(2000).
>
> I wrote: if (($Line =~/^[A-Z](.*)\d{4}(\))*\.$/)and ($Line =~/(\.,)/))
> { print "$Line\n"}
> This will keep all the references having a single line. But I don't
> know how to tell the computer to consider from "Alberts..." to "1989."
> is only one citation. Further more, I want to separate the
> inforamtion, output should look like:


The output part should be easy when compared to extracting the data.

For extracting the data I suspect you may be out of luck for a purely
automated system - the data is designed for humans and even then there
are probably cases that are ambigious (for humans, let alone machines).

This is the sort of problem for which the "human in the loop" approach tends
to be best. Parse as best you can, hopefully give a "score" to the parse and
let a human check the results.

References have the nice property of being referenced in multiple places, and
also have things like citeseer, so if you find something you've found before
it's more likely to be correct, and if a citeseer search for your parsed
result is successful you probably got it right too.

Authors have a reasonably consistant format (Lastname, Initials,) publishers
and journals and proceedings and the like can be covered by enumerating the
known ones (which should cover a large majority of posibilities). And a year
reference wil usually end the reference. So it should be easy to get something
which works on the vast majority of references (after all nothing you can
do will make the system work on an incorrect reference - and they exist...)

As an aside:

I'm amazed that academia hasn't worked out an ID system with publishers. Page
numbers suck (and I've seen at least one great study of incorrect references
spreading through a population (that study interpreted it as a symptom of people
giving references they haven't actually read - I interprete it as copying
the reference data of a read paper from another paper (I've done that
more than once)). The actual proceedings, etc have ISBNs. Giving each paper an
ID and then requiring that references have [ISBN.ID] after the human readable
text would make life *so* much easier.

[snip TOFU - please don't do that]

--
Sam Holden

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      07-26-2003
susan <(E-Mail Removed)> wrote:


> Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
> Cell,2nd edition, Garland Publishing, New York,1989.



Ends with 4 digits and a dot.


> Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
> Sci.Am.243, 100-125(1980).



Ends with open paren, 4 digits, close paren and a dot.


> I wrote: if (($Line =~/^[A-Z](.*)\d{4}(\))*\.$/)and ($Line =~/(\.,)/))
> { print "$Line\n"}
> This will keep all the references having a single line. But I don't
> know how to tell the computer to consider from "Alberts..." to "1989."



Do you know how to tell _us_ how to unambiguously determine the
end of a record?

We need to know what must be done before we can write code
for you that will do it.


> is only one citation.



If this description fits your data, then you can separate out
the records easily enough:

Every record ends with either
5 chars: 4 digits and a dot
or
7 chars: open paren, 4 digits, close paren and a dot


----------------------------------------
#!/usr/bin/perl
use strict;
use warnings;

$_ = '
Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
Cell,2nd edition, Garland Publishing, New York,1989.
Van Holde.,Chromatic, Springer-Verlag, New York, 1989.
Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
Sci.Am.243, 100-125(1980).
Zhuang,X.,Ha,T.,Kim,H.D.,Cartner,T.m, Lebeit,S., and Chi,
S.,Fluorescence Quen-
ching: A tool for Single-Molecule Protein-Folding Study,
Natl.Acad.Sci.19,14,41-
64(2000).
';


#while ( /^([A-Z].*?(\d{4}|\(\d{4}\))\.)$/gmsx ) {

while ( /^( # start of line, start of memory
[A-Z].*? # starts with upper case letter
( \d{4} | \(\d{4}\) ) # 4 digits with or without parens
\. # dot
)$ # end of memory, end of line
/gmsx ) { # gym sox (gimsox), according to Damian Conway

print "$1\n------\n";
}
----------------------------------------


> Further more, I want to separate the
> inforamtion,



You're on your own with that one.

It is more an Artificial Intelligence question than a Perl question.

The info is already hamburger. You cannot make steak out of it.


> (E-Mail Removed) (Tad McClellan) wrote in message news:<(E-Mail Removed)>.. .



[ snip a bit of TOFU ]


>> Have you seen the Posting Guidelines that are posted here frequently?



Have you done that yet?

Please do. Thanks.



[ snip some more unlovely TOFU ]

--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Snake references just as ok as Monty Python jokes/references in python community? :) seberino@spawar.navy.mil Python 8 12-12-2006 11:21 PM
Typedef A references struct B which references struct A which... DanielEKFA C++ 8 05-16-2005 10:26 AM
Difference between bin and obj directories and difference between project references and dll references jakk ASP .Net 4 03-22-2005 09:23 PM
how to understand references to variables and references to constants are distinguished? baumann.Pan@gmail.com C++ 3 11-10-2004 04:16 AM
Pointers and References (and References to Pointers) Roger Leigh C++ 8 11-17-2003 10:14 AM



Advertisments