Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Usable street address parser in Python?

Reply
Thread Tools

Usable street address parser in Python?

 
 
John Nagle
Guest
Posts: n/a
 
      04-17-2010
Is there a usable street address parser available? There are some
bad ones out there, but nothing good that I've found other than commercial
products with large databases. I don't need 100% accuracy, but I'd like
to be able to extract street name and street number for at least 98% of
US mailing addresses.

There's pyparsing, of course. There's a street address parser as an
example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py".
It's not very good. It gets all of the following wrong:

1500 Deer Creek Lane (Parses "Creek" as a street type")
186 Avenue A (NYC street)
2081 N Webb Rd (Parses N Webb as a street name)
2081 N. Webb Rd (Parses N as street name)
1515 West 22nd Street (Parses "West" as name)
2029 Stierlin Court (Street names starting with "St" misparse.)

Some special cases that don't work, unsurprisingly.
P.O. Box 33170
The Landmark @ One Market, Suite 200
One Market, Suite 200
One Market

Much of the problem is that this parser starts at the beginning of the string.
US street addresses are best parsed from the end, says the USPS. That's why
things like "Deer Creek Lane" are mis-parsed. It's not clear that regular
expressions are the right tool for this job.

There must be something out there a little better than this.

John Nagle
 
Reply With Quote
 
 
 
 
John Roth
Guest
Posts: n/a
 
      04-18-2010
On Apr 17, 1:23*pm, John Nagle <(E-Mail Removed)> wrote:
> * *Is there a usable street address parser available? *There are some
> bad ones out there, but nothing good that I've found other than commercial
> products with large databases. *I don't need 100% accuracy, but I'd like
> to be able to extract street name and street number for at least 98% of
> US mailing addresses.
>
> * *There's pyparsing, of course. There's a street address parser as an
> example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py".
> It's not very good. *It gets all of the following wrong:
>
> * * * * 1500 Deer Creek Lane * *(Parses "Creek" as a street type")
> * * * * 186 Avenue A * * * * * *(NYC street)
> * * * * 2081 N Webb Rd * * * * *(Parses N Webb as a street name)
> * * * * 2081 N. Webb Rd * * * * (Parses N as street name)
> * * * * 1515 West 22nd Street * (Parses "West" as name)
> * * * * 2029 Stierlin Court * * (Street names starting with "St" misparse.)
>
> Some special cases that don't work, unsurprisingly.
> * * * * P.O. Box 33170
> * * * * The Landmark @ One Market, Suite 200
> * * * * One Market, Suite 200
> * * * * One Market
>
> Much of the problem is that this parser starts at the beginning of the string.
> US street addresses are best parsed from the end, says the USPS. *That's why
> things like "Deer Creek Lane" are mis-parsed. *It's not clear that regular
> expressions are the right tool for this job.
>
> There must be something out there a little better than this.
>
> * * * * * * * * * * * * * * * * * * * * John Nagle


You have my sympathy. I used to work on the address parser module at
Trans Union, and I've never seen another piece of code that had as
many special cases, odd rules and stuff that absolutely didn't make
any sense until one of the old hands showed you the situation it was
supposed to handle.

And most of those files were supposed to be up to USPS mass mailing
standards.

When the USPS says that addresses are best parsed from the end, they
aren't talking about the street address; they're talking about the
address as a whole, where it's easiest if you look for a zip first,
then the state, etc. The best approach I know of for the street
address is simply to tokenize the thing, and then do some pattern
matching. Trying to use any kind of deterministic parser is going to
fail big time.

IMO, 98% is way too high for any module except one that's been given a
lot of love by a company that does this as part of their core
business. There's a reason why commercial products come with huge data
bases -- it's impossible to parse everything correctly with a single
set of rules. Those data bases also contain the actual street names
and address ranges by zip code, so that direct marketing files can be
cleansed to USPS standards.

That said, I don't see any reason why any of the examples in your
first group should be misparsed by a competent parser.

Sorry I don't have any real help for you.

John Roth
 
Reply With Quote
 
 
 
 
Paul McGuire
Guest
Posts: n/a
 
      04-19-2010
On Apr 17, 2:23*pm, John Nagle <(E-Mail Removed)> wrote:
> * *Is there a usable street address parser available? *There are some
> bad ones out there, but nothing good that I've found other than commercial
> products with large databases. *I don't need 100% accuracy, but I'd like
> to be able to extract street name and street number for at least 98% of
> US mailing addresses.
>
> * *There's pyparsing, of course. There's a street address parser as an
> example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py".
> It's not very good. *It gets all of the following wrong:
>
> * * * * 1500 Deer Creek Lane * *(Parses "Creek" as a street type")
> * * * * 186 Avenue A * * * * * *(NYC street)
> * * * * 2081 N Webb Rd * * * * *(Parses N Webb as a street name)
> * * * * 2081 N. Webb Rd * * * * (Parses N as street name)
> * * * * 1515 West 22nd Street * (Parses "West" as name)
> * * * * 2029 Stierlin Court * * (Street names starting with "St" misparse.)
>
> Some special cases that don't work, unsurprisingly.
> * * * * P.O. Box 33170
> * * * * The Landmark @ One Market, Suite 200
> * * * * One Market, Suite 200
> * * * * One Market
>


Please take a look at the updated form of this parser. It turns out
there actually *were* some bugs in the old form, plus there was no
provision for PO Boxes, avenues that start with "Avenue" instead of
ending with them, or house numbers spelled out as words. The only one
I consider a "special case" is the support for "Avenue X" instead of
"X Avenue" - adding support for the rest was added in a fairly general
way. With these bug fixes, I hope this improves your hit rate. (There
are also some simple attempts at adding apt/suite numbers, and APO and
AFP in addition to PO boxes - if not exactly what you need, the means
to extend to support other options should be pretty straightforward.)

-- Paul
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      04-19-2010
John Nagle, 17.04.2010 21:23:
> Is there a usable street address parser available?


What kind of street address are you talking about? Only US-American ones?

Because street addresses are spelled differently all over the world. Some
have house numbers, some use letters or a combination, some have no house
numbers at all. Some use ordinal numbers, others use regular numbers. Some
put the house number before the street name, some after it. And this is
neither a comprehensive list, nor is this topic finished after parsing the
line that gives you the street (assuming there is such a thing in the first
place).

Stefan

 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      04-20-2010
John Nagle wrote:
> Is there a usable street address parser available? There are some
> bad ones out there, but nothing good that I've found other than commercial
> products with large databases. I don't need 100% accuracy, but I'd like
> to be able to extract street name and street number for at least 98% of
> US mailing addresses.
>
> There's pyparsing, of course. There's a street address parser as an
> example at
> "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py".


The author of that module has changed the code, and it has some
new features. This is much better.

Unfortunately, now it won't run with the released
version of "pyparsing" (1.5.2, from April 2009), because it uses
"originalTextFor", a feature introduced since then. I worked around that,
but discovered that the new version is case-sensitive. Changed
"Keyword" to "CaselessKeyword" where appropriate.

I put in the full list of USPS street types, and discovered
that "1500 DEER CREEK LANE" still parses with a street name
of "DEER", and a street type fo "CREEK", because "CREEK" is a
USPS street type. Need to do something to pick up the last street
type, not the first. I'm not sure how to do that with pyparsing.
Maybe if I buy the book...

There's still a problem with: "2081 N Webb Rd", where the street name
comes out as "N WEBB".
Addresses like "1234 5th St. S." yield a street name of "5 TH",
but if the directional is before the name, it ends up with the name.

Getting closer, though. If I can get to 95% of common cases, I'll
be happy.


John Nagle
 
Reply With Quote
 
John Yeung
Guest
Posts: n/a
 
      04-20-2010
My response is similar to John Roth's. It's mainly just sympathy.

I deal with addresses a lot, and I know that a really good parser is
both rare/expensive to find and difficult to write yourself. We have
commercial, USPS-certified products where I work, and even with those
I've written a good deal of pre-processing and post-processing code,
consisting almost entirely of very silly-looking fixes for special
cases.

I don't have any experience whatsoever with pyparsing, but I will say
I agree that you should try to get the street type from the end of the
line. Just be aware that it can be valid to leave off the street type
completely. And of course it's a plus if you can handle suites that
are on the same line as the street (which is where the USPS prefers
them to be).

I would take the approach which John R. seems to be suggesting, which
is to tokenize and then write a whole bunch of very hairy, special-
case-laden logic. I'm almost positive this is what all the
commercial packages are doing, and I have a tough time imagining what
else you could do. Addresses inherently have a high degree of
irregularity.

Good luck!

John Y.
 
Reply With Quote
 
Iain King
Guest
Posts: n/a
 
      04-20-2010
On Apr 20, 8:24*am, John Yeung <(E-Mail Removed)> wrote:
> My response is similar to John Roth's. *It's mainly just sympathy.
>
> I deal with addresses a lot, and I know that a really good parser is
> both rare/expensive to find and difficult to write yourself. *We have
> commercial, USPS-certified products where I work, and even with those
> I've written a good deal of pre-processing and post-processing code,
> consisting almost entirely of very silly-looking fixes for special
> cases.
>
> I don't have any experience whatsoever with pyparsing, but I will say
> I agree that you should try to get the street type from the end of the
> line. *Just be aware that it can be valid to leave off the street type
> completely. *And of course it's a plus if you can handle suites that
> are on the same line as the street (which is where the USPS prefers
> them to be).
>
> I would take the approach which John R. seems to be suggesting, which
> is to tokenize and then write a whole bunch of very hairy, special-
> case-laden logic. *I'm almost positive this is what all the
> commercial packages are doing, and I have a tough time imagining what
> else you could do. *Addresses inherently have a high degree of
> irregularity.
>
> Good luck!
>
> John Y.


Not sure on the volume of addresses you're working with, but as an
alternative you could try grabbing the zip code, looking up all
addresses in that zip code, and then finding whatever one of those
address strings most closely resembles your address string (smallest
Levenshtein distance?).

Iain
 
Reply With Quote
 
Grant Edwards
Guest
Posts: n/a
 
      04-20-2010
On 2010-04-20, Tim Roberts <(E-Mail Removed)> wrote:

> This is a very tricky problem. Consider Salem, Oregon, which puts the
> direction after the street:
>
> 3340 Astoria Way NE
> Salem, OR 97303


In Minneapolis, the direction comes before the street in some
quadrants and after it in others. I used to live on W 43rd Street.
Now I live on 24th Ave NE. And just to be more inconsistent, only the
"NE" section uses two directions, everywhere else it's just W, S, N,
or E.

--
Grant Edwards grant.b.edwards Yow! Is it NOUVELLE
at CUISINE when 3 olives are
gmail.com struggling with a scallop
in a plate of SAUCE MORNAY?
 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      04-20-2010
Iain King wrote:
> Not sure on the volume of addresses you're working with, but as an
> alternative you could try grabbing the zip code, looking up all
> addresses in that zip code, and then finding whatever one of those
> address strings most closely resembles your address string (smallest
> Levenshtein distance?).


The parser doesn't have to be perfect, but it should
reliably reports when it fails. Then I can run the hard cases through
one of the commercial online address standardizers. I'd like to
be able to knock off the easy cases cheaply.

What I want to do is to first extract the street number and
undecorated street name only, match that to a large database of US businesses
stored in MySQL, and then find the best match from the database
hits. So I need reliable extraction of undecorated street name and number. The
other fields are less important.

John Nagle
 
Reply With Quote
 
Albert van der Horst
Guest
Posts: n/a
 
      04-21-2010
In article <4bcddc5a$0$1630$(E-Mail Removed)>,
John Nagle <(E-Mail Removed)> wrote:
>Iain King wrote:
>> Not sure on the volume of addresses you're working with, but as an
>> alternative you could try grabbing the zip code, looking up all
>> addresses in that zip code, and then finding whatever one of those
>> address strings most closely resembles your address string (smallest
>> Levenshtein distance?).

>
> The parser doesn't have to be perfect, but it should
>reliably reports when it fails. Then I can run the hard cases through
>one of the commercial online address standardizers. I'd like to
>be able to knock off the easy cases cheaply.


In a similar situation I did the exact reverse. ( analysing
assembler code sequences for the stack effect.)
I made a list of all exceptions, and checked against that first.
If it is not an exception, the rule should apply.
If it doesn't, call Houston.
(Of course one starts with making an input canonical, all upper case
maybe reordering etc.)

>
> What I want to do is to first extract the street number and
>undecorated street name only, match that to a large database of US businesses
>stored in MySQL, and then find the best match from the database
>hits. So I need reliable extraction of undecorated street name and number. The
>other fields are less important.


This kind of problem remains very tricky ...

At least in the Netherlands we have a book containing information
about how the spelling of a street should be officially using a limited
number of characters.

>
> John Nagle


Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
I am trying to find coordinates for pampa texas, but i do not knowthe exact street address upandattem2002@yahoo.com Computer Support 1 12-15-2007 01:45 AM
Re: [Tutor] matching a street address with regular expressions Shawn Milochik Python 15 10-12-2007 11:50 PM
newbie question: parsing street name from address cjl Python 7 06-22-2007 09:55 PM
Street address parsing API nfredyao@gmail.com Java 0 04-17-2007 10:29 PM
Street Date? We don't have to honor no stinkin' street date! One-Shot Scot DVD Video 3 05-23-2004 11:56 AM



Advertisments