Hashing Street Addresses ?

david_navigator · January 29, 2021

I need to store some data indexed on two physical street addresses - a source and a destination. Sometimes this street address might be as simple as just a Country, other times it could be a full Address down to the building's number.
I don't need to actually store the addresses.

In reality the source address will probably only have about a dozen different values, whereas the destination address will run in to 1000's.

I'm guessing that some kind of HASH is required of each address but this is an area that I don't really know much about and after any advice.

David Heffernan · January 29, 2021

First of all, don't worry about hashing the data. Work out how to represent the data, what data structure to use. And then define what you mean by equality for this data. Once you have done that, and have a clear specification of that, defining a hash function should be obvious. But if not, we will be able to help at that point. But first you need to define the data structure and the equality operator.

david_navigator · January 29, 2021

22 minutes ago, David Heffernan said:

First of all, don't worry about hashing the data. Work out how to represent the data, what data structure to use. And then define what you mean by equality for this data. Once you have done that, and have a clear specification of that, defining a hash function should be obvious. But if not, we will be able to help at that point. But first you need to define the data structure and the equality operator.

Not sure I follow you.

I have two addresses and against those I need to store about 10 floats. This will all be in a database table.

The two addresses come from a third party app, they're sent as structured string data e.g
Field1 - address_line_1

Field2 - address_line_2

Field3 - Town

Fied4 - County

Field5 - PostCode

Field6 - Country

Fields can be empty.
There's no unique ID or anything useful, just a load of strings. The one thing that is consistent is that the same street address will always have the same field values.

Knowing the original source of the data, I think it's acceptable to concatenate all the fields into one string and remove all whitespace/punctuation.

Thus I could be sent

Field1 - 1688 High St.

Field3 - Knowle

Fied4 - Solihull

Field5 - B93 0LY

and convert it to 1688highstknowlesolihullb930ly

which I could then store as one of the keys e.g

Source | Destination | Float1 | Float 2 | etc

1688highstknowlesolihullb930ly | 1600pennsylvaniaavenuenwwashingtondc20500unitedstates | Float 1 | Float 2 | etc

But as there's no maximum length of any of the field data I'm sent, these keys could end up being unnecessarily long, so this doesn't feel like the right approach.

David Heffernan · January 29, 2021

OK, maybe I misunderstood. I saw "hashing" and "indexed on", and inferred that you want to use a hashed collection like a dictionary. Hence the need to define data structure and what equality means. But perhaps you want to hash this data for some other reason. What will you do with this hash?

david_navigator · January 29, 2021

Maybe I'm using the wrong terminology - this isn't a area I'm too clued up on.

I was hoping that irrespective of the length of the address I could calculate some value (of known max length) that uniquely represented the address and could be used in in the database table

e.g in pseudo code

UNIQUEVALUE1 := Function ConvertAddress('1688highstknowlesolihullb930ly');

UNIQUEVALUE2 := Function ConvertAddress('1600pennsylvaniaavenuenwwashingtondc20500unitedstates');

DB TABLE

UNIQUEVALUE1 | UNIQUEVALUE2 | Float 1 | Float 2 | etc

Lars Fosdal · January 29, 2021

It looks like he wants to use a hash as a signature for an address - but that signature is only as good as the code doing signature building and the quality of the supplied data.

The algorithm would need to understand ordering, abbreviations, perhaps even common typos.

David Heffernan · January 29, 2021

Yeah, that's not a hash. And what you want (known max length) is clearly impossible. If you could store arbitrary data in a lossless way in a data of a fixed length, then well, you can see where I am going.

David Heffernan · January 29, 2021

1 minute ago, Lars Fosdal said:

It looks like he wants to use a hash as a signature for an address - but that signature is only as good as the code doing signature building and the quality of the supplied data.

The algorithm would need to understand ordering, abbreviations, perhaps even common typos.

What is a signature? All these terms being bandied about!!

david_navigator · January 29, 2021

I think I'm using the term HASH to refer to a checksum and was kind of hoping that different addresses would give different checksums.
Kind of how What3Words works.

Lars Fosdal · January 29, 2021

Signature - a unique footprint of data

Hash - algorithmic calculation on a signature

Lars Fosdal · January 29, 2021

4 minutes ago, david_navigator said:

I think I'm using the term HASH to refer to a checksum and was kind of hoping that different addresses would give different checksums.
Kind of how What3Words works.

A major difference here is that a coordinate in What3Words is strictly defined.

A signature built from combined data from external sources is far more difficult.

Consider '1688 High St.' vs '1688 High Street' as an example.

david_navigator · January 29, 2021

3 minutes ago, Lars Fosdal said:

Consider '1688 High St.' vs '1688 High Street' as an example.

In this particular scenario '1688 High St.' & '1688 High Street' would represent two distinctly different locations and thus two different records in the database. Even '1688 High St' (no period) would represent a third location.

Lars Fosdal · January 29, 2021

Even though they may refer to the same physical address?

In a warehouse management system, unique client identities and their addresses is a continuous source of "entertainment".

david_navigator · January 29, 2021

3 minutes ago, Lars Fosdal said:

Even though they may refer to the same physical address?

Yes. The system that sends the address data is pretty good at making sure that there isn't more than one version of an address for a physical location and for those that did slip though it would be acceptable to have "duplicate" data in this table.

David Heffernan · January 29, 2021

1 hour ago, david_navigator said:

different addresses would give different checksums

Two different values can have the same checksum. Read my previous post again. What you are expecting is impossible. I mean it. Impossible.

Lars Fosdal · January 29, 2021

There is one way that can avoid duplicate "hashes" or in this case - ensure a unique key - but it is not super performant, since lookup a database is expensive - but it will give you a fixed length unique identity for a signature.

Create a signature table in a DB and add a text field for the signature and a constraint for it to be unique.

Add an identity column as either an auto incremented int field, or an auto-generated GUID field.

The identity column will be your unique "hash".

Edit: If the "hash" will be exposed to third party - the GUID is recommended - since you cannot easily inject randoms GUID to access data - as compared to a sequential id which are predictable.

A.M. Hoornweg · January 29, 2021

Can you use the postal code to identify the town and street?

In the Netherlands we have a system where the postal code combined with the house nr uniquely identifies an address.

For example, enter "Netherlands 2594 BD 10" in google Maps and you'll see where our king lives.

david_navigator · January 29, 2021

1 hour ago, A.M. Hoornweg said:

Can you use the postal code to identify the town and street?

Sadly not. In the USA Zip codes are not as precise as Netherlands or UK, as they refer to the sorting & delivery district, not the street.

Darian Miller · January 29, 2021

53 minutes ago, david_navigator said:

Sadly not. In the USA Zip codes are not as precise as Netherlands or UK, as they refer to the sorting & delivery district, not the street.

Also, zipcodes in the U.S. change all the time. Your best bet is to standardize the address and use that. (120,N,Franklin,St,Chicago,IL,USA) The standardization doesn't have to be perfect, just consistent. Just know that it won't ever be 100% correct and you will eventually end up with duplicates with differently formatted addresses.

Otherwise just create a full covering index. Have an address table with a GUID as Primary Key and your 6 address fields. Create an index on your 6 address fields as varchars.

Create a values table with your source/destination address guids and then your 10 floats.

To optimize have a small Source Address table, and a larger Destination Address table.

Edited January 29, 2021 by Darian Miller

Remy Lebeau · January 29, 2021

2 hours ago, Darian Miller said:

Your best bet is to standardize the address and use that. (120,N,Franklin,St,Chicago,IL,USA) The standardization doesn't have to be perfect, just consistent. Just know that it won't ever be 100% correct and you will eventually end up with duplicates with differently formatted addresses.

I can tell you from personal experience that parsing US addresses is not easy or fun. I have an application I worked on, which includes a feature to gather real-time address data from various PBX systems and then standardize the components of those addresses into a database table, similar to what you have shown above. That project has over 150 parsers for all kinds of different ways that PBXs format their address data. And when a customer happens to identify a new PBX/format that is not covered by the existing parsers, it usually takes me about 3 days to identify the new format (which is rarely documented), write a parser for it, and test it against not only the formatted data that is expected, but also all kinds of corner-case formatting variations that inevitably crop up in production. Because people can't seem to input similar addresses with any amount of reliable consistency. Or even stupid things, like putting the Town name in the Address1/2 field, or the County in the Town field, or even the whole address as a single text in the Address1 field. It happens.

Edited January 29, 2021 by Remy Lebeau

Darian Miller · January 29, 2021

16 minutes ago, Remy Lebeau said:

I can tell you from personal experience that parsing US addresses is not easy or fun. I have an application I worked on, which includes a feature to gather real-time address data from various PBX systems and then standardize the components of those addresses into a database table, similar to what you have shown above. That project has over 150 parsers for all kinds of different ways that PBXs format their address data. And when a customer happens to identify a new PBX/format that is not covered by the existing parsers, it usually takes me about 3 days to identify the new format (which is rarely documented), write a parser for it, and test it against not only the formatted data that is expected, but also all kinds of corner-case formatting variations that inevitably crop up in production. Because people can't seem to input similar addresses with any amount of reliable consistency. Or even stupid things, like putting the Town name in the Address1/2 field, or the County in the Town field, or even the whole address as a single text in the Address1 field. It happens.

I handled address parsing for many years. There was one parsing routine which took a single line street address as input and chopped it into pieces. One source of input was hundreds of thousands of consumers would type their name/address into a web form every month and we imported each into a database with the street number, street name, street type, direction, PO Box, RR, Apartment... you can get it mostly correct, but perfection isn't possible. We pulled many millions of credit reports using that data, so it was pretty good and rarely completley failed. The data was also used for mailings, which we corrected with the NCOA database. We also compared to the OFAC/SDN list and some other lists... seems like every different address list has its quirks.

That single routine was written in Clipper in the mid 90's, converted to XBase++ in the late 90's, and then converted to Delphi early 2000's. It's not pretty code after the conversions and many tweaks. But perfection wasn't ever the target - if they typed in a town name into the address field, then so be it. That single routine may be the cause of many Americans having an alternate address listed on their credit report. (We pulled a few hundred million credit reports with that code over the years.) But the credit agencies handling of addresses is also extremely funky since they have been dealing with this problem for decades and they get data from so many different sources. They have a specific list of street types that are supported and each agency has a slightly different list. They take your name/address/ssn inputs and do a best guess-match to which consumer files to include in the credit report output (surprisingly, all the various pieces of the credit file are mainly stored by name/address and not SSN.) You can get a slightly different credit report depending on slight variation of your inputs. I've seen that happen many times and trying to explain it to an end user is always fun.

American address data sucks. Even the printed barcode id is not a permanent street address identifier. You simply have to establish your acceptable level of suckiness and stick with it.

david_navigator · January 29, 2021

Thanks for all the replies and suggestions. Just went for a walk with the dog and realised that I don't need the address info at all, I can index the other data fields (5 x integers) to find the record I need 🙂

Amazing how doing something different often brings alternative solutions to the forefront.

Darian Miller · January 29, 2021

44 minutes ago, david_navigator said:

Thanks for all the replies and suggestions. Just went for a walk with the dog and realised that I don't need the address info at all, I can index the other data fields (5 x integers) to find the record I need 🙂

Amazing how doing something different often brings alternative solutions to the forefront.

LOL. Feed that dog an extra snack.

david_navigator · January 30, 2021

I think it's the sharing effect. I can't think of the number of times I've described an issue in a NG or a forum and realised the solution before I pressed submit.

Darian Miller · January 30, 2021

Rubber Duck Debugging: https://en.wikipedia.org/wiki/Rubber_duck_debugging

Sign In

Hashing Street Addresses ?

Recommended Posts

david_navigator 12

Share this post

Link to post

David Heffernan 2481

Share this post

Link to post

david_navigator 12

Share this post

Link to post

David Heffernan 2481

Share this post

Link to post

david_navigator 12

Share this post

Link to post

Lars Fosdal 1922

Share this post

Link to post

David Heffernan 2481

Share this post

Link to post

David Heffernan 2481

Share this post

Link to post

david_navigator 12

Share this post

Link to post

Lars Fosdal 1922

Share this post

Link to post

Lars Fosdal 1922

Share this post

Link to post

david_navigator 12

Share this post

Link to post

Lars Fosdal 1922

Share this post

Link to post

david_navigator 12

Share this post

Link to post

David Heffernan 2481

Share this post

Link to post

Lars Fosdal 1922

Share this post

Link to post

A.M. Hoornweg 161

Share this post

Link to post

david_navigator 12

Share this post

Link to post

Darian Miller 391

Share this post

Link to post

Remy Lebeau 1656

Share this post

Link to post

Darian Miller 391

Share this post

Link to post

david_navigator 12

Share this post

Link to post

Darian Miller 391

Share this post

Link to post

david_navigator 12

Share this post

Link to post

Darian Miller 391

Share this post

Link to post

Create an account or sign in to comment

Create an account