The obfuscated address contest

Programmers sometimes organize contests in writing code that is perfectly understandable for a compiler, but very difficult to understand for people.

When working on products for address standardisation, one can discover an interesting variant: people sometimes write – unintentionally, I suppose – addresses in such a way that they are rather understandable for people, but very difficult to process for computers.

Consider for example this street name:

Kerkchoosteeg hoogl

The official version is:

Hooglandsekerk-choorsteeg (’high land church – choir alley’)

This street contains a couple of errors:

  • A hyphen is missing.
  • One ‘r’ is missing.
  • One word (’Hooglandsekerk’) has been split up into two words.
  • The first word (’Hooglandse’) is written at the end.
  • One word is abbreviated (’hoogl’).

The first two errors are not very special, but the last three can only be discovered in common: it can only be discovered that the word ‘hooglandsekerk’ has been split up into two words, if at the same time it is understood that the left part has been abbreviated and moved to the end.

Similar problems occur in:

O Lieve Vrouweschutstr

Official:

O.L.V. Schutsstraat

‘O.L.V.’ is a common abbreviation of the ‘Onze lieve Vrouwe’ (’Our Lady’); in order to determine that this abbreviation plays a role here, it must be understood that the word ‘Vrouwe’ is part of ‘Vrouweschutstr’ and that the rest of this word matches fairly well with ‘Schutsstraat’, which again is only possible if the missing ’s’ and the abbreviation ’str’ (for ’straat’) are correctly handled.

An extra complication occurs if there are multiple candidates:

rue dendicolle

This has some resemblance with two official streets:

RUE HENRI COLLET
RUE JEAN RENAUD DANDICOLLE

The first candidate has four differences:

  • Three typos (’d'-’h', ‘d’-'r’, missing ‘t’).
  • One missing space.

The second candidate has three differences:

  • Two missing first names.
  • One typo (’e'-’a'; in French these letters get in this context the same pronounciation).

The second street matches clearly better; this can only be determined if the errors in the first case are considered more severe than the errors in the second case.

If an address consists of many elements, there are even more possibilities to make things difficult:

30 FERMONT ROAD
199 CANARY WHARF
E33 9SF
E33 9SF LONDON

Official:

Flat 199
Canary Wharf
30 Fairmont Road
LONDON
E33 9SA

  • Street and house number are on the first line, but should be on the third line.
  • ‘Fermont’ must be written as ‘Fairmont’; the pronounciation is not equal, but fairly similar.
  • The address contains two postcodes; both are not on the right position and both are incorrect.
  • ‘199′ should be ‘Flat 199′ and must be written on a separate line.

A product that can recognize the error situations shown in these examples, must be able to switch constantly between different error types. Searching for displaced words or address fields must occur in combination with resolving abbreviations, determining whether to accept typing errors, and distinguising between typo’s that lead to different pronounciations and typo’s which don’t.
A product like this is never completely finished; therefore, when developing, it is good to start with the most common errors and the most common combinations of errors, and to add error situations in next releases. Investigating examples like these in an early stage of development helps setting up an architecture that is ready for further development.

Anyone got other nice examples?

(The addresses are real life examples; only the British example has been changed for privacy reasons, without changing the errors.)

Tags: ,

2 Responses to “The obfuscated address contest”

  1. Pim Hermans says:

    In general nice examples but there are some textual errors:

    The second street matches clearly better; this can only be determined if the errors in the second case are considered more severe than the errors in the second case

    Two times the “second case” is used. Seems to be incorrect.

    Gr,

    Pim

  2. Emil van den Berg says:

    You are right, Pim; I changed the text accordingly.
    Thanks for close reading :-)

    Greets,
    Emil van den Berg

Leave a Reply


Add an image:
Add image