September 1st, 2010 by Holger Wandt

As I was sitting on a terrace in Barcelona during my recent holiday, I found a copy of the Indenpendent, the well-known British newspaper. Having all the time in the world, I started reading and I came across this article about the North Yorkshire Police storing data of more than 180,000 people, including their date of birth and ethnicity. The vast majority of these people had given this information voluntarily and had not committed any crime.
When privacy campaigners questioned the need for compiling such a database, a police spokesperson answered: ” The system is used by many police forces in the UK and internationally to record all information relevant to policing, everything from details of arrested individuals, suspects, victims, witnesses and sources of information as well as addresses, phone numbers and vehicles. The information logged and cross-referenced in the system is absolutely vital to allow us to provide the effective policing service that the people of North Yorkshire and the City of York demand.”
I think that this is a very dangerous comment. What about the possibility of mixing up data of witnesses and criminals? How do the police forces create an unique view on their “customers”? What will be the consequences of so called “database errors”?
Of course I understand that the police forces all over the world need information to do their work properly and to prevent crime and other undesirable behaviour. But reading a comment like the above, I really wonder whether law enforcement agencies are really aware of the essential role of data quality in modern police work.
Tags: database errors, information gathering, law enforcement, police, unique view
Posted in Data Quality | No Comments »
July 2nd, 2010 by Holger Wandt

When the Thomas family from Ohio embarked on a recent trip from Cleveland to Minneapolis, they were in for a huge, but unpleasant surprise. It appeared that 6-year old Alyssa Thomas’ name was on the Homeland Security no-fly list; a list that is used to prevent individuals with known or suspected ties to terrorism from flying. The girl’s father, Santhosh Thomas, states that the worst thing his daughter has ever done, is probably been mean to her sister, but that this should hardly be a matter for the Department of Homeland Security.
The Thomases were eventually allowed to fly that day, but they were told to contact Homeland Security to clear up the matter. Now Alyssa just received a letter from the government, notifying the six-year-old that nothing will be changed and they won’t confirm nor deny any information they have about her or someone else with the same name. Read the rest of this entry »
Tags: anti-terror, Homeland Security, name check, name matching, no-fly lists
Posted in Data Governance, Data Quality | 1 Comment »
April 9th, 2010 by Holger Wandt

An increasing number of companies have to deal with data from the world’s fastest emerging economy: China. And the big question in this issue is of course: How can we compare these “strange” Chinese characters with our own writing set?
Grammar and character set of our Western alphabet-languages (such as English, French, Dutch or German) differ tremendously from Mandarin Chinese (which is the language spoken by most in the People’s Republic of China and abroad. Mandarin is a tonal language with an ideographic character set. Almost all characters have a semantic and a phonetic component. The different pithch in the pronunciation eventually determines the signification
Complicated? Definitely. But what about the other way around? Have you ever thought about the difficulties the Chinese have to face when trying to convert their language into meaningful English?
This phenomenon is sometimes hilariously being illustrated by the many public signs in China used to inform foreign visitors or to help them finding their way around.
This is truly a delightful side-effect of internationalization. …. Read the rest of this entry »
Tags: Chinese characters, fault-tolerance, internationalisation, internationalization, language, matching
Posted in Data Quality | No Comments »
March 9th, 2010 by Vincent van Hunnik

Have you ever tried to get contact details in and out of a CRM system, and ended up with a bigger mess? I have. The concept is easy: store all information about prospects and customers in one system, allowing you to have your communication efforts streamlined.
Reality, however, is harder: contact details entered on your website should be fed to the system automatically. Sending your periodic newsletter should be based on the details in your CRM system. Not to mention dealing with information on bounces. Integrating your CRM system(s) with mass mailing, campaign management and self service portals is helpful, but for some reason the major means of transporting lead and customer information still seems to be Excel… Leaving you with the necessity to mass import results, new contacts and changed information. Read the rest of this entry »
Tags: campaign management, contact details, CRM-system, mass mailing, self service
Posted in Data Quality, Data Services | No Comments »
March 2nd, 2010 by Jacques Baron

Everybody who has ever been on holiday in France has probably had a neighbour named Gaston, Jacques, Louis, Claire or Françoise . We are used to those first names, they evocate the “France profonde”, sleepy villages at the end of a road, films of Pagnol or Rohmer. Walks along the Seine in de shadow of “Notre Dame” in the spring. Coffee at a terrace of the Boulevard Saint-Germain where an obsequious garçon, named Marcel, is looking at your girl friend or wife in a way you dot not really appreciate. This particular image of France is in danger. In a few years our total frame of reference could have disappeared.
Nowadays French parents let their imagination go freely when they are choosing first names for their children. Looking at recent entries in the civil registry, you will find rather unusual first names like Bulle, Héribert, Loeva, Hermès, Evolène, and Argan.
These first names have all kind of origins. For example, they can be a combination of first names (Timéo, which is derived from Timothée and Théo),or they are different writing forms of known first names (Lilou becomes Lee-Lou). We can also find names from Greek or Celtic mythology or even from literature, like Arwen, a character from the novel Lord of the Rings. Read the rest of this entry »
Tags: civil registry, French names, processing French data
Posted in Data Quality, Data Services | 1 Comment »
February 15th, 2010 by Winfried van Holland

The Norwegian Fødselsnummer (Birthnumber) is an 11-digit number with 2 control digits. The 10-th digit is a control digit calculated with a weighted modulo 11 variant over the first 9 digits. The 11-th digit is a control digit calculated with another weighted modulo 11 variant over the first 9 digits combined with the 10-th control digit.
As in other countries also this number is based on the date of birth. The first 6 digits represent the birth date as “ddmmyy”. Problem with a 6-digit date is that you cannot identify the century – is a Fødselsnummer starting with 121009 someone born in 1909 or 2009? The Norwegian government has solved this by grouping the following 3 individual digits (individual number) in groups representing a certain era. If you are born between 1854-1899, then your individual number must be between 500 and 749, born between 1900-1999 then your number lies between 000 and 499, and for those born recently between 2000-2039 then your number lies between 500 and 999. With some exceptions for those with an individual number between 900 and 999. Read the rest of this entry »
Tags: personal identification number
Posted in Data Quality | 1 Comment »
February 11th, 2010 by Winfried van Holland

Professional matching engines are becoming more and more intelligent. Within Human Inference, we also see that our matching techniques are capable of using more and more intelligence, and needless to say that we incorporate and use this intelligence in our engines in order to adopt to the way that humans do their matching.
Traditional data quality or matching engines were based on atomic string comparison functions like match-codes, phonetic comparison, Levenshtein string distance, n-gram comparisons or similar functions. These kinds of functions are relatively easy to implement and to use although a significant amount of plumbing is needed to get reasonable results. Open source projects like the Lucene search engine, and variants, provide a solid and proven set of these functions. The drawback of these functions is that it’s not always clear for what purpose one needs to utilize a particular function. An even larger issue is the fact that these low-level DQ functions cannot distinguish between apples and oranges – you end up comparing family names with street names. We still see that, for example BI vendors, claim to provide data quality functionality, while they only provide these atomic comparisons. Read the rest of this entry »
Tags: apples and oranges, atomic string comparison, cultural differences, information retrieval, intelligent matching methods, Lucene
Posted in Data Quality | No Comments »
February 1st, 2010 by Winfried van Holland
The Finnish national personal identification number is the Henkilötunnus, aka Hetu or Ht, has the following format – ddmmyyc999C. For details how to calculate the control character, I refer to the overview blog on National Identification Numbers.
Validating the Hetu 270368A172X shows that it is indeed a correct number. The number 270368172 generates indeed 29 for the modulo 31 proof, represented by control character “X” in the checksum list. The number shows that this is the 86-th girl born on the 27th of March 2068.
The latter might is exactly the start for the discussion on validity. Althought the number itself is well formed, and passes all the automatic checks, dealing with this number in a data quality assessment will raise your digital eyebrow. In the data quality world we will nowadays say that this Hetu is a wrong Hetu, that it cannot be correct.
So always use a bit of human inference when dealing with finnish national personal identification numbers.
Tags: Finland, Human Inference, National Identification Numbers, personal identification number
Posted in Data Quality | 1 Comment »
January 19th, 2010 by Winfried van Holland

The national personal identification number in the Netherlands is called the Burgerservicenummer (or abbreviated with BSN, introduced since november 2007). It is a 9-digit number where the number can be validated by a weighted 11-proof. Basically all the digits become a weighting factor and by calculating the sequential digits with their weight the final result must exactly be divisible by 11.
A nice effect of this weighted 11-proof is that there are at least 2 digits different between 2 individual numbers. You need to perform at least 2 changes to come from one number to another – it might be that there are 2 completely different digits (e.g., 112682765 and 112682777) or the you need to swap one digit and change another (e.g., 427096509 and 427096510).
Mathematically it might still be that there are two succeeding numbers like 427096169 and 427096170, which still need 2 changes to come from the one to the other. Read the rest of this entry »
Tags: 11-proof, Personal Identification Numbers, statistics
Posted in Data Quality | 1 Comment »
January 19th, 2010 by Winfried van Holland
Within Europe there is no such system as European Social Security Number or European Identification Number. A lot of countries have their own system, and other countries are struggling to get a system into place.
The struggle of some countries has to do with historical reasons and with privacy aspects. Unique identifiation is not always used in favour of the community. And some of the used identification systems contain privacy-sensitive information, among others date of birth, gender and/or place of birth, where older systems might even contain religious or other privacy-senitive information.
A wide range of countries use the combination of date of birth, gender identification and the political region where you are born. In such a mechanism it is most common that part of the identification number is a 2-digit or 3-digit serial number to identify the unique male or female born on a specific date (or born on a specific month). Some countries provide odd serial numbers for male, and even for female. Bulgaria is the only one that wants “odd” females. Some countries like to divide on range (0-499 male, 500-999 female). And some countries like Norway make nice combinations to include the century of birth or period of birth in the serial number. Read the rest of this entry »
Tags: identification, privacy, privacy-sensitive, social security number, unique identification
Posted in Data Quality | 2 Comments »