Census 2016: why the privacy assurances from the ABS are not good enough

A concerned user of the online link aggregator Reddit recently highlighted some issues with the way the Australian Bureau of Statistics intends to link your personal data across other government databases. In addition, an article posted today by former Deputy Privacy Commissioner of NSW, Anna Johnston, outlines her reasons for boycotting the 2016 census. This led us to question some of the methods the ABS may be intending to use and how they might be of concern and in the public interest.

hashprint.png


How is the ABS planning to connect your data?

As mentioned here, the ABS will be attempting to link your census data to other government databases with a Statistical Linkage Key (SLK). We can assume for now that they are using SLK-581. This appears to be a common way for statisticians to create a semi-reliable link between datasets.

How does the SLK work?

Well, the method of generating this key is somewhat straightforward.

the concatenation of the 2nd, 3rd and 5th letters of the family name, the 2nd and 3rd letters of the given name, date of birth as a character string of the form ddmmyyyy, followed by the character ‘1’ for male and ‘2’ for female. Non-alphabetic letters in names are excluded (for example, hyphens and apostrophes), and where a name contains insufficient letters, the character ‘2’ is used as a place marker for absent key letters. The character ‘9’ is used for any other missing data so that the linkage key always has a length of 14 characters

Source

From this you will end up with a 14 character key in the format of "XXXXXDDMMYYYY".

Some examples:

Given name

Surname

Date of birth

Sex

SLK-581

Issue

BENJAMIN

GREGORY

9/12/30

M

REOEN091219301

 

BENJAMIN

GREGORY

9/12/30

M

REOEN091219301

Client has more than one client ID

BARBARA

BUTLER

15/08/23

F

UTEAR150819232

 

MARIA

VUTTESQUE

15/08/23

F

UTEAR150819232

Two clients have same sex, components of name and date of birth

JIMMY

BLACK

1/01/20

M

LAKIM010119201

Default date of birth used

MALVERN

GREY

1/01/20

M

RE2AL010119201

Default date of birth used

JOHN

SMITH

20/05/22

M

MIHOH200519221

 

JOHN

SMITH

20/05/22

M

MIHOH200519221

Two clients have a same name and date of birth

LAVINIA

WALTERS

12/02/16

F

ALEAV120219162

 

WINNY

WALTERS

12/02/16

F

ALEIN120219162

Pseudonyms

ZU

LU

6/06/37

M

U22U2060619371

Short name

XXXX

XXXXX

22/11/07

M

XXXXX221119071

Missing name

Source (pg 19)

There are a few exceptions but generally this is how it would work. You can see now why not adding your name will help to keep your data anonymous. Simple to do on the paper form, however difficult to omit this in the online version. Some have suggested to overcome the validation in the online form you should simply enter “WITHHELD” as your name as this is not necessarily providing false information. We cannot confirm if this is adequate and suggest you seek legal counsel should you choose to pursue this path.

But what about security?

The ABS have stated that they will be hashing your SLK to keep it secure. Hashing in computer terms is a method of creating a new set of characters, of a fixed length, based on an algorithm. By its very nature, anything that is hashed is not reversible i.e. you cannot ever decrypt the hashed text

For example, if we use the rather outdated hashing algorithm of MD5 (ABS should at least be using SHA-2) on Jimmy Black's SLK of "LAKIM010119201" (see above) we will get an MD5 hash of "71515F908B37834393062176FF72A11F" (online hash generators). This will be the same every time you run the hashing algorithm over Jimmy's SLK.

So, if you can guess Jimmy's SLK you can recreate his hash. No need to decrypt anything. However, to protect against this you can add what is known as a salt. Simply by adding a randomly-generated, and of course very secret salt, you can then generate a hash that can't be easily recreated.

E.g. Adding a random salt of "_t^5g29J;x" to Jimmy's SLK like so "LAKIM010119201_t^5g29J;x" will give us an MD5 hash of "81C2B9922803752CB8B2DE2CC847D85C"

For sure this makes the SLK hash a decently secure key to use for matching people across databases. But therein lies the rub. Some form of this key will be regenerated for every census, forever. It doesn't matter whether the ABS deletes your name after 18 months or 4 years. In 5, 10 or 50 years they can simply generate the key again using the same hashing algorithm with a known salt.

Now what?

You might already see by now, many other government databases could generate this key with a fairly good chance of linking your data across databases. What this means is that any government database containing enough details to generate an SLK can potentially be linked back to your full census data. Of course, we would hope that the salt used by the ABS to generate your SLK hash would never be accessible to any other government departments, thereby ensuring only the ABS will be able to create these database links, but hope isn’t really the most pragmatic way to protect your privacy.

Presumably all ABS staff are as committed to confidentiality as previous Commonwealth Statistician Sir Roland Wilson was. When lawfully asked by Commissioner of Taxation to supply data on a citizen to help convict them in a court case he had data destroyed rather than hand it over and have the public lose trust in the system [1]. Acts like this give me trust in the people who handle our sensitive data and pride in humanity. But times have changed and the hierarchies that once ruled are now composed differently with ambiguous objectives. Perhaps if the government and more specifically the ABS were more proactively transparent about their processes, they would be attracting more trust and goodwill from us citizens concerned with scope creep.

I have nothing to hide so why should I care?

Aaaarrrgghhh!!!

Arguing that you don't care about the right to privacy because you have nothing to hide is no different than saying you don't care about free speech because you have nothing to say.

Edward Snowden 


TL;DR: What does this all mean?

Ultimately this means that having your name kept for up to 4 years is of less concern than the fact that it is being stored at all. The way it is being stored allows for it to be re-linked to your extensive demographic information, and for all databases across time to be able to link their data back to you, for good or for bad.

 


[1] Informing a Nation - the evolution of the Australian Bureau of Statistics (pg 9) (PDF)