We look at how custom-built software, powerful data mining tools and old-fashioned manual labour helped expose 1,000s of anonymous offshore account holders.
Sorting through 260 gigabytes (GB) of any data is not an easy task. That task is made even more complicated when the data is "not structured or clean" and consists of "a large and mainly unsorted collation of company and trust documents and instructions, e-mails, large and small databases and spreadsheets, personal identity documents, accounting information and agents' and companies' internal papers and reports."
That was the challenge facing the International Consortium of Investigative Journalists (ICIJ) following the leak of a huge trove of sensitive information about offshore accounts mainly relating to the British Virgin Islands.
To put it in context, the amount of leaked data dealt with during the investigation into offshore accounts is 160 times more than the US State Department cables leaked to and published by Wikileaks in 2010.
The information landed on the desk of Gerard Ryle, director of the ICIJ, last year. The "computer hard drive" on which the information was stored was not sent to Ryle randomly, but as a result of his three-year investigation into Australia's Firepower scandal - a case involving offshore havens and corporate fraud.
The amount of information stored on the hard drive was staggering. The 260GB of data contained 2.5 million files, including 2 million emails, four large databases and half a million text, PDF, spreadsheet, image and web files.
During the course of the investigation details of more than 122,000 offshore companies or trusts, nearly 12,000 intermediaries (agents or "introducers"), and about 130,000 records on the people and agents who run, own, benefit from or hide behind offshore companies were discovered.
The volume of data was daunting and while the ICIJ used some of the most sophisticated data mining technology available over the course of investigation, it all began with a group of journalists in New Zealand manually sifting through the reams of data to try and see exactly what they had on their hands.
This manual analysis led to the identification of which countries the investigation would need to focus on, and therefore which countries the ICIJ needed reporters to work in.
Having established the scope of the investigation, 86 journalists in 46 countries around the globe began trying to untangle the complex set of data in front of them.
One of the main issues facing investigators was that tens of thousands of the documents were unreadable by traditional computer software as they were images such as photographs which contain no text.
To overcome this problem, optical character recognition (OCR) software was used to re-scan the unreadable files which identified and logged names and numbers on top of the images.
According to the ICIJ, this technique "brought to the surface dozens of important new documents, including passports, contracts and letters explaining how companies were controlled."
The next step was to begin analysing the huge volume of data it had on its hands.
On the scale the ICIJ were dealing with, it was not possible to simply go and look for the interesting pieces of data and so they turned to "free text retrieval" software.
This software is able to automatically analyse vast troves of data, many times bigger than this investigation, pre-indexing every number, word and name, making it possible for complex queries to be completed in milliseconds.
This powerful software has been traditionally used by the world's intelligence agencies, law firms and commercial corporations but not investigative journalists as the cost is prohibitive.
However thanks to Australian company Nuix granting the ICIJ a number of licences free-of-charge, it was able to use its high-end e-discovery software to help index the 260GB of data.
With the data indexed and journalists primed in countries around the world, what the ICIJ needed was a centralised, online hub, where those involved in the investigation could log on and search the data.
Developed and deployed in Britian
Known as Interdata, the ICIJ's online search and retrieval system was developed and deployed by a British programmer in less than two weeks in December 2012, to "support an urgent need to get relevant documents and files out faster for research by dozens of new journalists who were joining the expanding Offshore Project."
Since the system went online, journalists have made more than 28,000 online searches and downloaded over 53,000 documents.
Other specially-built programs allowed for names and address to be checked and matched, and has spotted thousands of cases where the same person's data has been entered numerous times in different ways for different companies.
Another specially built piece of software identifies the country associated with each person and company, even when geographic data has not been entered fully or correctly.
Over a three month period the ICIJ was also able to recover and rebuild databases detailing offshore companies and the people who had set up and operated them.