The British Library is one of six 'legal deposit libraries' to begin archiving the UK web, including one billion pages from 4.8 million websites, blogs, forums and social media sites.

Library computers
The six libraries will gather more than a billion pages from 4.8 million UK websites. (Credit: Reuters)

The collection begins this week and is the biggest expansion to the way libraries gather data for hundreds of years, with plans for around a billion web pages per year to be captured and recorded for future generations to read.

Culture Minister Ed Vaizey MP said: "Preserving and maintaining a record of everything that has been published provides a priceless resource for the researchers of today and the future.

"So it's right that these long-standing arrangements have now been brought up to date for the 21st century, covering the UK's digital publications for the first time."

Starting on 6 April, the operation will run an automatic "web harvest" of 4.8 million UK websites amounting to around one billion pages. The process is expected to take three months, followed by two months of processing the data to make it easily searchable in the British Library's database, which already holds 750 million pages of newsprint.

Previously, under the 2003 Legal Deposit Library Act, libraries were given copies of all major print publications, but had to ask permission each time they wanted to store something published online; now, the libraries involved will be granted access to all UK online publications.

Along with the British Library, the National Libraries of Scotland and Wales, the Bodleian Libraries in Oxford, Cambridge University Library and Trinity College Library Dublin will all help to gather the data and make it available for visitors.

Social networking

The archive will include everything from mainstream news websites - including access to content previously locked behind paywalls - to blogs, forums and eventually content from social networks, such as tweets and Facebook posts made by users with their privacy settings set to public.

While only posts and messages published publically will be harvested, this is a step too far according to Nick Pickles, director of UK privacy campaign group Big Brother Watch, who told the Financial Times:

"The danger of unintended consequences is magnified by how wide they've cast the net."

Pickles said that many people who use these social networking sites may not realise what they upload to the public web would be preserved forever.

Swallowed by a black hole

"Ten years ago, there was a very real danger of a black hole opening up and swallowing our digital heritage," said Roly Keating, chief executive of the British Library, "with millions of web pages, e-publications and other non-print items falling through cracks of a system that was devised primarily to capture ink and paper."

Such items to fall through the cracks include online reports of major news events, and where stories have been broken on Twitter and blogs, but have not appeared in as much detail or with as much analysis in printed publications.

Lucie Burgess, project leader at the British Library, said: "If you want a picture of what life is like today in the UK you have to look at the web. We have already lost a lot of material, particularly around events such as the 7/7 London bombings or the 2008 financial crisis.

"That material has fallen into the digital black hole of the 21st century because we haven't been able to capture it. Most of that material has already been lost or taken down. The social media reaction has gone."

To preserve the web archives, data will be stored on numerous servers across the country and there will be self-replicating copies to ensure nothing is lost if a server is damaged.