The Interface Between the Worlds of Cloud Computing and the Semantic Web

Paul Miller

Subscribe to Paul Miller: eMailAlertsEmail Alerts
Get Paul Miller via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: Cloud Computing, Rackspace Journal, Cloudonomics Journal, SaaS Journal, Government Cloud Computing, CIO/CTO Update, Telecom Innovation

Blog Feed Post

Repositories in the Cloud? Why on Earth Not?

Open Access is an important component of today’s scholarly ecosystem

To be honest, I’ve never fully understood Higher Education’s penchant for building ‘institutional repositories.’ These frequently under-populated aggregations of academic papers produced by ‘research active’ employees of a particular university appear aligned almost exclusively to vaguely expressed institutional imperatives, and seem largely unrelated to either the selfish aspirations of the contributing authors or the tangible relationships they painstakingly construct with others across their chosen discipline. The ‘repository’ all too often appears a bureaucratic solution to a problem that the supposed beneficiaries do not recognise; a technological aberration that sits outside the conversational flow of the Web to which it is only tenuously attached.

Furthermore, ‘Open Access‘ and ‘Repository’ typically go hand in hand. If you support Open Access you need a repository, and if you question the role of repositories you’re in the pocket of evil publishers who want to lock up everything ever written and lease reading rights back to the employers of those who wrote the stuff in the first place.


Open Access is an important component of today’s scholarly ecosystem. It’s not the only answer, and it’s not perfect, but it does have a significant part to play. Institutions have a role in preserving, disseminating and exploiting the work of their employees, but these are very different tasks that may benefit from different solutions. In too many cases, the repository is by default seen as a preservation mechanism and a dissemination vehicle, and as such it may fail to cost-effectively achieve either aim.

There are some large, well known, and research-intensive institutions where it might be possible to make a compelling argument for projecting a strong institutional image around a single ‘home’ for all of that research output. Never mind, for a moment, that so much research today is the result of inter-institutional collaboration, or that the eminent researcher might wish to take ‘their’ research publications with them as they move from Oxford to Harvard to York during their glittering career.

Alongside those institutions sit a plethora of others where research of equal quality is also being conducted; there just, maybe, isn’t quite as much of it. Bombarded by ‘advice’ and funding, and desperate to keep up with the Russell Group, ever-more institutions blindly join the repository cult and wonder why their new toys do not fill to overflowing with the jewels of scholarly erudition.

As research becomes increasingly data-rich, the whole cycle looks set to repeat. The recently released Panton Principles for Open Data in Science are to be welcomed, but I’ll bet the institutional response will all too often be the commissioning of a ‘data repository’ to sit alongside the ‘publication repository’ they already don’t use.

All of which is a rather long-winded way of introducing the fact that Eduserv’s Andy Powell has asked me to facilitate a breakout afternoon on ‘Policy Issues’ at the Repositories in the Cloud event Eduserv and JISC are holding in London on Tuesday.

“This free event, organised jointly by Eduserv and the JISC, will bring together software developers, repository managers, service providers, funding and advisory bodies to discuss the potential policy and technical issues associated with cloud computing and the delivery of repository services in UK HEIs.”

In a post on 11 February, Andy invited participants to share some of their views ahead of the meeting, and on 19 February he wrote about some of his own thoughts.

Like Andy, I struggled somewhat to nail down a coherent set of thoughts about the issue of pushing today’s repositories into the Cloud. On one level, I wonder whether the vast majority of institutions with small (and relatively low traffic) repositories would see much of a tangible efficiency gain or cost saving by moving off an in-house computer to rent an equivalent Virtual Machine from Amazon, Rackspace, or any of their competitors. If we’re talking about IT systems within a typical university, there are others (email, calendaring, pools of compute resource for research jobs, etc) that appear more immediately compelling for the shift Cloud-ward. Which is not to say that there isn’t a clear opportunity for someone trusted to step into this space and offer a SaaS repository to which institutions might affordably subscribe. Eduserv? Mimas? Edina? The British Library? The National Archives? Duraspace? Any could, and if we’re not ready for something more then at least one probably should.

However, a bolder reconsideration of what repositories are and what they’re for might very well lead to something interesting, sustainable, and perfectly suited for benefitting from Cloud Computing’s strengths.

Why does a paper have to be ‘deposited’ in a repository? Why does a single paper with three authors from three institutions have to be deposited in three separate institutional repositories? Why does that same paper have to be deposited – separately – in the subject repository favoured by scholars in the relevant discipline? Why does the institution’s very reasonable desire to protect, preserve, promote and disseminate its excellence mean that it has to run systems in perpetuity that preserve and permit access? Why do we address the fundamentally different (perhaps even contradictory) problems of access and preservation in the same system? Why can’t the individual researcher easily assemble a view across their publication history, regardless of the institution within which they happened to reside as they wrote each paper? Why don’t the assemblages of papers reflect personal, professional and disciplinary relationships, alongside (or instead of) the contractual accident of employee-employer relationships? Why isn’t the wealth of metadata implicit to any publication (authors, subjects, dates, citations, and more) available and actionable, both inside the repository and far beyond it across the Web? Why isn’t there a tight and active association between the paper and the data from which its findings were derived (something for which Internet Archaeology was demonstrating utility a very long time ago)?

Scholarly papers principally comprise text, augmented by the occasional static image. They’re not big, and they don’t tend to change very fast. In many ways, they represent a fairly easy problem set with which to work. As more and more data becomes key to research in a growing number of subject areas, the problems are set to become far larger and far more difficult. For individual universities to even consider replicating the process by which they all ended up with their repositories of text surely seems madness in this data-rich environment. Even with levels of uptake as low as those seen in too many text repositories, the issues of data management, curation, access and dissemination are too great to be sensibly solved in the institutional machine room. Services like InfoChimps and Amazon’s own Public Data Sets offering show some of the ways that we might begin to work with data at scale. Might we, for example, come to recognise as Amazon has that it’s actually cheaper and quicker to entrust large data sets to FedEx rather than transmit them over the Internet?

‘The answer’ might be some central service for the community, funded by JISC like the Arts & Humanities Data Service (AHDS) of old. Or it might be something different, something nimbler, more responsive, more flexible to individual, institutional, and disciplinary requirements, and something more scalable to new disciplines; institutional support for and use of existing Cloud infrastructures extending far beyond UK Higher Education, aligned with a clear understanding of the separation between preservation and access.

I certainly don’t have all the answers, but I do believe that simply asking whether or not we should move existing repositories to the Cloud is to miss the point. Rather, we should ask what role the Cloud might play in addressing the business requirements to which the institutional repository was our initial – faltering – response. The answer might very well be ‘None,’ but I doubt it.

I look forward to Tuesday’s discussion. I’m not going there to push my personal view that individual institutions frequently shouldn’t be building, running or populating their own repositories at all. I’m going there to facilitate the discussion those in the room want to have, and to learn from their experiences and their perspectives.

Read the original blog entry...

More Stories By Paul Miller

Paul Miller works at the interface between the worlds of Cloud Computing and the Semantic Web, providing the insights that enable you to exploit the next wave as we approach the World Wide Database.

He blogs at www.cloudofdata.com.