Repositories for research and teaching/learning material: The debate continues at #rpmeet

reprog

Last week I attended the JISC Repository and Preservation end of programme meeting in Birmingham. I recall being very nervous at my first JISC event in November 2007 but feel much more at ease now and enjoyed the event immensely; the programme has certainly been successful in fostering a sense of community though it’s an unusual social experience to meet people face to face, often for the very first time, when one feels you already know them from reading their blog and following them on Twitter.

During one of the breakout sessions on the first day I made a bee-line for a discussion about repositories for learning and teaching materials – as opposed to OA research repositories. I use the word “opposed” advisedly as there is certainly some strong sentiment around the issue, particularly with respect to using a common software platform. As a representative of a project that is adapting a learning object repository to also serve as an effective Open Access research repository I’m finding it a little difficult to understand the vehemence of some of this opposition, though I would be the first to acknowledge a steep learning curve and recognise that we have required extensive development, not of intraLibrary itself perhaps, but of an appropriate web infrastructure surrounding it. And yes, we would certainly have been able to implement a functioning OA research repository more quickly using EPrints or DSpace however, from the outset, it was vital that our repository had the the capacity to fulfil its broader potential – in the words of Clifford Lynch “[A] mature and fully realised institutional repository will contain the intellectual works of faculty and students – both research and teaching materials – and also documentation of the activities of the institution itself in the form of records of events and performance and of the ongoing intellectual life of the institution.”  [Lynch, Clifford. A “Institutional Repositories: Essential Infrastructure for Scholarship in the Digital AgeARL Bimonthly Report 226 (2003).]

It’s also important to be pragmatic.  Historically, Leeds Metropolitan University is a polytechnic that gained chartered university status in 1992; its heritage is very much in teaching and learning rather than research with, arguably, a more vocational than academic flavour.  In recent years, the research profile has steadily increased, culminating in unprecedented success in the 2008 RAE and the university is naturally keen to capitalise on this success, enhance its research profile further whilst also continuing to emphasise its student focussed teaching and learning credentials. The implementation of an integrated repository to support both research outputs and learning objects reflects this dual focus.  Clifford Lynch’s article suggests that the concept of a central system to manage disparate resources in this way has been implicit within the sector for some years, however, the technology has tended to focus on Open Access to research, with the two most widely used software platforms being EPrints, developed at the University of Southampton in 2000, and DSpace, developed at MIT in 2002; early versions of both platforms were primarily designed to manage text based resources (though subsequent versions of EPrints and DSpace can manage a wide range of digital file formats.)  

NB.  In an extended discussion on this issue on JISC-REPOSITORIES (archive hereRepositoryMan Les Carr of EPrints refers to the fact that he still comes across the firmly held (and spurious) belief that because EPrints is used for Open Access it can’t be used for multimedia files or scientific data.

The session was chaired by Amber Thomas of JISC and I asked a somewhat blunt, perhaps naive, question about JISC’s perspectives on combined repositories of research and teaching materials.  Amber suggested that JISC have been deliberately neutral on the issue which is also perhaps emphasised by the diagramatic representation of the programme structure reproduced above.  

Some of the commentators last Wednesday were adamant that though it may well be possible to manage different types of resources with a single system it was far from desirable with one colleague making the pithy analogous observation that you can write letters in Excel but that doesn’t make it right.  Phil Barker of CETIS was also at the discussion and in a recent blog post on the “question of whether research outputs and learning materials should stored in the same repository” is “inclined to think the answer is no, the purpose of the repository is different, a learning material isn’t an output, sharing means something different for the two resource types.”  Phil goes on to say that ” If you think a repository is a database and a bit-store then you may come to a different conclusion, but I think a repository is a service offered to people and your choice of starting point in offering that service will affect how easy your journey is.”  (Full post here)

I’d certainly concede that our journey hasn’t been an easy one and I also agree that a repository is a service offered to people and with our repository start-up, and also Streamline and PERSoNA, that is certainly the approach we have tried to take; with intraLibrary and the SRU interface we now have an incipient infrastructure to manage both research material and learning objects; the discrete types of material can be managed entirely separately, however, there is also potential for the ongoing development of a holistic approach to the management of the full range of digital resources produced by a modern university and as we develop our infrastructure further I hope we can utilise appropriate web-technology around a central management system (intraLibrary) to achieve decentralised resource discovery – through appropriate interfaces, widgets and environments – the VLE for example.

JISc-meeting09-poster

Then of course there is the small matter of persuading academics to part with their resources, not to mention IPR, copyright and quality control issues…

Open Access to research is an evolving paradigm and represents a considerable shift in the established academic publishing process; Open Access to a broader range of educational resources still more so. Any paradigm shift is likely to take time to evolve and Open Access, to research and other materials, is no exception, especially given that academia, perhaps, tends to subscribe rather strongly to established tradition!

JISC’s current OER programme should go some way to addressing many of these issues but infrastructure is the foundation. The perfect system almost certainly doesn’t exist and it’s surely important to be pragmatic when implementing and developing appropriate system. Here’s to ongoing discussion, debate and development.

Google indexing and SEO

It is crucial that both the Open Access full text research content of the repository and metadata records of citation material are fully indexed by Google (and other search engines); in the future it is also likely to be required for other Open Educational Resources (learning objects). However, site:http://repository-intralibrary.leedsmet.ac.uk/ currently returns just 4 results (in addition to the Login page itself) and it is a bit of a mystery how these 4 are actually being picked up when the majority of records are not.

In intraLibrary, for a given collection, the administrator may choose to:

• Allow published content in this collection to be searched by external systems

This effectively means SRU (Search and Retrieve by URL) a standard search protocol utilizing CQL (Common Query Language).

• Allow published records in this collection to be harvested by external systems

This effectively means harvest by OAI-PMH

XML Sitemaps

Intrallect have suggested that it is necessary to implement an XML sitemap to ensure that content is properly crawled by Google. Until 2008, Google did support sitemaps using OAI-PMH but have since withdrawn this and now support only the standard XML format. Intrallect have therefore developed a software tool that converts OAI-PMH output to an appropriate XML format. A sitemap has been generated and registered using Google’s webmaster tools but currently is registering a series of errors that indicate “This URL is not allowed for a Sitemap at this location”; 9 errors are listed from the very first URL and which are sequential; it seems that the crawl does not go any further and none of the 100+ URLs in the sitemap have been successfully recognised. Two possible reasons have been suggested for this:

• All of the URLs in the sitemap are external; it may be that Google does not permit URLs outside the mapped domain.
• There is a problem with the XML itself

Sitemap here: http://repository-intralibrary.leedsmet.ac.uk/sitemap/Sitemap.xml

Sitemaps using RSS

It is also possible to submit a sitemap based on RSS, however, this approach has not been any more successful as the Open URL/virtual file paths generated by intraLibrary are inaccessible to Google resulting in the following warning:

URLs not followed
When we tested a sample of URLs from your Sitemap, we found that some URLs redirect to other locations. We recommend that your Sitemap contain URLs that point to the final destination (the redirect target) instead of redirecting to another URL.

Google and SRU

Though SRU does not facilitate indexing by Google per se, the integration of the SRU Open Search interface may provide a potential solution. site:http://repository.leedsmet.ac.uk/ currently returns 247 records; largely these appear to represent Googlebot following the various browse links (many of which themselves return no results where there is no content to find!) In addition, Googlebot appears to be following hyperlinked author names, publisher and subject(s) in the individual metadata records:

google

The third of these “The Repository search for Morton, Veronica” links to the two metadata records associated with that name as though it had simply been entered into http://repository.leedsmet.ac.uk/ as a search term:

http://repository.leedsmet.ac.uk/main/search.php?q=Morton%2C+Veronica+

Presumably these records were initially indexed via the appropriate links on the browse interface – http://repository.leedsmet.ac.uk/main/browse.phpFaculty of Health and R – Medicine and then re-indexed via the hyperlinks embedded in the metadata records. It is interesting to note that, though Morton, Veronica only has two records associated with her name, this record appears relatively high – at the top of the second page – and this is probably because there so many other authors also associated with these papers; all of these names are hyperlinked giving over 21 separate indexable links.

It seems that we might need to formalise the structure of the SRU to ensure it is optimised for Google; possibly with some sort of SRU sitemap. For example, if we could generate a page that linked to all the individual metadata records in the repository and optimise this page to be crawled by search engine spiders (doesn’t need to be human readable; could be XML) which could then follow the links to the associated metadata.

It also seems to me that Search Engine Optimisation will need to comprise appropriate customisation of the SRU interface; for example, we want to facilitate browse by author which, in turn, will provide indexable links for Googlebot.

Full text indexing

There is also the issue of indexing full text. As already mentioned, Google does not follow the Open URL/virtual file paths generated by intraLibrary and all the results from site:http://repository.leedsmet.ac.uk/ are search results. Potentially this is a benefit in as much as people are less likely to bypass the metadata record and go directly to the PDF but we do also want to facilitate full text indexing. We may have to wait for Intrallect on this who have assured us they are looking into facilitating full text indexing – probably via intraLibrary itself rather than the SRU.

Follow

Get every new post delivered to your Inbox.