Repository News

Implementing an Institutional Repository for Leeds Metropolitan University

Posts Tagged ‘Google indexing’

Leeds Met Repository Open Search Version 2.0

Posted by Nick on November 9, 2009

This is a bit of a trailer for our shiny new interface that I hope will go live in the next week or so and a run down of some of the new features.

It’s far from perfect and should still be seen as a beta – we very much need real users to start using it and I’m feeling a little nervous about how it will be received as I know how much work Mike, in particular, has put into it.

The interface has evolved from an SRU client developed for by IRISS – http://www.iriss.org.uk/learnx – which is available under GNU General Public Licence v.3 at http://code.google.com/p/sruopensearch/ (N.B.  We still intend to release our modified code under a similar licence.)  Learning Exchange Open Search is a great front end for searching intraLibrary but with just a simple search box lacked advanced search functionality that was essential for us.  We also wanted to use intraLibrary to manage resources for teaching & learning aswell as facilitating Open Access to our research collection in accordance with the EPrints model.

The tabbed interface incorporates an “Advanced search” form that allows users to cross reference multiple fields specifying AND/OR and they are also able to search for either “Research” or “Open Educational Resources” which uses authentication tokens to return results from the appropriate collections in intraLibrary:

advanced

There are also big changes in the way that results are returned; Mike has been able to use a unique identifier to build individual pages for each record so that a search will return a set of results that indicates whether or not each individual record has the full text available:

repository

These titles then link through to a static HTML page comprising all of the metadata associated with that record including a published URL and, where the full text is available, a link to the PDF in intraLibrary:

static

This static page should be indexed more effectively than was the case before though there is one small fly left in the ointment in that the public URL generated by intraLibrary that is used to download the full text is dynamic which means it cannot be indexed by Google; I’m not sure if it will be possible for Intrallect to do anything about this though they are aware of the need for full text indexing and are looking into the problem.

Posted in Adapting intraLibrary, Open Search V2.0 | Tagged: , , , , | Leave a Comment »

Development of Research Repository Aspect of IntraLibrary

Posted by Nick on June 1, 2009

On Friday Mike and I visited colleagues at Keele University for a meeting with Charles Duncan from Intrallect to consider development priorities for intraLibrary to better serve our needs as a research repository.  Over 4 and a half hours we considered the basic issues that need addressing as well as looking forward to some more ambitious functionality and integration with the wider research infrastructure as we move towards the REF.

I was particularly interested to learn about how Keele are implementing Symplectic’s publications management system – http://www.symplectic.co.uk/ – which regularly trawls Web of Science and PubMed central for information about Keele’s academic publications.  Symplectic have clearly been thinking about integration with IRs and there’s even a link to SHERPA/RoMEO.  The system was used at Imperial College London for the RAE 2008 process and includes link functionality with DSpace which is that institution’s IR platform – http://spiral.imperial.ac.uk/.  Intrallect are currently liaising with Symplectic about integration with intralibrary – I’m not certain precisely what form this would take but in an ideal world it would be great if we could auto populate as much metadata as possible (title/bibliographic info/abstract/author/copyright status according to RoMEO) and automatically nudge academics for full text where appropriate!

At Leeds Met we currently lack any form of research database which is why I’ve been exploring what are essentially manual workflows to populate the repository with all research output – I’m not sure how expensive Symplectic is and it may be difficult to justify given this institution’s relatively small research output and the repository may well have to be the research database which is the assumption I’ve been working on; we will also want to explore the soon-to-be-released Web of Science API which may, in any case, enable us to emulate some of this functionality ourselves.

The first item on our agenda was somewhat more prosaic and focussed on our immediate functional requirements – SRU searching and metadata.  Mike has been working on incorporating advanced search into the SRU interface and come up against a couple of issues when searching by author and date which are essentially artifacts of having to query DC rather than LOM; in the LOM, creators and contributors are clearly differentiated, however, querying by DC conflates creator and author roles which may (will) be different if resources are uploaded by someone other than the author.

  • Searching dc.creator will search for the creator and author roles
  • Searching dc.contributor will search for the content provider role

In addition:

  • Searching by dc.date only searches data that relates to the intraLibrary submission process (i.e. the deposit date, and perhaps modification dates if you added an author later on for example)
  • The only way to search journal dates is to use the default free text search that searches everything (or most fields anyway).

The solution, of course, is to make it possible to query the LOM by SRU and this is now Intrallect’s intention – indeed, to render all LOM fields query-able which would include user generated tags for example.

The next big question is exposure of open content to search engines and Charles gave us an overview of plans to develop an object “home page” with a static URL which should help in this area.  We also discussed sitemaps and what need to be done external to intraLibrary.  I’m still unclear on how we can improve the format of results returned by Google from the SRU interface; to repeat, Google IS indexing http://repository.leedsmet.ac.uk/ with site: http://repository.leedsmet.ac.uk/ currently returning over 500 records.  However this is fairly unstructured; Google is simply following links from http://repository.leedsmet.ac.uk/main/browse.php; any subsequent links Googlebot encounters are also indexed and returned as “The Repository search for [link name]” and ideally I’d like results to be returned in a more structured and user friendly form.   Many queries actually return no results where there is (yet) no content to find though where there is content, Google is indexing all human readable metadata.  I’m also not certain whether Googlebot is finding its way into the full text via the Open URL/virtual file paths generated by intraLibrary.  Full text indexing within intraLibrary itself has also been promised.

In short, I’m really not sure how all of these factors may combine to be exploited by a next generation SRU interface!

We then touched upon self-archiving and (semi) mediated workflows; potentially developing SWORD based quick deposit from desktop/web, ideally with automatic metadata generation.

The two other major issues we considered are:

  • Policy metadata – handling embargoes

This is pretty crucial to an OA archive of research as many publishers of academic journals specify an embargo period of 12 or 18 months from the date of publication before a paper can be made available in a repository.  We need to be able to add a paper to intraLibrary upon receipt but restrict access until the embargo has expired and for this to happen automatically.  On one level, this functionality should be fairly straightforward to achieve by having intraLibrary check today’s date against an embargo date specified in the metadata; it’s a little more complicated than that though as we would want the metadata to be visible before the embargo date, just not the full text.

  • Cover pages for PDF

It was suggested that a coversheet should be generated by intraLibrary on the fly which would certainly be useful as manually creating cover sheets for each and every article is time consuming to say the least; this would be useful functionality for CLA materials which also require a coversheet.

These developments will take some time to implement and the next stage is to prioritise – by anonymous e-postal ballot – Intrallect hope we will start to see some of the major initiatives in a build towards the end of the year.

Thank you to our colleagues at Keele for making us welcome and for feeding us!

Posted in Adapting intraLibrary, Open Access | Tagged: , , , , , , , , , | 3 Comments »