Implementing the Symplectic API

We’ve made real progress implementing the Symplectic API which I hope will help motivate academic staff to update and maintain their Symplectic profile and, who knows, perhaps even encourage them to upload full-text to the repository! Kudos to web-developer Mike Taylor who has done all the clever stuff (though this summary reflects my understanding so may contain errors!)

As can be seen in the screen-shot below, Mike has been able to submit a query to the API (using Leeds Met username as a parameter) and differentially parse the resulting XML by publication type including, where available, links to DOI and full-text in the repository (currently labelled as Public URL). Symplectic also has the option to “favourite” records which is flagged in the XML and which we’ve use to identify “Selected publications” in order to give academics greater control over their profile (there is also a “make invisible” option to prevent specific records being exposed from the API.)

The next step will be to liaise with the corporate web-team to explore how the feed can be embedded in the institutional CMS. We’ve already picked a few brains and it shouldn’t be too difficult though there are still one or two technical issues including how best to submit a query; we wouldn’t want to use username as that would be a privacy issue and the preference would be email address though this will require a layer of translation from email address (which isn’t searchable)to either Leeds Metropolitan username or Symplectic internal user id. In addition, the API isn’t designed to be hammered dynamically so results need to be cached so there are questions how best to refresh that cache to reflect changes that academics may wish to make to their profile.

Advertisements

An institutional tangram – musings on developing an integrated research management system

“The tangram (Chinese: 七巧板; pinyin: qī qiǎo bǎn; literally “seven boards of skill”) is a dissection puzzle consisting of seven flat shapes, called tans, which are put together to form shapes. The objective of the puzzle is to form a specific shape (given only an outline or silhouette) using all seven pieces, which may not overlap.”

http://en.wikipedia.org/wiki/Tangram

Having implemented an institutional repository at Leeds Metropolitan and learning by experience some of the difficulties associated with advocacy around the use of that repository (both for OA research and OER) I have become all too aware “that repositories are ‘lonely and isolated’; still very much under-used and not sufficiently linked to other university systems”. So said JISC’s Andy McGregor at an event called “Learning How to Play Nicely: Repositories and CRIS” in May 2010 at Leeds Metropolitan (see my report for Ariadne here). This quote is still relevant, though  perhaps a little less so than when I heard it nearly 2 years ago, thanks to the ongoing work of JISC and particularly the RSP. In any case, the event was a revelation for me and I have coveted a so called Current Research Information Management systems (or CRIS for short) ever since!

And now, in Symplectic Elements, I have one…or at least the components of one (click on image for full size.)

The finished tangram? (click on image for full size)

It’s a puzzle though. A tangram if you will…one with considerably more than seven pieces:

intraLibrary, Symplectic, institutional website, University Research Office (URO), faculty research administrators, The Research Excellence Framework (REF), academic staff, web-developers, bibliographic information, research outputs, Open Educational Resources (OER)…

In fact, this may well not be all the pieces…pretty sure a few have been pushed down the back of the settee. I’ll look for them later.

Anyway, tortured metaphors aside, I have become increasingly aware that working in a large institution, in a role that encompasses technology and institutional policy (though I’m not, by any means, a policy maker…or indeed a real techie) is largely about communication and getting the right people, with the right skills, in the right place at the right time! Absorb policy and technical requirements from senior stakeholders and communicate those requirements to the proper techies – while also trying to ensure any motivating passions of one’s own don’t get lost along the way – Open Access to research and Open Education in my case.

For various reasons, individual user accounts have never been implemented for our repository and historically it has been administered centrally from the Library. In Symplectic we now have a system that is populated with central HR data; all staff will have an account they can access with their standard user name and password from where they can manage their own research profile including uploading full-text outputs directly to the repository*. In addition, administration by the University Research Office and faculty research administrators will be more easily centralised (particularly for the REF).

* In actual fact this functionality is not yet available in lieu of development work from Intrallect to capture the Atom feed from Symplectic and transform with XSLT to a suitable format for intraLibrary. I think.

One of the clever bits of functionality used to sell the software is automatic retrieval of bibliographic data from online citation databases – we are currently running against various APIs, Web of Science (lite), PubMed and arXiv – but I think this may actually be a bit of a red-herring for an institution like Leeds Metropolitan – at least until more (preferably free) data sources are available (JournalToCs API please!); early testing has shown, at best, it will only retrieve a subset of (the types of) outputs that we will need to record and it will be necessary to manually import existing records (e.g. EndNote) as well as implementing other administrative procedures at faculty level to capture information at the point of publication, especially for book-items, monographs, conference material, reports and grey literature.

More important, I think, to ensure that academic staff actually engage with the software rather than just seeing it as a tool for administrators, is to re-use the data to generate a list of research outputs – a dynamic bibliography – on a personal web-profile which has the potential to dramatically increase the visibility of research including Open Access to full-text.

Developing staff profiles of this type has been something of an obsession of mine for a while; we explored doing so from the repository (using SRU and email address as a Unique Identifier) and did develop a working prototype. Symplectic, however, integrated with central HR data and with its more sophisticated API, should make it much easier, at least from a technical perspective, and we are currently liaising with the central web-team to develop something similar to this example from Keele University – http://www.keele.ac.uk/chemistry/staff/mormerod/ (like us, Keele run Symplectic alongside intraLibrary.)

N.B. From the Symplectic interface, a user is able to “favourite” a research record and a flag comes out in the xml from the API which I understand is used on this page to display “Selected Publications”. DOI is also available from the API to link to the published version and if a user uploads full-text to the repository from Symplectic, this link is also in the xml – the first two records on this page include links to the full-text in Keele’s intraLibrary repository.

Our own Library web-dev Mike Taylor has been looking at the Symplectic API in detail and has put together a couple of prototype pages on a development server and after a meeting this week with a representative of the central web-team I’m reasonably confident we can move forward with this work fairly quickly…though there’s still a bit of a chicken & egg situation in populating the Symplectic database to then be re-surfaced via the API in this way.

There is also the question of whether we might alter our repository policy to become full-text only; one limitation of repositories across UK HE from an original conception (in the arXiv mould) of holding, disseminating and preserving full-text research outputs, is that they have in effect become “diluted” by metadata records for which it has not (yet) been possible to procure full-text or copyright does not permit deposit and “hybrid” repositories like ours, of full-text and metadata typically contain more metadata records than full-text (see figures from the RSP survey here). As I have argued on the UKCoRR blog, I think is makes sense to separate a bibliographic database (in Symplectic) from full-text only in a repository.

N.B. As Symplectic does not have the same search functionality as the repository, this approach has the potential disadvantage that it makes it more difficult to search across the entire corpus of research records (though one potential solution may be along the lines of that implemented by City Research Online which, in my view is rapidly becoming an exemplar of a research management system (Symplectic) + full-text repository (EPrints). Another good example is  St Andrews (PURE + DSpace) who presented a case study at “Learning How to Play Nicely: Repositories and CRIS” (video here.)

And what of OER? Along with our EasyDeposit SWORD interface, using OER to resource the refocus the undergraduate curriculum and the soon to be released intraLibrary 3.5 that will enable us to harvest OER from other repositories…for now I think they may be the bits down the back of the settee…

Infrastructure schematic (1st draft)

There are several significant developments that will impact on our repository / research management / OER dissemination and discovery over the next 12 months or so…briefly these are:

This is a quick schematic of how the developing infrastructure might look (a bit big to fit in my WordPress theme so click on image for full size):

BiblioSight project recommended for funding

We’ve just learned that we’ve been successful in our most recent funding bid to JISC’s Rapid Innovation call.

Outline project description:

“The project will aim to exploit the Web of Science Web Services API that uses standard transport protocols, such as HTTP, and message formats, such as SOAP and XML, to facilitate the exchange of data between Web of Knowledge and a custom application. It will build on work undertaken by the JISC funded SUE project, Implementing an Institutional Repository for Leeds Metropolitan University to integrate bibliographic information from Web of Science into the Leeds Met Open Access repository of research; this will facilitate automatic update when a published article appears in Web of Science. The aim is to integrate the technology into an efficient workflow to populate the repository with citation information / full text; we will also build on work undertaken by the JISC funded PERSoNA project and aim to develop a ‘widget’ that can easily be added to a personal environment like iGoogle or personal/communal environment like netvibes and that will extract bibliographic information – and potentially also bibliometrics – for authenticated Leeds Met staff in line with Web of Science licensing.”

Repositories for research and teaching/learning material: The debate continues at #rpmeet

reprog

Last week I attended the JISC Repository and Preservation end of programme meeting in Birmingham. I recall being very nervous at my first JISC event in November 2007 but feel much more at ease now and enjoyed the event immensely; the programme has certainly been successful in fostering a sense of community though it’s an unusual social experience to meet people face to face, often for the very first time, when one feels you already know them from reading their blog and following them on Twitter.

During one of the breakout sessions on the first day I made a bee-line for a discussion about repositories for learning and teaching materials – as opposed to OA research repositories. I use the word “opposed” advisedly as there is certainly some strong sentiment around the issue, particularly with respect to using a common software platform. As a representative of a project that is adapting a learning object repository to also serve as an effective Open Access research repository I’m finding it a little difficult to understand the vehemence of some of this opposition, though I would be the first to acknowledge a steep learning curve and recognise that we have required extensive development, not of intraLibrary itself perhaps, but of an appropriate web infrastructure surrounding it. And yes, we would certainly have been able to implement a functioning OA research repository more quickly using EPrints or DSpace however, from the outset, it was vital that our repository had the the capacity to fulfil its broader potential – in the words of Clifford Lynch “[A] mature and fully realised institutional repository will contain the intellectual works of faculty and students – both research and teaching materials – and also documentation of the activities of the institution itself in the form of records of events and performance and of the ongoing intellectual life of the institution.”  [Lynch, Clifford. A “Institutional Repositories: Essential Infrastructure for Scholarship in the Digital AgeARL Bimonthly Report 226 (2003).]

It’s also important to be pragmatic.  Historically, Leeds Metropolitan University is a polytechnic that gained chartered university status in 1992; its heritage is very much in teaching and learning rather than research with, arguably, a more vocational than academic flavour.  In recent years, the research profile has steadily increased, culminating in unprecedented success in the 2008 RAE and the university is naturally keen to capitalise on this success, enhance its research profile further whilst also continuing to emphasise its student focussed teaching and learning credentials. The implementation of an integrated repository to support both research outputs and learning objects reflects this dual focus.  Clifford Lynch’s article suggests that the concept of a central system to manage disparate resources in this way has been implicit within the sector for some years, however, the technology has tended to focus on Open Access to research, with the two most widely used software platforms being EPrints, developed at the University of Southampton in 2000, and DSpace, developed at MIT in 2002; early versions of both platforms were primarily designed to manage text based resources (though subsequent versions of EPrints and DSpace can manage a wide range of digital file formats.)  

NB.  In an extended discussion on this issue on JISC-REPOSITORIES (archive hereRepositoryMan Les Carr of EPrints refers to the fact that he still comes across the firmly held (and spurious) belief that because EPrints is used for Open Access it can’t be used for multimedia files or scientific data.

The session was chaired by Amber Thomas of JISC and I asked a somewhat blunt, perhaps naive, question about JISC’s perspectives on combined repositories of research and teaching materials.  Amber suggested that JISC have been deliberately neutral on the issue which is also perhaps emphasised by the diagramatic representation of the programme structure reproduced above.  

Some of the commentators last Wednesday were adamant that though it may well be possible to manage different types of resources with a single system it was far from desirable with one colleague making the pithy analogous observation that you can write letters in Excel but that doesn’t make it right.  Phil Barker of CETIS was also at the discussion and in a recent blog post on the “question of whether research outputs and learning materials should stored in the same repository” is “inclined to think the answer is no, the purpose of the repository is different, a learning material isn’t an output, sharing means something different for the two resource types.”  Phil goes on to say that ” If you think a repository is a database and a bit-store then you may come to a different conclusion, but I think a repository is a service offered to people and your choice of starting point in offering that service will affect how easy your journey is.”  (Full post here)

I’d certainly concede that our journey hasn’t been an easy one and I also agree that a repository is a service offered to people and with our repository start-up, and also Streamline and PERSoNA, that is certainly the approach we have tried to take; with intraLibrary and the SRU interface we now have an incipient infrastructure to manage both research material and learning objects; the discrete types of material can be managed entirely separately, however, there is also potential for the ongoing development of a holistic approach to the management of the full range of digital resources produced by a modern university and as we develop our infrastructure further I hope we can utilise appropriate web-technology around a central management system (intraLibrary) to achieve decentralised resource discovery – through appropriate interfaces, widgets and environments – the VLE for example.

JISc-meeting09-poster

Then of course there is the small matter of persuading academics to part with their resources, not to mention IPR, copyright and quality control issues…

Open Access to research is an evolving paradigm and represents a considerable shift in the established academic publishing process; Open Access to a broader range of educational resources still more so. Any paradigm shift is likely to take time to evolve and Open Access, to research and other materials, is no exception, especially given that academia, perhaps, tends to subscribe rather strongly to established tradition!

JISC’s current OER programme should go some way to addressing many of these issues but infrastructure is the foundation. The perfect system almost certainly doesn’t exist and it’s surely important to be pragmatic when implementing and developing appropriate system. Here’s to ongoing discussion, debate and development.

OER project: Unicycle

As mentioned in a previous post, colleagues at Leeds Met have recently been successful in the recent JISC call for the Open Educational Resources programme.  Simon Thomson, the project manager for Unicycle, has given me an overview of the project and how the repository will contribute to project deliverables.  In essence we need to make 360 credits worth of content locally and publically available – both via our own repository and JORUM Open.  This will equate to approximately 3600 hours of material and Simon already has some ideas of where this will come from – CETL workshops for example – Unicycle will explicitely repurpose / share existing material; it will not create new material.

Simon hopes to assemble an “editorial” team comprising  an academic representative and a learning technologist from each of the 6 Leeds Met faculties; Simon and I will also be members of this team that will convene every month to assess / quality check potential content.  My job will be to ensure material is in an suitable format for ingest and appropriately tagged with metadata; to get stuff IN, ensure that it is discoverable and can be got back OUT!   In the first instance I anticipate cataloguing resources against the JACS system and using the JORUM metadata template already in place; this would seem sensible in view of the fact that the same resources will also be stored in JORUM Open and it will certainly be desirable to liaise with that service throughout the project.

N.B.  Rather than dual deposit in this way, might JORUM explore harvesting open content from our repository / other repositories of Open Educational Resources?

Another crucial area, of course, is the licencing issue; both Simon and I anticipate using some flavour of Creative Commons and again this is an area that will benefit from liaison with JORUM – especially in view of their evolving 3 licence model.

On a more technical note I will also be very interested to see how JORUM will be facilitating open search functionality.  Currently there are a series of RSS subject feeds at http://www.jorum.ac.uk/support/rssfeeds.html#subjectfeeds but these still need authentication to access the resources; presumably they will need to implement some sort of portal based on OAI-PMH or SRU – might they also look at searching other repositories (like ours!) using OAI-PMH for example?

Lack of incentive for sharing is recognised as a problem in the context of reusable learning objects and another crucial element of the project will be to identify / implement reward and recognition policies though cultural change with respect to OERs will no doubt be a long term process both institutionally and within HE as a whole.

Google indexing and SEO

It is crucial that both the Open Access full text research content of the repository and metadata records of citation material are fully indexed by Google (and other search engines); in the future it is also likely to be required for other Open Educational Resources (learning objects). However, site:http://repository-intralibrary.leedsmet.ac.uk/ currently returns just 4 results (in addition to the Login page itself) and it is a bit of a mystery how these 4 are actually being picked up when the majority of records are not.

In intraLibrary, for a given collection, the administrator may choose to:

• Allow published content in this collection to be searched by external systems

This effectively means SRU (Search and Retrieve by URL) a standard search protocol utilizing CQL (Common Query Language).

• Allow published records in this collection to be harvested by external systems

This effectively means harvest by OAI-PMH

XML Sitemaps

Intrallect have suggested that it is necessary to implement an XML sitemap to ensure that content is properly crawled by Google. Until 2008, Google did support sitemaps using OAI-PMH but have since withdrawn this and now support only the standard XML format. Intrallect have therefore developed a software tool that converts OAI-PMH output to an appropriate XML format. A sitemap has been generated and registered using Google’s webmaster tools but currently is registering a series of errors that indicate “This URL is not allowed for a Sitemap at this location”; 9 errors are listed from the very first URL and which are sequential; it seems that the crawl does not go any further and none of the 100+ URLs in the sitemap have been successfully recognised. Two possible reasons have been suggested for this:

• All of the URLs in the sitemap are external; it may be that Google does not permit URLs outside the mapped domain.
• There is a problem with the XML itself

Sitemap here: http://repository-intralibrary.leedsmet.ac.uk/sitemap/Sitemap.xml

Sitemaps using RSS

It is also possible to submit a sitemap based on RSS, however, this approach has not been any more successful as the Open URL/virtual file paths generated by intraLibrary are inaccessible to Google resulting in the following warning:

URLs not followed
When we tested a sample of URLs from your Sitemap, we found that some URLs redirect to other locations. We recommend that your Sitemap contain URLs that point to the final destination (the redirect target) instead of redirecting to another URL.

Google and SRU

Though SRU does not facilitate indexing by Google per se, the integration of the SRU Open Search interface may provide a potential solution. site:http://repository.leedsmet.ac.uk/ currently returns 247 records; largely these appear to represent Googlebot following the various browse links (many of which themselves return no results where there is no content to find!) In addition, Googlebot appears to be following hyperlinked author names, publisher and subject(s) in the individual metadata records:

google

The third of these “The Repository search for Morton, Veronica” links to the two metadata records associated with that name as though it had simply been entered into http://repository.leedsmet.ac.uk/ as a search term:

http://repository.leedsmet.ac.uk/main/search.php?q=Morton%2C+Veronica+

Presumably these records were initially indexed via the appropriate links on the browse interface – http://repository.leedsmet.ac.uk/main/browse.phpFaculty of Health and R – Medicine and then re-indexed via the hyperlinks embedded in the metadata records. It is interesting to note that, though Morton, Veronica only has two records associated with her name, this record appears relatively high – at the top of the second page – and this is probably because there so many other authors also associated with these papers; all of these names are hyperlinked giving over 21 separate indexable links.

It seems that we might need to formalise the structure of the SRU to ensure it is optimised for Google; possibly with some sort of SRU sitemap. For example, if we could generate a page that linked to all the individual metadata records in the repository and optimise this page to be crawled by search engine spiders (doesn’t need to be human readable; could be XML) which could then follow the links to the associated metadata.

It also seems to me that Search Engine Optimisation will need to comprise appropriate customisation of the SRU interface; for example, we want to facilitate browse by author which, in turn, will provide indexable links for Googlebot.

Full text indexing

There is also the issue of indexing full text. As already mentioned, Google does not follow the Open URL/virtual file paths generated by intraLibrary and all the results from site:http://repository.leedsmet.ac.uk/ are search results. Potentially this is a benefit in as much as people are less likely to bypass the metadata record and go directly to the PDF but we do also want to facilitate full text indexing. We may have to wait for Intrallect on this who have assured us they are looking into facilitating full text indexing – probably via intraLibrary itself rather than the SRU.