Still baffled by Google…

Just reproducing an email to ukcorr-discuss here in case any technically minded folk not on the list might pass by these parts…

To revisit the whole Google Scholar / full-text indexing “thing” I was just looking at results in GS for a particular academic who has raised a query about his full-text not being visible in Google Scholar; he has 6 full-text in the repository but a site: search of GS only appears to return x2:

http://scholar.google.co.uk/scholar?hl=en&q=site%3Ahttp%3A%2F%2Frepository-intralibrary.leedsmet.ac.uk+%22x.+font%22&btnG=Search&as_sdt=0%2C5&as_ylo=&as_vis=0

Initially I thought it may be an artefact of when full-text were added; records were all added at the same time (24th May 2011) but full-text was only added for one of the GS results at that time (plus one not indexed at all – see below) as opposed to October 2011 for all the others (including the other GS result)…and that’s still a good 6 months which you would think would be long enough to be indexed. Wouldn’t you?

Normal Google, by contrast, returns 4 full-text records:

https://www.google.co.uk/search?hl=en&as_q=&as_epq=xavier+font&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=http%3A%2F%2Frepository-intralibrary.leedsmet.ac.uk%2F&as_occt=any&safe=images&tbs=&as_filetype=pdf&as_rights=

The missing results are http://repository.leedsmet.ac.uk/main/view_record.php?identifier=4881&SearchGroup=Research (full-text added 24th May 2011) / http://repository.leedsmet.ac.uk/main/view_record.php?identifier=4893&SearchGroup=Research (full-text added 10th October 2011).

The only other difference I can spot is that several of those non-indexed in GS don’t have metadata in the PDF (which is why they have just been picked up in normal Google as “Leeds Metropolitan University Repository” from the coversheet…

As a caveat, there is technical peculiarity in that we effectively have a two-server set up with our Open Search interface on an institutional server which queries intraLibrary by SRU, the software itself hosted for us in a server-farm somewhere which might explain idiosyncratic behaviour to some extent…

Am I missing anything else?!