Tuesday, November 30, 2010

Discovering Discovery: DSpace + Solr tips & tricks


DSpace 1.7.0, which is due for release on December 17th, will include a new module called "DSpace Discovery", contributed by the fine folk at @mire.

Discovery adds the ability to use Apache Solr for search, an XMLUI aspect that replaces (most of) the old 'ArtifactBrowser' to enable easy navigation through configurable facets, and a service to allow external sites to perform searches. In future releases, searching will get even easier as autocompletion is added to search boxes.

It's incredibly easy to set up, and because the Solr index exists alongside your traditional plain-old-Lucene search indices, you can switch back and forth without any hassle: no rebuilds, no re-indexing; just enabling/disabling the relevant XMLUI aspects.

You may have seen similar interfaces in other sites: Solr is being used for generic discovery interfaces like Blacklight, as a full-text search module in Drupal and as a custom solution for in-house sites.

You've also possibly seen DSpace Discovery in action at Dryad, an international biosciences data repository.

You can read some more information, including the official documentation and development roadmap on the DSpace wiki.

I've installed DSpace 1.7, now what do I do?

The Discovery Configuration guide in the DSpace documentation/wiki will get you up and running in no time.

I want to create some custom facets/filters. They don't exist as fields in my metadata registries so I can't easily configure them in dspace-solr-search.cfg. Can I configure Solr directly?

Yes! Let me give you an example:

(note: please excuse and ignore my horrible usage of qualified DC -- it's just an example!)

I've been working on a new repository/archive for the Archive of Māori and Pacific Music at The University of Auckland Library, and we had a few pieces of metadata we wanted to treat differently for the purposes of navigation -- a 3-tier "location" for each recording, which we wanted to combine into a single "Place" facet, and fields for both "iwi of the performer" and "iwi of the composer", which we wanted to combine into a single "Iwi" facet.

(for those outside New Zealand, iwi means 'people', and in this case, refers to Māori tribal affiliation, eg. Ngāti Porou or Tainui)

Here's how the Solr schema for DSpace Discovery is configured for faceted/filtered search:

* Defines a dspaceFilter type, which is a fairly simple Solr field type that converts to lowercase and preserves the entire string as a single token (ie. no splitting on spaces or commas, etc.)

* Copies every metadata value into a dynamic field, named [schema.element.qualifier]_filter, eg. dc.title_filter or dc.identifer.issn_filter

So we have three tiers of location data that might look something like:

dc.coverage.spatial_country: "New Zealand"
dc.coverage.spatial_region: "Hawkes Bay"
dc.coverage.spatial_locality: "Waipukurau"

Now, we edit [dspace]/solr/search/conf/schema.xml and add the following new field definitions beneath the definitions for internal fields like "search.resourceid":

<field name="spatial_filter" type="dspaceFilter" indexed="true" stored="true" multiValued="true"/>
<copyField source="dc.coverage.*" dest="spatial_filter"/>

This will take all values where schema is "dc" and element is "coverage", and copy them into a new spatial_filter field, which can then be accessed by dspace-solr-search.cfg when configuring your facets/filters.

Note that this particular example would also copy dc.coverage.temporal values, if any existed -- dc.coverage.spatial* is strictly better for this example, but not as relevant to most use cases ( eg. dc.subject.*, dc.identifier.*, dc.contributor.*, dc.title.* ).

Now all that's left is to add our new "spatial" field to our lists of facets and filters in [dspace]/config/dspace-solr-search.cfg, rebuild our discovery index (I recommend deleting and rebuilding when altering schema.xml) and create some new i18n labels for displaying in XMLUI.

DSpace Discovery will surface our new, helpful "Places" facet which we've created without touching our stored metadata or legacy browse/search indices. Check it out:

If we select "new zealand" and "waikato" to filter our results, the Place facet is now going to tell us about places just within "Waikato, New Zealand"


And that's all! The data does most of the work for us, and DSpace Discovery handles the rest.

In DSpace 1.6.x, I could export a CSV containing item metadata from my search results. Is that possible in DSpace Discovery?

Yes, sort of -- I've written an updated CSV exporter for XMLUI to work with Discovery, but it wasn't written in time for 1.7. It should be in the next release, and I will put a patch up on JIRA shortly for those who wish to use it with 1.7.0.

You mentioned the ability for external sites to query DSpace Discovery --  tell me more!


I'd love to, but I haven't played around with it quite enough to feel like I could do this topic justice -- watch this space!

If you have any questions or tips to share about DSpace Discovery or Apache Solr, please send me an email or leave a comment, or hop over to the DSpace Mailing Lists.

Monday, October 25, 2010

GSoC Mentor Summit 2010

(Note: I'll stick up some photos in a followup post once the official ones go up, and I get the stuff off my camera)

So, the GSoC mentor summit was pure awesome. I've had virtually no unconference/barcamp experience, and I knew the level of smarts would be extremely high, so I was a bit nervous at first about being overwhelmed, or that there would be an atmosphere of elitism, or that the whole thing would be a chaotic mess that I wouldn't be able to participate in. My concerns turned out to be entirely unfounded.

Perhaps it's the fact that the sort of people who get into mentoring have a great attitude in the first place, or perhaps I've just been too cynical about 'FOSS personalities'... whatever the case, the unconference worked brilliantly. There were no egos driving sessions, no elitism or flaming; it was all just pure, unadulterated geekery that allowed for participation by everyone and somehow ran like clockwork. I learned heaps, met a whole bunch of cool smart people, and I have my usual post-conference 'vibe' that motivates me to spend even more time hacking and contributing to all sorts of stuff.

I also wasn't sure how many projects would be related in any way to the education/GLAM sector, and was pleasantly surprised there too: I met people working with enhancing text with semantic markup (FISE), some folk developing an open source web conferencing tool made to plug into LMSes (Big Blue Button), the Creative Commons people were there, and many more that have just slipped my mind right now.

Of personal interest were sessions around GIS and managing/manipulating geo-spatial data. I've been doing some mashups and webapp work at home around the new TradeMe API (amongst other things) using Google Maps and geocoding (or reverse geocoding) locations, so it was great to learn some more about OpenStreetMap, PostGIS, OpenLayers and similar tools, as well as the challenges facing developers in data storage and interchange. (I also met the other two kiwis attending the summit at this session, oddly enough)

Sessions I attended: (some names are paraphrased since they were just written on a whiteboard)
  • Liberate your data!
  • Distributed systems and security
  • OpenStreetMap routing demo (shortest path) with geofabrik.de
  • Geo-spatial data
  • Anyone can be a great mentor
  • Open Source licensing and copyright issues
  • Final session/feedback
Notes were taken in realtime, en masse, using an Etherpad instance provided by the ever-helpful OSUOSL team (and a few similar tools like TypeWithMe). I'll put notes up once they're available on the wiki or I've saved them somewhere.

As well as attending the unconference sessions, I spent a fair bit of time hacking DSpace with fellow committer Mark Diggory and talking geek with him -- always a good opportunity when most DSpace developers are a whole hemisphere away from me.

I just missed out on the "git for data" session, which was a pity, but I'll take a look at the notes once they're up -- they should be full of goodness.

Post-summit resolutions:
  • Get even more involved in GSoC next year and put the lessons I've learned into practice
  • Start pronouncing "data" properly (I wince every time I hear myself say "dah-tah")
  • Follow up on all the GIS tricks and tools I learned about
  • Introduce BigBlueButton to the NZ e-learning community and any university staff who run webinars
  • Start using my camera instead of leaving it in my damn backpack all the time
  • Blog more (or at least write more)
  • Come back to San Francisco some time

Big props to the Google Open Source Programs Office for running GSoC in the first place (especially Carol and Cat) and for organising a brilliant mentor summit, and to all the org mentors/admins who showed up and made the summit what it was.

ps. If you want to catch a glimpse of what was going down while it was going down, as well as some of the aftermath, take a look at the #gsoc #mentorsummit twitter stream

Obligatory first post

I finally wrote something longer than 140 characters that I felt like sharing, and rather than try to resurrect my old WordPress blog which was more of a travel diary than anything else, I figured I'd keep this one just about work and other geeky activities, and let Blogspot do all the hosting for me.
That may change, who knows.