Final Progress Post

September 7th, 2010

gMan: From research data repositories to virtual research environments: (re-)activating  archival knowledge for the Humanities

Studies of humanities scholars have demonstrated that they continue to rely on primary materials held in dedicated collections in special places, in repositories and archives, and it is in repositories (and archives) that the scholar carries out the work of assessing these source materials. In the UK and elsewhere there are significant digitisation programmes for humanities material, which to an increasing extent are able to provide the humanities researcher with digital surrogates for the physical archives. In some cases major memory institutions are systematically digitising the material for which they are responsible, but nevertheless digitisation is on the whole a somewhat piecemeal affair, and is carried out to different extents (e.g. image only or image plus OCR) and quality levels, depending on the availability of funds. Individual projects may address a particular set of archival material relating to a particular research topic, resulting in numerous dispersed (albeit usually online) resources, developed using different technologies and standards. Archival material is thus made easier to access, creating new possibilities for the researcher, but on the other hand this very availability raises new issues.

Our work sets out to investigate how (digital) repository content can be delivered to humanities researchers more effectively, independently of the location and implementation of that content, and with special means provided for customising the retrieval, management and manipulation of this information. Traditional finding aids are to be complemented by more sophisticated retrieval means. In particular, the personal copy of a finding aid that is often quoted as an important prerequisite for specialised research in archives is complemented by the ability to create on demand relevance indexes on the unstructured resources, and to combine the resources in new ways. We consider this to be the grand integration challenge for research repositories in the humanities, delivering data-driven humanities.
Our starting point was D4Science, a production-level infrastructure serving mainly scientific communities, but which is not biased towards any particular discipline and has great potential for meeting the needs that we have identified for building VREs by combining repositories resources. gCube, on which the infrastructure is based, is a distributed, service-based system designed to support the full life-cycle of modern research, with particular emphasis on application-level requirements for information and knowledge management. In gCube, VREs can be interactively designed and configured on demand, and the system is responsible for its physical deployment and correct operation in the infrastructure. Computational resources are exploited for computationally demanding tasks such as on-demand indexing of large collections.
We have been investigating how humanities repository resources can be imported into gCube, and how the VRE can be enhanced with further services according to the needs of the targeted research community. The gCube system is designed for extensibility; communities are encouraged to tailor the functionality to their particular needs, by developing new services or plugins. We have focused on importing existing Humanities research collections. We have plenty of those in Humanities. These are often in databases hidden behind web front ends. gCube has developed a well-defined archival import service, which is of great use in Humanities. Here, we find a lot of existing collection produced in various digitisation and online analysis projects. At the end of such projects, it is often difficult to reuse these collections for future collaborative research. In gMan, we have shown a possible way forward based on scientific research infrastructures.

Some of the services are better explained in the screencast we produced: http://gman.cerch.kcl.ac.uk/?p=105

And some example collections we used to show the potentials of gMan services:  http://fresh.cerch.kcl.ac.uk/collections/

Link to technical documentation of gCube: https://wiki.gcore.research-infrastructures.eu/documentation/index.php/GCube_Wiki

The import scripts we used can be directly reused from within the gMan infrastructure. If you have further questions, about this, please contact us directly

Date prototype was launched at Digital Humanities 2010: 08/07/2010

Website of the gMan service: http://portal.d4science.research-infrastructures.eu/ If you want to play with it, please contact us. We had to protect it with a password, as there is some expensive infrastructure involved. But, access is generally available to anybody.

Project Team Names, Emails and Organisations: http://gman.cerch.kcl.ac.uk/about


If you are interested in details and further collaboration, please contact Tobias Blanke.

Screen Cast

August 27th, 2010

This is the first gMan screen, demonstrating browsing a collection, gathering information objects (IOs) into a new personal virtual collection, and annotating some of the IOs. The annotations are both the addition of text notes and relationships to other IOs.

gMan: Browsing and Annotation fromgMan Project on Vimeo.

Advanced Search Queries

August 27th, 2010

In gMan we have a number of gCube’s search tools [1] available to us, as can be seen below, including browsing a single collection, a full text search in the guise of Simple search and what we’ll initially look at here; Combined Search and Refine previous search.

A Combined Search can perform either a logical OR (match any condition) or a logical AND (match all conditions) on all fields used in the search query, for example:

Give me all the Information Objects (IOs) that have “Gaius Julius” in the Title field, or “-0050″ in NotBefore or “-0001″ in NotAfter [2]

Give me all the Information Objects (IOs) that have “Gaius Julius” in the Title field, and “-0050″ in NotBefore and “-0001″ in NotAfter [2]

We can add more conditions to a second sub-search by using ‘Refine previous results’. This only queries the result set from a previous search but does allow for a combination of ANDs & ORs in conjunction with the previous combined search with a little careful aforethought. However we cannot combine the previous search (Simple Search[3] or Combined Search) with a new collection or even remove a collection. Neither can we use the results from one query to be used as keywords in a second query.

refine.jpg

Although it is slightly complex to do a Refined Search, these types of search are pretty straightforward. An interesting search query might look something like:

Get all the IOs (A) from collection X where K appears in field R. Now use the unique values in field S (of A) as queries set (L) against field T in collection Z and give me the final Search Results (B).

….example in set notation….

This sort of search would, for example, allow for the look-up of variations in spelling of a name in one collection and look for any matches in a second collection. An interesting aspect to this form of search is that the collections used may not need to have matching schema as we have 2 separate queries with different inputs, fields searched and outputs.

What other advanced searches could we see in gCube that would be useful for the Arts & Humanities?


[1] Details on the different search tools can be found on the D4Science wiki.

[2] This is an example using the indexed EpiDoc fields. Please see a previous blog entry for date range issues. Furthermore negative numbers are currently not supported in these fields.

[3] A Simple Search is a full text search across all text fields.

Dates Use Case

August 9th, 2010

A key Humanities use case for data discovery is querying for date ranges, for example:

  • “give me all the documents in the month of September 1939″
  • “Are there any records that refer to Carthage before 264BC?”
  • “Return all the records between 1225 and 1350 that refer to the Magna Carta”
  • “Are there any obituaries for Oscar Wilde after his death in 1900 in these collections?”
  • “Get all inscriptions that were possibly created between 50B.C. and 100A.D.

This is further complicated by the fact the digital record may not have a fixed date (dd-mm-yyyy) but itself has a date range that is considered as a possible period for the creation of the original physical object. For example:

  • An inscription in the IAph collection may have a date range of <date notBefore="0034" notAfter="0066" exact="none">Early first century A.D.</date>
  • A book may have a number of published date depending upon version and where it was published
  • A flint arrow head’s date is measured in epochs.

In gMan we can only match exact dates, for example we can do a combined search with values for the fields notBefore (100) and notAfter (200) but it will only find records with those values matched in the metadata of the information object, i.e. notBefore == 100 & notAfter == 200.

For gCube to fully support the research questions humanists might have then there needs to be more comprehensive date handling tools. Import scripts could help with generating dates in a format that is compatible with date handling tools. An interim solution using date conversion on import could be investigated in gMan (or a future project) by creating a numeric value for the year values that may occur in the metadata and using numeric comparisons (rather than string comparisons) in a combined search. This requires careful consideration by the researcher when creating a query to ensure that the date range required is part of the search criteria.

If gCube is to be widely used in the Arts & Humanities then the development of a date range portlet, or changes to the combined search, should be considered as a key development objective.

footnote: There is one further issue to consider, and that is the differences between Julian & Gregorian calendars. Although the use case for conversion between the two maybe rare (an astronomer looking at pre-gregorian archives for the appearance of a certain phenomenon is an example) there is a likely need where collections content cross the boundary between the two systems and when this occurs in different countries. It maybe desirable to keep this problem in mind and address it only when it becomes an issue with specific collections imported.

Digital Classicist Seminar – the Results

July 30th, 2010

Slides from the seminar are available (note the similarity to some of Mark’s slides from Open Repositories) as is a podcast. There were some difficulties as there was no Firefox on the presentation PC, so, as this is a must for gCube there was no live demo of gMan which is a shame.

Open Repositories

July 22nd, 2010

I (Mark) attended Open Repositories 2010 in Madrid, and I gave a talk on gMan in one of the main sessions (a session on repository infrastructures). The session was chaired by David Flanders from JISC.

Open Repositories aims “to bring together individuals and organizations responsible for the conception, development, implementation and management of digital repositories, as well as stakeholders who interact with them, to address theoretical, practical, and strategic issues”, to quote from the conference website.

Here is the presentation – I split it up into three parts as it was too big for wordpress.

gMan-OR2010-Hedges-slides-1-10

gMan-OR2010-Hedges-slides-11-20

gMan-OR2010-Hedges-slides-21-29

All three initial collections are in!

July 19th, 2010

Finally the Projet Volterra collection (Roman laws) has been imported into gMan and indexed. As expected this has caused some difficulties, which would be the same for any collection originally held within a SQL database. The original data was imported into a MySQL database and the xmldump command used to output some form of xml. The tags in the mark-up of the XML were badly formed as the original column names in the database had spaces. Also the XML dump was a single file where the importing in gCube prefers one file per record. These were problems that were easy to solve.

Other problems were ignored to simulate what might happen for a user. Differences in naming between possibly similar fields (there are 6 tables that were not normalised) in the collection were not reconciled and the xml was not structured in any way. Certain fields had no entries in any record – these were ignored in the import in this instance.

Because there is no common schema across all three collections (Volterra has it’s own) then it is not possible to undertake a ‘combined search’, however a ‘full text’ search is still possible. It is possible to perform a ‘combined’ search across the annotations for all the collections as these will have the schema common to gCube annotation service. We will investigate the possibility of creating common schema & indices for all three collections, although the usefulness might be limited.

Digital Classicists Seminar

July 16th, 2010

The Digital Classicist seminar on Friday 23rd will be about gMan as an on-demand VRE environment for the humanities, and especially as we are working with classical collections. An abstract for the seminar and location details are on the DC site.

gMan gets it’s own VO!

June 30th, 2010

Now that’s a PR-style headline, but it is rather important news though. There is now a Arts-Humanities (A-H) Virtual Organisation (VO) in the D4Science infrastructure which contains the gMan VRE.

Why is this important? Firstly we can, for gMan, try some data importing without affecting other peoples’ data if things go a bit wrong- the VO limits the scope of actions and the visibility of data. Secondly, we can now create new VREs on-demand within the A-H VO, but most importantly it represents a commitment from the D4Science consortium to support Arts & Humanities VREs.

Importing HGV

April 15th, 2010

The documents of HGV, the Heidelberg Gesamtverzeichnis der griechischen Papyrusurkunden Ägyptens, are now available in the gMan VRE. This required importing, describing, and indexing the documents in the Ecosystem VO which hosts the VRE.

To import and describe the documents, we followed the same approach used for Inscriptions of Aphrodisias (InsAph), as both collections represent their documents in Epidoc/TEI. We defined an import script for the Archive Import Service (AIS) and used the AIS Portlet to submit the script to the AIS instance deployed in the Ecosystem VO. This required Data Manager rights in the Ecosystem VO.

The script creates a content collection and a metadata collection for the corpus, and it populates both with Information Objects (IOs) that have Epidoc/TEI documents as payloads. It then draws binary relationships between content IOs and metadata IOs, as well as between the content and metadata collections themselves (which are IOs in their own right in gCube). The script is iterative across the documents of the corpus and assumes that they can be enumerated and accessed over the network. As the existing front-end to the corpus lacks a programmatic API to enumerate and access the corpus, the documents have been first localised and then republished at a known address in the D4Science domain along with a pre-computed enumeration of their names. The script can then access the enumeration and use it to fetch and process each document in turn.

With the documents imported and described in gCube, we created forward and inverted indices on the full-text of selected metadata fields; the forward index is relied upon to browse the collection and the inverted index is used to answer free-form queries. We have used gCube’s IRBoostrapper portlet to configure and launch the process that creates the indices. This also required Data Manager rights in the Ecosystem VO. The IRBootstrapper abstracts away a number of interactions among gCube services, including the FullText Index service, the Forward Index service, and the Data Transformation service (gDTS). The first two services are responsible for the creation, storage, and consumption of the corresponding indices, while the third transforms the metadata in the ingestion formats required by the previous two services. We simply had to feed the IRBoostrapper with the configuration required by the gDTS: a stylesheet that defines the transformation required to ingest metadata into the FullText Index Service, and a list of paths that identify the metadata fields to ingest into the Forward Index service. The process is now stored and can be executed at any time in the future to rebuild or update the indices.

Finally, we modified the configuration of the D4Science portal to include the fields that are available to users for  browsing and searching HGV, as well as to specify the fields in the presentation of query results.