IPM Text Encoding Part 1: Calendars
Posted by: pgooch 7 years, 4 months ago
The process of digitizing the inquisition post mortem records involves the encoding of 29 volumes of text into a computer-interpretable format that provides both structural and semantic markup. We are using Extensive Markup Language (XML), and a specific vocabulary of this known as Text Encoding Initiative (TEI) P5. Encoding the data in this way allows us to identify - and make searchable - specific items of interest within each inquisition, such as:
- the subject of the inquisition, his/her date of death and details of any heirs,
- the date and location of the inquisition, the writ clerk, escheator, and jurors,
- details of the holdings of which the subject was tenant-in-chief (see 'A very short introduction to the inquisitions post mortem').
Each of the 29 volumes contains around 750 calendar entries, and each volume mentions around 5000 significant persons (excluding jurors). In addition to structuring the main calendar into XML, we need to identify specific people, their roles and relationships; places, including specific types of place such as counties, parishes and manors; and the relationships between people and places.
Clearly then, this is an immense task! Although we can extract the digital text of each calendar (either from the typeset PDF, scanning and OCR of the printed volumes, or having the text rekeyed from scratch by a specialist company), the process of structuring the text into XML and annotating people, places and relationships would normally need to be done manually, which, on this scale, would be prohibitively expensive and time consuming.
However, we have the advantage that the structure of each inquisition is fairly consistent and predictable. To leverage this, we are making use of the following techniques to automate some of the XML markup:
Regular expressions: these allow patterns to be written that match a given sequence of characters or types of characters. For example, the regular expression
would identify a line beginning with one or more digits, followed by one or more spaces, followed by the word ‘Writ', and then any other characters up to a line break or carriage return. This would identify lines such as
352 Writ 22 Nov. 1418.
359 Writ, devenerunt, 18 Nov. 1419.
Text identified with regular expressions can be grouped and restructured by adding additional content around the found text via a series of search and replace operations (which can be automated via a script). We are using regular expressions to add the first layer of structural markup to the plain text.
Natural language processing: using the GATE framework, we have developed text processing pipelines to identify people, places, and their relationships, without requiring string matching against huge lookup lists of place names or person names, but instead applying a sequence of lexico-syntactic patterns. This processing step adds semantic markup to the text.
Let's look at a concrete example:
1 Writ mandamus. ‡ 3 May 1423. [Thoralby] Regarding lands held of Henry V.
CITY OF YORK. Inquisition. York. 19 May 1423. [Esyngwald]
John Hewyk; Robert Middelton; Thomas Doncastre; Thomas Neuton; Gilbert Walker; Richard Chaundeller; William Perceay; Richard Neuland; Thomas Bemes- lay; Robert Ketill; Walter Luket; and William Coupeland.
He held 13 messuages and 6 selions in Bootham in the suburbs in his demesne as of fee of Henry IV in burgage tenure as the whole of the city is held, annual value 8 marks.
He died on 27 December 1406. Roger Strange is his son and heir, aged 38 years and more.
Since his death the messuages and selions have been successively occupied, and the issues taken, by William de Bowes and John Petyt until 3 February 1410, John Waterton until his own death, and then Richard Waterton who still occupies, manner or title unknown.
C 139/1/1 mm.1–2
Combining the first two processes above gives us the following output as displayed in the GATE user interface.
Here we can see the different types of structures and entities identified, as well as specific metadata about Hugh Strange, such as his gender, his date of death, and his son and heir Roger Strange.
Finally, the XSLT process transforms the output from GATE into TEI P5 XML:
This can then be corrected and edited by our research team. As you might notice, some of the metadata identified in GATE is not used in the inline XML - this is because we store this metadata in a topic map database known as EATS - Entity Authoring Tool Set, which is based on the Python Django framework. Records within EATS are then linked back to the main calendar XML.
In Part 2 of this blog post, I will give an overview of the process we are using to generate automatically this EATS database of person—person, person—role and person—place relations from the back-of-volume indexes.