British Universities Film & Video Council

moving image and sound, knowledge and access

Digitisation Timeline

What are the stages involved in transforming boxes of BECTU membership forms into a searchable database of digitised content? This interactive timeline describes the journey from initial scoping to user testing.

Digitisation Timeline

  • What do we want to do?

    This was defining the initial scope of the digitisation and led by what the research team wanted to do with the collection. As well as publishing the forms the team wanted to extract key data to inform their research. The membership forms were mainly handwritten so optical character recognition (OCR) could not be applied to make the forms searchable. So it was decided to transcribe six fields (threshold determined by cost) so that key data within the forms could be interrogated. The initial six proposed were: applicant name, gender, nationality, employer name, position held and membership number. After some discussion these were changed to: gender, employer name, department, grade, membership number and date of application. The position held field actually held two separate fields that were valuable for analysis: department and grade. The date of application was vital to give a chronology to the data. We wanted to create a digital copy of each membership form together with extracted data from six specified fields. The data would then be searchable in a database and linked to a copy of the form.

  • Agreements

    In order to publish the membership forms Learning on Screen and BECTU needed to agree how this would be done and this was captured in an agreement. This defined the access and broadly, the terms of use. That the forms would only be available to UK higher and further education institutions & the terms of use would be clearly visible.

  • The Collection

    The collection had been roughly scoped with a total estimate of forms of around 68,000. Initially we needed to do two things: compile an inventory and prepare the forms for digitisation. By compiling a detailed inventory we could a) check the original scope b) extract forms for the pilot c) provide a listing against which the material could be checked on return from the digitisation company. The digitisation company would only scan the forms so all the related documentation eg correspondence had to be removed.

  • What’s The Process?

    So how would we digitise 68,000 membership forms? Initially by asking lots of questions. What specification – from file formats to file names? What process should be employed? What checking processes – from transcription to forms? We worked out the answers to these questions with the digitisation company and then conducted a pilot to test it. For example, we needed a clear idea of how we would keep track of individual pages and related data during the process. The membership number was the key individual identifier here and we could track if used in the filenames for the digitised pages and the csv files. The transcription of the membership number from the scanned pages was vital as was accurate replacement within the jpeg filenames.

  • The Pilot

    We sampled around 100 forms at regular intervals across the collection in order to test the process from the order to the accuracy of the transcription. This raised a number of issues from confusion caused by change of application format in relation to Department and Grade to finding a standard way to distinguish between fields that were blank and those that had illegible content. The feedback and solutions were integrated into the process.

  • Digitisation

    Probationary Forms

    There were two major issues that emerged during the process. The first was the existence of probationary forms which were different to full membership application forms. This led to a duplication of membership numbers or forms with no number but the same name as another form. This was outside the original scope which specified one form per member so we had to decide a) whether to include these b) if so, how to do it. Although time-consuming to identify and link these forms up we felt the value they imparted to the collection was worth it. We had to adjust the specification for the membership number from a five figure number to a five figure number appended by ‘a’ or ‘b’.

    Scanning

    The other issue was automatic edge detection used during the scanning process which sliced the bottom of the form off by around one centimeter. The membership number was generally hand-written on the form and at certain times very close to the bottom edge of the form. This meant there were batches of around 100 forms that had no number but in fact were illegible because only half was visible. Although we considered many solutions including re-scanning, in the end these were individually identified and manually re-scanned.

  • Database

    The initial development of the database was made using the data and scans from the pilot. Here we wanted to test the design of the search form, the presentation of the search results and the publication of the membership forms. Some of this was fundamental, making sure all the data within a record was the right data, that the right form linked with the right record but this also highlighted unexpected issues such as those linked to the Position Held. This consisted of two fields: Department and Grade but what we had not realised was that over time was that these had changed name and reversed position to Grade, then Department. These had been transcribed as Position Held Field 1 and Position Held Field 2 so we had to analyse the changes in the forms over time and develop a formula to identify the relevant data.

  • Visualisation Tool

    We wanted to create a dynamic visualisation tool that could be applied to search results within the database. The first task was to identify appropriate sets of data within the membership forms that would provide a meaningful comparison. This boiled down to time (Date of application) and gender (the gender of the applicant) so the visualisation would indicate the number of membership forms by year that were submitted by male and female applicants. In terms of design, this was initially presented as a line graph but after feedback, the totals for each gender were added along with an interactive mouse-over feature on the individual lines to indicate the annual totals.

  • Digital Viewer

    One of the major challenges in publishing the membership forms was how to redact personal information. This information is generally handwritten so OCR would not help us to identify it but it could be used to find the name of the field eg Address. A mask was then applied to a pre-set area from the beginning of this word. This was effective in the vast majority of cases but failed when either the OCR was unsuccessful or the format of the form had changed marginally in terms of vertical spacing. In these cases the redaction had to be applied manually.

  • User Testing

    The research team compiled a group for user testing of the database and we kept the parameters fairly broad. The guidelines asked the group to note anything that was confusing, ideas to make use easier and comment on what worked well. They were also asked to give feedback on specific elements such as the search form, the search results and the individual record. The feedback was then collated and we decided with the research team what could and should be implemented. Examples of changes made included those to the visualisation tool, changes to Sort parameters eg under Date listing records with no date at the end the results instead of the beginning, developing a Go To Page option to allow for easier navigation over large volumes of search results.

 

Page citation: BECTU Membership Database, Linda Kaye, Women’s Work in British Film and Television, http://bufvc.ac.uk/womenswork/bectu-membership-database _(Access date)