Reference networks for targeted findability

Enference represents more than twenty years of experience with the design and realization of digital knowledge systems. In particular with the type of systems we refer to as semantic reference networks.

A semantic reference network is a network of references that is used to link knowledge resources ‑ like text documents, database records, web pages, etcetera ‑ to structures of keywords, thus making information searchable in a highly directed way.

Home A

To realize this, the knowledge worker organizes the available digital knowledge, applies meaningful connections, and makes content items searchable. For this he or she is equipped with tools that don't require any technical expertise, only domain knowledge is needed.

Mapping knowledge

  • Keywords that are typical for describing the knowledge domain are selected and organized into structured lists ‑ so-called thesauri.
  • Tools are available to support building and maintaining these thesauri, using world-wide standards and formats.
  • Smart functions help to link all data in the emerging "findability system" to the thesauri.

Making connections

  • To realize findability data are extracted from the knowledge resources that are appropriate to collect and represent specific information, so-called metadata. These metadata connect ‑ as we will see ‑ the resources to the thesauri.
  • Tools are available to support the selection of the appropriate metadata and to link them to keywords in existing thesauri. Conversely thesauri can be expanded based on the extracted metadata.
  • Knowledge resources can include text documents, database records, images, web pages, etcetera.

Home B

Putting knowledge to work

light on

  • Web developers use the thesaurus functionalities and the possibilities for linking these with knowledge resources to equip end users with different advanced aids for retrieving information from these sources.
  • This can be used to build simple applications, like a library of manuals and product descriptions that have many linked parts and can be easily kept up to date.
  • It is also possible to build complex systems, for instance in the area of workflow management, where knowledge and procedures are closely linked and the knowledge system has to cooperate with other applications.

Example: Cultural Heritage

The Cultural Heritage Agency of the Netherlands, a service of the Ministry of Education, Culture and Science, wants to improve access to knowledge a much as possible, with the aim of "unburdening". This means: supporting citizens, entrepreneurs, knowledge institutions and governmental organizations with their initiatives, helping with the execution of their tasks, and facilitating the necessary processes.

The motto is "linking", linking of knowledge domains like archaeology, built and movable heritage, cultural landscapes and spatial planning. This is an infrastructural task.

A lot of specialized knowledge is available in all sorts of resources inside and outside the heritage domain. In heritage institutions, in planning departments of municipalities, in the land register, in the Royal Library, the National Archive and the Institute for Sound and Vision. Improving access to that knowledge makes it faster, easier and cheaper for citizens, companies, research institutions, municipalities and financiers to make decisions and developing new initiatives.

This requires an efficient solution: a digital infrastructure that links data from diverse and dispersed sources. The Agency is currently building such a digital infrastructure, with many partners sharing their data and in which information can easily be traced and selected by directed search, resulting in its fast and cheap availability for a broad group of potential users.

How does a reference network look like?

To be able to connect the information in the resources via thesauri, an infrastructure is used that exists of three layers.

  1. The data layer contains the data directly related to the original content, information that is stored in diverse resources, for instance case oriented data about legal procedures or production processes, data about objects like paintings, archaeological finds or listed buildings, biodiversity data, etcetera.
  2. The reference layer contains the structured lists of keywords that are linked to the source data. Also this layer can contain links between items in the source data themselves. So this layer holds all the elements necessary for directed findability.
  3. And finally the application layer contains all the software that makes use of the source data and the connections ‑ for instance web sites, digital desks, apps to edit data, etcetera. This is the layer that will be primary addressed by end users.

Together these three layers form the knowledge system. The reference network is composed of the reference layer plus the linked metadata records. It is the central part of the knowledge system. Represented schematically it looks like this:

How can you link?

For applications that use the reference network there are basically two ways to retrieve specific information from character based source data: searching for text via keywords, or searching for sets of numbers, like dates or coordinates of locations. The first method is called "semantic search", the second could be called "mathematical search". The two methods also can be combined.

Example: Semantic and mathematical search

During a research about agriculture and livestock in the Middle Ages the question arises where and when chicken farms have emerged.

To find the answer, the researchers make use of databases with archaeological finds. They filter the data by the period "Middle Ages", by the material "animal bone" and by the species "Gallus gallus domesticus" or "chicken, hen, fowl". These terms, respectively, are taken from a thesaurus on periods, a thesaurus on archaeological materials, and a archaeozoological thesaurus.

The set of source data that remains after filtering can be divided into five periods: the Early Middle Ages, that exist of the periods A, B, C and D, and the Late Middle Ages. This division can be done based on the numeric values indicating the start and end of a period, for instance 900-1050. It also can be done by taking the name of a period, for instance "Middle Ages early D", a semantic selection.

Next, for each period the finds are plotted on a map based on the coordinates of their location ‑ a mathematical activity. Each find is rendered as a colored dot, yellow, orange, red or purple, the color representing the amount of bones per find, respectively 1-10, 11-100, 101-1000, and more than 1000 bones.

By observing per period the concentrations of the purple dots (and, possibly, of the red ones), the researchers get an idea where and when chicken farms arose.

In a reference network data are linked semantically, that is, by words. If in a specific domain certain keywords are used again and again, they can be collected for reuse in a list, like an index in a book. And just like with such an index, direct links can be created between the keywords in the list and related words in the source data. This way searching becomes actually more like selecting relevant links, which is much faster then having to start from scratch for each search.

Source data and their metadata

The "related words in the source data" that are mention above exist usually in metadata records. Content items in the data layer are often provided with a separate block of data with information about the item itself: who is the author, what is the creation date, what are keywords that represent the content, etcetera. This kind of "data about data" is called metadata. A record with metadata can be compared with an old-fashioned card from the catalogue of a library.

Example: Metadata

An article from a magazine (digital or on paper) can have the following metadata:

label jansen_c_1986_2
title Eating habits in the Middle Ages
author Jansen, C.
publication date 1986
published in Cultural Historical Perspective
keywords archaeology, archaeozoology, Bovidae, Gallus gallus domesticus, Middle Ages
link http://www.rce.org/articles/archaeology/jansen_c_1986

metadata

The metadata are gathered in a record, which is linked to both the thesauri (by using their keywords) and to the related content item (file) in the data layer. In the example above, for instance, this link to the source is realized through the hyperlink at the bottom.

Sometimes the metadata are originally already present, but when items without metadata are imported into the data layer, the metadata have to be created explicitly afterwards. If the new items consist mainly of text, constructing the metadata records often can be automated with the help of natural language processing techniques.

The lists with common keywords on the one hand, the metadata records on the other hand, and the connections between these two form the core of the reference layer.

Smart keyword lists

If providing metadata (which data do you use for the "catalogue card", for the record?) is based on pre-established lists of keywords, those lists are called controlled vocabularies. With them, filling the metadata records takes places in a highly structured way, they provide for univocality. Thus findability of items in the data layer can be realized in a rather precise and consistent way.

A next step is to add synonyms, conjugations and language versions to the terms in the controlled vocabularies. Starting with the search term "car", you will then also find content related to "cars", "auto", "automobile", "motor vehicle" and "voiture". A term which is provided with these kind of variants is called a concept.

Example: A concept with language variants and synonyms

The concept "boarded sail" from a thesaurus of windmill parts:

label boarded sail
description A mill's sail where a wooden board is used to catch the wind, instead of the common canvas.
English name boarded sail
English synonym wood-framed sail
Dutch name bordwiek
Dutch synonym wiek met borden, wiek met bordentuig
German name Holzgatterflügel
German synonym Türenflügel
French name ailes à quarterons
French synonym voilure de bois
source Dictionary of Molinology

Furthermore, in many cases the description that goes with a concept can be pulled apart in separate derivable properties, which as such can play a distinct role in searches.

Example: A concept with mathematical data

The concept "Early Middle Ages D" from a thesaurus of archeological periods:

label Early Middle Ages D
start date 900
end date 1050
description The period from the 5th century (after the disintegration of the Roman Empire) until the mid-10th century. Sub period D covers the last 150 years of this period.

In the last example the start and end date are derived from the description. This type of numeric properties can be used for mathematical search, like filtering all content items that have a date between 500 and 1000 AD.

Thesauri

When finally the concepts are arranged hierarchically ‑ by placing for instance the terms "Early Middle Ages" and "Late Middle Ages" under the term "Middle Ages ‑ the keyword list becomes a thesaurus.

The hierarchy makes it possible to search in a very structured way and thus retrieve much more relevant information than in a simple search. If a certain search term has underlying terms in the hierarchy, then also content linked to those terms will be included in the search result. With the term "Middle Ages" you will also find content that is linked to the terms "Early Middle Ages" and "Late Middle Ages", and possible other terms that exist in that part of the hierarchy.

thesauri

Example: Thesauri

The thesaurus of archeological periods is structured according to a time-based division into periods.

[...]
Middle Ages (450 -1500)
    Early Middle Ages (450 ‑ 1050)
        Early Middle Ages A (450 ‑ 525)
        Early Middle Ages B (525 ‑ 725)
        Early Middle Ages C (725 ‑ 900)
        Early Middle Ages D (900 ‑ 1050)
    Late Middle Ages (1050 ‑ 1500)
        Late Middle Ages A (1050 ‑ 1250)
        Late Middle Ages B (1250 ‑ 1500)
[...]

The archeozoological thesaurus is structured according to a tight taxonomic classification: a species is placed under a genus, which on its turn is placed under a family.

[...]
Bovidae (a large family of ruminants distinguished by their hollow unbranched horns)
    Bison (genus)
        Bison priscus (species: bison)
    Bos (genus)
        Bos primigenius (species: aurochs)
        Bos taurus (species: ox)
    Capra (genus)
    Capra hircus (species: goat)
    Ovis (genus)
        Ovis ammon (species: moufflon)
        Ovis aries (species: sheep)
[...]

References with a name

A reference network is inhabited by items, it is an item-based findability system. Those items are either a concept ‑ basically a term representing a class of similar objects or phenomena, like "book" or "biology" ‑ or a record representing a real life instance of an object or phenomenon ‑ like the writer or publisher of a specific book.

A record representing a real life instance exists of a set of properties that is selected to realize a desired form of findability. Links to concepts are mostly used for global classification of the instance, while links to other records or to fixed values like numbers or strings of characters give more specific information.

Example: A record representing the book "On the origin of species"

hasLabel On the origin of species
hasItemType book
hasAuthor Darwin, Charles
hasPublisher Murray, John
hasPublicationDate 1859

The example above is a record with some data about a real life object that is typified as a "book" by the reference to the corresponding concept "book". Also reference is made to a record about a person "Darwin, Charles" and to another record about a person "Murray, John". These three items are shown in the examples below.

Example: The concept "book"

hasLabel book
hasItemType concept
hasDescription A long written work that can be read from a stack of sheets of paper or on an electronic device.

Example: A record representing the person "Darwin, Charles"

hasLabel Darwin, Charles
hasItemType person
hasDate 1809-02-12 / 1882-04-19
isAuthorOf On the origin of species

Example: A record representing the person "Murray, John"

hasLabel Murray, John
hasItemType person
hasDate 1808 / 1892
isPublisherOf On the origin of species

As you can see in the example of the record representing the book "On the origin of species", the links to the persons "Darwin, Charles" and "Murray, John" are typified as a "hasAuthor link" and a "hasPublisher link". From this you ‑ and software ‑ can easily infer that Charles Darwin is the author of the book and John Murray the publisher. Likewise you can infer that the fixed numeric value "1859" indicates the year of publication.

named references

Usually different types of reference (links with a different name) point to keywords in different thesauri. For instance, the reference type "hasMaterial" will be linked to keywords in a thesaurus of materials, while the reference type "hasPeriod" will point to keywords in a thesaurus of periods. Also reference can be made to a record rather than to a concept (keyword) in a thesaurus: the reference type "hasAuthor" points to a record of the person Charles Darwin, a "real life object".

These named references like "hasAuthor", "hasPublisher", "hasPublicationDate", and so forth, can make your search very focused. If you want to collect all records in a certain knowledge system about books written by Charles Darwin, you simply select the item type "book" and the reference named "hasAuthor" with the value "Darwin, Charles" (or "Charles Darwin"). Provided that the knowledge system is well maintained, these three choices will give you exactly what you asked for, nothing less and nothing more.

Knowledge workers at the steering wheel

The model has been applied in different projects over the last couple of years (see below). It can be described as a "simple model for authoring of item-based reference networks".

  • Authoring: Starting point is the availability of an authoring tool ‑ comparable to a word processor, for example ‑ that can be easily operated by non-technical knowledge workers.
  • Content: There is a set of simple rules that guide knowledge workers while creating a well-functioning knowledge system.
  • Item-based: Creating the system is done by defining which types of items play a role and which properties they need for that role. Applying interim changes should be easy.
  • Reference networks: In doing the knowledge workers build a reference network, adhering to a pre-defined set of simple rules. With the same rules as a blueprint, developers can realize various advanced forms of findability for the end users. Interim changes in content or reference are no threat to the proper functioning of their applications.

A few sample projects

  • Dutch Species Register
    The Species Register is an hierarchical structure which houses all Dutch species (plants, animals and other living beings), the so called taxonomic thesaurus.
    → read more
  • Facet determination
    In the project Facet Determination tools and workflows were developed that enable "computer illiterate" experts (like many biologists) to build end-user-friendly determination systems in a very easy way.
    → read more
  • Sterna project
    The goal of the Sterna-project was to develop best practices for dynamic and multilingual knowledge systems that use dispersed sources ‑ of the thirteen participating European organizations ‑ with a multitude of different data models.
    → read more
  • DocumentChecker
    In the project DocumentChecker tools were developed for metadating documents in a largely automated process with the use of natural language processing.
    → read more
  • RCE Semantic Network
    The Dutch Cultural Heritage Agency (RCE) is implementing an infrastructure that anables participants (from listed building owners to regulation officials) to share data, where findability of information is easy and focused, and where availability of relevant knowledge is fast and cheap for a large group of potential users.
    → read more
  • Framework for Educational Concepts
    The Framework for Educational Concepts houses al formal and informal information which is being developed in the context of design, planning, realization and evaluation of (formal) education in The Netherlands.
    → read more
  • Research project Modelling and Findability
    The goal of the project Modelling and Findability is to investigate the formulation of an generic theoretical framework that is useful for non-technical management and content experts to design knowledge systems as a reference network.
    → read more

Acknowledgements

Part of this text is written in cooperation with Gemmeke van Kempen (GemRedactie), describing the basics of the projects "RCE Semantic Network" and "HeritageThesaurus" of the Netherlands Cultural Heritage Agency. Most examples are drawn from these projects.


top