Clone

HTTPS: git clone https://vervis.peers.community/repos/yEzqv

SSH: git clone USERNAME@vervis.peers.community:yEzqv

Branches

master

integrity.mdwn

Purpose

Examine the various data integrity models possible for a semantic database and choose the ones the expression model supports.

For example, in a statement xRy the database may reject it of R is not a property, or it can just accept it, or it can accept it and automatically add the R-is-a-property statement if doesn’t exist yet. Same for property domain and range: Deduce from them, or enforce them. Other issues probably exist.

Content

One of the core components of many computer systems, especially servers in client-server designs, is access control. Access control is the guard watching the door: It says who can make changes, and which changes they can make. Some changes are not allowed to be made by anyone, and these changes are managed by another component. It’s called validation, or integrity. It is a component which says which changes to the data are valid and which are not, regardless of who makes them.

However, in an open collaborative free environment, these aspects of databases become a bit less important. In addition, due to the universal nature of semantic models, any introduction of data brings in new information, and it becomes harder, sometimes impossible, to test its validity against a trusted source. There is no authority deciding which information and which access are valid, and which are not.

There are several kinds of integrity constraints, and they can be demonstrated by the same examples given in the Constraints page on the model level:

Every person has two eyes. Therefore, if we have Eye objects and Person objects, we’d like to define exactly two eyes for each person. But if someone adds three hasEye triples for the same person, we easily get a person described to have three eyes.
In a given organization there may be between 2 and 5 advisors. But it’s possible to add less than 2 or more than 5 hasAdvisor triples.
It’s possible for two Persons to have the same Eye.
It’s possible for an Eye to belong to person A, while person B is defined to have that Eye.
It’s possible to have an Eye which doesn’t belong to any person.
It’s possible for two people to be parents of each other.

These constraints are usually enforced in databases, because it is how relational databases are usually used. For example, a People table with two Eye columns would enforce both eyes to be “not null”, i.e. always have values. But the expression model designed here tries to change the situation, and introduce a new model.

Can you imagine a person with three eyes? I can. I guess you can too. So it is possible that such a person exists? Maybe. Can it be proved than no person with three eyes will ever be born? No. It can happen, even if the probability is very close to 0. Therefore, “every person has two eyes” is misleading. What is actually means is “every person so far has two eyes”. In other words, it is not a physical law, but a mere observation.

Relational databases are very limited in the amount of expressiveness and flexibility they allow: Adding new columns to tables are not always trivial operations, and even then the number should be pre-determined. When the number is unlimited, e.g. a garden can have many flowers, we don’t create columns “flower 1”, “flower 2”, etc. Instead we just create a new table which contains garden-flower pairs. But this is a serious change to the database schema, i.e. move the “eye” columns to a whole new table. All the program code using this data would have to be changed.

For this reason, relational databases make assumptions, e.g. assume the only genders are “male” and “female”, assume all the people have a first name and a last name, assume every song is created by a person and so on. Therefore these databases pretend the assumptions are facts, and cannot easily adapt to changes of the assumptions, for example:

a person who is neither a male or a female
a person who has a single name, no “first” or “last” name
a song created by a computer, i.e. artificial intelligence

A semantic database can easily adapt to such changes. Of course not all possible changes can be anticipated, and semantic databases have cases which require non-trivial changes, but they are much better at accepting forms of data not planned and not foreseen. Therefore they don’t need to make the assumptions that relational databases make. The database administrators are free from the work of maintaining integrity and assumption correctness.

As long as only pure truth is inserted into a databasw, it is safe. For example, if a person with three eyes is born, adding a new Person resource with three related Eye resources is okay. An Eye can never belong to two different people, unless something like that happens in reality. The problems begin when false data is inserted. More precisely, when contradicting information is inserted.

Let’s see an example. Assume you download a song from the internet, and it doesn’t contain any metadata: author, time of creation, etc. So you add it manually. You find out when the song was written - 1966 - and add the statement:

[the song] [writtenInYear] 1966

Now assume some time passes, and one day you install a new music player. This music player can fetch song information from the web automatically. According to that information, the song was written in 1965! Assume the player doesn’t detect your statement, and now both statements exist:

song was written in 1966
song was written in 1965

Using the constraint rules menioned in the Constraints page, specifically, the property cardinality, the problem can be solved: Every time a statement whose predicate is wasWrittenInYear is added to the database, the query “give me the existing wasWrittenInYear statements for this song” must be executed, and the new statement is added only if the query result is the empty set.

However, assume such an operation takes too much time for a common computer. Taking all database traffic into consideration, and a collection of constraints, the performance may change drastically: All writing operations which require a constraint-validating queries now also become reading operations, and all these validation queries of course add extra overhead to the writing operations. What if it’s too much for a common home computer to handle?

Several options exist:

Allow programs to run the verifications if and when they wish, or specify in the query whether they wish
Have the database run constraint tests periodically async, i.e. hold a list of changes in a queue and validate them in the background
Allow the user to run “database cleanup” whenever she wishes, or set it to run every hour/day/week

In practice, it is expected that semantic desktop databases are able to do daily/weekly cleanup without causing trouble, because the chances that a well-written integrated application creates invalid statements is low, and even then other apps may detect them and fix them, and even if not it’s not a critical problem if the invalid statement exists for several hours until fixed. For large databases serving many users, where reliability and integrity are important, it may be better to use one of the first options.

Before we proceed to constraint modeling, here are points we need to refer to:

Stating that two classes are disjoint
Stating that two classes are equivalent, possible by a pair or isSubclassOf statements
Stating that two objects are not related through a property, e.g. “Anne is not John’s wife”
Stating that two properties complete each other, e.g. aRb is true iff aSb is false
Stating that two objects refer to the same thing
Stating in addition to “the size is 5” things like “the size is not 5 / larger than 5 / smaller than 5 / odd number”
Using cardinality as a statement, e.g. “John has 5 children” is information regardless of whether it is enforced
Inverse properties
Symmetric property, i.e. it’s the inverse of itself
Assymetric property
Disjoint properties, e.g. aRb and aSb can’t be true at the same time
Reflexive, Irreflexive, Function (i.e. cardinality 0 or 1), InverseFunction, Transitive
Property chains, e.g. hasGrandchild is hasChild of hasChild

Constraints can be divided into two groups:

Semantic
Non-Semantic

Semantic constraints are related to the meaning of the model concepts, and non-semantic constraints are arbitrary limitations. Example:

Semantic: Each person has exactly two parents
Semantic: A person cannot be the parent of itself
Semantic: A person is not a machine
Non-semantic: John’s house must have at least five rooms
Non-semantic: Every American commercial software company in the database must have between 2 and 5 managers
Non-semantic: All John’s friends are good people.

Now let’s see how constraints are expressed.

TODO move things to the constaints page under model level

[See repo JSON]

Clone

Branches

Tags

integrity.mdwn

Purpose

Content