Mirror of the Rel4tion website/wiki source, view at <http://rel4tion.org>

[[ 🗃 ^yEzqv rel4tion-wiki ]] :: [📥 Inbox] [📤 Outbox] [🐤 Followers] [🤝 Collaborators] [🛠 Commits]

Clone

HTTPS: git clone https://vervis.peers.community/repos/yEzqv

SSH: git clone USERNAME@vervis.peers.community:yEzqv

Branches

master

design2.mdwn

In [[Design]] I did some basic modeling, but I feel it ended up being more like brainstorming, and I’m losing the top-down view of the big picture. I’d like to start again, and add new things.

I’m going to define query model using abstract syntax. Although probably unnecessary at this point, I will use EBNF because it’s standard and I want to know how to use it. So far I’ve been using other random notations which are probably very close to EBNF, but EBNF parser software would fail to parse them. Also, translation into the concrete syntax of a query language will become easier.

My source for EBNF syntax rules is Wikipedia: https://http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form.

I’m going to design the query model in an iterative process, in which I will refine the definitions with each iteration until they feel stable and complete enough.

Abstract Math

In the [[design]] page, I defined a query as a 6-section unit:

Namespaces
Parameters
Variables
Known
Condition
Tests

I will begin with some mathematical modeling, so the redundant part of this model will be temporarily ignored.

First, there are at least two types of symbols in a query:

Symbols which refer to specific Entities
Placeholders for which the query engine will need to find matching Entities

Let’s define sets:

T - all possible entities in the world
P - all possible placeholders in the world

A single statement is composed 4 components, the ISPO model. Therefore:

S - all possible statements

We can actually define S easily:

S = T^4

Now, what about statement patterns used in queries? More definitions:

Element: a component of a statement pattern
E - the set of all elements
E = (T union P)

Now we can define (or at least set an upper limit to) the scope of simple statement patterns:

Q - all possible statement patterns
Q = E^4 = (T union P)^4

However, this information is not enough in order to resolve a query: We need to know which placeholders should get their own columns. Let’s assume for now that all placeholders should. Then we already have enough information to define a query resolution function!

First, let’s define the return type.

(m_1 … m_n) - a table title row, n>0
(t_1 … t_n) - a table data row, n>0
M_n - all possible title rows of length n
R_n - all possible rows of length n
M - the set of all valid tables
M = {(M_n, 2^(R_n)) : n is a positive integer}

Now, the query solver:

f: 2^Q -> M
f takes a set of statement patterns, and returns the solution table

This model is compact, but quite inconvenient because all the metadata of the patterns is encoded inside them. Each element e is either a placeholder p or an entity symbol t. Therefore, at least until I have a full initial model in place, I’ll use a more human friendly representation instead.

Nicer Math

A minimal query is a query containing just a single statement pattern. I’ll use them soon. First here’s a scheme for AND queries:

q = (P, V, K, T)

Where:

q is the query
P is a finite non-empty set, containing the parameters
V is a finite set, disjoint with P, containing the variables
K is a finite set, disjoint with P and V, containing the known entities
T is a finite non-empty set of statement patterns, i.e. quads whose elements are from P, V and K

Requirements:

Each p in P must appear in some t in T
Each v in V must appear in some t in T
Each k in K must appear in some t in T

Good. Now back to minimal queries.

q = (P, V, K, t)

Where:

q is the query
P is a finite non-empty set, containing the parameters
V is a finite set, disjoint with P, containing the variables
K is a finite set, disjoint with P and V, containing the known entities
t is a statement pattern, i.e. a quad whose elements are from P, V and K

Requirements:

Each p in P must appear in t
Each v in V must appear in t
Each k in K must appear in t

Let’s define what f should do for a given q. Assume E is the set of all possible entities. Therefore:

Let’s denote the set containment relation with (=
K (= E

f takes two things as parameters:

q, the query
D (= E^4, the set of all quads in the database

It returns two things:

m, a permutation of P
R, a finite set of rows, i.e. tuples from E^|P| whose elements exist in D

How does it do that? What does it return exactly? Let’s define.

For a given q = (P, V, K, t):

P = {p_1 … p_n}
V = {v_1 … v_m}
K = {k_1 … k_l}
t = (c_i, c_s, c_p, c_o)

f returns (p_1 … p_n) and the set of all rows:

r = (e_1 … e_n)

such that there exist - wait a second. I need to define statement pattern substitution. I’ll call that function g. Here it is:

g (t, A) = s

g takes a statement pattern containing variables/parameters W={w_i_1 … w_i_u} and a mapping A of them to elements from D, i.e. of the form “w => e”. A must have exactly one match for each w in W. g returns the statement s, which is t after all variables/parameters in it have been replaced with their mappings. The mapping A can be given in the form B=>X, where B is a tuple of items from P and V, and X is tuple of entities from D.

Starting f again.

f takes two things as parameters:

q, the query
D (= E^4, the set of all quads in the database

It returns two things:

m, a permutation of P
R, a finite set of rows, i.e. tuples from E^|P| whose elements exist in D

How does it do that? What does it return exactly? Let’s define.

For a given q = (P, V, K, t):

P = {p_1 … p_n}
V = {v_1 … v_m}
K = {k_1 … k_l}
t = (c_i, c_s, c_p, c_o)

f returns (p_1 … p_n) and the set of all rows:

r = (e_1 … e_n)

such that there exists a tuple G = (d_1 … d_n) of elements from D for which there exists a tuple H = (e_1 … e_m) such that the statement g(t, P=>G union V=>H) is true in D.

Let’s start adding features.

Logical Connectives

Logical connectives (OR, AND, NOT) are modeled by transforming the statement pattern t into a tree where each leaf node is a statement pattern, and each non-leaf node is a connective. OR and AND have two children, and NOT has a single child.

I’m going to define recursively how they work, by defining f(q, D) in terms of components of the tree. For leaf nodes, the definition is the same one we gave for minimal (single-statement) queries.

Actually, I’d like to think again about the connectives and redefine them. The ones I’m sure about are:

OR(a, b) matches if a matches or b matches
AND(a, b) matches if a matches and b matches

Now here are interesting ideas for NOT:

NOT(a) matches if D contradicts a
UNKNOWN(a) matches if a doesn’t match

UNKNOWN is relatively easy to implement, while NOT requires inference and some extra intelligence in the algorithms. But I will define both.

UNKNOWN

q = (P, V, K, T)
z = (P, V, K, UNKNOWN(T))
|P| = n

Goal: Define f(z, D) using f(q, D).

Solution:

Y_i(D) is the set of all i-tuples of entities in D.
Binary operation is set difference
f(z, D) = Y_n(D) f(q, D)

OR

q = (P, V, K, T)
w = (P, V, K, U)
z = (P, V, K, OR(T, U))
|P| = n

Goal: Define f(z, D) using f(q, D) and f(w, D).

Solution:

Binary operation + is set union
f(z, D) = f(q, D) + f(w, D)

AND

q = (P, V, K, T)
w = (P, V, K, U)
z = (P, V, K, AND(T, U))
|P| = n

Goal: Define f(z, D) using f(q, D) and f(w, D).

Solution:

Binary operation * is set intersection
f(z, D) = f(q, D) * f(w, D)

NOT

q = (P, V, K, T)
z = (P, V, K, NOT(T))
|P| = n

Goal: Define f(z, D) using f(q, D).

So far “known” meant something is explicitly stated. Now I’d like to introduce negative entailment. If something is true, it may be concluded that something else is false. For example, if lives-in-country is a FunctionalProperty and we know John lives in Denmark, then any statement of the form "John lives in ___" where the country is not Denmark is known to be false.

Let’s also define the Constradiction Set of a database. For a set of statements D, a set of statements C(D) is the contradiction set of D if for each statement in C(D), negative entailment on D concludes the statement is false.

Solution:

C(D) is the contradiction set of D
f(z, D) = f(q, C(D))

Symbols

I’d like to assign symbols to the connectives:

OR: or, |
AND: and, &
UNKNOWN: ukn, ~
NOT: not, !

Query Splitting

In the definitions above, I assumed P, V and K are identical in the arguments for OR and AND. But it’s not always true! For example: “Give me all the people who know someone (A) named John or are friends of someone (B) named Alice and someone (C) named Bob”. The first part has a single variable A while the second part has two, B and C. Is such a query valid?

Technically, if you start with “if there exists X such that”, then the condition must refer to X. But we can also define a simple rule to cover this case: If there is no reference to X, then the answer is always the boolean truth value of the condition. Example:

f(x) = “There exists y such that x + x = x”

If x + x = x, then for each value of y is it still true. Not only some y exists for which it’s true. It’s actually true for every y we take. If x + x != x, then it doesn’t matter which y we take - it’s still false. So no y can make it true, no such y exists.

Conclusion: We just evaluate the expression inside.

Final conclusion: Both parts of the OR/AND have to use all members of P, but they don’t have to use all members of K and V. However, together they do have to use P, V and K just like any other statement pattern.

Translation

Due to the usage of Smaoin itself to model the labeling and i18n systems, using namespaces and labels directly in place of uids is nothing more than syntactic sugar. But before I let myself use them, I want to define exactly how this sugar works, and how to convert it into the statement patterns we’ve already seen.

First of all, a query can have the same header components an Idan files has. For simplicity, assume I use just the language tag and namespace prefixes (e.g. no default namespace tag). Now the query looks like this:

q = (P, V, K, T, l, N)

Where:

l is a natural language
N is a set of namespace-prefix pairs

Goal: Convert q into a uid-only query q’ = (P, V’, K’, T’).

Here is an algorithm, assuming no chaining is done (e.g. x:y:z) for simplicity:

Step 1: Check if K contains an entity of the form ‘prefix:label’, where ‘prefix’ is a namespace prefix string. If it does, define K’ to be K with ‘prefix’ replaced by the uid of the namespace, matched using N. Then repeat this step. If K doesn’t contain any such entity, proceed to step 2.

Step 2: Check if K contains an entity of the form ‘uid:label’, where ‘uid’ is a namespace uid. If it does, create new uniquely named variables ?x, ?i, ?j and assign V to be (V union {?x, ?i, ?j}). Assign K’ to be K with ‘uid:label’ replaced with ‘?x’, and replace all occurences in T as well, producing T*. Now add a new statement, producing T’:

T’ = AND{T*, (?i, ?x, has-label, ‘label’), (?j, ?x, belongs-to-namespace, ‘uid’)}

Finally, add has-label and belongs-to-namespace (which are uids) to K. Then, repeat step 2 with the resulting q’.

If K doesn’t have any entity of the form ‘uid:label’, the q’ we now have is the value we return. Done.

Relation Operators

Just like in [[design]], these are relations (predicates) for which membership is computed rather than tested against a listing of their content. Statement patterns which use them can be used just like any other statement pattern, but the way they are matched is different.

First, while an atomic statement pattern is a quad:

t = (i, s, p, o),

a relation operator pattern is a triple:

a = (x, u, y)

and is matched (i.e. evaluated to ‘true’) if the operator u applied to x and y - i.e. u(x, y) - returns ‘true’.

Since u always returns a value, UNKNOWN doesn’t make much sense. But it is possible to define that u returns “unknown” if it cannot compare the operands, e.g. if a number comperator is passed a pair of strings. Otherwise, applying ukn to a relation operator pattern always evaluates to ‘false’.

Values

The K component of a statement pattern doesn’t make sense for Values, because there’s an infinite number of them, and not all of them are used in each query. So in addition to P, V and K, each statement pattern component may also be a member of Z, the set of Values in Smaoin.

Value Operators

In a similar manner to [[design]], but maybe defined slightly differently, these are functions which take one or more Entities and return an Entity. They do not use the database content. These functions can be used in place of statement pattern components. In other words, instead of a component c of t, we can now have u(x1…x2), where each xi is a parameter, a variable, a known entity or a value, and u is a value operator.

Examples:

String concatenation
Number arithmetics
Encoding conversion
Boolean algebra on Boolean values
Number to string conversion and formatting
String to number parsing
Min/max/average of sets of numbers

Transformations

These work in a similar way to value operators, but they’re used in the P section rather than in T. From now on, each member of P can not only be a mere symbol, but may instead be a function application m = u(x1…xn) m is a title for the column, u is a value operator and each xi is a parameter (title), a variable, a known entity or a value.

Optional Fields

So far, all our placeholders were required to have a match. Parameter matches are inserted into the result table. variables matches are not. But what if we want some parameter to be optional? If a match for it is found - great, insert it. If not, insert a blank cell instead of skipping the tuple.

Let’s take a query:

q = (P, V, K, T)

Until now, each p_i was required. Now, let’s mark one of them, say p_1, as optional. It means that all the statements in which p_1 appears do not have to match. If they all do - we have a value for p_1. If we don’t, still take the values of the other parameters, but make the substitution for p_1 blank.

This definition also means all the patterns containing p_1 are themselves optional patterns, but unlike Tests (next section), their truth value is not directly inserted into the table.

From now on, optional parameters have their own query component, A.

q = (P, A, V, K, T)

Tests

Tests are like optional fields - but instead they are optional patterns. Have a pattern which doesn’t have to match. For each potential row tuple, check if the optional pattern matches. Add a column to the table, which contains “true” or “false” accordingly.

In other words, instead of a single pattern tree, we now have several. After we match the first one, we check the truth value of the others. They may be existential tests, i.e. check whether some query would return at least one row. But that probably can work using variables, we’ll see in a moment.

Let’s assign letters. In addition to T, we can now have tests:

H = {C_1 … C_w}

So now q looks like this, if we use namespaces and languages:

q = (P, A, V, K, T, H, l, N)

Existential tests can be created by having a variable that’s used only in H, not in T. The same is true for regular patterns in T: To have an existential part like EXISTS in SPARQL, you just make a new variable and write statement patters which use it.

Alternative Transformation Model

Instead of modeling transformations in P, we can reuse value operators and have a more compact model. Here’s how.

Let’s define a special predefined relation operator, the “identity” predicate. It compares a pair of entities and determines whether they are identical. Identity means exact match, e.g. very close floating-point numbers are not identical. Let’s denote this operator with ‘=’.

Now we can use a value operator in a relation operator pattern. For example, instead of having a parameter in P defined by a function:

m = u(x1…xn)

We can have just “m” as a member of P, and have the following pattern in T:

m = u(x1…xn)

Where ‘=’ is the identity operator! It’s simple and compact. Maybe the evaluation algorithm looks much less obvious or clear this way (it’s less clear we first match all x1…xn and then compute m, without using it in any database indexing etc.), but that’s the business of the implementation only, and will be planned later.

What’s Next

I’m translating [[sparql-examples]] to Naya: [[examples]]. I’m also discussing new issues in [[design3]].

[[TODO|TODO/OPEN]] Follow the [[sparql-examples]] and bring features here if there’s anything I missed here. Oh, I think I missed aggregation functions (count, average, etc.). Also there’s an uncommitted file idan-naya-dev in rdd-wiki which talks about property extension and other things.

[See repo JSON]

Clone

Branches

Tags

design2.mdwn

Abstract Math

Nicer Math

Logical Connectives

UNKNOWN

OR

AND

NOT

Symbols

Query Splitting

Translation

Relation Operators

Values

Value Operators

Transformations

Optional Fields

Tests

Alternative Transformation Model

What’s Next