Clone

HTTPS: git clone https://vervis.peers.community/repos/yEzqv

SSH: git clone USERNAME@vervis.peers.community:yEzqv

Branches

master

i18n.mdwn

Purpose

Plan and design the internationalization system, allowing all names to be translated and viewed in the user’s preferred language. Classes, objects, properties and even the presentation model (syntax of the computer language) can and should be internationalized and localized.

Content

Programming languages are meant to be used by technical people and to be shared by people speaking different languages, and as a result code is usually written only in English. But the design of the expression model puts the users and their needs in the center. One of the important aspects of user-friendliness is usage of the user’s local language, which means the expression model should at least try to implement localization.

And indeed, the expression model has this capability.

Here are two possible models for internationalization (i18n) of the expression model:

Base/translation model: Each concept and each resource has a label in a base language used as a default for worldwide communication (e.g. English) and the label may have translations to other languages.
Decentralized model: Each concept and each resource may have labels in many different languages, with no single language serving as a base language. It may be possible for the user to define the system’s default language, but to the database the notion of a special base language doesn’t exist.

In a similar manner to the implementation of namespace-label access to resources, there are two approaches here: Store the multilingual names directly in the database, or use optimized tables. Here we will examine the first approach.

So far, the label of a resource has been given to it through the hasLabel property. But it is no longer enough, because a label must now have a language. Therefore, a label is no longer a mere string, but a concept which can relate to other concepts. Thus a new class Label is introduced.

Before the modeling approach is explained further, it is very important to note that a design in which a isInLanguage maps a hasLabel statement to a Language resource is not a good design. The language of the label is a direct trait of the label alone, and has nothing to do with the resource itself, therefore it is wrong to make declare the whole statement as being in a specific language. The label itself has a language; the statement and the resource do not.

A Label has three important ways to relate to other entities:

Have a textual representation through a property such as hasText, which matches the label to a text value
Have a language associated with it through a property such as isInLanguage, which matches the label to a Language resource
Have a resource using it as a label, e.g. through a property hasLabel which matches a resource to a label it uses

Assuming the first, direct approach, in which a resource needs to be identified using the Labels stored in the database, what is the identification process? We start with two things:

Label
Language

For simplicity, assume we are not using a namespace and we already have the uid of the Language resource. Then from all Labels in the given language, we need to find the one whose text is exactly the one given to us by the user. The query may look like this:

Parameters: *t* (label text), *l* (label language identifier)
Give me: ?r
Such that:
*	?r hasLabel *T*
*	*T* isInLanguage *l*
*	*T* hasText *t*

Conceptually we can break the process into three steps:

f(L) - take a language L and get the set of labels in that language
g(t, S) - take a text t and a set of labels S, and find the label in S whose text is t
h(l) - take a label and find the uid of the resource which uses it

Then the query q(t, l) taking a label text and a language identifier can be expressed as follows:

q(t, L) = h(g(t,f(L)))

The i18n method described above assumes we translate resource names to many languages. But translating them is not enough: The keywords of the presentation model, and query language, etc. should be translated too. The approach to localizing keywords is similar: We use the same query, only now we are not looking for some general resource. This time we are specifically looking for a member of Keyword, where Keyword is the class of all keywords in our specific computer language.

This model of i18n may be simple and logical, but several problems arise which we must deal with:

What to do if two resources have the same name
What if two resources share translations

Let’s why we these are problems. Currently, as explained above, each resource is related separately to each of its names through a separate Label resource. Thus, while the same Label can be used by many resources, it is possible that two resources from different namespaces use the same name, and then they need to point to the translations separately. This is unnecessary duplication, since the translation has nothing to do with namespaces and modeling. It depends only on the meaning of the word as it appears in the dictionary.

Translation has two different aspects in the context of computer entity labeling:

Taking a concept and getting the word describing it in a given language
Taking a computer language concept and getting the name chosen for it in a given language

The first aspect would actually result in an electronic dictionary. The second aspect is a mixture: It can use words from the first aspect, or use new labels which give existing (or new) words a new meaning specific to the computer language they’re defined under.

Let’s start facing problems. Assume there is a Notion class, representing a unit of meaning. Each Notion can have many labels, preferably at least one Label for each language. Now assume two resources use the same name, but different meanings of it. For example:

Bat - the class of all bats (i.e. the animal)
Bat - the class of all clubs used in games (i.e. the club)

Should they use the same label? If we examine other languages, we will find out these two things are represented by different words in other languages. Therefore, a single label must not be used, otherwise, there will be no way to separate translations of one meaning from translations of the other. Conclusion: Each such “label” is actually a Notion and has its own set of translations.

Now let’s assume we have a pair of resources referring to the same meaning. In this case, in order to avoid duplication, they should use the same label. However, there’s also the case where a word is used in a computer language to denote a concept not precisely matching the word’s meaning, or a new “word” is invented. For example, programming language keywords, e.g. typedef in C/C++. In this case, it is a new Notion with its set of translations. It shouldn’t use an existing label.

Example why it’s dangerous: Take the keyword class, which exists in many programming languages. If we decide to translate it to other languages, we may decide for a specific language to use a word that is not the translation of “class” to that language. In this case, the translation of “class” as an English word and the translation of class as a programming concept are different. It is possible to copy translations, but they should be kept as separate definitions in order to allow using arbitrary translations regardless of natural language rules, which do not apply to or restrict computer languages.

An important question we haven’t discussed is the following: What if we use text values instead of Labels? This is the current usage of Labels:

[class] [isA] [Label]
[class] [isInLanguage] [English]
[class] [hasText] "class"
[Class] [hasLabel] [class]

Now assume we use plain text values instead of Labels:

[English] [hasWord] "class"
[Class] [hasLabel] "class"

This definitely looks more compact. And for the examples, we have seen so far, maybe it works. But there is a hidden design issue here, which reveals itself once we expand our model a bit. The design issue is that we use character strings to represent words. The truth is, that the string is not the word. The string is just a textual representation, while a word can be communicated by sound, sign language, new artificial languages and so on. Strings are just values in the expression model, and don’t represent things we want to describe.

If you’re not convinced, let’s examine a simple expansion of our model, which easily exposes the problem. Assume we want to add a definition concept, i.e. express a relation between a word and its dictionary definition. Either we define a Definition class which has a word text and definition text, or we define a Word class. Either way, we cannot use a hasDefinition property which matches a word text to a definition text. That’s because the subject of a statement must be a resource. Values’ definitions are pre-existing, thus further descriptions of them are not allowed.

As you can see, we quickly reached a conclusion that a value cannot represent an abstract concept, and thus a class for words/terms/definition is necessary.

Now it is time to deal with namespaces. Namespace translations have be handled exactly in the same way described above for resources and keywords, but there is one important different in the nature of the names: While resources and keywords tend to be real words, namespaces are often 2-4 letter strings, which are usually abbreviations. Examples:

foaf (friend of a friend)
doap (description of a project)
dc (Dublin Core)
xs (XML Schema)
nao (NEPOMUL annotation ontology)

There are two options to deal with namespace labels:

Localize them just like keywords and resources
Use the original name in the base language

The second case requires that a base language is specified, but it can be done easily inside the database itself. Then the label is searched in the set of base language labels and matched to the corresponding namespace.

[See repo JSON]

Clone

Branches

Tags

i18n.mdwn

Purpose

Content