Clone

HTTPS: git clone https://vervis.peers.community/repos/yEzqv

SSH: git clone USERNAME@vervis.peers.community:yEzqv

Branches

master

2.mdwn

[[!template id=ticket class=task]] [[!tag /projects/idan/decisions]]

[[!meta title=“Choose language tag naming scheme”]]

Issue

In Unicode there are characters that represent language tags. They could be used in addition to the two-letter ASCII tags. Chapter 23 of Unicode 7 talks about them, in section 23.9. The two-letter and four-letter tags and all the related names are probably explained well in Wikipedia: [[!wikipedia “Language tag”]]. I noticed that the tags used (or may be used) in Turtle are a bit different than the ones I see in locale names. So take a look.

Process

Tag Characters

Unicode’s language tag characters ('\E0000' - '\E007F) are deprecated. The Unicode standard recommends putting language tags in higher level protocols, which provide them in the syntax. Indeed, Idan does so too. Therefore language tag characters won’t be used.

Turtle and BCP47

In Turtle and in SPARQL the rule for language tags is:

LANGTAG  ::=  '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*

This alone doesn’t say much. The Turtle specification refers to the RDF Concepts document, which refers to BCP47 as the definition of language tags. Let’s go there too. This is also what Wikipedia mentions as the IETF language tag, which probably means the best thing to do is to use these tags like other languages do. While reading, I’ll try to summarize the rules and define rules for Idan while trying to be compatible with the BCP.

Language Tag Modeling

A language tag is a sequence of one or more subtags, separated by hyphens (“-”, \x2D in ASCII and in Unicode).

There are different types of subtag, each of which is distinguished by length, position in the tag, and content: each subtag’s type can be recognized solely by these features. This makes it possible to extract and assign some semantic information to the subtags, even if the specific subtag values are not recognized. Thus, a language tag processor need not have a list of valid tags or subtags (that is, a copy of some version of the IANA Language Subtag Registry) in order to perform common searching and matching operations.

For consistency, I’m downloading the current version of the registry and storing it locally. It’s [[here|language-subtag-registry]].

Before we present the syntax of language tags, there’s an important point regarding use in Idan. Since I already designed the l10n and language system, I will probably need to update it to support the various kinds of subtags and do smart matching of language tags. Should I keep the rule that says that each language tag refers to an nli:Language defined under lang? With the variety of subtags, does it still make sense? Should I perhaps define subtags and related concepts in NLI instead? Also, there must be a smart way to determine the parent-child relations between languages, so that languages contain all the children they validly can, but no more than that (e.g. maybe general Chinese and some specific dialect of Chinese have unique words, so neither can fully be contained by the other, i.e. be a parent of the other).

The current system in NLI - Languages and has_dialect - does work and Idan therefore does have a stable basis to work with. No reason to worry. Language tags are just an improvement.

So here’s the syntax, based on the one from the BCP but not identical (e.g. I’m dropping the old “grandfathered” tags and the “private use” syntax). I’m using the same notation used in [[Idan’s spec|/projects/idan/spec]].

Hold on… before I can really write the complete syntax, I must decide how these tags are related to the header’s language chooser, lang member language labels, language specifiers in ns:label references and NLI members.

Important question: Can any language, region and variant be combined? If not, what are the rules?

Let’s start with a bit of modeling. Things like regions aren’t specific to languages, but I’ll move them as needed later. For now the whole thing is in NLI. Some classes may be:

Ummm…

Is “language” the whole tag, or just the primary first part? I need some terms for this, from linguistics. Let’s see Wikipedia.

First, there are language families. Suppose each one is a Class, and we can have Languages which are members of families. Each language has several codes according to several standards. In addition to the ISO ones, Wikipedia also lists the [[!wikipedia Glottolog]] code - seems a good idea because they publish under CC by-sa (the Ethnologue is proprietary at the time of writing, i.e. not free culture work. See [[here|http://www.ethnologue.com/terms-use]]). Also check out the [[!wikipedia “Linguasphere Observatory”]].

Languages also have [[!wikipedia dialect]]s, in the linguistic sense. There are various kinds of them, such as [[!wikipedia sociolect]]s, [[!wikipedia ethnolects]]s and regiolects. Dialects are one kind of [[!wikipedia “variety (linguistics)”]].

Worth noting: While the variety of properties for IETF language tags allows a huge number of combinations, it’s very inconvenient for all the subtags to be understood and managed by everyone. People just need to know their tag, and use it everywhere. Perhaps it then makes sense to have predefined languages in a central place, and other text can refer to these. For example, if language X isn’t spoken in region Y, combining them doesn’t name much sense.

Let’s call the various varieties “lects” from now on, to distinguish from “languages”.

Language tags can have script subtags. These indicate the characters used for writing, i.e. the [[!wikipedia “writing system”]].

Region tags can be country codes or numbers as assigned in UN M. 49, where continents etc. get three-digit codes.

A short break from learning - back to modeling. Let’s call each specific variety nli:Lect (term taken from Wikipedia). Each lect is a variant of some nli:Language. For example, American English may be a variant of English.

Here’s a summary of concepts so far - because I’m getting confused. Certain stressing things going on in my life are distracting me from this a bit, so I need a reminder or I’ll never make a good decision here… here it is:

Language
Dialect
Sociolect
Ethnolect
Regiolect
Variety
Lect
Writing System
Language Family
Region
Continent
Country

Language Tag Syntax

langtag   = language
            (sep script)?
            (sep region)?
            (sep variant)*
            (sep extension)*
language  = letter #2-3 (sep extlang)
script    =
region    =
variant   =
extension =

sep       = "-"
letter    = [a-zA-Z]
digit     = [0-9]

Links:

[[!wikipedia “ISO 639-1”]]
[[!wikipedia “ISO 639-2”]]
[[!wikipedia “ISO 639-3”]]
[[!wikipedia “ISO 639-5”]]

Decision

None yet.

[See repo JSON]