Mirror of the Rel4tion website/wiki source, view at <http://rel4tion.org>
Clone
HTTPS:
git clone https://vervis.peers.community/repos/yEzqv
SSH:
git clone USERNAME@vervis.peers.community:yEzqv
Branches
Tags
1.mdwn
[[!template id=ticket class=task done=yes]] [[!tag /projects/idan/decisions]]
[[!meta title=“Characters allowed in String and Character literals”]]
Problem
While String and Character support full Unicode, some characters don’t make sense when specified as-is in Idan files, such as a newline. A linefeed Character literal would look like this:
<%> myns:favorite_character '
'
And if someone would edit the file on another system and change the line ending character from LF to something else, the meaning of the statement would change! Very bad.
In general, all kinds on control characters, invisible characters and so on are not a good idea to write as-is in character and string literals.
Solution Process
First Thoughts
On one hand, Idan should be more accessible and less technical than programming languages. Easily and conveniently and more-or-less intuitively writable. On the other hand, the text must be clear, simple and readable. Therefore, the solution should probably be somewhere between the “ASCII only” you expect in programming languages, and the “any character” you might see in data languages that are less human-oriented than Idan.
First let’s deal with characters, and then adapt the solution to strings.
At the time of writing, the tutorial says anything except for CR and LF is allowed in Character literals. This solves the ugliness in the example above, but it doesn’t solve the problem for control/invisible characters.
An important point which should be examined is whether forbidding these characters is the right thing to do, as opposed to discouraging their use as-is while supporting them in the syntax. Why not provide the alternative? Do make a list of suggested categories to escape, while still allowing them to be typed into the file as-is.
For example, the direction change characters. I type in Hebrew often, which is a right-to-left language, and I sometimes use the RTL and LTR direction control characters. Typing such a character alone as a Character literal is a very bad idea, because it looks just like any other invisible character would look:
''
And it also looks exactly like a pair of '
s with nothing in between, which is an error. Hmmm… looks like the parser should at least warn about this.
The next steps would be:
- See what XML, Turtle and Haskell do
- Understand Unicode character properties better
XML
[[XML 1.0 5th edition|http://www.w3.org/TR/REC-xml/]] says the following in section 2.2 titled “Characters”:
A character is an atomic unit of text as specified by ISO/IEC 10646:2000. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char.
That’s a bit strange, isn’t it? I got the impression the entire ASCII is superseded by Unicode. Where are all the other characters between null (\0
) and space (\x20
)? Here are the ISO standards:
- [[ISO/IEC 10646-1:2000|http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=29819]] (withdrawn standard)
- [[ISO/IEC 10646:2014|http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=63182]] (current standard)
It seems, judging by other sources, that the whole ASCII indeed exists as-is in Unicode. Is this reflected in both of the documents? Checking… ummm it’s not whay I thought. The 2000 standard isn’t even public, and my browser fails to show the preview. Looks like it needs JS. The 2014 standard is available as a PDF file, but the top of the download page says all the files are proprietary. It begins to stink… then when I click the download link of the 2014 standard, I get a license agreement form. Each line is more disgusting than the one before it. I don’t agree to these terms. While I could click Agree anyway or download/create a torrent of this file, this time it’s all about standards, so we’ll just use Wikipedia and other freely available resources. I’m sure you’re doing great work, ISO, but it doesn’t make sense to tell me that 1 printed copy is okay but 2 copies are a reason to “punish by criminal law”. Publish your work with love, or not at all.
Going back to XML, that quote is followed by a definition:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
and a comment:
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Then there’s a note:
Document authors are encouraged to avoid “compatibility characters”, as defined in section 2.3 of Unicode. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
Then this:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
This may or may not be up-to-date with the current version of Unicode. Unicode’s own file - the character database from old version 2.1 - does have all those control characters which XML seems to ignore. The data can be found [[here|http://www.unicode.org/Public/2.1-Update/UnicodeData-2.1.2.txt]].
Turtle
Turtle’s double-quoted strings forbid as-is usage of "
, \
, CR and LF, which are available via escape sequences. All the other characters defined as “Char” above, in the XML specification, are allowed as-is.
Haskell
[[Haskell|https://www.haskell.org/onlinereport/haskell2010/haskellch10.html]] strings similarly forbid "
and \
(characters forbid '
), and allow all the other characters matched by a grammar symbol called graphic in the Report. Its rule is:
graphic → small | large | symbol | digit | special | " | '
small and large are for Unicode letters. digit is for Unicode decimal digits. symbol is for most Unicode symbols and punctuation. special is for several ASCII characters which have special meaning in Haskell syntax. So the general idea is: Letters, digits, symbols and punctuation.
Character Properties
Are there character properties we can use to identify the characters that are visible and safe to type into literals in Idan? [[Chapter 4|http://www.unicode.org/versions/Unicode7.0.0/ch04.pdf]] of Unicode version 7 lists the properties, in page 4. Wikipedia has some info too. Each character has a single General Category. The major categories seem relevant: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. Is it enough to test by these?
Actually the [[character database|http://www.unicode.org/reports/tr44/#Properties]] seems to have useful info too. We need to know exactly what these categories mean. Wikipedia also has a page about [[!wikipedia “Unicode character property”]]. A lot of info!
General Categories aren’t the only useful thing. There are also Basic Types: Graphic, Format, Control, Private-use, Surrogate, Noncharacter and Reserved.
Questions:
- Is limiting to Graphical reasonable?
- Are all Graphical characters okay?
Graphicals are: Letters, Marks, Numbers, Punctuations, Symbols and some Separators. It means that some Separators and all the Others are excluded.
Reading… reading… reading…
It seems that picking up Graphical will work, except for one small issue: non spacing marks, i.e. category Mn
. These are marks that don’t occupy line space, and instead can appear below the previous character or above it and so on. So using one of these between '
s is a bit ugly. Possible solution: allow spaces before and after the character! Instead of this:
quote char quote
have this:
quote space* char space* quote
The space character itself can be a “special case”.
The rest of the characters are undefined, invisible or have no graphical form (e.g. control characters at the beginning of ASCII). It still may be nice if the lexer catches them and maybe uses Unicode knowledge to produce friendly errors etc., but for literals Graphical looks like a good solution.
Another point to consider: Maybe the ASCII space \x20
isn’t the only line space character that is graphical. If that’s true, limiting to Graphical may not be enough. UCD file [[PropList.txt|http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt]] lists in lines 11-21 the 25 characters with the White_Space property. These are the same 25 in Unicode 6.3, listed in Wikipedia at [[!wikipedia “Unicode character property#Whitespace”]]. So how many of these are Graphical? Here’s a list:
00A0 ; White_Space # Zs NO-BREAK SPACE
1680 ; White_Space # Zs OGHAM SPACE MARK
2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
202F ; White_Space # Zs NARROW NO-BREAK SPACE
205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
3000 ; White_Space # Zs IDEOGRAPHIC SPACE
New suggestion: Allow all non-whitespace Grahicals and the ASCII space. Hmmm what if in some languages it’s common to use a space that isn’t the ASCII one? Hard to say. Right now it seems no other Graphical space is too common to exclude from the as-is set, but I suppose it could change in the future.
Another idea: Allow characters be specified by their name, which is much more indicative than their hex number. Are names safe to use for long term? Chapter 4 of Unicode 7 says that names are unique, and a name is immutable once assigned. Implementations can rely on name uniqueness. Great! Just one problem: The names are in English. Still useful, of course. There are also name aliases, in English too, in the file NameAliases.txt. I could even allow flexible syntax in which letter case, spaces and mediating hyphens don’t matter.
XML Revisited
Now I understand better what “control” and “undefined” mean there. A parser can warn if these characters are found.
Character Names
Decision
There are several ways to specify characters:
- The character as-is between single quotes
- A string escape sequence between single quotes
- A numeric escape sequence between single quotes
- An optionally space-surrounded character as-is between
'''
s - A character name between backticks, possibly with langtag appended
Rules and notes:
- Only and all Graphical characters are allowed as-is, except for
'
and\
, which must be escaped - Warn about graphical whitespace that isn’t the ASCII space
'\x20'
- Warn about use of
'''
and surrounding spaces where not necessary - Character names are case insensitive and mediating hyphens can be replaced with underscores or ASCII spaces
- The
\&
sequence isn’t allowed; it’s only for strings
There are several ways to specify strings:
- Between
"
s on a single line, with\
and"
escaped - Between
"""
s on a 1 or more lines, with\
and"""
escaped - A sequence of strings of types 1 and/or 2, the resulting value being their concatenation