Sep. 4th, 2014
no, seriously
Sep. 4th, 2014 03:57 pmWe have utf-8 that uses bytes to encode Unicode characters. Unicode characters are just letters or symbols for alphabetic languages, but become pretty meaningful symbols when we are in Chinese code range.
So, why not go further, start using 64-bit codepoints and just enumerate everything. Letters and other characters go first; then words of all languages, then standardized objects (Eiffel tower, Rubik cube, this new code table, weather in San Mateo on July 16th 2008 at 7:32 pm (ok, maybe), you, me, etc.
The thing is, there must be a standard representation; okay, maybe. The bigger issue is that alphabets are for linear streaming of ideas; so maybe we should limit ourselves with the linear stuff, basically, just languages.
But anyway.
Ok, 256 bits must be enough for everything except the list of all groups of 256-bit characters.
So, why not go further, start using 64-bit codepoints and just enumerate everything. Letters and other characters go first; then words of all languages, then standardized objects (Eiffel tower, Rubik cube, this new code table, weather in San Mateo on July 16th 2008 at 7:32 pm (ok, maybe), you, me, etc.
The thing is, there must be a standard representation; okay, maybe. The bigger issue is that alphabets are for linear streaming of ideas; so maybe we should limit ourselves with the linear stuff, basically, just languages.
But anyway.
Ok, 256 bits must be enough for everything except the list of all groups of 256-bit characters.