Re: unicode length
- Posted by jmduro Jan 04, 2020
- 1447 views
Thank you Pete,
With a simple mapping function, almost all UTF-8 compliant functions can be built:
include std/types.e include std/text.e include std/sequence.e constant FROM=1, UPTO=2 ------------------------------------------------------------------------------ -- returns position of each utf8 character in a string as a sequence of pairs -- {from, upto} -- example: -- umap("$£€¥") = { -- {1,1}, -- first character ranges from position 1 to position 1 -- {2,3}, -- second character ranges from position 2 to position 3 -- {4,6}, -- third character ranges from position 4 to position 6 -- {7,8} -- fourth character ranges from position 7 to position 8 -- } function umap(string s) integer i = 1 sequence res = {} while i <= length(s) do integer si = s[i] if and_bits(si, #80) = #00 then res = append(res, {i, i}) i += 1 elsif and_bits(si, #E0) = #C0 then res = append(res, {i, i+1}) i += 2 elsif and_bits(si, #F0) = #E0 then res = append(res, {i, i+2}) i += 3 elsif and_bits(si, #F8) = #F0 then res = append(res, {i, i+3}) i += 4 else res = append(res, {i, i}) i += 1 end if end while return res end function ------------------------------------------------------------------------------ function ulength(string s) return length(umap(s)) end function ------------------------------------------------------------------------------ function uhead(string s, integer n) sequence um = umap(s) return head(s, um[n][UPTO]) end function ------------------------------------------------------------------------------ function ureverse(string s) sequence um = umap(s) sequence res = {} for i = length(um) to 1 by -1 do res &= s[um[i][FROM]..um[i][UPTO]] end for return res end function ------------------------------------------------------------------------------ function uremove(string s, integer start, integer stop=start) sequence um = umap(s) return remove(s, um[start][FROM], um[stop][UPTO]) end function ------------------------------------------------------------------------------ function ureplace(string s, object what, integer start, integer stop=start) sequence um = umap(s) sequence res = remove(s, um[start][FROM], um[stop][UPTO]) return insert(res, what, um[start][FROM]) end function ------------------------------------------------------------------------------
This is just a subset of what can be done.
Of course, UTF-8 strings must be written with an UTF-8 encoding, so this is mostly useful on Linux. Even US programmers can need it to draw box frames.
Here is an example that compares ASCII and UTF-8 functions:
sequence currencies = "$£€¥" -- dollar, pound, euro, yen printf(1, "currencies = %s\n", {currencies}) printf(1, "umap(currencies) = %s\n", {sprint(umap(currencies))}) puts(1, "\n") printf(1, "length(currencies) = %d\n", {length(currencies)}) printf(1, "ulength(currencies) = %d\n", {ulength(currencies)}) puts(1, "\n") printf(1, "head(currencies, 3) = %s\n", {head(currencies, 3)}) printf(1, "uhead(currencies, 3) = %s\n", {uhead(currencies, 3)}) puts(1, "\n") printf(1, "reverse(currencies) = %s\n", {reverse(currencies)}) printf(1, "ureverse(currencies) = %s\n", {ureverse(currencies)}) puts(1, "\n") printf(1, "remove(currencies, 2) = %s\n", {remove(currencies, 2)}) printf(1, "uremove(currencies, 2) = %s\n", {uremove(currencies, 2)}) puts(1, "\n") printf(1, "replace(currencies, 'E', 3) = %s\n", {replace(currencies, 'E', 3)}) printf(1, "ureplace(currencies, 'E', 3) = %s\n", {ureplace(currencies, 'E', 3)})
Jean-Marc