1. I hope there's a more efficient way of doing this
- Posted by axtens Oct 18, 2008
- 1103 views
G'day everyone,
Now that I've given up on returning an array of BSTRs (for now at least), it's time to get on to the project itself. First off the rank is a DLL function which lists the unicode range that each character in the input stream falls into.
The VB6 code calling the DLL, and using a typelib to avoid the use of Declares statements, looks like this:
Dim cfg As Std.Config Sub Main() Set cfg = New Config cfg.Load "test.cfg" sText = cfg.RecallElse("test", "dog") Dim s As String s = BLOCKS(sText) Dim a() As String a = Split(s, vbTab) Dim b For Each b In a Debug.Print b Next End SubThe cfg call is to a home-grown COM DLL which provides access to .cfg files (ANSI or Unicode.) In test.cfg is stored the following, which probably won't reproduce very well here, so beneath it is the results of a Euphoria print of the data parsed from it.
test=¶á‡â…¿æ¯…毅訜è¨ð’…ñ„‘„ò™¦™
{13,{182,4557,8575,27589,27589,35356,35357,55304,57157,55505,56388,55846,56729}}Now comes the ugliest bit of Euphoria I think I've ever had the cause to generate (I can send you the VBScript that did it if you like). My question is (in expectation that you're not going to scroll down to the end) is this: Is there any better way of doing NumToBlock? I'm fairly sure there's no 'case' statement in Euphoria, but things have changed a fair bit since v1.5 (when I first encountered the language), so there may well be a solution standing right under my nose.
include machine.e include dll.e include wildcard.e include w32msgs.e include w32def_series.ew include Unicode.ew include variant.ew function peek_vb( atom ptr ) sequence str, temp, len str = "" ptr -= 4 len = peek( {ptr,4} ) -- get length DWORD ptr += 4 temp = peek( {ptr, 2} ) ptr += 2 while not equal( temp, {0,0} ) do str &= ( temp[2] * 256 + temp[1] ) temp = peek( {ptr,2} ) ptr += 2 end while return { bytes_to_int(len) / 2 , str } end function function NumToBlock( atom codepoint ) if ( codepoint >= #0000 and codepoint <= #007F ) then return "Basic Latin" end if if ( codepoint >= #0080 and codepoint <= #00FF ) then return "Latin-1 Supplement" end if if ( codepoint >= #0100 and codepoint <= #017F ) then return "Latin Extended-A" end if if ( codepoint >= #0180 and codepoint <= #024F ) then return "Latin Extended-B" end if if ( codepoint >= #0250 and codepoint <= #02AF ) then return "IPA Extensions" end if if ( codepoint >= #02B0 and codepoint <= #02FF ) then return "Spacing Modifier Letters" end if if ( codepoint >= #0300 and codepoint <= #036F ) then return "Combining Diacritical Marks" end if if ( codepoint >= #0370 and codepoint <= #03FF ) then return "Greek and Coptic" end if if ( codepoint >= #0400 and codepoint <= #04FF ) then return "Cyrillic" end if if ( codepoint >= #0500 and codepoint <= #052F ) then return "Cyrillic Supplement" end if if ( codepoint >= #0530 and codepoint <= #058F ) then return "Armenian" end if if ( codepoint >= #0590 and codepoint <= #05FF ) then return "Hebrew" end if if ( codepoint >= #0600 and codepoint <= #06FF ) then return "Arabic" end if if ( codepoint >= #0700 and codepoint <= #074F ) then return "Syriac" end if if ( codepoint >= #0750 and codepoint <= #077F ) then return "Arabic Supplement" end if if ( codepoint >= #0780 and codepoint <= #07BF ) then return "Thaana" end if if ( codepoint >= #07C0 and codepoint <= #07FF ) then return "NKo" end if if ( codepoint >= #0900 and codepoint <= #097F ) then return "Devanagari" end if if ( codepoint >= #0980 and codepoint <= #09FF ) then return "Bengali" end if if ( codepoint >= #0A00 and codepoint <= #0A7F ) then return "Gurmukhi" end if if ( codepoint >= #0A80 and codepoint <= #0AFF ) then return "Gujarati" end if if ( codepoint >= #0B00 and codepoint <= #0B7F ) then return "Oriya" end if if ( codepoint >= #0B80 and codepoint <= #0BFF ) then return "Tamil" end if if ( codepoint >= #0C00 and codepoint <= #0C7F ) then return "Telugu" end if if ( codepoint >= #0C80 and codepoint <= #0CFF ) then return "Kannada" end if if ( codepoint >= #0D00 and codepoint <= #0D7F ) then return "Malayalam" end if if ( codepoint >= #0D80 and codepoint <= #0DFF ) then return "Sinhala" end if if ( codepoint >= #0E00 and codepoint <= #0E7F ) then return "Thai" end if if ( codepoint >= #0E80 and codepoint <= #0EFF ) then return "Lao" end if if ( codepoint >= #0F00 and codepoint <= #0FFF ) then return "Tibetan" end if if ( codepoint >= #1000 and codepoint <= #109F ) then return "Myanmar" end if if ( codepoint >= #10A0 and codepoint <= #10FF ) then return "Georgian" end if if ( codepoint >= #1100 and codepoint <= #11FF ) then return "Hangul Jamo" end if if ( codepoint >= #1200 and codepoint <= #137F ) then return "Ethiopic" end if if ( codepoint >= #1380 and codepoint <= #139F ) then return "Ethiopic Supplement" end if if ( codepoint >= #13A0 and codepoint <= #13FF ) then return "Cherokee" end if if ( codepoint >= #1400 and codepoint <= #167F ) then return "Unified Canadian Aboriginal Syllabics" end if if ( codepoint >= #1680 and codepoint <= #169F ) then return "Ogham" end if if ( codepoint >= #16A0 and codepoint <= #16FF ) then return "Runic" end if if ( codepoint >= #1700 and codepoint <= #171F ) then return "Tagalog" end if if ( codepoint >= #1720 and codepoint <= #173F ) then return "Hanunoo" end if if ( codepoint >= #1740 and codepoint <= #175F ) then return "Buhid" end if if ( codepoint >= #1760 and codepoint <= #177F ) then return "Tagbanwa" end if if ( codepoint >= #1780 and codepoint <= #17FF ) then return "Khmer" end if if ( codepoint >= #1800 and codepoint <= #18AF ) then return "Mongolian" end if if ( codepoint >= #1900 and codepoint <= #194F ) then return "Limbu" end if if ( codepoint >= #1950 and codepoint <= #197F ) then return "Tai Le" end if if ( codepoint >= #1980 and codepoint <= #19DF ) then return "New Tai Lue" end if if ( codepoint >= #19E0 and codepoint <= #19FF ) then return "Khmer Symbols" end if if ( codepoint >= #1A00 and codepoint <= #1A1F ) then return "Buginese" end if if ( codepoint >= #1B00 and codepoint <= #1B7F ) then return "Balinese" end if if ( codepoint >= #1B80 and codepoint <= #1BBF ) then return "Sundanese" end if if ( codepoint >= #1C00 and codepoint <= #1C4F ) then return "Lepcha" end if if ( codepoint >= #1C50 and codepoint <= #1C7F ) then return "Ol Chiki" end if if ( codepoint >= #1D00 and codepoint <= #1D7F ) then return "Phonetic Extensions" end if if ( codepoint >= #1D80 and codepoint <= #1DBF ) then return "Phonetic Extensions Supplement" end if if ( codepoint >= #1DC0 and codepoint <= #1DFF ) then return "Combining Diacritical Marks Supplement" end if if ( codepoint >= #1E00 and codepoint <= #1EFF ) then return "Latin Extended Additional" end if if ( codepoint >= #1F00 and codepoint <= #1FFF ) then return "Greek Extended" end if if ( codepoint >= #2000 and codepoint <= #206F ) then return "General Punctuation" end if if ( codepoint >= #2070 and codepoint <= #209F ) then return "Superscripts and Subscripts" end if if ( codepoint >= #20A0 and codepoint <= #20CF ) then return "Currency Symbols" end if if ( codepoint >= #20D0 and codepoint <= #20FF ) then return "Combining Diacritical Marks for Symbols" end if if ( codepoint >= #2100 and codepoint <= #214F ) then return "Letterlike Symbols" end if if ( codepoint >= #2150 and codepoint <= #218F ) then return "Number Forms" end if if ( codepoint >= #2190 and codepoint <= #21FF ) then return "Arrows" end if if ( codepoint >= #2200 and codepoint <= #22FF ) then return "Mathematical Operators" end if if ( codepoint >= #2300 and codepoint <= #23FF ) then return "Miscellaneous Technical" end if if ( codepoint >= #2400 and codepoint <= #243F ) then return "Control Pictures" end if if ( codepoint >= #2440 and codepoint <= #245F ) then return "Optical Character Recognition" end if if ( codepoint >= #2460 and codepoint <= #24FF ) then return "Enclosed Alphanumerics" end if if ( codepoint >= #2500 and codepoint <= #257F ) then return "Box Drawing" end if if ( codepoint >= #2580 and codepoint <= #259F ) then return "Block Elements" end if if ( codepoint >= #25A0 and codepoint <= #25FF ) then return "Geometric Shapes" end if if ( codepoint >= #2600 and codepoint <= #26FF ) then return "Miscellaneous Symbols" end if if ( codepoint >= #2700 and codepoint <= #27BF ) then return "Dingbats" end if if ( codepoint >= #27C0 and codepoint <= #27EF ) then return "Miscellaneous Mathematical Symbols-A" end if if ( codepoint >= #27F0 and codepoint <= #27FF ) then return "Supplemental Arrows-A" end if if ( codepoint >= #2800 and codepoint <= #28FF ) then return "Braille Patterns" end if if ( codepoint >= #2900 and codepoint <= #297F ) then return "Supplemental Arrows-B" end if if ( codepoint >= #2980 and codepoint <= #29FF ) then return "Miscellaneous Mathematical Symbols-B" end if if ( codepoint >= #2A00 and codepoint <= #2AFF ) then return "Supplemental Mathematical Operators" end if if ( codepoint >= #2B00 and codepoint <= #2BFF ) then return "Miscellaneous Symbols and Arrows" end if if ( codepoint >= #2C00 and codepoint <= #2C5F ) then return "Glagolitic" end if if ( codepoint >= #2C60 and codepoint <= #2C7F ) then return "Latin Extended-C" end if if ( codepoint >= #2C80 and codepoint <= #2CFF ) then return "Coptic" end if if ( codepoint >= #2D00 and codepoint <= #2D2F ) then return "Georgian Supplement" end if if ( codepoint >= #2D30 and codepoint <= #2D7F ) then return "Tifinagh" end if if ( codepoint >= #2D80 and codepoint <= #2DDF ) then return "Ethiopic Extended" end if if ( codepoint >= #2DE0 and codepoint <= #2DFF ) then return "Cyrillic Extended-A" end if if ( codepoint >= #2E00 and codepoint <= #2E7F ) then return "Supplemental Punctuation" end if if ( codepoint >= #2E80 and codepoint <= #2EFF ) then return "CJK Radicals Supplement" end if if ( codepoint >= #2F00 and codepoint <= #2FDF ) then return "Kangxi Radicals" end if if ( codepoint >= #2FF0 and codepoint <= #2FFF ) then return "Ideographic Description Characters" end if if ( codepoint >= #3000 and codepoint <= #303F ) then return "CJK Symbols and Punctuation" end if if ( codepoint >= #3040 and codepoint <= #309F ) then return "Hiragana" end if if ( codepoint >= #30A0 and codepoint <= #30FF ) then return "Katakana" end if if ( codepoint >= #3100 and codepoint <= #312F ) then return "Bopomofo" end if if ( codepoint >= #3130 and codepoint <= #318F ) then return "Hangul Compatibility Jamo" end if if ( codepoint >= #3190 and codepoint <= #319F ) then return "Kanbun" end if if ( codepoint >= #31A0 and codepoint <= #31BF ) then return "Bopomofo Extended" end if if ( codepoint >= #31C0 and codepoint <= #31EF ) then return "CJK Strokes" end if if ( codepoint >= #31F0 and codepoint <= #31FF ) then return "Katakana Phonetic Extensions" end if if ( codepoint >= #3200 and codepoint <= #32FF ) then return "Enclosed CJK Letters and Months" end if if ( codepoint >= #3300 and codepoint <= #33FF ) then return "CJK Compatibility" end if if ( codepoint >= #3400 and codepoint <= #4DBF ) then return "CJK Unified Ideographs Extension A" end if if ( codepoint >= #4DC0 and codepoint <= #4DFF ) then return "Yijing Hexagram Symbols" end if if ( codepoint >= #4E00 and codepoint <= #9FFF ) then return "CJK Unified Ideographs" end if if ( codepoint >= #A000 and codepoint <= #A48F ) then return "Yi Syllables" end if if ( codepoint >= #A490 and codepoint <= #A4CF ) then return "Yi Radicals" end if if ( codepoint >= #A500 and codepoint <= #A63F ) then return "Vai" end if if ( codepoint >= #A640 and codepoint <= #A69F ) then return "Cyrillic Extended-B" end if if ( codepoint >= #A700 and codepoint <= #A71F ) then return "Modifier Tone Letters" end if if ( codepoint >= #A720 and codepoint <= #A7FF ) then return "Latin Extended-D" end if if ( codepoint >= #A800 and codepoint <= #A82F ) then return "Syloti Nagri" end if if ( codepoint >= #A840 and codepoint <= #A87F ) then return "Phags-pa" end if if ( codepoint >= #A880 and codepoint <= #A8DF ) then return "Saurashtra" end if if ( codepoint >= #A900 and codepoint <= #A92F ) then return "Kayah Li" end if if ( codepoint >= #A930 and codepoint <= #A95F ) then return "Rejang" end if if ( codepoint >= #AA00 and codepoint <= #AA5F ) then return "Cham" end if if ( codepoint >= #AC00 and codepoint <= #D7AF ) then return "Hangul Syllables" end if if ( codepoint >= #D800 and codepoint <= #DB7F ) then return "High Surrogates" end if if ( codepoint >= #DB80 and codepoint <= #DBFF ) then return "High Private Use Surrogates" end if if ( codepoint >= #DC00 and codepoint <= #DFFF ) then return "Low Surrogates" end if if ( codepoint >= #E000 and codepoint <= #F8FF ) then return "Private Use Area" end if if ( codepoint >= #F900 and codepoint <= #FAFF ) then return "CJK Compatibility Ideographs" end if if ( codepoint >= #FB00 and codepoint <= #FB4F ) then return "Alphabetic Presentation Forms" end if if ( codepoint >= #FB50 and codepoint <= #FDFF ) then return "Arabic Presentation Forms-A" end if if ( codepoint >= #FE00 and codepoint <= #FE0F ) then return "Variation Selectors" end if if ( codepoint >= #FE10 and codepoint <= #FE1F ) then return "Vertical Forms" end if if ( codepoint >= #FE20 and codepoint <= #FE2F ) then return "Combining Half Marks" end if if ( codepoint >= #FE30 and codepoint <= #FE4F ) then return "CJK Compatibility Forms" end if if ( codepoint >= #FE50 and codepoint <= #FE6F ) then return "Small Form Variants" end if if ( codepoint >= #FE70 and codepoint <= #FEFF ) then return "Arabic Presentation Forms-B" end if if ( codepoint >= #FF00 and codepoint <= #FFEF ) then return "Halfwidth and Fullwidth Forms" end if if ( codepoint >= #FFF0 and codepoint <= #FFFF ) then return "Specials" end if if ( codepoint >= #10000 and codepoint <= #1007F ) then return "Linear B Syllabary" end if if ( codepoint >= #10080 and codepoint <= #100FF ) then return "Linear B Ideograms" end if if ( codepoint >= #10100 and codepoint <= #1013F ) then return "Aegean Numbers" end if if ( codepoint >= #10140 and codepoint <= #1018F ) then return "Ancient Greek Numbers" end if if ( codepoint >= #10190 and codepoint <= #101CF ) then return "Ancient Symbols" end if if ( codepoint >= #101D0 and codepoint <= #101FF ) then return "Phaistos Disc" end if if ( codepoint >= #10280 and codepoint <= #1029F ) then return "Lycian" end if if ( codepoint >= #102A0 and codepoint <= #102DF ) then return "Carian" end if if ( codepoint >= #10300 and codepoint <= #1032F ) then return "Old Italic" end if if ( codepoint >= #10330 and codepoint <= #1034F ) then return "Gothic" end if if ( codepoint >= #10380 and codepoint <= #1039F ) then return "Ugaritic" end if if ( codepoint >= #103A0 and codepoint <= #103DF ) then return "Old Persian" end if if ( codepoint >= #10400 and codepoint <= #1044F ) then return "Deseret" end if if ( codepoint >= #10450 and codepoint <= #1047F ) then return "Shavian" end if if ( codepoint >= #10480 and codepoint <= #104AF ) then return "Osmanya" end if if ( codepoint >= #10800 and codepoint <= #1083F ) then return "Cypriot Syllabary" end if if ( codepoint >= #10900 and codepoint <= #1091F ) then return "Phoenician" end if if ( codepoint >= #10920 and codepoint <= #1093F ) then return "Lydian" end if if ( codepoint >= #10A00 and codepoint <= #10A5F ) then return "Kharoshthi" end if if ( codepoint >= #12000 and codepoint <= #123FF ) then return "Cuneiform" end if if ( codepoint >= #12400 and codepoint <= #1247F ) then return "Cuneiform Numbers and Punctuation" end if if ( codepoint >= #1D000 and codepoint <= #1D0FF ) then return "Byzantine Musical Symbols" end if if ( codepoint >= #1D100 and codepoint <= #1D1FF ) then return "Musical Symbols" end if if ( codepoint >= #1D200 and codepoint <= #1D24F ) then return "Ancient Greek Musical Notation" end if if ( codepoint >= #1D300 and codepoint <= #1D35F ) then return "Tai Xuan Jing Symbols" end if if ( codepoint >= #1D360 and codepoint <= #1D37F ) then return "Counting Rod Numerals" end if if ( codepoint >= #1D400 and codepoint <= #1D7FF ) then return "Mathematical Alphanumeric Symbols" end if if ( codepoint >= #1F000 and codepoint <= #1F02F ) then return "Mahjong Tiles" end if if ( codepoint >= #1F030 and codepoint <= #1F09F ) then return "Domino Tiles" end if if ( codepoint >= #20000 and codepoint <= #2A6DF ) then return "CJK Unified Ideographs Extension B" end if if ( codepoint >= #2F800 and codepoint <= #2FA1F ) then return "CJK Compatibility Ideographs Supplement" end if if ( codepoint >= #E0000 and codepoint <= #E007F ) then return "Tags" end if if ( codepoint >= #E0100 and codepoint <= #E01EF ) then return "Variation Selectors Supplement" end if if ( codepoint >= #F0000 and codepoint <= #FFFFF ) then return "Supplementary Private Use Area-A" end if if ( codepoint >= #100000 and codepoint <= #10FFFF ) then return "Supplementary Private Use Area-B" end if end function global function BLOCKS( atom ptrString ) atom h sequence sData sequence sResult sData = peek_vb( ptrString ) h = open("boune1.dat", "w" ) print( h, sData ) close(h) sResult = "" for i = 1 to sData[ 1 ] - 1 do sResult = sResult & NumToBlock( sData[ 2 ][ i ] ) & {9} end for sResult = sResult & NumToBlock( sData[ 2 ][ $ ] ) return alloc_bstr( sResult ) end function --printf(1, "%s", {NumToBlock( 125 )} )
Kind regards,
Bruce.
2. Re: I hope there's a more efficient way of doing this
- Posted by mattlewis (admin) Oct 18, 2008
- 1097 views
Now comes the ugliest bit of Euphoria I think I've ever had the cause to generate (I can send you the VBScript that did it if you like). My question is (in expectation that you're not going to scroll down to the end) is this: Is there any better way of doing NumToBlock? I'm fairly sure there's no 'case' statement in Euphoria, but things have changed a fair bit since v1.5 (when I first encountered the language), so there may well be a solution standing right under my nose.
There will be a switch/case construct in 4.0, but in this case, I don't think it would be any prettier. What you could do is store the information like:
{ { start_range, end_range, code_page }, { start_range, end_range, code_page } ... }
Then you search for the appropriate code page in a loop. If it's sorted, you could do a binary search, or just order them by expected frequency.
Matt
3. Re: I hope there's a more efficient way of doing this
- Posted by axtens Oct 18, 2008
- 1064 views
{ { start_range, end_range, code_page }, { start_range, end_range, code_page } ... }
Then you search for the appropriate code page in a loop.
Sounds like the go.
Thanks, Matt.
Bruce.
4. Re: I hope there's a more efficient way of doing this
- Posted by axtens Oct 18, 2008
- 1096 views
G'day everyone,
Just to let you see what it became.
Kind regards,
Bruce.
include machine.e include dll.e include wildcard.e include w32msgs.e include w32def_series.ew include Unicode.ew include variant.ew function peek_vb( atom ptr ) sequence str, temp, len str = "" ptr -= 4 len = peek( {ptr,4} ) -- get length DWORD ptr += 4 temp = peek( {ptr, 2} ) ptr += 2 while not equal( temp, {0,0} ) do str &= ( temp[2] * 256 + temp[1] ) temp = peek( {ptr,2} ) ptr += 2 end while return { bytes_to_int(len) / 2 , str } end function sequence BlockRanges BlockRanges = { { #0000, #007F, "Basic Latin" }, { #0080, #00FF, "Latin-1 Supplement" }, { #0100, #017F, "Latin Extended-A" }, { #0180, #024F, "Latin Extended-B" }, { #0250, #02AF, "IPA Extensions" }, { #02B0, #02FF, "Spacing Modifier Letters" }, { #0300, #036F, "Combining Diacritical Marks" }, { #0370, #03FF, "Greek and Coptic" }, { #0400, #04FF, "Cyrillic" }, { #0500, #052F, "Cyrillic Supplement" }, { #0530, #058F, "Armenian" }, { #0590, #05FF, "Hebrew" }, { #0600, #06FF, "Arabic" }, { #0700, #074F, "Syriac" }, { #0750, #077F, "Arabic Supplement" }, { #0780, #07BF, "Thaana" }, { #07C0, #07FF, "NKo" }, { #0900, #097F, "Devanagari" }, { #0980, #09FF, "Bengali" }, { #0A00, #0A7F, "Gurmukhi" }, { #0A80, #0AFF, "Gujarati" }, { #0B00, #0B7F, "Oriya" }, { #0B80, #0BFF, "Tamil" }, { #0C00, #0C7F, "Telugu" }, { #0C80, #0CFF, "Kannada" }, { #0D00, #0D7F, "Malayalam" }, { #0D80, #0DFF, "Sinhala" }, { #0E00, #0E7F, "Thai" }, { #0E80, #0EFF, "Lao" }, { #0F00, #0FFF, "Tibetan" }, { #1000, #109F, "Myanmar" }, { #10A0, #10FF, "Georgian" }, { #1100, #11FF, "Hangul Jamo" }, { #1200, #137F, "Ethiopic" }, { #1380, #139F, "Ethiopic Supplement" }, { #13A0, #13FF, "Cherokee" }, { #1400, #167F, "Unified Canadian Aboriginal Syllabics" }, { #1680, #169F, "Ogham" }, { #16A0, #16FF, "Runic" }, { #1700, #171F, "Tagalog" }, { #1720, #173F, "Hanunoo" }, { #1740, #175F, "Buhid" }, { #1760, #177F, "Tagbanwa" }, { #1780, #17FF, "Khmer" }, { #1800, #18AF, "Mongolian" }, { #1900, #194F, "Limbu" }, { #1950, #197F, "Tai Le" }, { #1980, #19DF, "New Tai Lue" }, { #19E0, #19FF, "Khmer Symbols" }, { #1A00, #1A1F, "Buginese" }, { #1B00, #1B7F, "Balinese" }, { #1B80, #1BBF, "Sundanese" }, { #1C00, #1C4F, "Lepcha" }, { #1C50, #1C7F, "Ol Chiki" }, { #1D00, #1D7F, "Phonetic Extensions" }, { #1D80, #1DBF, "Phonetic Extensions Supplement" }, { #1DC0, #1DFF, "Combining Diacritical Marks Supplement" }, { #1E00, #1EFF, "Latin Extended Additional" }, { #1F00, #1FFF, "Greek Extended" }, { #2000, #206F, "General Punctuation" }, { #2070, #209F, "Superscripts and Subscripts" }, { #20A0, #20CF, "Currency Symbols" }, { #20D0, #20FF, "Combining Diacritical Marks for Symbols" }, { #2100, #214F, "Letterlike Symbols" }, { #2150, #218F, "Number Forms" }, { #2190, #21FF, "Arrows" }, { #2200, #22FF, "Mathematical Operators" }, { #2300, #23FF, "Miscellaneous Technical" }, { #2400, #243F, "Control Pictures" }, { #2440, #245F, "Optical Character Recognition" }, { #2460, #24FF, "Enclosed Alphanumerics" }, { #2500, #257F, "Box Drawing" }, { #2580, #259F, "Block Elements" }, { #25A0, #25FF, "Geometric Shapes" }, { #2600, #26FF, "Miscellaneous Symbols" }, { #2700, #27BF, "Dingbats" }, { #27C0, #27EF, "Miscellaneous Mathematical Symbols-A" }, { #27F0, #27FF, "Supplemental Arrows-A" }, { #2800, #28FF, "Braille Patterns" }, { #2900, #297F, "Supplemental Arrows-B" }, { #2980, #29FF, "Miscellaneous Mathematical Symbols-B" }, { #2A00, #2AFF, "Supplemental Mathematical Operators" }, { #2B00, #2BFF, "Miscellaneous Symbols and Arrows" }, { #2C00, #2C5F, "Glagolitic" }, { #2C60, #2C7F, "Latin Extended-C" }, { #2C80, #2CFF, "Coptic" }, { #2D00, #2D2F, "Georgian Supplement" }, { #2D30, #2D7F, "Tifinagh" }, { #2D80, #2DDF, "Ethiopic Extended" }, { #2DE0, #2DFF, "Cyrillic Extended-A" }, { #2E00, #2E7F, "Supplemental Punctuation" }, { #2E80, #2EFF, "CJK Radicals Supplement" }, { #2F00, #2FDF, "Kangxi Radicals" }, { #2FF0, #2FFF, "Ideographic Description Characters" }, { #3000, #303F, "CJK Symbols and Punctuation" }, { #3040, #309F, "Hiragana" }, { #30A0, #30FF, "Katakana" }, { #3100, #312F, "Bopomofo" }, { #3130, #318F, "Hangul Compatibility Jamo" }, { #3190, #319F, "Kanbun" }, { #31A0, #31BF, "Bopomofo Extended" }, { #31C0, #31EF, "CJK Strokes" }, { #31F0, #31FF, "Katakana Phonetic Extensions" }, { #3200, #32FF, "Enclosed CJK Letters and Months" }, { #3300, #33FF, "CJK Compatibility" }, { #3400, #4DBF, "CJK Unified Ideographs Extension A" }, { #4DC0, #4DFF, "Yijing Hexagram Symbols" }, { #4E00, #9FFF, "CJK Unified Ideographs" }, { #A000, #A48F, "Yi Syllables" }, { #A490, #A4CF, "Yi Radicals" }, { #A500, #A63F, "Vai" }, { #A640, #A69F, "Cyrillic Extended-B" }, { #A700, #A71F, "Modifier Tone Letters" }, { #A720, #A7FF, "Latin Extended-D" }, { #A800, #A82F, "Syloti Nagri" }, { #A840, #A87F, "Phags-pa" }, { #A880, #A8DF, "Saurashtra" }, { #A900, #A92F, "Kayah Li" }, { #A930, #A95F, "Rejang" }, { #AA00, #AA5F, "Cham" }, { #AC00, #D7AF, "Hangul Syllables" }, { #D800, #DB7F, "High Surrogates" }, { #DB80, #DBFF, "High Private Use Surrogates" }, { #DC00, #DFFF, "Low Surrogates" }, { #E000, #F8FF, "Private Use Area" }, { #F900, #FAFF, "CJK Compatibility Ideographs" }, { #FB00, #FB4F, "Alphabetic Presentation Forms" }, { #FB50, #FDFF, "Arabic Presentation Forms-A" }, { #FE00, #FE0F, "Variation Selectors" }, { #FE10, #FE1F, "Vertical Forms" }, { #FE20, #FE2F, "Combining Half Marks" }, { #FE30, #FE4F, "CJK Compatibility Forms" }, { #FE50, #FE6F, "Small Form Variants" }, { #FE70, #FEFF, "Arabic Presentation Forms-B" }, { #FF00, #FFEF, "Halfwidth and Fullwidth Forms" }, { #FFF0, #FFFF, "Specials" }, { #10000, #1007F, "Linear B Syllabary" }, { #10080, #100FF, "Linear B Ideograms" }, { #10100, #1013F, "Aegean Numbers" }, { #10140, #1018F, "Ancient Greek Numbers" }, { #10190, #101CF, "Ancient Symbols" }, { #101D0, #101FF, "Phaistos Disc" }, { #10280, #1029F, "Lycian" }, { #102A0, #102DF, "Carian" }, { #10300, #1032F, "Old Italic" }, { #10330, #1034F, "Gothic" }, { #10380, #1039F, "Ugaritic" }, { #103A0, #103DF, "Old Persian" }, { #10400, #1044F, "Deseret" }, { #10450, #1047F, "Shavian" }, { #10480, #104AF, "Osmanya" }, { #10800, #1083F, "Cypriot Syllabary" }, { #10900, #1091F, "Phoenician" }, { #10920, #1093F, "Lydian" }, { #10A00, #10A5F, "Kharoshthi" }, { #12000, #123FF, "Cuneiform" }, { #12400, #1247F, "Cuneiform Numbers and Punctuation" }, { #1D000, #1D0FF, "Byzantine Musical Symbols" }, { #1D100, #1D1FF, "Musical Symbols" }, { #1D200, #1D24F, "Ancient Greek Musical Notation" }, { #1D300, #1D35F, "Tai Xuan Jing Symbols" }, { #1D360, #1D37F, "Counting Rod Numerals" }, { #1D400, #1D7FF, "Mathematical Alphanumeric Symbols" }, { #1F000, #1F02F, "Mahjong Tiles" }, { #1F030, #1F09F, "Domino Tiles" }, { #20000, #2A6DF, "CJK Unified Ideographs Extension B" }, { #2F800, #2FA1F, "CJK Compatibility Ideographs Supplement" }, { #E0000, #E007F, "Tags" }, { #E0100, #E01EF, "Variation Selectors Supplement" }, { #F0000, #FFFFF, "Supplementary Private Use Area-A" }, { #100000, #10FFFF, "Supplementary Private Use Area-B" }} function NumToBlock( atom codepoint ) for i = 1 to length( BlockRanges ) do if ( codepoint >= BlockRanges[ i ][ 1 ] and codepoint <= BlockRanges[ i ][ 2 ] ) then return BlockRanges[ i ][ 3 ] end if end for return "<unassigned>" end function global function BLOCKS( atom ptrString ) atom h sequence sData sequence sResult sData = peek_vb( ptrString ) h = open("boune1.dat", "w" ) print( h, sData ) close(h) sResult = "" for i = 1 to sData[ 1 ] - 1 do sResult = sResult & NumToBlock( sData[ 2 ][ i ] ) & {9} end for sResult = sResult & NumToBlock( sData[ 2 ][ $ ] ) return alloc_bstr( sResult ) end function --printf(1, "%s", {NumToBlock( 125 )} )
5. Re: I hope there's a more efficient way of doing this
- Posted by euphoric (admin) Oct 18, 2008
- 1060 views
Bruce, you could eliminate the second column I think...
Then loop through like this:
c = 1 while x > BlockRanges[c][1] do c+=1 end while c-=1 result = BlockRanges[c][2]
Data would look like this:
sequence BlockRanges BlockRanges = { { #0000, "Basic Latin" }, { #0080, "Latin-1 Supplement" }, { #0100, "Latin Extended-A" }, { #0180, "Latin Extended-B" }, { #0250, "IPA Extensions" }, ... { #F0000, "Supplementary Private Use Area-A" }, { #100000, "Supplementary Private Use Area-B" }}
Something like that.
6. Re: I hope there's a more efficient way of doing this
- Posted by euphoric (admin) Oct 18, 2008
- 1056 views
Oops. I screwed up that quoted reply...
7. Re: I hope there's a more efficient way of doing this
- Posted by axtens Oct 19, 2008
- 1039 views
Hey, euphoric, that's really cool code, and saves some data space as well.
Thanks very much!
Bruce.
8. Re: I hope there's a more efficient way of doing this
- Posted by axtens Oct 19, 2008
- 1042 views
... however, not all the areas are contiguous, so if you supply a value of #E01F0 it gets interpreted as Variation Selectors Supplement which is incorrect. (Sigh).
Bruce.
9. Re: I hope there's a more efficient way of doing this
- Posted by euphoric (admin) Oct 19, 2008
- 1061 views
... however, not all the areas are contiguous, so if you supply a value of #E01F0 it gets interpreted as Variation Selectors Supplement which is incorrect. (Sigh).
Bruce.
I only did a quick scan of the data and it looked contiguous. Sorry!
Are there large sets of non-contiguous numbers? Maybe you could set those aside as exceptions and just catch them...?
10. Re: I hope there's a more efficient way of doing this
- Posted by euphoric (admin) Oct 19, 2008
- 1343 views
Actually, you could just put those numbers into the data set and act on them if they get returned.
{ #E0000, "Tags" }, { #E0080, "UNPOSSIBLE!" }, { #E0100, "Variation Selectors Supplement" }, { #F0000, "Supplementary Private Use Area-A" }, { #100000, "Supplementary Private Use Area-B" }}
11. Re: I hope there's a more efficient way of doing this
- Posted by axtens Oct 19, 2008
- 1037 views
- Last edited Oct 20, 2008
Are there large sets of non-contiguous numbers? Maybe you could set those aside as exceptions and just catch them...?
D'oh, why didn't I think of that?!
Kind regards,
Bruce.