1. I hope there's a more efficient way of doing this

G'day everyone,

Now that I've given up on returning an array of BSTRs (for now at least), it's time to get on to the project itself. First off the rank is a DLL function which lists the unicode range that each character in the input stream falls into.

The VB6 code calling the DLL, and using a typelib to avoid the use of Declares statements, looks like this:

Dim cfg As Std.Config 
Sub Main() 
    Set cfg = New Config 
    cfg.Load "test.cfg" 
    sText = cfg.RecallElse("test", "dog") 
    Dim s As String 
    s = BLOCKS(sText) 
    Dim a() As String 
    a = Split(s, vbTab) 
    Dim b 
    For Each b In a 
        Debug.Print b 
    Next 
End Sub 
The cfg call is to a home-grown COM DLL which provides access to .cfg files (ANSI or Unicode.) In test.cfg is stored the following, which probably won't reproduce very well here, so beneath it is the results of a Euphoria print of the data parsed from it.
test=¶ᇍⅿ毅毅訜訝𒍅񄑄򙦙 
    {13,{182,4557,8575,27589,27589,35356,35357,55304,57157,55505,56388,55846,56729}} 
Now comes the ugliest bit of Euphoria I think I've ever had the cause to generate (I can send you the VBScript that did it if you like). My question is (in expectation that you're not going to scroll down to the end) is this: Is there any better way of doing NumToBlock? I'm fairly sure there's no 'case' statement in Euphoria, but things have changed a fair bit since v1.5 (when I first encountered the language), so there may well be a solution standing right under my nose.

include machine.e 
include dll.e 
include wildcard.e 
include w32msgs.e 
include w32def_series.ew 
include Unicode.ew 
include variant.ew 
   
function peek_vb( atom ptr ) 
    sequence str, temp, len 
     
    str = "" 
     
    ptr -= 4 
    len = peek( {ptr,4} ) -- get length DWORD 
    ptr += 4 
     
    temp = peek( {ptr, 2} ) 
    ptr += 2 
    while not equal( temp, {0,0} ) do 
        str &= ( temp[2] * 256 + temp[1] ) 
        temp = peek( {ptr,2} ) 
        ptr += 2 
    end while 
    return { bytes_to_int(len) / 2 , str } 
end function 
 
function NumToBlock( atom codepoint ) 
	if ( codepoint >= #0000 and codepoint <= #007F ) then 
		return "Basic Latin" 
	end if 
	if ( codepoint >= #0080 and codepoint <= #00FF ) then 
		return "Latin-1 Supplement" 
	end if 
	if ( codepoint >= #0100 and codepoint <= #017F ) then 
		return "Latin Extended-A" 
	end if 
	if ( codepoint >= #0180 and codepoint <= #024F ) then 
		return "Latin Extended-B" 
	end if 
	if ( codepoint >= #0250 and codepoint <= #02AF ) then 
		return "IPA Extensions" 
	end if 
	if ( codepoint >= #02B0 and codepoint <= #02FF ) then 
		return "Spacing Modifier Letters" 
	end if 
	if ( codepoint >= #0300 and codepoint <= #036F ) then 
		return "Combining Diacritical Marks" 
	end if 
	if ( codepoint >= #0370 and codepoint <= #03FF ) then 
		return "Greek and Coptic" 
	end if 
	if ( codepoint >= #0400 and codepoint <= #04FF ) then 
		return "Cyrillic" 
	end if 
	if ( codepoint >= #0500 and codepoint <= #052F ) then 
		return "Cyrillic Supplement" 
	end if 
	if ( codepoint >= #0530 and codepoint <= #058F ) then 
		return "Armenian" 
	end if 
	if ( codepoint >= #0590 and codepoint <= #05FF ) then 
		return "Hebrew" 
	end if 
	if ( codepoint >= #0600 and codepoint <= #06FF ) then 
		return "Arabic" 
	end if 
	if ( codepoint >= #0700 and codepoint <= #074F ) then 
		return "Syriac" 
	end if 
	if ( codepoint >= #0750 and codepoint <= #077F ) then 
		return "Arabic Supplement" 
	end if 
	if ( codepoint >= #0780 and codepoint <= #07BF ) then 
		return "Thaana" 
	end if 
	if ( codepoint >= #07C0 and codepoint <= #07FF ) then 
		return "NKo" 
	end if 
	if ( codepoint >= #0900 and codepoint <= #097F ) then 
		return "Devanagari" 
	end if 
	if ( codepoint >= #0980 and codepoint <= #09FF ) then 
		return "Bengali" 
	end if 
	if ( codepoint >= #0A00 and codepoint <= #0A7F ) then 
		return "Gurmukhi" 
	end if 
	if ( codepoint >= #0A80 and codepoint <= #0AFF ) then 
		return "Gujarati" 
	end if 
	if ( codepoint >= #0B00 and codepoint <= #0B7F ) then 
		return "Oriya" 
	end if 
	if ( codepoint >= #0B80 and codepoint <= #0BFF ) then 
		return "Tamil" 
	end if 
	if ( codepoint >= #0C00 and codepoint <= #0C7F ) then 
		return "Telugu" 
	end if 
	if ( codepoint >= #0C80 and codepoint <= #0CFF ) then 
		return "Kannada" 
	end if 
	if ( codepoint >= #0D00 and codepoint <= #0D7F ) then 
		return "Malayalam" 
	end if 
	if ( codepoint >= #0D80 and codepoint <= #0DFF ) then 
		return "Sinhala" 
	end if 
	if ( codepoint >= #0E00 and codepoint <= #0E7F ) then 
		return "Thai" 
	end if 
	if ( codepoint >= #0E80 and codepoint <= #0EFF ) then 
		return "Lao" 
	end if 
	if ( codepoint >= #0F00 and codepoint <= #0FFF ) then 
		return "Tibetan" 
	end if 
	if ( codepoint >= #1000 and codepoint <= #109F ) then 
		return "Myanmar" 
	end if 
	if ( codepoint >= #10A0 and codepoint <= #10FF ) then 
		return "Georgian" 
	end if 
	if ( codepoint >= #1100 and codepoint <= #11FF ) then 
		return "Hangul Jamo" 
	end if 
	if ( codepoint >= #1200 and codepoint <= #137F ) then 
		return "Ethiopic" 
	end if 
	if ( codepoint >= #1380 and codepoint <= #139F ) then 
		return "Ethiopic Supplement" 
	end if 
	if ( codepoint >= #13A0 and codepoint <= #13FF ) then 
		return "Cherokee" 
	end if 
	if ( codepoint >= #1400 and codepoint <= #167F ) then 
		return "Unified Canadian Aboriginal Syllabics" 
	end if 
	if ( codepoint >= #1680 and codepoint <= #169F ) then 
		return "Ogham" 
	end if 
	if ( codepoint >= #16A0 and codepoint <= #16FF ) then 
		return "Runic" 
	end if 
	if ( codepoint >= #1700 and codepoint <= #171F ) then 
		return "Tagalog" 
	end if 
	if ( codepoint >= #1720 and codepoint <= #173F ) then 
		return "Hanunoo" 
	end if 
	if ( codepoint >= #1740 and codepoint <= #175F ) then 
		return "Buhid" 
	end if 
	if ( codepoint >= #1760 and codepoint <= #177F ) then 
		return "Tagbanwa" 
	end if 
	if ( codepoint >= #1780 and codepoint <= #17FF ) then 
		return "Khmer" 
	end if 
	if ( codepoint >= #1800 and codepoint <= #18AF ) then 
		return "Mongolian" 
	end if 
	if ( codepoint >= #1900 and codepoint <= #194F ) then 
		return "Limbu" 
	end if 
	if ( codepoint >= #1950 and codepoint <= #197F ) then 
		return "Tai Le" 
	end if 
	if ( codepoint >= #1980 and codepoint <= #19DF ) then 
		return "New Tai Lue" 
	end if 
	if ( codepoint >= #19E0 and codepoint <= #19FF ) then 
		return "Khmer Symbols" 
	end if 
	if ( codepoint >= #1A00 and codepoint <= #1A1F ) then 
		return "Buginese" 
	end if 
	if ( codepoint >= #1B00 and codepoint <= #1B7F ) then 
		return "Balinese" 
	end if 
	if ( codepoint >= #1B80 and codepoint <= #1BBF ) then 
		return "Sundanese" 
	end if 
	if ( codepoint >= #1C00 and codepoint <= #1C4F ) then 
		return "Lepcha" 
	end if 
	if ( codepoint >= #1C50 and codepoint <= #1C7F ) then 
		return "Ol Chiki" 
	end if 
	if ( codepoint >= #1D00 and codepoint <= #1D7F ) then 
		return "Phonetic Extensions" 
	end if 
	if ( codepoint >= #1D80 and codepoint <= #1DBF ) then 
		return "Phonetic Extensions Supplement" 
	end if 
	if ( codepoint >= #1DC0 and codepoint <= #1DFF ) then 
		return "Combining Diacritical Marks Supplement" 
	end if 
	if ( codepoint >= #1E00 and codepoint <= #1EFF ) then 
		return "Latin Extended Additional" 
	end if 
	if ( codepoint >= #1F00 and codepoint <= #1FFF ) then 
		return "Greek Extended" 
	end if 
	if ( codepoint >= #2000 and codepoint <= #206F ) then 
		return "General Punctuation" 
	end if 
	if ( codepoint >= #2070 and codepoint <= #209F ) then 
		return "Superscripts and Subscripts" 
	end if 
	if ( codepoint >= #20A0 and codepoint <= #20CF ) then 
		return "Currency Symbols" 
	end if 
	if ( codepoint >= #20D0 and codepoint <= #20FF ) then 
		return "Combining Diacritical Marks for Symbols" 
	end if 
	if ( codepoint >= #2100 and codepoint <= #214F ) then 
		return "Letterlike Symbols" 
	end if 
	if ( codepoint >= #2150 and codepoint <= #218F ) then 
		return "Number Forms" 
	end if 
	if ( codepoint >= #2190 and codepoint <= #21FF ) then 
		return "Arrows" 
	end if 
	if ( codepoint >= #2200 and codepoint <= #22FF ) then 
		return "Mathematical Operators" 
	end if 
	if ( codepoint >= #2300 and codepoint <= #23FF ) then 
		return "Miscellaneous Technical" 
	end if 
	if ( codepoint >= #2400 and codepoint <= #243F ) then 
		return "Control Pictures" 
	end if 
	if ( codepoint >= #2440 and codepoint <= #245F ) then 
		return "Optical Character Recognition" 
	end if 
	if ( codepoint >= #2460 and codepoint <= #24FF ) then 
		return "Enclosed Alphanumerics" 
	end if 
	if ( codepoint >= #2500 and codepoint <= #257F ) then 
		return "Box Drawing" 
	end if 
	if ( codepoint >= #2580 and codepoint <= #259F ) then 
		return "Block Elements" 
	end if 
	if ( codepoint >= #25A0 and codepoint <= #25FF ) then 
		return "Geometric Shapes" 
	end if 
	if ( codepoint >= #2600 and codepoint <= #26FF ) then 
		return "Miscellaneous Symbols" 
	end if 
	if ( codepoint >= #2700 and codepoint <= #27BF ) then 
		return "Dingbats" 
	end if 
	if ( codepoint >= #27C0 and codepoint <= #27EF ) then 
		return "Miscellaneous Mathematical Symbols-A" 
	end if 
	if ( codepoint >= #27F0 and codepoint <= #27FF ) then 
		return "Supplemental Arrows-A" 
	end if 
	if ( codepoint >= #2800 and codepoint <= #28FF ) then 
		return "Braille Patterns" 
	end if 
	if ( codepoint >= #2900 and codepoint <= #297F ) then 
		return "Supplemental Arrows-B" 
	end if 
	if ( codepoint >= #2980 and codepoint <= #29FF ) then 
		return "Miscellaneous Mathematical Symbols-B" 
	end if 
	if ( codepoint >= #2A00 and codepoint <= #2AFF ) then 
		return "Supplemental Mathematical Operators" 
	end if 
	if ( codepoint >= #2B00 and codepoint <= #2BFF ) then 
		return "Miscellaneous Symbols and Arrows" 
	end if 
	if ( codepoint >= #2C00 and codepoint <= #2C5F ) then 
		return "Glagolitic" 
	end if 
	if ( codepoint >= #2C60 and codepoint <= #2C7F ) then 
		return "Latin Extended-C" 
	end if 
	if ( codepoint >= #2C80 and codepoint <= #2CFF ) then 
		return "Coptic" 
	end if 
	if ( codepoint >= #2D00 and codepoint <= #2D2F ) then 
		return "Georgian Supplement" 
	end if 
	if ( codepoint >= #2D30 and codepoint <= #2D7F ) then 
		return "Tifinagh" 
	end if 
	if ( codepoint >= #2D80 and codepoint <= #2DDF ) then 
		return "Ethiopic Extended" 
	end if 
	if ( codepoint >= #2DE0 and codepoint <= #2DFF ) then 
		return "Cyrillic Extended-A" 
	end if 
	if ( codepoint >= #2E00 and codepoint <= #2E7F ) then 
		return "Supplemental Punctuation" 
	end if 
	if ( codepoint >= #2E80 and codepoint <= #2EFF ) then 
		return "CJK Radicals Supplement" 
	end if 
	if ( codepoint >= #2F00 and codepoint <= #2FDF ) then 
		return "Kangxi Radicals" 
	end if 
	if ( codepoint >= #2FF0 and codepoint <= #2FFF ) then 
		return "Ideographic Description Characters" 
	end if 
	if ( codepoint >= #3000 and codepoint <= #303F ) then 
		return "CJK Symbols and Punctuation" 
	end if 
	if ( codepoint >= #3040 and codepoint <= #309F ) then 
		return "Hiragana" 
	end if 
	if ( codepoint >= #30A0 and codepoint <= #30FF ) then 
		return "Katakana" 
	end if 
	if ( codepoint >= #3100 and codepoint <= #312F ) then 
		return "Bopomofo" 
	end if 
	if ( codepoint >= #3130 and codepoint <= #318F ) then 
		return "Hangul Compatibility Jamo" 
	end if 
	if ( codepoint >= #3190 and codepoint <= #319F ) then 
		return "Kanbun" 
	end if 
	if ( codepoint >= #31A0 and codepoint <= #31BF ) then 
		return "Bopomofo Extended" 
	end if 
	if ( codepoint >= #31C0 and codepoint <= #31EF ) then 
		return "CJK Strokes" 
	end if 
	if ( codepoint >= #31F0 and codepoint <= #31FF ) then 
		return "Katakana Phonetic Extensions" 
	end if 
	if ( codepoint >= #3200 and codepoint <= #32FF ) then 
		return "Enclosed CJK Letters and Months" 
	end if 
	if ( codepoint >= #3300 and codepoint <= #33FF ) then 
		return "CJK Compatibility" 
	end if 
	if ( codepoint >= #3400 and codepoint <= #4DBF ) then 
		return "CJK Unified Ideographs Extension A" 
	end if 
	if ( codepoint >= #4DC0 and codepoint <= #4DFF ) then 
		return "Yijing Hexagram Symbols" 
	end if 
	if ( codepoint >= #4E00 and codepoint <= #9FFF ) then 
		return "CJK Unified Ideographs" 
	end if 
	if ( codepoint >= #A000 and codepoint <= #A48F ) then 
		return "Yi Syllables" 
	end if 
	if ( codepoint >= #A490 and codepoint <= #A4CF ) then 
		return "Yi Radicals" 
	end if 
	if ( codepoint >= #A500 and codepoint <= #A63F ) then 
		return "Vai" 
	end if 
	if ( codepoint >= #A640 and codepoint <= #A69F ) then 
		return "Cyrillic Extended-B" 
	end if 
	if ( codepoint >= #A700 and codepoint <= #A71F ) then 
		return "Modifier Tone Letters" 
	end if 
	if ( codepoint >= #A720 and codepoint <= #A7FF ) then 
		return "Latin Extended-D" 
	end if 
	if ( codepoint >= #A800 and codepoint <= #A82F ) then 
		return "Syloti Nagri" 
	end if 
	if ( codepoint >= #A840 and codepoint <= #A87F ) then 
		return "Phags-pa" 
	end if 
	if ( codepoint >= #A880 and codepoint <= #A8DF ) then 
		return "Saurashtra" 
	end if 
	if ( codepoint >= #A900 and codepoint <= #A92F ) then 
		return "Kayah Li" 
	end if 
	if ( codepoint >= #A930 and codepoint <= #A95F ) then 
		return "Rejang" 
	end if 
	if ( codepoint >= #AA00 and codepoint <= #AA5F ) then 
		return "Cham" 
	end if 
	if ( codepoint >= #AC00 and codepoint <= #D7AF ) then 
		return "Hangul Syllables" 
	end if 
	if ( codepoint >= #D800 and codepoint <= #DB7F ) then 
		return "High Surrogates" 
	end if 
	if ( codepoint >= #DB80 and codepoint <= #DBFF ) then 
		return "High Private Use Surrogates" 
	end if 
	if ( codepoint >= #DC00 and codepoint <= #DFFF ) then 
		return "Low Surrogates" 
	end if 
	if ( codepoint >= #E000 and codepoint <= #F8FF ) then 
		return "Private Use Area" 
	end if 
	if ( codepoint >= #F900 and codepoint <= #FAFF ) then 
		return "CJK Compatibility Ideographs" 
	end if 
	if ( codepoint >= #FB00 and codepoint <= #FB4F ) then 
		return "Alphabetic Presentation Forms" 
	end if 
	if ( codepoint >= #FB50 and codepoint <= #FDFF ) then 
		return "Arabic Presentation Forms-A" 
	end if 
	if ( codepoint >= #FE00 and codepoint <= #FE0F ) then 
		return "Variation Selectors" 
	end if 
	if ( codepoint >= #FE10 and codepoint <= #FE1F ) then 
		return "Vertical Forms" 
	end if 
	if ( codepoint >= #FE20 and codepoint <= #FE2F ) then 
		return "Combining Half Marks" 
	end if 
	if ( codepoint >= #FE30 and codepoint <= #FE4F ) then 
		return "CJK Compatibility Forms" 
	end if 
	if ( codepoint >= #FE50 and codepoint <= #FE6F ) then 
		return "Small Form Variants" 
	end if 
	if ( codepoint >= #FE70 and codepoint <= #FEFF ) then 
		return "Arabic Presentation Forms-B" 
	end if 
	if ( codepoint >= #FF00 and codepoint <= #FFEF ) then 
		return "Halfwidth and Fullwidth Forms" 
	end if 
	if ( codepoint >= #FFF0 and codepoint <= #FFFF ) then 
		return "Specials" 
	end if 
	if ( codepoint >= #10000 and codepoint <= #1007F ) then 
		return "Linear B Syllabary" 
	end if 
	if ( codepoint >= #10080 and codepoint <= #100FF ) then 
		return "Linear B Ideograms" 
	end if 
	if ( codepoint >= #10100 and codepoint <= #1013F ) then 
		return "Aegean Numbers" 
	end if 
	if ( codepoint >= #10140 and codepoint <= #1018F ) then 
		return "Ancient Greek Numbers" 
	end if 
	if ( codepoint >= #10190 and codepoint <= #101CF ) then 
		return "Ancient Symbols" 
	end if 
	if ( codepoint >= #101D0 and codepoint <= #101FF ) then 
		return "Phaistos Disc" 
	end if 
	if ( codepoint >= #10280 and codepoint <= #1029F ) then 
		return "Lycian" 
	end if 
	if ( codepoint >= #102A0 and codepoint <= #102DF ) then 
		return "Carian" 
	end if 
	if ( codepoint >= #10300 and codepoint <= #1032F ) then 
		return "Old Italic" 
	end if 
	if ( codepoint >= #10330 and codepoint <= #1034F ) then 
		return "Gothic" 
	end if 
	if ( codepoint >= #10380 and codepoint <= #1039F ) then 
		return "Ugaritic" 
	end if 
	if ( codepoint >= #103A0 and codepoint <= #103DF ) then 
		return "Old Persian" 
	end if 
	if ( codepoint >= #10400 and codepoint <= #1044F ) then 
		return "Deseret" 
	end if 
	if ( codepoint >= #10450 and codepoint <= #1047F ) then 
		return "Shavian" 
	end if 
	if ( codepoint >= #10480 and codepoint <= #104AF ) then 
		return "Osmanya" 
	end if 
	if ( codepoint >= #10800 and codepoint <= #1083F ) then 
		return "Cypriot Syllabary" 
	end if 
	if ( codepoint >= #10900 and codepoint <= #1091F ) then 
		return "Phoenician" 
	end if 
	if ( codepoint >= #10920 and codepoint <= #1093F ) then 
		return "Lydian" 
	end if 
	if ( codepoint >= #10A00 and codepoint <= #10A5F ) then 
		return "Kharoshthi" 
	end if 
	if ( codepoint >= #12000 and codepoint <= #123FF ) then 
		return "Cuneiform" 
	end if 
	if ( codepoint >= #12400 and codepoint <= #1247F ) then 
		return "Cuneiform Numbers and Punctuation" 
	end if 
	if ( codepoint >= #1D000 and codepoint <= #1D0FF ) then 
		return "Byzantine Musical Symbols" 
	end if 
	if ( codepoint >= #1D100 and codepoint <= #1D1FF ) then 
		return "Musical Symbols" 
	end if 
	if ( codepoint >= #1D200 and codepoint <= #1D24F ) then 
		return "Ancient Greek Musical Notation" 
	end if 
	if ( codepoint >= #1D300 and codepoint <= #1D35F ) then 
		return "Tai Xuan Jing Symbols" 
	end if 
	if ( codepoint >= #1D360 and codepoint <= #1D37F ) then 
		return "Counting Rod Numerals" 
	end if 
	if ( codepoint >= #1D400 and codepoint <= #1D7FF ) then 
		return "Mathematical Alphanumeric Symbols" 
	end if 
	if ( codepoint >= #1F000 and codepoint <= #1F02F ) then 
		return "Mahjong Tiles" 
	end if 
	if ( codepoint >= #1F030 and codepoint <= #1F09F ) then 
		return "Domino Tiles" 
	end if 
	if ( codepoint >= #20000 and codepoint <= #2A6DF ) then 
		return "CJK Unified Ideographs Extension B" 
	end if 
	if ( codepoint >= #2F800 and codepoint <= #2FA1F ) then 
		return "CJK Compatibility Ideographs Supplement" 
	end if 
	if ( codepoint >= #E0000 and codepoint <= #E007F ) then 
		return "Tags" 
	end if 
	if ( codepoint >= #E0100 and codepoint <= #E01EF ) then 
		return "Variation Selectors Supplement" 
	end if 
	if ( codepoint >= #F0000 and codepoint <= #FFFFF ) then 
		return "Supplementary Private Use Area-A" 
	end if 
	if ( codepoint >= #100000 and codepoint <= #10FFFF ) then 
		return "Supplementary Private Use Area-B" 
	end if 
end function 
 
global function BLOCKS( atom ptrString ) 
  atom h 
  sequence sData 
  sequence sResult 
   
  sData = peek_vb( ptrString )  
 
  h = open("boune1.dat", "w" ) 
  print( h, sData ) 
  close(h) 
   
  sResult = "" 
  for i = 1 to sData[ 1 ] - 1 do 
    sResult = sResult & NumToBlock( sData[ 2 ][ i ] ) & {9} 
  end for 
   
  sResult = sResult & NumToBlock( sData[ 2 ][ $ ] ) 
   
  return alloc_bstr( sResult ) 
end function 
 
--printf(1, "%s", {NumToBlock( 125 )} ) 

Kind regards,

Bruce.

new topic     » topic index » view message » categorize

2. Re: I hope there's a more efficient way of doing this

axtens said...

Now comes the ugliest bit of Euphoria I think I've ever had the cause to generate (I can send you the VBScript that did it if you like). My question is (in expectation that you're not going to scroll down to the end) is this: Is there any better way of doing NumToBlock? I'm fairly sure there's no 'case' statement in Euphoria, but things have changed a fair bit since v1.5 (when I first encountered the language), so there may well be a solution standing right under my nose.

There will be a switch/case construct in 4.0, but in this case, I don't think it would be any prettier. What you could do is store the information like:

 { 
  { start_range, end_range, code_page }, 
  { start_range, end_range, code_page } 
  ... 
 } 

Then you search for the appropriate code page in a loop. If it's sorted, you could do a binary search, or just order them by expected frequency.

Matt

new topic     » goto parent     » topic index » view message » categorize

3. Re: I hope there's a more efficient way of doing this

mattlewis said...
 { 
  { start_range, end_range, code_page }, 
  { start_range, end_range, code_page } 
  ... 
 } 

Then you search for the appropriate code page in a loop.

Sounds like the go.

Thanks, Matt.

Bruce.

new topic     » goto parent     » topic index » view message » categorize

4. Re: I hope there's a more efficient way of doing this

G'day everyone,

Just to let you see what it became.

Kind regards,

Bruce.

include machine.e 
include dll.e 
include wildcard.e 
include w32msgs.e 
include w32def_series.ew 
include Unicode.ew 
include variant.ew 
   
function peek_vb( atom ptr ) 
    sequence str, temp, len 
     
    str = "" 
     
    ptr -= 4 
    len = peek( {ptr,4} ) -- get length DWORD 
    ptr += 4 
     
    temp = peek( {ptr, 2} ) 
    ptr += 2 
    while not equal( temp, {0,0} ) do 
        str &= ( temp[2] * 256 + temp[1] ) 
        temp = peek( {ptr,2} ) 
        ptr += 2 
    end while 
    return { bytes_to_int(len) / 2 , str } 
end function 
 
sequence BlockRanges 
BlockRanges = { 
	{ #0000, #007F, "Basic Latin" }, 
	{ #0080, #00FF, "Latin-1 Supplement" }, 
	{ #0100, #017F, "Latin Extended-A" }, 
	{ #0180, #024F, "Latin Extended-B" }, 
	{ #0250, #02AF, "IPA Extensions" }, 
	{ #02B0, #02FF, "Spacing Modifier Letters" }, 
	{ #0300, #036F, "Combining Diacritical Marks" }, 
	{ #0370, #03FF, "Greek and Coptic" }, 
	{ #0400, #04FF, "Cyrillic" }, 
	{ #0500, #052F, "Cyrillic Supplement" }, 
	{ #0530, #058F, "Armenian" }, 
	{ #0590, #05FF, "Hebrew" }, 
	{ #0600, #06FF, "Arabic" }, 
	{ #0700, #074F, "Syriac" }, 
	{ #0750, #077F, "Arabic Supplement" }, 
	{ #0780, #07BF, "Thaana" }, 
	{ #07C0, #07FF, "NKo" }, 
	{ #0900, #097F, "Devanagari" }, 
	{ #0980, #09FF, "Bengali" }, 
	{ #0A00, #0A7F, "Gurmukhi" }, 
	{ #0A80, #0AFF, "Gujarati" }, 
	{ #0B00, #0B7F, "Oriya" }, 
	{ #0B80, #0BFF, "Tamil" }, 
	{ #0C00, #0C7F, "Telugu" }, 
	{ #0C80, #0CFF, "Kannada" }, 
	{ #0D00, #0D7F, "Malayalam" }, 
	{ #0D80, #0DFF, "Sinhala" }, 
	{ #0E00, #0E7F, "Thai" }, 
	{ #0E80, #0EFF, "Lao" }, 
	{ #0F00, #0FFF, "Tibetan" }, 
	{ #1000, #109F, "Myanmar" }, 
	{ #10A0, #10FF, "Georgian" }, 
	{ #1100, #11FF, "Hangul Jamo" }, 
	{ #1200, #137F, "Ethiopic" }, 
	{ #1380, #139F, "Ethiopic Supplement" }, 
	{ #13A0, #13FF, "Cherokee" }, 
	{ #1400, #167F, "Unified Canadian Aboriginal Syllabics" }, 
	{ #1680, #169F, "Ogham" }, 
	{ #16A0, #16FF, "Runic" }, 
	{ #1700, #171F, "Tagalog" }, 
	{ #1720, #173F, "Hanunoo" }, 
	{ #1740, #175F, "Buhid" }, 
	{ #1760, #177F, "Tagbanwa" }, 
	{ #1780, #17FF, "Khmer" }, 
	{ #1800, #18AF, "Mongolian" }, 
	{ #1900, #194F, "Limbu" }, 
	{ #1950, #197F, "Tai Le" }, 
	{ #1980, #19DF, "New Tai Lue" }, 
	{ #19E0, #19FF, "Khmer Symbols" }, 
	{ #1A00, #1A1F, "Buginese" }, 
	{ #1B00, #1B7F, "Balinese" }, 
	{ #1B80, #1BBF, "Sundanese" }, 
	{ #1C00, #1C4F, "Lepcha" }, 
	{ #1C50, #1C7F, "Ol Chiki" }, 
	{ #1D00, #1D7F, "Phonetic Extensions" }, 
	{ #1D80, #1DBF, "Phonetic Extensions Supplement" }, 
	{ #1DC0, #1DFF, "Combining Diacritical Marks Supplement" }, 
	{ #1E00, #1EFF, "Latin Extended Additional" }, 
	{ #1F00, #1FFF, "Greek Extended" }, 
	{ #2000, #206F, "General Punctuation" }, 
	{ #2070, #209F, "Superscripts and Subscripts" }, 
	{ #20A0, #20CF, "Currency Symbols" }, 
	{ #20D0, #20FF, "Combining Diacritical Marks for Symbols" }, 
	{ #2100, #214F, "Letterlike Symbols" }, 
	{ #2150, #218F, "Number Forms" }, 
	{ #2190, #21FF, "Arrows" }, 
	{ #2200, #22FF, "Mathematical Operators" }, 
	{ #2300, #23FF, "Miscellaneous Technical" }, 
	{ #2400, #243F, "Control Pictures" }, 
	{ #2440, #245F, "Optical Character Recognition" }, 
	{ #2460, #24FF, "Enclosed Alphanumerics" }, 
	{ #2500, #257F, "Box Drawing" }, 
	{ #2580, #259F, "Block Elements" }, 
	{ #25A0, #25FF, "Geometric Shapes" }, 
	{ #2600, #26FF, "Miscellaneous Symbols" }, 
	{ #2700, #27BF, "Dingbats" }, 
	{ #27C0, #27EF, "Miscellaneous Mathematical Symbols-A" }, 
	{ #27F0, #27FF, "Supplemental Arrows-A" }, 
	{ #2800, #28FF, "Braille Patterns" }, 
	{ #2900, #297F, "Supplemental Arrows-B" }, 
	{ #2980, #29FF, "Miscellaneous Mathematical Symbols-B" }, 
	{ #2A00, #2AFF, "Supplemental Mathematical Operators" }, 
	{ #2B00, #2BFF, "Miscellaneous Symbols and Arrows" }, 
	{ #2C00, #2C5F, "Glagolitic" }, 
	{ #2C60, #2C7F, "Latin Extended-C" }, 
	{ #2C80, #2CFF, "Coptic" }, 
	{ #2D00, #2D2F, "Georgian Supplement" }, 
	{ #2D30, #2D7F, "Tifinagh" }, 
	{ #2D80, #2DDF, "Ethiopic Extended" }, 
	{ #2DE0, #2DFF, "Cyrillic Extended-A" }, 
	{ #2E00, #2E7F, "Supplemental Punctuation" }, 
	{ #2E80, #2EFF, "CJK Radicals Supplement" }, 
	{ #2F00, #2FDF, "Kangxi Radicals" }, 
	{ #2FF0, #2FFF, "Ideographic Description Characters" }, 
	{ #3000, #303F, "CJK Symbols and Punctuation" }, 
	{ #3040, #309F, "Hiragana" }, 
	{ #30A0, #30FF, "Katakana" }, 
	{ #3100, #312F, "Bopomofo" }, 
	{ #3130, #318F, "Hangul Compatibility Jamo" }, 
	{ #3190, #319F, "Kanbun" }, 
	{ #31A0, #31BF, "Bopomofo Extended" }, 
	{ #31C0, #31EF, "CJK Strokes" }, 
	{ #31F0, #31FF, "Katakana Phonetic Extensions" }, 
	{ #3200, #32FF, "Enclosed CJK Letters and Months" }, 
	{ #3300, #33FF, "CJK Compatibility" }, 
	{ #3400, #4DBF, "CJK Unified Ideographs Extension A" }, 
	{ #4DC0, #4DFF, "Yijing Hexagram Symbols" }, 
	{ #4E00, #9FFF, "CJK Unified Ideographs" }, 
	{ #A000, #A48F, "Yi Syllables" }, 
	{ #A490, #A4CF, "Yi Radicals" }, 
	{ #A500, #A63F, "Vai" }, 
	{ #A640, #A69F, "Cyrillic Extended-B" }, 
	{ #A700, #A71F, "Modifier Tone Letters" }, 
	{ #A720, #A7FF, "Latin Extended-D" }, 
	{ #A800, #A82F, "Syloti Nagri" }, 
	{ #A840, #A87F, "Phags-pa" }, 
	{ #A880, #A8DF, "Saurashtra" }, 
	{ #A900, #A92F, "Kayah Li" }, 
	{ #A930, #A95F, "Rejang" }, 
	{ #AA00, #AA5F, "Cham" }, 
	{ #AC00, #D7AF, "Hangul Syllables" }, 
	{ #D800, #DB7F, "High Surrogates" }, 
	{ #DB80, #DBFF, "High Private Use Surrogates" }, 
	{ #DC00, #DFFF, "Low Surrogates" }, 
	{ #E000, #F8FF, "Private Use Area" }, 
	{ #F900, #FAFF, "CJK Compatibility Ideographs" }, 
	{ #FB00, #FB4F, "Alphabetic Presentation Forms" }, 
	{ #FB50, #FDFF, "Arabic Presentation Forms-A" }, 
	{ #FE00, #FE0F, "Variation Selectors" }, 
	{ #FE10, #FE1F, "Vertical Forms" }, 
	{ #FE20, #FE2F, "Combining Half Marks" }, 
	{ #FE30, #FE4F, "CJK Compatibility Forms" }, 
	{ #FE50, #FE6F, "Small Form Variants" }, 
	{ #FE70, #FEFF, "Arabic Presentation Forms-B" }, 
	{ #FF00, #FFEF, "Halfwidth and Fullwidth Forms" }, 
	{ #FFF0, #FFFF, "Specials" }, 
	{ #10000, #1007F, "Linear B Syllabary" }, 
	{ #10080, #100FF, "Linear B Ideograms" }, 
	{ #10100, #1013F, "Aegean Numbers" }, 
	{ #10140, #1018F, "Ancient Greek Numbers" }, 
	{ #10190, #101CF, "Ancient Symbols" }, 
	{ #101D0, #101FF, "Phaistos Disc" }, 
	{ #10280, #1029F, "Lycian" }, 
	{ #102A0, #102DF, "Carian" }, 
	{ #10300, #1032F, "Old Italic" }, 
	{ #10330, #1034F, "Gothic" }, 
	{ #10380, #1039F, "Ugaritic" }, 
	{ #103A0, #103DF, "Old Persian" }, 
	{ #10400, #1044F, "Deseret" }, 
	{ #10450, #1047F, "Shavian" }, 
	{ #10480, #104AF, "Osmanya" }, 
	{ #10800, #1083F, "Cypriot Syllabary" }, 
	{ #10900, #1091F, "Phoenician" }, 
	{ #10920, #1093F, "Lydian" }, 
	{ #10A00, #10A5F, "Kharoshthi" }, 
	{ #12000, #123FF, "Cuneiform" }, 
	{ #12400, #1247F, "Cuneiform Numbers and Punctuation" }, 
	{ #1D000, #1D0FF, "Byzantine Musical Symbols" }, 
	{ #1D100, #1D1FF, "Musical Symbols" }, 
	{ #1D200, #1D24F, "Ancient Greek Musical Notation" }, 
	{ #1D300, #1D35F, "Tai Xuan Jing Symbols" }, 
	{ #1D360, #1D37F, "Counting Rod Numerals" }, 
	{ #1D400, #1D7FF, "Mathematical Alphanumeric Symbols" }, 
	{ #1F000, #1F02F, "Mahjong Tiles" }, 
	{ #1F030, #1F09F, "Domino Tiles" }, 
	{ #20000, #2A6DF, "CJK Unified Ideographs Extension B" }, 
	{ #2F800, #2FA1F, "CJK Compatibility Ideographs Supplement" }, 
	{ #E0000, #E007F, "Tags" }, 
	{ #E0100, #E01EF, "Variation Selectors Supplement" }, 
	{ #F0000, #FFFFF, "Supplementary Private Use Area-A" }, 
	{ #100000, #10FFFF, "Supplementary Private Use Area-B" }} 
 
function NumToBlock( atom codepoint ) 
	for i = 1 to length( BlockRanges ) do 
		if ( codepoint >= BlockRanges[ i ][ 1 ] and codepoint <= BlockRanges[ i ][ 2 ] ) then 
			return BlockRanges[ i ][ 3 ] 
		end if 
	end for 
	return "<unassigned>" 
end function 
 
global function BLOCKS( atom ptrString ) 
  atom h 
  sequence sData 
  sequence sResult 
   
  sData = peek_vb( ptrString )  
 
  h = open("boune1.dat", "w" ) 
  print( h, sData ) 
  close(h) 
   
  sResult = "" 
  for i = 1 to sData[ 1 ] - 1 do 
    sResult = sResult & NumToBlock( sData[ 2 ][ i ] ) & {9} 
  end for 
   
  sResult = sResult & NumToBlock( sData[ 2 ][ $ ] ) 
   
  return alloc_bstr( sResult ) 
end function 
 
--printf(1, "%s", {NumToBlock( 125 )} ) 
 
new topic     » goto parent     » topic index » view message » categorize

5. Re: I hope there's a more efficient way of doing this

axtens said...

Bruce, you could eliminate the second column I think...

Then loop through like this:

c = 1 
while x > BlockRanges[c][1] do 
 c+=1 
end while 
c-=1 
result = BlockRanges[c][2] 

Data would look like this:

sequence BlockRanges 
BlockRanges = { 
	{ #0000, "Basic Latin" }, 
	{ #0080, "Latin-1 Supplement" }, 
	{ #0100, "Latin Extended-A" }, 
	{ #0180, "Latin Extended-B" }, 
	{ #0250, "IPA Extensions" }, 
        ... 
	{ #F0000, "Supplementary Private Use Area-A" }, 
	{ #100000, "Supplementary Private Use Area-B" }} 
 

Something like that.

new topic     » goto parent     » topic index » view message » categorize

6. Re: I hope there's a more efficient way of doing this

Oops. I screwed up that quoted reply... sad

new topic     » goto parent     » topic index » view message » categorize

7. Re: I hope there's a more efficient way of doing this

Hey, euphoric, that's really cool code, and saves some data space as well.

Thanks very much!

Bruce.

new topic     » goto parent     » topic index » view message » categorize

8. Re: I hope there's a more efficient way of doing this

... however, not all the areas are contiguous, so if you supply a value of #E01F0 it gets interpreted as Variation Selectors Supplement which is incorrect. (Sigh).

Bruce.

new topic     » goto parent     » topic index » view message » categorize

9. Re: I hope there's a more efficient way of doing this

axtens said...

... however, not all the areas are contiguous, so if you supply a value of #E01F0 it gets interpreted as Variation Selectors Supplement which is incorrect. (Sigh).

Bruce.

I only did a quick scan of the data and it looked contiguous. Sorry! smile

Are there large sets of non-contiguous numbers? Maybe you could set those aside as exceptions and just catch them...?

new topic     » goto parent     » topic index » view message » categorize

10. Re: I hope there's a more efficient way of doing this

Actually, you could just put those numbers into the data set and act on them if they get returned.

	{ #E0000, "Tags" },  
        { #E0080, "UNPOSSIBLE!" }, 
	{ #E0100, "Variation Selectors Supplement" },  
	{ #F0000, "Supplementary Private Use Area-A" },  
	{ #100000, "Supplementary Private Use Area-B" }}  

new topic     » goto parent     » topic index » view message » categorize

11. Re: I hope there's a more efficient way of doing this

euphoric said...

Are there large sets of non-contiguous numbers? Maybe you could set those aside as exceptions and just catch them...?

D'oh, why didn't I think of that?!

Kind regards,

Bruce.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu