I hope there's a more efficient way of doing this

new topic     » topic index » view thread      » older message » newer message

G'day everyone,

Now that I've given up on returning an array of BSTRs (for now at least), it's time to get on to the project itself. First off the rank is a DLL function which lists the unicode range that each character in the input stream falls into.

The VB6 code calling the DLL, and using a typelib to avoid the use of Declares statements, looks like this:

Dim cfg As Std.Config 
Sub Main() 
    Set cfg = New Config 
    cfg.Load "test.cfg" 
    sText = cfg.RecallElse("test", "dog") 
    Dim s As String 
    s = BLOCKS(sText) 
    Dim a() As String 
    a = Split(s, vbTab) 
    Dim b 
    For Each b In a 
        Debug.Print b 
    Next 
End Sub 
The cfg call is to a home-grown COM DLL which provides access to .cfg files (ANSI or Unicode.) In test.cfg is stored the following, which probably won't reproduce very well here, so beneath it is the results of a Euphoria print of the data parsed from it.
test=¶ᇍⅿ毅毅訜訝𒍅񄑄򙦙 
    {13,{182,4557,8575,27589,27589,35356,35357,55304,57157,55505,56388,55846,56729}} 
Now comes the ugliest bit of Euphoria I think I've ever had the cause to generate (I can send you the VBScript that did it if you like). My question is (in expectation that you're not going to scroll down to the end) is this: Is there any better way of doing NumToBlock? I'm fairly sure there's no 'case' statement in Euphoria, but things have changed a fair bit since v1.5 (when I first encountered the language), so there may well be a solution standing right under my nose.

include machine.e 
include dll.e 
include wildcard.e 
include w32msgs.e 
include w32def_series.ew 
include Unicode.ew 
include variant.ew 
   
function peek_vb( atom ptr ) 
    sequence str, temp, len 
     
    str = "" 
     
    ptr -= 4 
    len = peek( {ptr,4} ) -- get length DWORD 
    ptr += 4 
     
    temp = peek( {ptr, 2} ) 
    ptr += 2 
    while not equal( temp, {0,0} ) do 
        str &= ( temp[2] * 256 + temp[1] ) 
        temp = peek( {ptr,2} ) 
        ptr += 2 
    end while 
    return { bytes_to_int(len) / 2 , str } 
end function 
 
function NumToBlock( atom codepoint ) 
	if ( codepoint >= #0000 and codepoint <= #007F ) then 
		return "Basic Latin" 
	end if 
	if ( codepoint >= #0080 and codepoint <= #00FF ) then 
		return "Latin-1 Supplement" 
	end if 
	if ( codepoint >= #0100 and codepoint <= #017F ) then 
		return "Latin Extended-A" 
	end if 
	if ( codepoint >= #0180 and codepoint <= #024F ) then 
		return "Latin Extended-B" 
	end if 
	if ( codepoint >= #0250 and codepoint <= #02AF ) then 
		return "IPA Extensions" 
	end if 
	if ( codepoint >= #02B0 and codepoint <= #02FF ) then 
		return "Spacing Modifier Letters" 
	end if 
	if ( codepoint >= #0300 and codepoint <= #036F ) then 
		return "Combining Diacritical Marks" 
	end if 
	if ( codepoint >= #0370 and codepoint <= #03FF ) then 
		return "Greek and Coptic" 
	end if 
	if ( codepoint >= #0400 and codepoint <= #04FF ) then 
		return "Cyrillic" 
	end if 
	if ( codepoint >= #0500 and codepoint <= #052F ) then 
		return "Cyrillic Supplement" 
	end if 
	if ( codepoint >= #0530 and codepoint <= #058F ) then 
		return "Armenian" 
	end if 
	if ( codepoint >= #0590 and codepoint <= #05FF ) then 
		return "Hebrew" 
	end if 
	if ( codepoint >= #0600 and codepoint <= #06FF ) then 
		return "Arabic" 
	end if 
	if ( codepoint >= #0700 and codepoint <= #074F ) then 
		return "Syriac" 
	end if 
	if ( codepoint >= #0750 and codepoint <= #077F ) then 
		return "Arabic Supplement" 
	end if 
	if ( codepoint >= #0780 and codepoint <= #07BF ) then 
		return "Thaana" 
	end if 
	if ( codepoint >= #07C0 and codepoint <= #07FF ) then 
		return "NKo" 
	end if 
	if ( codepoint >= #0900 and codepoint <= #097F ) then 
		return "Devanagari" 
	end if 
	if ( codepoint >= #0980 and codepoint <= #09FF ) then 
		return "Bengali" 
	end if 
	if ( codepoint >= #0A00 and codepoint <= #0A7F ) then 
		return "Gurmukhi" 
	end if 
	if ( codepoint >= #0A80 and codepoint <= #0AFF ) then 
		return "Gujarati" 
	end if 
	if ( codepoint >= #0B00 and codepoint <= #0B7F ) then 
		return "Oriya" 
	end if 
	if ( codepoint >= #0B80 and codepoint <= #0BFF ) then 
		return "Tamil" 
	end if 
	if ( codepoint >= #0C00 and codepoint <= #0C7F ) then 
		return "Telugu" 
	end if 
	if ( codepoint >= #0C80 and codepoint <= #0CFF ) then 
		return "Kannada" 
	end if 
	if ( codepoint >= #0D00 and codepoint <= #0D7F ) then 
		return "Malayalam" 
	end if 
	if ( codepoint >= #0D80 and codepoint <= #0DFF ) then 
		return "Sinhala" 
	end if 
	if ( codepoint >= #0E00 and codepoint <= #0E7F ) then 
		return "Thai" 
	end if 
	if ( codepoint >= #0E80 and codepoint <= #0EFF ) then 
		return "Lao" 
	end if 
	if ( codepoint >= #0F00 and codepoint <= #0FFF ) then 
		return "Tibetan" 
	end if 
	if ( codepoint >= #1000 and codepoint <= #109F ) then 
		return "Myanmar" 
	end if 
	if ( codepoint >= #10A0 and codepoint <= #10FF ) then 
		return "Georgian" 
	end if 
	if ( codepoint >= #1100 and codepoint <= #11FF ) then 
		return "Hangul Jamo" 
	end if 
	if ( codepoint >= #1200 and codepoint <= #137F ) then 
		return "Ethiopic" 
	end if 
	if ( codepoint >= #1380 and codepoint <= #139F ) then 
		return "Ethiopic Supplement" 
	end if 
	if ( codepoint >= #13A0 and codepoint <= #13FF ) then 
		return "Cherokee" 
	end if 
	if ( codepoint >= #1400 and codepoint <= #167F ) then 
		return "Unified Canadian Aboriginal Syllabics" 
	end if 
	if ( codepoint >= #1680 and codepoint <= #169F ) then 
		return "Ogham" 
	end if 
	if ( codepoint >= #16A0 and codepoint <= #16FF ) then 
		return "Runic" 
	end if 
	if ( codepoint >= #1700 and codepoint <= #171F ) then 
		return "Tagalog" 
	end if 
	if ( codepoint >= #1720 and codepoint <= #173F ) then 
		return "Hanunoo" 
	end if 
	if ( codepoint >= #1740 and codepoint <= #175F ) then 
		return "Buhid" 
	end if 
	if ( codepoint >= #1760 and codepoint <= #177F ) then 
		return "Tagbanwa" 
	end if 
	if ( codepoint >= #1780 and codepoint <= #17FF ) then 
		return "Khmer" 
	end if 
	if ( codepoint >= #1800 and codepoint <= #18AF ) then 
		return "Mongolian" 
	end if 
	if ( codepoint >= #1900 and codepoint <= #194F ) then 
		return "Limbu" 
	end if 
	if ( codepoint >= #1950 and codepoint <= #197F ) then 
		return "Tai Le" 
	end if 
	if ( codepoint >= #1980 and codepoint <= #19DF ) then 
		return "New Tai Lue" 
	end if 
	if ( codepoint >= #19E0 and codepoint <= #19FF ) then 
		return "Khmer Symbols" 
	end if 
	if ( codepoint >= #1A00 and codepoint <= #1A1F ) then 
		return "Buginese" 
	end if 
	if ( codepoint >= #1B00 and codepoint <= #1B7F ) then 
		return "Balinese" 
	end if 
	if ( codepoint >= #1B80 and codepoint <= #1BBF ) then 
		return "Sundanese" 
	end if 
	if ( codepoint >= #1C00 and codepoint <= #1C4F ) then 
		return "Lepcha" 
	end if 
	if ( codepoint >= #1C50 and codepoint <= #1C7F ) then 
		return "Ol Chiki" 
	end if 
	if ( codepoint >= #1D00 and codepoint <= #1D7F ) then 
		return "Phonetic Extensions" 
	end if 
	if ( codepoint >= #1D80 and codepoint <= #1DBF ) then 
		return "Phonetic Extensions Supplement" 
	end if 
	if ( codepoint >= #1DC0 and codepoint <= #1DFF ) then 
		return "Combining Diacritical Marks Supplement" 
	end if 
	if ( codepoint >= #1E00 and codepoint <= #1EFF ) then 
		return "Latin Extended Additional" 
	end if 
	if ( codepoint >= #1F00 and codepoint <= #1FFF ) then 
		return "Greek Extended" 
	end if 
	if ( codepoint >= #2000 and codepoint <= #206F ) then 
		return "General Punctuation" 
	end if 
	if ( codepoint >= #2070 and codepoint <= #209F ) then 
		return "Superscripts and Subscripts" 
	end if 
	if ( codepoint >= #20A0 and codepoint <= #20CF ) then 
		return "Currency Symbols" 
	end if 
	if ( codepoint >= #20D0 and codepoint <= #20FF ) then 
		return "Combining Diacritical Marks for Symbols" 
	end if 
	if ( codepoint >= #2100 and codepoint <= #214F ) then 
		return "Letterlike Symbols" 
	end if 
	if ( codepoint >= #2150 and codepoint <= #218F ) then 
		return "Number Forms" 
	end if 
	if ( codepoint >= #2190 and codepoint <= #21FF ) then 
		return "Arrows" 
	end if 
	if ( codepoint >= #2200 and codepoint <= #22FF ) then 
		return "Mathematical Operators" 
	end if 
	if ( codepoint >= #2300 and codepoint <= #23FF ) then 
		return "Miscellaneous Technical" 
	end if 
	if ( codepoint >= #2400 and codepoint <= #243F ) then 
		return "Control Pictures" 
	end if 
	if ( codepoint >= #2440 and codepoint <= #245F ) then 
		return "Optical Character Recognition" 
	end if 
	if ( codepoint >= #2460 and codepoint <= #24FF ) then 
		return "Enclosed Alphanumerics" 
	end if 
	if ( codepoint >= #2500 and codepoint <= #257F ) then 
		return "Box Drawing" 
	end if 
	if ( codepoint >= #2580 and codepoint <= #259F ) then 
		return "Block Elements" 
	end if 
	if ( codepoint >= #25A0 and codepoint <= #25FF ) then 
		return "Geometric Shapes" 
	end if 
	if ( codepoint >= #2600 and codepoint <= #26FF ) then 
		return "Miscellaneous Symbols" 
	end if 
	if ( codepoint >= #2700 and codepoint <= #27BF ) then 
		return "Dingbats" 
	end if 
	if ( codepoint >= #27C0 and codepoint <= #27EF ) then 
		return "Miscellaneous Mathematical Symbols-A" 
	end if 
	if ( codepoint >= #27F0 and codepoint <= #27FF ) then 
		return "Supplemental Arrows-A" 
	end if 
	if ( codepoint >= #2800 and codepoint <= #28FF ) then 
		return "Braille Patterns" 
	end if 
	if ( codepoint >= #2900 and codepoint <= #297F ) then 
		return "Supplemental Arrows-B" 
	end if 
	if ( codepoint >= #2980 and codepoint <= #29FF ) then 
		return "Miscellaneous Mathematical Symbols-B" 
	end if 
	if ( codepoint >= #2A00 and codepoint <= #2AFF ) then 
		return "Supplemental Mathematical Operators" 
	end if 
	if ( codepoint >= #2B00 and codepoint <= #2BFF ) then 
		return "Miscellaneous Symbols and Arrows" 
	end if 
	if ( codepoint >= #2C00 and codepoint <= #2C5F ) then 
		return "Glagolitic" 
	end if 
	if ( codepoint >= #2C60 and codepoint <= #2C7F ) then 
		return "Latin Extended-C" 
	end if 
	if ( codepoint >= #2C80 and codepoint <= #2CFF ) then 
		return "Coptic" 
	end if 
	if ( codepoint >= #2D00 and codepoint <= #2D2F ) then 
		return "Georgian Supplement" 
	end if 
	if ( codepoint >= #2D30 and codepoint <= #2D7F ) then 
		return "Tifinagh" 
	end if 
	if ( codepoint >= #2D80 and codepoint <= #2DDF ) then 
		return "Ethiopic Extended" 
	end if 
	if ( codepoint >= #2DE0 and codepoint <= #2DFF ) then 
		return "Cyrillic Extended-A" 
	end if 
	if ( codepoint >= #2E00 and codepoint <= #2E7F ) then 
		return "Supplemental Punctuation" 
	end if 
	if ( codepoint >= #2E80 and codepoint <= #2EFF ) then 
		return "CJK Radicals Supplement" 
	end if 
	if ( codepoint >= #2F00 and codepoint <= #2FDF ) then 
		return "Kangxi Radicals" 
	end if 
	if ( codepoint >= #2FF0 and codepoint <= #2FFF ) then 
		return "Ideographic Description Characters" 
	end if 
	if ( codepoint >= #3000 and codepoint <= #303F ) then 
		return "CJK Symbols and Punctuation" 
	end if 
	if ( codepoint >= #3040 and codepoint <= #309F ) then 
		return "Hiragana" 
	end if 
	if ( codepoint >= #30A0 and codepoint <= #30FF ) then 
		return "Katakana" 
	end if 
	if ( codepoint >= #3100 and codepoint <= #312F ) then 
		return "Bopomofo" 
	end if 
	if ( codepoint >= #3130 and codepoint <= #318F ) then 
		return "Hangul Compatibility Jamo" 
	end if 
	if ( codepoint >= #3190 and codepoint <= #319F ) then 
		return "Kanbun" 
	end if 
	if ( codepoint >= #31A0 and codepoint <= #31BF ) then 
		return "Bopomofo Extended" 
	end if 
	if ( codepoint >= #31C0 and codepoint <= #31EF ) then 
		return "CJK Strokes" 
	end if 
	if ( codepoint >= #31F0 and codepoint <= #31FF ) then 
		return "Katakana Phonetic Extensions" 
	end if 
	if ( codepoint >= #3200 and codepoint <= #32FF ) then 
		return "Enclosed CJK Letters and Months" 
	end if 
	if ( codepoint >= #3300 and codepoint <= #33FF ) then 
		return "CJK Compatibility" 
	end if 
	if ( codepoint >= #3400 and codepoint <= #4DBF ) then 
		return "CJK Unified Ideographs Extension A" 
	end if 
	if ( codepoint >= #4DC0 and codepoint <= #4DFF ) then 
		return "Yijing Hexagram Symbols" 
	end if 
	if ( codepoint >= #4E00 and codepoint <= #9FFF ) then 
		return "CJK Unified Ideographs" 
	end if 
	if ( codepoint >= #A000 and codepoint <= #A48F ) then 
		return "Yi Syllables" 
	end if 
	if ( codepoint >= #A490 and codepoint <= #A4CF ) then 
		return "Yi Radicals" 
	end if 
	if ( codepoint >= #A500 and codepoint <= #A63F ) then 
		return "Vai" 
	end if 
	if ( codepoint >= #A640 and codepoint <= #A69F ) then 
		return "Cyrillic Extended-B" 
	end if 
	if ( codepoint >= #A700 and codepoint <= #A71F ) then 
		return "Modifier Tone Letters" 
	end if 
	if ( codepoint >= #A720 and codepoint <= #A7FF ) then 
		return "Latin Extended-D" 
	end if 
	if ( codepoint >= #A800 and codepoint <= #A82F ) then 
		return "Syloti Nagri" 
	end if 
	if ( codepoint >= #A840 and codepoint <= #A87F ) then 
		return "Phags-pa" 
	end if 
	if ( codepoint >= #A880 and codepoint <= #A8DF ) then 
		return "Saurashtra" 
	end if 
	if ( codepoint >= #A900 and codepoint <= #A92F ) then 
		return "Kayah Li" 
	end if 
	if ( codepoint >= #A930 and codepoint <= #A95F ) then 
		return "Rejang" 
	end if 
	if ( codepoint >= #AA00 and codepoint <= #AA5F ) then 
		return "Cham" 
	end if 
	if ( codepoint >= #AC00 and codepoint <= #D7AF ) then 
		return "Hangul Syllables" 
	end if 
	if ( codepoint >= #D800 and codepoint <= #DB7F ) then 
		return "High Surrogates" 
	end if 
	if ( codepoint >= #DB80 and codepoint <= #DBFF ) then 
		return "High Private Use Surrogates" 
	end if 
	if ( codepoint >= #DC00 and codepoint <= #DFFF ) then 
		return "Low Surrogates" 
	end if 
	if ( codepoint >= #E000 and codepoint <= #F8FF ) then 
		return "Private Use Area" 
	end if 
	if ( codepoint >= #F900 and codepoint <= #FAFF ) then 
		return "CJK Compatibility Ideographs" 
	end if 
	if ( codepoint >= #FB00 and codepoint <= #FB4F ) then 
		return "Alphabetic Presentation Forms" 
	end if 
	if ( codepoint >= #FB50 and codepoint <= #FDFF ) then 
		return "Arabic Presentation Forms-A" 
	end if 
	if ( codepoint >= #FE00 and codepoint <= #FE0F ) then 
		return "Variation Selectors" 
	end if 
	if ( codepoint >= #FE10 and codepoint <= #FE1F ) then 
		return "Vertical Forms" 
	end if 
	if ( codepoint >= #FE20 and codepoint <= #FE2F ) then 
		return "Combining Half Marks" 
	end if 
	if ( codepoint >= #FE30 and codepoint <= #FE4F ) then 
		return "CJK Compatibility Forms" 
	end if 
	if ( codepoint >= #FE50 and codepoint <= #FE6F ) then 
		return "Small Form Variants" 
	end if 
	if ( codepoint >= #FE70 and codepoint <= #FEFF ) then 
		return "Arabic Presentation Forms-B" 
	end if 
	if ( codepoint >= #FF00 and codepoint <= #FFEF ) then 
		return "Halfwidth and Fullwidth Forms" 
	end if 
	if ( codepoint >= #FFF0 and codepoint <= #FFFF ) then 
		return "Specials" 
	end if 
	if ( codepoint >= #10000 and codepoint <= #1007F ) then 
		return "Linear B Syllabary" 
	end if 
	if ( codepoint >= #10080 and codepoint <= #100FF ) then 
		return "Linear B Ideograms" 
	end if 
	if ( codepoint >= #10100 and codepoint <= #1013F ) then 
		return "Aegean Numbers" 
	end if 
	if ( codepoint >= #10140 and codepoint <= #1018F ) then 
		return "Ancient Greek Numbers" 
	end if 
	if ( codepoint >= #10190 and codepoint <= #101CF ) then 
		return "Ancient Symbols" 
	end if 
	if ( codepoint >= #101D0 and codepoint <= #101FF ) then 
		return "Phaistos Disc" 
	end if 
	if ( codepoint >= #10280 and codepoint <= #1029F ) then 
		return "Lycian" 
	end if 
	if ( codepoint >= #102A0 and codepoint <= #102DF ) then 
		return "Carian" 
	end if 
	if ( codepoint >= #10300 and codepoint <= #1032F ) then 
		return "Old Italic" 
	end if 
	if ( codepoint >= #10330 and codepoint <= #1034F ) then 
		return "Gothic" 
	end if 
	if ( codepoint >= #10380 and codepoint <= #1039F ) then 
		return "Ugaritic" 
	end if 
	if ( codepoint >= #103A0 and codepoint <= #103DF ) then 
		return "Old Persian" 
	end if 
	if ( codepoint >= #10400 and codepoint <= #1044F ) then 
		return "Deseret" 
	end if 
	if ( codepoint >= #10450 and codepoint <= #1047F ) then 
		return "Shavian" 
	end if 
	if ( codepoint >= #10480 and codepoint <= #104AF ) then 
		return "Osmanya" 
	end if 
	if ( codepoint >= #10800 and codepoint <= #1083F ) then 
		return "Cypriot Syllabary" 
	end if 
	if ( codepoint >= #10900 and codepoint <= #1091F ) then 
		return "Phoenician" 
	end if 
	if ( codepoint >= #10920 and codepoint <= #1093F ) then 
		return "Lydian" 
	end if 
	if ( codepoint >= #10A00 and codepoint <= #10A5F ) then 
		return "Kharoshthi" 
	end if 
	if ( codepoint >= #12000 and codepoint <= #123FF ) then 
		return "Cuneiform" 
	end if 
	if ( codepoint >= #12400 and codepoint <= #1247F ) then 
		return "Cuneiform Numbers and Punctuation" 
	end if 
	if ( codepoint >= #1D000 and codepoint <= #1D0FF ) then 
		return "Byzantine Musical Symbols" 
	end if 
	if ( codepoint >= #1D100 and codepoint <= #1D1FF ) then 
		return "Musical Symbols" 
	end if 
	if ( codepoint >= #1D200 and codepoint <= #1D24F ) then 
		return "Ancient Greek Musical Notation" 
	end if 
	if ( codepoint >= #1D300 and codepoint <= #1D35F ) then 
		return "Tai Xuan Jing Symbols" 
	end if 
	if ( codepoint >= #1D360 and codepoint <= #1D37F ) then 
		return "Counting Rod Numerals" 
	end if 
	if ( codepoint >= #1D400 and codepoint <= #1D7FF ) then 
		return "Mathematical Alphanumeric Symbols" 
	end if 
	if ( codepoint >= #1F000 and codepoint <= #1F02F ) then 
		return "Mahjong Tiles" 
	end if 
	if ( codepoint >= #1F030 and codepoint <= #1F09F ) then 
		return "Domino Tiles" 
	end if 
	if ( codepoint >= #20000 and codepoint <= #2A6DF ) then 
		return "CJK Unified Ideographs Extension B" 
	end if 
	if ( codepoint >= #2F800 and codepoint <= #2FA1F ) then 
		return "CJK Compatibility Ideographs Supplement" 
	end if 
	if ( codepoint >= #E0000 and codepoint <= #E007F ) then 
		return "Tags" 
	end if 
	if ( codepoint >= #E0100 and codepoint <= #E01EF ) then 
		return "Variation Selectors Supplement" 
	end if 
	if ( codepoint >= #F0000 and codepoint <= #FFFFF ) then 
		return "Supplementary Private Use Area-A" 
	end if 
	if ( codepoint >= #100000 and codepoint <= #10FFFF ) then 
		return "Supplementary Private Use Area-B" 
	end if 
end function 
 
global function BLOCKS( atom ptrString ) 
  atom h 
  sequence sData 
  sequence sResult 
   
  sData = peek_vb( ptrString )  
 
  h = open("boune1.dat", "w" ) 
  print( h, sData ) 
  close(h) 
   
  sResult = "" 
  for i = 1 to sData[ 1 ] - 1 do 
    sResult = sResult & NumToBlock( sData[ 2 ][ i ] ) & {9} 
  end for 
   
  sResult = sResult & NumToBlock( sData[ 2 ][ $ ] ) 
   
  return alloc_bstr( sResult ) 
end function 
 
--printf(1, "%s", {NumToBlock( 125 )} ) 

Kind regards,

Bruce.

new topic     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu