Re: bug in char_test() ?
- Posted by DerekParnell (admin) Oct 15, 2012
- 1011 views
I decided to give the char_test() in types.e some work in an app. The function of the app is to strip out and separate non-english and non-printable data, merely to lessen the computers workload, narrow the focus for now. There's gigabytes, it's too much data right now.
I started by asking some questions in #euphoria irc channel:
Is everyting in CS_Printable printable, and all else isn't?
So i can use it to filter out the Korean, Chinese, Afghani, Iranian, Polish, Turkish, etc?
The printable attribute in this library strictly refers to the bytes in the range 32 to 126 inclusive. These are the ASCII characters from SPACE to TILDE. This library does NOT know about any other character encoding schemes other than ASCII.
I got no answers, i coded it up anyhow, and it broke:
My source code is: if char_test(dataline,CS_Printable)
The error printed to screen by the interpreter is:
c:\Euphoria-4.0.4\include\std\types.e:191 in function char_test()
type_check failure, char_set is 6
Of course this would fail in this manner. The documentation states that the second argument to char_test must be a sequence and that CS_Printable is an enum.
dataline = /a/[/r/CapableOf/,/c/en/dog/,/c/en/fight/] /r/CapableOf /c/en/dog /c/en/fight /ctx/all 1 /s/activity/globalmind/assert,/s/contributor/globalmind/ks/maybe /e/624633f49ed73af19503a5cd75c97210b58ea47f /d/globalmind a dog can fight.
types.e line 191 = public function char_test(object test_data, sequence char_set)
i called it with char_test(dataline,CS_Printable)
where CS_Printable = { {' ', '~'} } per line 253 of types.e
It would appear that you have mis-read that line of code. The line actually reads ...
Defined_Sets[CS_Printable ] = {{' ', '~'}}
which is NOT saying that CS_Printable is a sequence but that Defined_Sets indexed by CS_Printable is a sequence.
The code ran 100's of times before this crash, and nowhere did i redefine CS_Printable to be 6.
I can only guess that your call to char_test was not executed every time, but eventually it was executed and that's when the program crashed.
Edit:
CS_Printable is enumerated as 6, it's set in
public procedure set_default_charsets()
but Defined_Sets isn't public, so i cannot call
if char_test(dataline,Defined_Sets[CS_Printable]) then
or
if char_test(dataline,types:Defined_Sets[CS_Printable]) then
I coded the library this way because Euphoria does not implement read-only data. I did not want the predefined character sets to be casually modifiable by the coder. The library has two helper routines, get_charsets() and set_charsets(). The first returns a list of all the predefined sets. The second allows the coder to modify one or more sets.
So how do i make this work?
By appending all the various enumerated types by copy/pasting out of types.e until a text line passed as TRUE, i changed my source to:
if char_test(dataline,{{' ', '~'},{' ', '/'},{':', '?'},{'[', '`'},{'{', '~'},{' ', '~'}, " ", "\t\t", "\n\n", "\r\r", {8,8}, {7,7}})
You can you get_charsets() the use CS_Printable to index it, or more simply just use the t_print() function, which is also in the library. There is a library function for each of the predefined character sets.