1. bug in char_test() ?


I decided to give the char_test() in types.e some work in an app. The function of the app is to strip out and separate non-english and non-printable data, merely to lessen the computers workload, narrow the focus for now. There's gigabytes, it's too much data right now.

I started by asking some questions in #euphoria irc channel:

Is everyting in CS_Printable printable, and all else isn't?
So i can use it to filter out the Korean, Chinese, Afghani, Iranian, Polish, Turkish, etc?

I got no answers, i coded it up anyhow, and it broke:

My source code is: if char_test(dataline,CS_Printable)

The error printed to screen by the interpreter is:
c:\Euphoria-4.0.4\include\std\types.e:191 in function char_test()
type_check failure, char_set is 6

dataline = /a/[/r/CapableOf/,/c/en/dog/,/c/en/fight/] /r/CapableOf /c/en/dog /c/en/fight /ctx/all 1 /s/activity/globalmind/assert,/s/contributor/globalmind/ks/maybe /e/624633f49ed73af19503a5cd75c97210b58ea47f /d/globalmind a dog can fight.

types.e line 191 = public function char_test(object test_data, sequence char_set)

i called it with char_test(dataline,CS_Printable)
where CS_Printable = { {' ', '~'} } per line 253 of types.e


The code ran 100's of times before this crash, and nowhere did i redefine CS_Printable to be 6.

useless

Edit:
CS_Printable is enumerated as 6, it's set in
public procedure set_default_charsets()
but Defined_Sets isn't public, so i cannot call
if char_test(dataline,Defined_Sets[CS_Printable]) then
or
if char_test(dataline,types:Defined_Sets[CS_Printable]) then

So how do i make this work?
By appending all the various enumerated types by copy/pasting out of types.e until a text line passed as TRUE, i changed my source to:

if char_test(dataline,{{' ', '~'},{' ', '/'},{':', '?'},{'[', '`'},{'{', '~'},{' ', '~'}, "  ", "\t\t", "\n\n", "\r\r", {8,8}, {7,7}}) 



useless

new topic     » topic index » view message » categorize

2. Re: bug in char_test() ?

useless_ said...

The code ran 100's of times before this crash, and nowhere did i redefine CS_Printable to be 6.

It should not have run at all.

useless_ said...

Edit:
CS_Printable is enumerated as 6, it's set in
public procedure set_default_charsets()
but Defined_Sets isn't public, so i cannot call
if char_test(dataline,Defined_Sets[CS_Printable]) then
or
if char_test(dataline,types:Defined_Sets[CS_Printable]) then

So how do i make this work?
By appending all the various enumerated types by copy/pasting out of types.e until a text line passed as TRUE, i changed my source to:

if char_test(dataline,{{' ', '~'},{' ', '/'},{':', '?'},{'[', '`'},{'{', '~'},{' ', '~'}, "  ", "\t\t", "\n\n", "\r\r", {8,8}, {7,7}}) 

It seems to me that you have a bit of redundancy there in your ranges. Several ranges have the same start, a few have the same ends, two are identical. Programatically, you can use get_charsets() to get the various sets that are defined, and then you can put them together as required.

Matt

new topic     » goto parent     » topic index » view message » categorize

3. Re: bug in char_test() ?

mattlewis said...
useless_ said...

The code ran 100's of times before this crash, and nowhere did i redefine CS_Printable to be 6.

It should not have run at all.

useless_ said...

Edit:
CS_Printable is enumerated as 6, it's set in
public procedure set_default_charsets()
but Defined_Sets isn't public, so i cannot call
if char_test(dataline,Defined_Sets[CS_Printable]) then
or
if char_test(dataline,types:Defined_Sets[CS_Printable]) then

So how do i make this work?
By appending all the various enumerated types by copy/pasting out of types.e until a text line passed as TRUE, i changed my source to:

if char_test(dataline,{{' ', '~'},{' ', '/'},{':', '?'},{'[', '`'},{'{', '~'},{' ', '~'}, "  ", "\t\t", "\n\n", "\r\r", {8,8}, {7,7}}) 

It seems to me that you have a bit of redundancy there in your ranges. Several ranges have the same start, a few have the same ends, two are identical.

Yes, i know. I made every attempt to make the app work and no attempt to optomise it.

mattlewis said...

Programatically, you can use get_charsets() to get the various sets that are defined, and then you can put them together as required.

Matt


So not being able to use them by name directly is done on purpose? Why? Namespace pollution? I guess i missed the line in the manual about that. To me, get_charsets() seemed to be for modifying the charsets, and retrieving the charset(s) so i could send it(s) right back in char_test() didn't occur to me. I didn't see what was in each charset by reading http://openeuphoria.org/docs/std_types.html , i discovered them by accident when reading std/types.e.

useless

new topic     » goto parent     » topic index » view message » categorize

4. Re: bug in char_test() ?

useless_ said...

So not being able to use them by name directly is done on purpose? Why? Namespace pollution?

It seems that way. I suspect to prevent accidentally changing them without using set_charsets(). I wasn't involved in its development, so this is all speculation on my part.

useless_ said...

I guess i missed the line in the manual about that. To me, get_charsets() seemed to be for modifying the charsets, and retrieving the charset(s) so i could send it(s) right back in char_test() didn't occur to me. I didn't see what was in each charset by reading http://openeuphoria.org/docs/std_types.html , i discovered them by accident when reading std/types.e.

The enum names seem pretty descriptive to me, though the documentation probably isn't clear enough about connecting those to the sets. I don't know that I've ever really looked at that stuff until you posted about it.

Matt

new topic     » goto parent     » topic index » view message » categorize

5. Re: bug in char_test() ?

Kat said...


I decided to give the char_test() in types.e some work in an app. The function of the app is to strip out and separate non-english and non-printable data, merely to lessen the computers workload, narrow the focus for now. There's gigabytes, it's too much data right now.

I started by asking some questions in #euphoria irc channel:

Is everyting in CS_Printable printable, and all else isn't?
So i can use it to filter out the Korean, Chinese, Afghani, Iranian, Polish, Turkish, etc?

The printable attribute in this library strictly refers to the bytes in the range 32 to 126 inclusive. These are the ASCII characters from SPACE to TILDE. This library does NOT know about any other character encoding schemes other than ASCII.

Kat said...

I got no answers, i coded it up anyhow, and it broke:

My source code is: if char_test(dataline,CS_Printable)

The error printed to screen by the interpreter is:
c:\Euphoria-4.0.4\include\std\types.e:191 in function char_test()
type_check failure, char_set is 6

Of course this would fail in this manner. The documentation states that the second argument to char_test must be a sequence and that CS_Printable is an enum.

Kat said...

dataline = /a/[/r/CapableOf/,/c/en/dog/,/c/en/fight/] /r/CapableOf /c/en/dog /c/en/fight /ctx/all 1 /s/activity/globalmind/assert,/s/contributor/globalmind/ks/maybe /e/624633f49ed73af19503a5cd75c97210b58ea47f /d/globalmind a dog can fight.

types.e line 191 = public function char_test(object test_data, sequence char_set)

i called it with char_test(dataline,CS_Printable)
where CS_Printable = { {' ', '~'} } per line 253 of types.e


It would appear that you have mis-read that line of code. The line actually reads ...

Defined_Sets[CS_Printable 	] = {{' ', '~'}} 

which is NOT saying that CS_Printable is a sequence but that Defined_Sets indexed by CS_Printable is a sequence.

Kat said...

The code ran 100's of times before this crash, and nowhere did i redefine CS_Printable to be 6.

I can only guess that your call to char_test was not executed every time, but eventually it was executed and that's when the program crashed.

Kat said...

Edit:
CS_Printable is enumerated as 6, it's set in
public procedure set_default_charsets()
but Defined_Sets isn't public, so i cannot call
if char_test(dataline,Defined_Sets[CS_Printable]) then
or
if char_test(dataline,types:Defined_Sets[CS_Printable]) then

I coded the library this way because Euphoria does not implement read-only data. I did not want the predefined character sets to be casually modifiable by the coder. The library has two helper routines, get_charsets() and set_charsets(). The first returns a list of all the predefined sets. The second allows the coder to modify one or more sets.

Kat said...

So how do i make this work?
By appending all the various enumerated types by copy/pasting out of types.e until a text line passed as TRUE, i changed my source to:

if char_test(dataline,{{' ', '~'},{' ', '/'},{':', '?'},{'[', '`'},{'{', '~'},{' ', '~'}, "  ", "\t\t", "\n\n", "\r\r", {8,8}, {7,7}}) 

You can you get_charsets() the use CS_Printable to index it, or more simply just use the t_print() function, which is also in the library. There is a library function for each of the predefined character sets.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu