OpenEuphoria: Forum: Integrating string types into EUPHORIA handling

1. Integrating string types into EUPHORIA handling

Posted by SDPringle May 04, 2014
1542 views

Forked from Re: Euphoria vs The Other Guys

DerekParnell said...

SDPringle said...

... Strings are easily implemented as a UDT.

Not quite. UDT have some restrictions. They are only used during assignments or explicit if UDT(x) tests. We can't tailor other operations to deal with UDTs, nor can we arbitrarily have existing library functions know what to do with a UDT when they encounter one. That, of course, also applies to built in types too.

I really mean we can use UDTs for assignments and explicit checks. In fact the standard library has 'string' so I should have said a UDT already exists there. Did you mean we cannot use UDT routines for things other than assignments and tests or did you mean we don't have operator overloading in EUPHORIA? I don't think we need to.

DerekParnell said...

SDPringle said...

If added as a builtin type, you would only need to modify the parser part.

Actually, it turns out to be a lot more complicated than that. A new built-in type has semantic implications that would undoubtedly affect nearly every aspect of the existing language operations.

One could implement string the same way as sequence is. That is to say use the same structure. We would need to add a special call to make sure it is a valid string but other than that just use struct s1. We have three representations for objects: struct s1, struct d, and int. If you add another we go from binary operations having to handle 9 cases to operations having to handle 16. [/quote]

DerekParnell said...

SDPringle said...

... working with a string is the same as working with a sequence of integers, or a sequence of anythings.

Again, only if you are talking about sequence operations, but even then we would need to add some extra semantics to some operations. For example, adding an atom to a String type should do what, exactly? Issue a runtime error, maybe? Then there's the whole complication of how to output String data. If the string is going to a console device, we would probably need to convert it to UTF8. But what about out to a file? Or output to a windowed applications - UTF16 or ASCII? Now consider functions like upper() etc, these need special handling for unicode strings.

[/quote]

I have thought this out down this same path. I would love for the interpreter to magically help the user and re-encode the strings to the way they ought to be when calling C routines.

I pass the buck to library implemntations and that means us with the Standard Library. If we had different UDT for different encodings we could have ascii_string, codepage800string, utf_string, etc... The EUPHORIA type system could not always tell if they were encoded as we normally encode strings if you passed the wrong kind of string to the wrong place. You could have magic numbers in strings but they would be magic, string packages instead of just strings. We could just leave things simple and let the users figure it out. Most importantly library implementors should always specify what kind of string they are expecting for a given routine.

new topic » topic index » view message » categorize

2. Re: Integrating string types into EUPHORIA handling

Posted by jimcbrown (admin) May 04, 2014
1516 views

SDPringle said...

Did you mean we cannot use UDT routines for things other than assignments and tests or did you mean we don't have operator overloading in EUPHORIA?

Isn't the second part (operator overloading) automatically covered by the first part (cannot use UDT for things other than assignments and tests)?

SDPringle said...

DerekParnell said...

SDPringle said...

If added as a builtin type, you would only need to modify the parser part.

Actually, it turns out to be a lot more complicated than that. A new built-in type has semantic implications that would undoubtedly affect nearly every aspect of the existing language operations.

One could implement string the same way as sequence is. That is to say use the same structure. We would need to add a special call to make sure it is a valid string but other than that just use struct s1. We have three representations for objects: struct s1, struct d, and int.

When would we make the special call? Are you saying that we should make string a sort of 'built-in' UDT, handled and implmented the same way as other UDTs are, but in the same builtin namespace that the other builtin types are?

SDPringle said...

If you add another we go from binary operations having to handle 9 cases to operations having to handle 16.

A 'built-in' UDT wouldn't require any changes to binary operations.

If it's not a 'built-in' UDT but a true new datatype, then there'd be a lot more changes, as Derek says.

SDPringle said...

I have thought this out down this same path. I would love for the interpreter to magically help the user and re-encode the strings to the way they ought to be when calling C routines.

This is doable without adding a new builtin type. We'd just add a new C type, say C_STRING in addition to C_POINTER and C_CHAR. The user passes that type into define_c_func/proc and then later when the user calls c_func/c_proc, the callc code would take a sequence, validate it (so no subsequences in that sequence, for sample), and convert it to a NUL-terminated char array.

Or, for C_WSTRING, do conversion into UTF-16. For rare cases, C_WWSTRING would convert to UTF-32, while C_USTRING would do UTF-8.

SDPringle said...

I pass the buck to library implemntations and that means us with the Standard Library.

I'm confused. Should string be a builtin type or not?

SDPringle said...

If we had different UDT for different encodings we could have ascii_string, codepage800string, utf_string, etc... The EUPHORIA type system could not always tell if they were encoded as we normally encode strings if you passed the wrong kind of string to the wrong place. You could have magic numbers in strings but they would be magic, string packages instead of just strings. We could just leave things simple and let the users figure it out. Most importantly library implementors should always specify what kind of string they are expecting for a given routine.

I like the KISS principle myself. But maybe that's just me.

new topic » goto parent » topic index » view message » categorize

3. Re: Integrating string types into EUPHORIA handling

Posted by DerekParnell (admin) May 04, 2014
1477 views

SDPringle said...

... Did you mean we cannot use UDT routines for things other than assignments and tests or did you mean we don't have operator overloading in EUPHORIA?

Both.

string A 
string B 
string C 
object D 
 
A = "Some text" 
B = "And here is some more text" 
 
C = A + B  -- Should the parser or runtime allow this? Is it addition or concatenation? 
 
D = A * B  -- And what about this (assigning to an 'object')?

SDPringle said...

One could implement string the same way as sequence is. That is to say use the same structure. We would need to add a special call to make sure it is a valid string but other than that just use struct s1. We have three representations for objects: struct s1, struct d, and int. If you add another we go from binary operations having to handle 9 cases to operations having to handle 16.

Yes, and don't forget the translator too. It emits code depending on the data types of the operands.

SDPringle said...

I would love for the interpreter to magically help the user and re-encode the strings to the way they ought to be when calling C routines.

I pass the buck to library implemntations and that means us with the Standard Library. If we had different UDT for different encodings we could have ascii_string, codepage800string, utf_string, etc... The EUPHORIA type system could not always tell if they were encoded as we normally encode strings if you passed the wrong kind of string to the wrong place. You could have magic numbers in strings but they would be magic, string packages instead of just strings. We could just leave things simple and let the users figure it out. Most importantly library implementors should always specify what kind of string they are expecting for a given routine.

I too feel that string handling, especially for unicode, needs to be library based rather than built-in to the language. However, we should not also be re-inventing the wheel. There exists a few good C-Based unicode libraries already and we should leverage off those, and the best way to do that is to statically link the unicode library into the EUI executable. An alternative would be to create a DLL shim and load the unicode library at runtime as needed. Either way, the unicode specific functionality would not be coded in Euphoria, only the bare minimum required to interface with the third party routines.

new topic » goto parent » topic index » view message » categorize

4. Re: Integrating string types into EUPHORIA handling

Posted by SDPringle May 04, 2014
1440 views

Just to be clear I was not advocating creating a builtin string types. It was rather a brainstorming post. The stuff should be done in the library routines. We already have cstring and string in the library. Come to think of it, maybe we should make the std/regex.e stuff take cstring rather than string.

Things like lower() and upper() should be worked out to use Unicode. As it already works using encodings we can make utf32 as one of the encodings. I say we can put the functionality of std/unicode.e into std/text.e this way.

Shawn

new topic » goto parent » topic index » view message » categorize

5. Re: Integrating string types into EUPHORIA handling

Posted by ghaberek (admin) May 05, 2014
1402 views

jimcbrown said...

This is doable without adding a new builtin type. We'd just add a new C type, say C_STRING in addition to C_POINTER and C_CHAR. The user passes that type into define_c_func/proc and then later when the user calls c_func/c_proc, the callc code would take a sequence, validate it (so no subsequences in that sequence, for sample), and convert it to a NUL-terminated char array.

Or, for C_WSTRING, do conversion into UTF-16. For rare cases, C_WWSTRING would convert to UTF-32, while C_USTRING would do UTF-8.

euadvlib does this. smile

override function c_func( integer rid, sequence args = {} ) 
     
    -- get the argument types and return type 
    sequence arg_types = map:get( m_args, rid, {} ) 
    atom return_type = map:get( m_retv, rid, 0 ) 
     
    -- make sure we have the correct number of arguments 
    if length( args ) = length( arg_types ) then 
         
        for i = 1 to length( args ) do 
             
            -- allocate strings as necessary 
             
            if sequence( args[i] ) and arg_types[i] = C_STRING then 
                args[i] = allocate_string( args[i], 1 ) 
            elsif sequence( args[i] ) and arg_types[i] = C_WSTRING then 
                args[i] = allocate_wstring( args[i], 1 ) 
            end if 
             
        end for 
         
    end if 
     
    -- call the function 
    object result = eu:c_func( rid, args ) 
     
    -- fetch strings as necessary 
    if return_type = C_STRING then 
        result = peek_string( result ) 
    elsif return_type = C_WSTRING then 
        result = peek_wstring( result ) 
    end if 
     
    return result 
end function

-Greg

new topic » goto parent » topic index » view message » categorize

6. Re: Integrating string types into EUPHORIA handling

Posted by jimcbrown (admin) May 05, 2014
1400 views

ghaberek said...

euadvlib does this. smile

That's pretty cool. I'd love to see more of euadvlib added to the stdlib!

new topic » goto parent » topic index » view message » categorize

OpenEuphoria

1. Integrating string types into EUPHORIA handling

2. Re: Integrating string types into EUPHORIA handling

3. Re: Integrating string types into EUPHORIA handling

4. Re: Integrating string types into EUPHORIA handling

5. Re: Integrating string types into EUPHORIA handling

6. Re: Integrating string types into EUPHORIA handling

Search

Include:

Quick Links

User menu

Misc Menu