1. Integrating string types into EUPHORIA handling
- Posted by SDPringle May 04, 2014
- 1542 views
Forked from Re: Euphoria vs The Other Guys
... Strings are easily implemented as a UDT.
Not quite. UDT have some restrictions. They are only used during assignments or explicit if UDT(x) tests. We can't tailor other operations to deal with UDTs, nor can we arbitrarily have existing library functions know what to do with a UDT when they encounter one. That, of course, also applies to built in types too.
I really mean we can use UDTs for assignments and explicit checks. In fact the standard library has 'string' so I should have said a UDT already exists there. Did you mean we cannot use UDT routines for things other than assignments and tests or did you mean we don't have operator overloading in EUPHORIA? I don't think we need to.
If added as a builtin type, you would only need to modify the parser part.
Actually, it turns out to be a lot more complicated than that. A new built-in type has semantic implications that would undoubtedly affect nearly every aspect of the existing language operations.
One could implement string the same way as sequence is. That is to say use the same structure. We would need to add a special call to make sure it is a valid string but other than that just use struct s1. We have three representations for objects: struct s1, struct d, and int. If you add another we go from binary operations having to handle 9 cases to operations having to handle 16. [/quote]
... working with a string is the same as working with a sequence of integers, or a sequence of anythings.
Again, only if you are talking about sequence operations, but even then we would need to add some extra semantics to some operations. For example, adding an atom to a String type should do what, exactly? Issue a runtime error, maybe? Then there's the whole complication of how to output String data. If the string is going to a console device, we would probably need to convert it to UTF8. But what about out to a file? Or output to a windowed applications - UTF16 or ASCII? Now consider functions like upper() etc, these need special handling for unicode strings.
[/quote]
I have thought this out down this same path. I would love for the interpreter to magically help the user and re-encode the strings to the way they ought to be when calling C routines.
I pass the buck to library implemntations and that means us with the Standard Library. If we had different UDT for different encodings we could have ascii_string, codepage800string, utf_string, etc... The EUPHORIA type system could not always tell if they were encoded as we normally encode strings if you passed the wrong kind of string to the wrong place. You could have magic numbers in strings but they would be magic, string packages instead of just strings. We could just leave things simple and let the users figure it out. Most importantly library implementors should always specify what kind of string they are expecting for a given routine.
2. Re: Integrating string types into EUPHORIA handling
- Posted by jimcbrown (admin) May 04, 2014
- 1516 views
Did you mean we cannot use UDT routines for things other than assignments and tests or did you mean we don't have operator overloading in EUPHORIA?
Isn't the second part (operator overloading) automatically covered by the first part (cannot use UDT for things other than assignments and tests)?
If added as a builtin type, you would only need to modify the parser part.
Actually, it turns out to be a lot more complicated than that. A new built-in type has semantic implications that would undoubtedly affect nearly every aspect of the existing language operations.
One could implement string the same way as sequence is. That is to say use the same structure. We would need to add a special call to make sure it is a valid string but other than that just use struct s1. We have three representations for objects: struct s1, struct d, and int.
When would we make the special call? Are you saying that we should make string a sort of 'built-in' UDT, handled and implmented the same way as other UDTs are, but in the same builtin namespace that the other builtin types are?
If you add another we go from binary operations having to handle 9 cases to operations having to handle 16.
A 'built-in' UDT wouldn't require any changes to binary operations.
If it's not a 'built-in' UDT but a true new datatype, then there'd be a lot more changes, as Derek says.
I have thought this out down this same path. I would love for the interpreter to magically help the user and re-encode the strings to the way they ought to be when calling C routines.
This is doable without adding a new builtin type. We'd just add a new C type, say C_STRING in addition to C_POINTER and C_CHAR. The user passes that type into define_c_func/proc and then later when the user calls c_func/c_proc, the callc code would take a sequence, validate it (so no subsequences in that sequence, for sample), and convert it to a NUL-terminated char array.
Or, for C_WSTRING, do conversion into UTF-16. For rare cases, C_WWSTRING would convert to UTF-32, while C_USTRING would do UTF-8.
I pass the buck to library implemntations and that means us with the Standard Library.
I'm confused. Should string be a builtin type or not?
If we had different UDT for different encodings we could have ascii_string, codepage800string, utf_string, etc... The EUPHORIA type system could not always tell if they were encoded as we normally encode strings if you passed the wrong kind of string to the wrong place. You could have magic numbers in strings but they would be magic, string packages instead of just strings. We could just leave things simple and let the users figure it out. Most importantly library implementors should always specify what kind of string they are expecting for a given routine.
I like the KISS principle myself. But maybe that's just me.
3. Re: Integrating string types into EUPHORIA handling
- Posted by DerekParnell (admin) May 04, 2014
- 1477 views
... Did you mean we cannot use UDT routines for things other than assignments and tests or did you mean we don't have operator overloading in EUPHORIA?
Both.
string A string B string C object D A = "Some text" B = "And here is some more text" C = A + B -- Should the parser or runtime allow this? Is it addition or concatenation? D = A * B -- And what about this (assigning to an 'object')?
One could implement string the same way as sequence is. That is to say use the same structure. We would need to add a special call to make sure it is a valid string but other than that just use struct s1. We have three representations for objects: struct s1, struct d, and int. If you add another we go from binary operations having to handle 9 cases to operations having to handle 16.
Yes, and don't forget the translator too. It emits code depending on the data types of the operands.
I would love for the interpreter to magically help the user and re-encode the strings to the way they ought to be when calling C routines.
I pass the buck to library implemntations and that means us with the Standard Library. If we had different UDT for different encodings we could have ascii_string, codepage800string, utf_string, etc... The EUPHORIA type system could not always tell if they were encoded as we normally encode strings if you passed the wrong kind of string to the wrong place. You could have magic numbers in strings but they would be magic, string packages instead of just strings. We could just leave things simple and let the users figure it out. Most importantly library implementors should always specify what kind of string they are expecting for a given routine.
I too feel that string handling, especially for unicode, needs to be library based rather than built-in to the language. However, we should not also be re-inventing the wheel. There exists a few good C-Based unicode libraries already and we should leverage off those, and the best way to do that is to statically link the unicode library into the EUI executable. An alternative would be to create a DLL shim and load the unicode library at runtime as needed. Either way, the unicode specific functionality would not be coded in Euphoria, only the bare minimum required to interface with the third party routines.
4. Re: Integrating string types into EUPHORIA handling
- Posted by SDPringle May 04, 2014
- 1440 views
Just to be clear I was not advocating creating a builtin string types. It was rather a brainstorming post. The stuff should be done in the library routines. We already have cstring and string in the library. Come to think of it, maybe we should make the std/regex.e stuff take cstring rather than string.
Things like lower() and upper() should be worked out to use Unicode. As it already works using encodings we can make utf32 as one of the encodings. I say we can put the functionality of std/unicode.e into std/text.e this way.
Shawn
5. Re: Integrating string types into EUPHORIA handling
- Posted by ghaberek (admin) May 05, 2014
- 1402 views
This is doable without adding a new builtin type. We'd just add a new C type, say C_STRING in addition to C_POINTER and C_CHAR. The user passes that type into define_c_func/proc and then later when the user calls c_func/c_proc, the callc code would take a sequence, validate it (so no subsequences in that sequence, for sample), and convert it to a NUL-terminated char array.
Or, for C_WSTRING, do conversion into UTF-16. For rare cases, C_WWSTRING would convert to UTF-32, while C_USTRING would do UTF-8.
euadvlib does this.
override function c_func( integer rid, sequence args = {} ) -- get the argument types and return type sequence arg_types = map:get( m_args, rid, {} ) atom return_type = map:get( m_retv, rid, 0 ) -- make sure we have the correct number of arguments if length( args ) = length( arg_types ) then for i = 1 to length( args ) do -- allocate strings as necessary if sequence( args[i] ) and arg_types[i] = C_STRING then args[i] = allocate_string( args[i], 1 ) elsif sequence( args[i] ) and arg_types[i] = C_WSTRING then args[i] = allocate_wstring( args[i], 1 ) end if end for end if -- call the function object result = eu:c_func( rid, args ) -- fetch strings as necessary if return_type = C_STRING then result = peek_string( result ) elsif return_type = C_WSTRING then result = peek_wstring( result ) end if return result end function
-Greg
6. Re: Integrating string types into EUPHORIA handling
- Posted by jimcbrown (admin) May 05, 2014
- 1400 views
euadvlib does this.
That's pretty cool. I'd love to see more of euadvlib added to the stdlib!