1. better flatten()?
- Posted by petelomax Jul 26, 2015
- 1713 views
I decided it was about time I added flatten() to Phix, so I had a look at the one in std\sequence.e...
public function flatten(sequence s, object delim = "") sequence ret object x integer len integer pos sequence temp if atom(delim) then delim = {delim} end if ret = s pos = 1 len = length(ret) while pos<=len do x = ret[pos] if sequence(x) then if length(delim)=0 then ret = ret[1..pos-1] & flatten(x) & ret[pos+1..$] else temp = ret[1..pos-1] & flatten(x) if pos!=length(ret) then ret = temp & delim & ret[pos+1..$] else ret = temp & ret[pos+1..$] end if end if len = length(ret) else pos += 1 end if end while return ret end function
Hmmm... after staring at it incredulously for a while, I finally grasped that what it was doing was to flatten(x) then quite pointlessly check that the just processed flatten(x)[2..$] were indeed all atoms and do nothing to them. Unless it is part of the plan that delim can be a nested sequence which you want to it to repeatedly flatten, which is equally ridiculous. It is also shunting the end of s/ret around on every iteration for no good reason. Anyway, I decided it was utter shite, so I wrote my own. Not entirely surprisingly (to me) it ran between 50 and 250 times faster.
Does anyone want to check that this does what the docs say it should and is indeed significantly faster?
function flatten2(sequence s, object delim = "") sequence res = "" object si if length(s)=0 then return s end if si = s[1] if atom(si) then res = append(res,si) else res = flatten2(si) end if for i=2 to length(s) do res &= delim si = s[i] if atom(si) then res = append(res,si) else res &= flatten2(si) end if end for return res end function
It seems much clearer to me as well. The specific test I used was:
sequence st = repeat(repeat(repeat("1",1000),100-0),100-0) atom t0, t1, t2 sequence t for i=1 to 3 do t0 = time() t = flatten(st,'\n') t1 = time() t = flatten2(st,'\n') t2 = time() printf(1,"f:%3.2fs, f2:%3.2fs\n",{t1-t0,t2-t1}) end for
The results on Phix varied depending on which level of repeat had 1000, whereas on eui (a somewhat dated r3412) it was about 25 times faster wherever the 1000 was.
I aslo tried
sequence st = repeat("this",100000)
and got 248s for flatten vs 0.02s for flatten2... which is 12,400 times faster.
Pete
EDIT: I just spotted that neither flatten nor flatten2() pass delim when making their recursive calls - should they?
2. Re: better flatten()?
- Posted by Spock Jul 27, 2015
- 1760 views
I decided it was about time I added flatten() to Phix, so I had a look at the one in std\sequence.e
... after staring at it incredulously for a while.. I decided it was utter shite, so I wrote my own. Not entirely surprisingly ... it ran between 50 and 250 times faster.
Does anyone want to check that this does what the docs say it should and is indeed significantly faster?
The results on Phix varied depending on which level of repeat had 1000, whereas on eui (a somewhat dated r3412) it was about 25 times faster wherever the 1000 was.
...
Pete
EDIT: I just spotted that neither flatten nor flatten2() pass delim when making their recursive calls - should they?
Pete, I get results similar to yours using a simplified version (in Orac, that don't quite get the last delimiter right..anyway). I think the delimiter should be passed in the recursion otherwise it is just emitted at the top-level which implies the author could only imagine a nested structure with a few fixed levels - quite bizarre for Euphoria.
On the last test my code is faster than flatten() by a factor of roughly 3000 times..
Thank you. For the umpteenth time it highlights why I hardly ever bother with \std
Spock
function flattenX(seq s, obj delim="") -- init seq ret = "" -- loop for i = 1 to ~ s do obj x = s.i if atom(x) then ret &= x & delim else ret &= flattenX( x, delim ) end if end for -- exit return ret end function
3. Re: better flatten()?
- Posted by ghaberek (admin) Jul 27, 2015
- 1635 views
I noticed a peculiar example in the documentation for flatten:
Example 3:
Using the delimiter argument. s = flatten({"abc", "def", "ghi"}, ", ") -- s is "abc, def, ghi"
Pete's flatten2() accomplishes this correctly, but Spock's flattenX() does not. flattenX() also leaves a trailing delimiter which should not be part of the correct output.
s = flatten2({"abc", "def", "ghi"}, ", ") -- s is "abc, def, ghi" s = flattenX({"abc", "def", "ghi"}, ", ") -- s is "a, b, c, d, e, f, g, h, i, "
However, taking this a step further, I noticed that flatten() and flatten2() do not seem to handle nested strings correctly. Notice how the nested strings get merged together:
s = flatten({"abc", "def", "ghi", {"jkl", "mno", "pqr"}, "stu", "vwx", "yz"}, ", ") -- s is "abc, def, ghi, jklmnopqr, stu, vwx, yz" s = flatten2({"abc", "def", "ghi", {"jkl", "mno", "pqr"}, "stu", "vwx", "yz"}, ", ") -- s is "abc, def, ghi, jklmnopqr, stu, vwx, yz"
So I made my own attempt to better handle this behavior and I wrote two functions: flatten_all() and flatten_seq().
-- -- string type borrowed from std/types.e -- type string( object x ) if not sequence(x) then return 0 end if for i = 1 to length(x) do if not integer(x[i]) then return 0 end if if x[i] < 0 then return 0 end if if x[i] > 255 then return 0 end if end for return 1 end type -- -- an array of only string objects -- type string_array( object x ) if atom( x ) then return 0 end if for i = 1 to length( x ) do if not string( x[i] ) then return 0 end if end for return 1 end type -- -- flatten a sequence into its raw atoms -- function flatten_all( object s1, object delim = "" ) if atom( s1 ) then return {s1} end if sequence s2 = {} for i = 1 to length( s1 ) do s2 &= flatten_all( s1[i] ) end for return join( s2, delim ) end function -- -- flatten a sequence, preserving nested strings -- function flatten_seq( sequence s1, object delim = "" ) sequence s2 = {} for i = 1 to length( s1 ) do if string( s1[i] ) then -- append string item s2 &= {s1[i]} elsif string_array( s1[i] ) then -- append the whole array s2 &= s1[i] else -- append the raw atoms s2 &= flatten_all( s1[i] ) end if end for return join( s2, delim ) end function
s = flatten_all({"abc", "def", "ghi", {"jkl", "mno", "pqr"}, "stu", "vwx", "yz"}, ", ") -- s is "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z" s = flatten_seq({"abc", "def", "ghi", {"jkl", "mno", "pqr"}, "stu", "vwx", "yz"}, ", ") -- s is "abc, def, ghi, jkl, mno, pqr, stu, vwx, yz"
These are nearly as fast as flatten2() or flattenX() and I believe they produce the most "correct" output thus far.
I did some testing on large random sequences and flatten_seq() seems to be just as fast as flatten2(), while flatten_all() is about 3-5 times slower.
I'll admit I'm probably taking a performance hit by using join() but it made the code come out a lot cleaner.
-Greg
4. Re: better flatten()?
- Posted by petelomax Jul 27, 2015
- 1632 views
Independently of Greg, and with the now recursive passing of delim, I also realised that strings needed a bit of help.
This is my latest version (I'll take a closer look at Greg's in a minute, and (btw) I'm just using the Phix builtin string() function here)
global function flatten(sequence s, object delim = "") sequence res = "" object si for i=1 to length(s) do if i!=1 then res &= delim end if si = s[i] --minor edit (replaced with next 2 lines): -- if atom(si) then -- res = append(res,si) -- elsif string(si) then if atom(si) or string(si) then res &= si else res &= flatten(si,delim) end if end for return res end function
This makes
s = flatten({"one", "two", "three"}, '\n')
produce
-- s is "one\ntwo\nthree"
rather than (ugh)
-- s is "o\nn\ne\nt\nw\no\nt\nh\nr\ne\ne"
Pete
5. Re: better flatten()?
- Posted by jimcbrown (admin) Jul 27, 2015
- 1631 views
and (btw) I'm just using the Phix builtin string() function here
What are the differences between this and from the OE std string() function ? http://openeuphoria.org/docs/std_types.html#_2199_string
Aside, of course, from the obvious (speed!)....
6. Re: better flatten()?
- Posted by petelomax Jul 27, 2015
- 1648 views
I noticed that flatten2() does not seem to handle nested strings correctly.
The revised version above yields the much more reasonable "abc, def, ghi, jkl, mno, pqr, stu, vwx, yz", thanks
So I made my own attempt to better handle this behavior and I wrote two functions: flatten_all() and flatten_seq().
Without adequate testing, may I quickly suggest:
-- -- string type borrowed from std/types.e -- type string( object x ) -- unchanged if not sequence(x) then return 0 end if for i = 1 to length(x) do if not integer(x[i]) then return 0 end if if x[i] < 0 then return 0 end if if x[i] > 255 then return 0 end if end for return 1 end type -- -- flatten a sequence, preserving nested strings -- function flatten_seq2( sequence s1, string delim = "" ) sequence s2 = "" object s1i for i = 1 to length( s1 ) do if i!=1 then s2 &= delim end if s1i = s1[i] if atom( s1i ) or string(s1i) then -- (edit: wrong test) s2 &= s1i else s2 &= flatten_seq2( s1i,delim ) end if end for return s2 end function
That string_array thing just looked wrong to me, like it was only ever going to work at a single nested level. Of course flatten() needs to do something sensible when passed a tree of words.
Oh, I had to remove join because I was making it recursive, it now calls itself rather than flatten_all, and that just feels more sensible.
I'm thinking it is reasonable to say that when processing non-string stuff you should not use a delimiter, and if you do specify one there is a fairly strong implication that you are expecting a single string back.
To that end I sneaked in a type check on delim.
I'm not absolutely against the idea of two functions, but I'm not convinced we really need that split yet either.
Pete
7. Re: better flatten()?
- Posted by petelomax Jul 27, 2015
- 1652 views
and (btw) I'm just using the Phix builtin string() function here
What are the differences between this and from the OE std string() function ? http://openeuphoria.org/docs/std_types.html#_2199_string
Aside, of course, from the obvious (speed!)....
The other obvious one is that you don't need to include anything to invoke string() in Phix, which is mainly why I mention it.
Otherwise nothing major, except that in Phix strings can get auto-expanded to dword-sequences, which leads to the subtle difference:
sequence s s = "fred" -- string(s) yields true on Phix and OE s &= -1 -- string(s) yields false on Phix and OE s = s[1..$-1] -- string(s) yields false on Phix but true on OE
Obviously in Phix, if you want something to stay a string you keep it in a string, so that anomaly is very unlikely to happen.
You could also argue that last true on OE is a bit of a "false positive" in the sense that it looks like a string, but it probably ain't really supposed to be one by then.
Oh, I forgot about declaration semantics:
string("fred") -- true on Phix and OE string({'f','r','e','d'}) -- false on Phix, true on OE.
You can of course puts() etc dword_sequences as if they were strings, without any hassle.
Pete
8. Re: better flatten()?
- Posted by DerekParnell (admin) Jul 27, 2015
- 1619 views
The library version is definitely buggy and sub-optimal.
Here's my efforts ...
public function flatten(sequence pSource, object pDelim={}) sequence lResult object lTemp lResult = {} for i = 1 to length(pSource) do lTemp = pSource[i] if atom(lTemp) then lResult &= lTemp else lResult &= flatten(lTemp, pDelim) -- Only add delimiter if not the last sequence. if i != length(pSource) then lResult &= pDelim end if end if end for return lResult end function
9. Re: better flatten()?
- Posted by petelomax Jul 27, 2015
- 1627 views
Here's my efforts ...
unfortunately this is what I get:
s = flatten({"abc",'x','y'}, ", ") -- s is "abc, xy"
whereas I was expecting to see "abc, x, y", which is what my latest code above (just updated with a minor edit that did not change the outcome), Greg's, and my hastily-just-updated (which did change the result) version of Greg's code deliver.
Pete
10. Re: better flatten()?
- Posted by euphoric (admin) Jul 27, 2015
- 1618 views
Is one of these better/faster algorithms going to be adopted into Euphoria source? Just curious.
11. Re: better flatten()?
- Posted by ghaberek (admin) Jul 27, 2015
- 1604 views
I'm not absolutely against the idea of two functions, but I'm not convinced we really need that split yet either.
It seemed reasonable to me that flatten_all() always returned a sequence of atoms, but flatten_seq() handled strings gracefully.
-Greg
12. Re: better flatten()?
- Posted by DerekParnell (admin) Jul 27, 2015
- 1614 views
Here's my efforts ...
unfortunately this is what I get:
s = flatten({"abc",'x','y'}, ", ") -- s is "abc, xy"
whereas I was expecting to see "abc, x, y", which is what my latest code above (just updated with a minor edit that did not change the outcome), Greg's, and my hastily-just-updated (which did change the result) version of Greg's code deliver.
Pete
I was going by the documentation in the current library one. It says 'An optional delimiter to place after each flattened sub-sequence (except the last one).' And as the source in your last example {"abc",'x','y'} has only one sub-sequence (the "abc"), I would have expected only one delimiter to be inserted. However, I do acknowledge that I interpreted the phrase (except the last one) to mean (except if the sub-sequence is the last element). Therefore I feel my code is producing what the original author has specified.
This why functional specifications are so important and extremely hard to write, often harder than implementing them in code.
13. Re: better flatten()?
- Posted by DerekParnell (admin) Jul 27, 2015
- 1606 views
I'm not absolutely against the idea of two functions, but I'm not convinced we really need that split yet either.
It seemed reasonable to me that flatten_all() always returned a sequence of atoms, but flatten_seq() handled strings gracefully.
-Greg
My code implements flatten_all when the delimiter is {}, and flatten_seq when given a delimiter. Euphoria makes no fundamental distinction between strings and sequences. For example, how should flatten() handle {"Range", {-3.45, 102.50}} given the delimiter " is "?
14. Re: better flatten()?
- Posted by jimcbrown (admin) Jul 27, 2015
- 1619 views
I was going by the documentation in the current library one. It says 'An optional delimiter to place after each flattened sub-sequence (except the last one).' And as the source in your last example {"abc",'x','y'} has only one sub-sequence (the "abc"), I would have expected only one delimiter to be inserted. However, I do acknowledge that I interpreted the phrase (except the last one) to mean (except if the sub-sequence is the last element). Therefore I feel my code is producing what the original author has specified.
Considering that you apparently seem to be the original author of that phrase, and the delim functionality of flatten() itself in the first place, I think you would know that best. http://scm.openeuphoria.org/hg/euphoria/annotate/9f25f7612b39/include/std/sequence.e#l1803
(I would suggest modifying the docs to make that more explict in the future however: 'An optional delimiter to place after each flattened sub-sequence (except if the sub-sequence is the last element).')
This why functional specifications are so important and extremely hard to write, often harder than implementing them in code.
Agreed in full.
I am curious as to the rationale, though, of distinguishing between the two cases. That is, why should flatten() only insert a delimiter up to the last subsequence (or last element that is a sequence), and continue to append atomic elements after that without a delimiter? Seems to open the door to other kinds of weird cases. (E.g. if flatten() is passed a sequence that is already flat, and also given a delimiter, should merely return the original string or should it insert a delimiter between each of the atomic elements?)
15. Re: better flatten()?
- Posted by DerekParnell (admin) Jul 28, 2015
- 1577 views
Considering that you apparently seem to be the original author of that phrase, and the delim functionality of flatten() itself in the first place, I think you would know that best.
Yeah ... you'd think so, right? Oh well ... I can't remember doing any of those changes to Jeremy's original code.
(I would suggest modifying the docs to make that more explict in the future however: 'An optional delimiter to place after each flattened sub-sequence (except if the sub-sequence is the last element).')
I might if I could work out Hg. (That software is sooooo unintuitive, IMO)
I am curious as to the rationale, though, of distinguishing between the two cases. That is, why should flatten() only insert a delimiter up to the last subsequence (or last element that is a sequence), and continue to append atomic elements after that without a delimiter? Seems to open the door to other kinds of weird cases. (E.g. if flatten() is passed a sequence that is already flat, and also given a delimiter, should merely return the original string or should it insert a delimiter between each of the atomic elements?)
I don't know. I have not really had a need to use this function.
16. Re: better flatten()?
- Posted by jimcbrown (admin) Jul 28, 2015
- 1575 views
Considering that you apparently seem to be the original author of that phrase, and the delim functionality of flatten() itself in the first place, I think you would know that best.
Yeah ... you'd think so, right? Oh well ... I can't remember doing any of those changes to Jeremy's original code.
Understandable, considering it's been over half a decade. (Anyways, I think most of the sillyness in that function is from CChris.)
(I would suggest modifying the docs to make that more explict in the future however: 'An optional delimiter to place after each flattened sub-sequence (except if the sub-sequence is the last element).')
I might if I could work out Hg. (That software is sooooo unintuitive, IMO)
Note to Tom!
I'm wondering if we shoudl revisit Heechee or another Hg-SVN gateway. Or even just move back to svn altogether.
I am curious as to the rationale, though, of distinguishing between the two cases. That is, why should flatten() only insert a delimiter up to the last subsequence (or last element that is a sequence), and continue to append atomic elements after that without a delimiter? Seems to open the door to other kinds of weird cases. (E.g. if flatten() is passed a sequence that is already flat, and also given a delimiter, should merely return the original string or should it insert a delimiter between each of the atomic elements?)
I don't know. I have not really had a need to use this function.
Why have the delim stuff in the first place? Euphoria generally doesn't follow the KISS principle, but I wonder if that's making the function overcomplicated. Does anybody use it?
17. Re: better flatten()?
- Posted by petelomax Jul 28, 2015
- 1573 views
I don't know. I have not really had a need to use this function.
Why have the delim stuff in the first place? Euphoria generally doesn't follow the KISS principle, but I wonder if that's making the function overcomplicated. Does anybody use it?
I am beginning to think you might be quite right there, instead of flatten(s,delim) the programming construct to use should perhaps be join(flatten(s),delim), or often just join(s,delim).
I was also thinking that flatten() should either have a "check for string elements" option or a sister function, at least that was until I did a quick search...
I found the following uses of flatten() with a delimiter:
std\cmdline.e\build_commandline() -- wants string handling, eg flatten({"del","filename"}," ") -> "del filename", rather than "d e l f i l e n a m e". redydemo\gui\objects\hypertext.e() -- "" edita\eacons.ew (private routine) -- "", for messagebox, eg flatten({"line1","line2"},"\n") -> "line1\nline2", rather than "l\ni\nn\ne\n1\nl\ni\nn\ne\n2" edita\eafext.ew ("") -- "" edita\eatabl.e ("") -- ""
and only demo\net\pastey.ex without, which relies on \n being left in by read_lines() for something similar to eacons.ew above.
Surprisingly, all existing uses of flatten() that I could find were "sequence of string -> string", and all were really join() under a different name.
Pete
18. Re: better flatten()?
- Posted by jimcbrown (admin) Jul 28, 2015
- 1570 views
I don't know. I have not really had a need to use this function.
Why have the delim stuff in the first place? Euphoria generally doesn't follow the KISS principle, but I wonder if that's making the function overcomplicated. Does anybody use it?
I am beginning to think you might be quite right there, instead of flatten(s,delim) the programming construct to use should perhaps be join(flatten(s),delim), or often just join(s,delim).
Agreed in full.
I was also thinking that flatten() should either have a "check for string elements" option or a sister function, at least that was until I did a quick search...
What would this sister function do?
I was also thinking that flatten() should either have a "check for string elements" option or a sister function, at least that was until I did a quick search... I found the following uses of flatten() with a delimiter:
std\cmdline.e\build_commandline() -- wants string handling, eg flatten({"del","filename"}," ") -> "del filename", rather than "d e l f i l e n a m e". redydemo\gui\objects\hypertext.e() -- "" edita\eacons.ew (private routine) -- "", for messagebox, eg flatten({"line1","line2"},"\n") -> "line1\nline2", rather than "l\ni\nn\ne\n1\nl\ni\nn\ne\n2" edita\eafext.ew ("") -- "" edita\eatabl.e ("") -- ""
and only demo\net\pastey.ex without, which relies on \n being left in by read_lines() for something similar to eacons.ew above.
Surprisingly, all existing uses of flatten() that I could find were "sequence of string -> string", and all were really join() under a different name.
Pete
Yes, I think all those cases should be changed to replace the flatten() call with a join() call.
Maybe, a flatten(join(...)) call, in case (for example) a caller of build_commandline() passes in a sequence with more than a single sublevel (e.g. more depth than a mere sequence of strings). OTOH maybe we should just fail early in that case.
19. Re: better flatten()?
- Posted by petelomax Jul 28, 2015
- 1557 views
I was also thinking that flatten() should either have a "check for string elements" option or a sister function, at least that was until I did a quick search...
What would this sister function do?
I just meant that instead of an optional flag on the standard flatten(), you could have a similar function that always performed the string checks. I was thinking (given the plan to remove the delimiter)
flatten({{"this","that"},{"the","other"}}) --> "thisthattheother" flattens({{"this","that"},{"the","other"}}) --> {"this","that","the","other"}
However, as I am now thinking that lists of strings, whether you want a delimiter or not, are better passed to join(), any more deeply nested structures that happen to contain strings are likely to contain all other kinds of junk, and passing that sort of thing to flatten() now seems much less likely to yield any kind of useful result no matter how you slice and dice it. If your application has something like { {"this","that"},{"the","other"} } or worse, then it is probably better to roll your own disentangle(), maybe one that sorts and removes duplicates, or only selects those words where some other flag is set, than expect standard routines to have the magic to deal with every possible situation.
Pete