1. better flatten()?

I decided it was about time I added flatten() to Phix, so I had a look at the one in std\sequence.e...

public function flatten(sequence s, object delim = "") 
sequence ret 
object x 
integer len 
integer pos 
sequence temp 
 
    if atom(delim) then 
        delim = {delim} 
    end if 
    ret = s 
    pos = 1 
    len = length(ret) 
    while pos<=len do 
        x = ret[pos] 
        if sequence(x) then 
            if length(delim)=0 then 
                ret = ret[1..pos-1] & flatten(x) & ret[pos+1..$] 
            else 
                temp = ret[1..pos-1] & flatten(x) 
                if pos!=length(ret) then 
                    ret = temp &  delim & ret[pos+1..$] 
                else 
                    ret = temp & ret[pos+1..$] 
                end if 
            end if 
            len = length(ret) 
        else 
            pos += 1 
        end if 
    end while 
 
    return ret 
end function 

Hmmm... after staring at it incredulously for a while, I finally grasped that what it was doing was to flatten(x) then quite pointlessly check that the just processed flatten(x)[2..$] were indeed all atoms and do nothing to them. Unless it is part of the plan that delim can be a nested sequence which you want to it to repeatedly flatten, which is equally ridiculous. It is also shunting the end of s/ret around on every iteration for no good reason. Anyway, I decided it was utter shite, so I wrote my own. Not entirely surprisingly (to me) it ran between 50 and 250 times faster.

Does anyone want to check that this does what the docs say it should and is indeed significantly faster?

function flatten2(sequence s, object delim = "") 
sequence res = "" 
object si 
 
    if length(s)=0 then return s end if 
    si = s[1] 
    if atom(si) then 
        res = append(res,si) 
    else 
        res = flatten2(si) 
    end if 
    for i=2 to length(s) do 
        res &= delim 
        si = s[i] 
        if atom(si) then 
            res = append(res,si) 
        else 
            res &= flatten2(si) 
        end if 
    end for 
    return res 
end function 

It seems much clearer to me as well. The specific test I used was:

sequence st = repeat(repeat(repeat("1",1000),100-0),100-0) 
atom t0, t1, t2 
sequence t 
for i=1 to 3 do 
    t0 = time() 
    t = flatten(st,'\n') 
    t1 = time() 
    t = flatten2(st,'\n') 
    t2 = time() 
    printf(1,"f:%3.2fs, f2:%3.2fs\n",{t1-t0,t2-t1}) 
end for 

The results on Phix varied depending on which level of repeat had 1000, whereas on eui (a somewhat dated r3412) it was about 25 times faster wherever the 1000 was.

I aslo tried

sequence st = repeat("this",100000) 

and got 248s for flatten vs 0.02s for flatten2... which is 12,400 times faster.

Pete

EDIT: I just spotted that neither flatten nor flatten2() pass delim when making their recursive calls - should they?

new topic     » topic index » view message » categorize

2. Re: better flatten()?

petelomax said...

I decided it was about time I added flatten() to Phix, so I had a look at the one in std\sequence.e

... after staring at it incredulously for a while.. I decided it was utter shite, so I wrote my own. Not entirely surprisingly ... it ran between 50 and 250 times faster.

Does anyone want to check that this does what the docs say it should and is indeed significantly faster?

The results on Phix varied depending on which level of repeat had 1000, whereas on eui (a somewhat dated r3412) it was about 25 times faster wherever the 1000 was.

...

Pete

EDIT: I just spotted that neither flatten nor flatten2() pass delim when making their recursive calls - should they?

Pete, I get results similar to yours using a simplified version (in Orac, that don't quite get the last delimiter right..anyway). I think the delimiter should be passed in the recursion otherwise it is just emitted at the top-level which implies the author could only imagine a nested structure with a few fixed levels - quite bizarre for Euphoria.

On the last test my code is faster than flatten() by a factor of roughly 3000 times..

Thank you. For the umpteenth time it highlights why I hardly ever bother with \std

Spock

function flattenX(seq s, obj delim="") 
 
	-- init 
	seq ret = "" 
 
	-- loop 
	for i = 1 to ~ s do 
	   obj x = s.i 
	   if atom(x) then 
	      ret &= x & delim 
	   else 
	      ret &= flattenX( x, delim ) 
	   end if 
	end for 
 
	-- exit 
	return ret 
 
end function 
new topic     » goto parent     » topic index » view message » categorize

3. Re: better flatten()?

I noticed a peculiar example in the documentation for flatten:

said...

Example 3:

Using the delimiter argument. 
s = flatten({"abc", "def", "ghi"}, ", ") 
-- s is "abc, def, ghi" 

Pete's flatten2() accomplishes this correctly, but Spock's flattenX() does not. flattenX() also leaves a trailing delimiter which should not be part of the correct output.

s = flatten2({"abc", "def", "ghi"}, ", ") 
-- s is "abc, def, ghi" 
 
s = flattenX({"abc", "def", "ghi"}, ", ") 
-- s is "a, b, c, d, e, f, g, h, i, " 

However, taking this a step further, I noticed that flatten() and flatten2() do not seem to handle nested strings correctly. Notice how the nested strings get merged together:

s = flatten({"abc", "def", "ghi", {"jkl", "mno", "pqr"}, "stu", "vwx", "yz"}, ", ") 
-- s is "abc, def, ghi, jklmnopqr, stu, vwx, yz" 
 
s = flatten2({"abc", "def", "ghi", {"jkl", "mno", "pqr"}, "stu", "vwx", "yz"}, ", ") 
-- s is "abc, def, ghi, jklmnopqr, stu, vwx, yz" 

So I made my own attempt to better handle this behavior and I wrote two functions: flatten_all() and flatten_seq().

-- 
-- string type borrowed from std/types.e 
-- 
type string( object x ) 
     
    if not sequence(x) then 
        return 0 
    end if 
     
    for i = 1 to length(x) do 
        if not integer(x[i]) then 
            return 0 
        end if 
        if x[i] < 0 then 
            return 0 
        end if 
        if x[i] > 255 then 
            return 0 
        end if 
    end for 
     
    return 1 
end type 
 
-- 
-- an array of only string objects 
-- 
type string_array( object x ) 
     
    if atom( x ) then return 0 end if 
     
    for i = 1 to length( x ) do 
        if not string( x[i] ) then 
            return 0 
        end if 
    end for 
     
    return 1 
end type 
 
-- 
-- flatten a sequence into its raw atoms 
-- 
function flatten_all( object s1, object delim = "" ) 
     
    if atom( s1 ) then return {s1} end if 
     
    sequence s2 = {} 
     
    for i = 1 to length( s1 ) do 
        s2 &= flatten_all( s1[i] ) 
    end for 
     
    return join( s2, delim ) 
end function 
 
-- 
-- flatten a sequence, preserving nested strings 
-- 
function flatten_seq( sequence s1, object delim = "" ) 
     
    sequence s2 = {} 
     
    for i = 1 to length( s1 ) do 
         
        if string( s1[i] ) then 
            -- append string item 
            s2 &= {s1[i]} 
             
        elsif string_array( s1[i] ) then 
            -- append the whole array 
            s2 &= s1[i] 
             
        else 
            -- append the raw atoms 
            s2 &= flatten_all( s1[i] ) 
             
        end if 
         
    end for 
     
    return join( s2, delim ) 
end function 
s = flatten_all({"abc", "def", "ghi", {"jkl", "mno", "pqr"}, "stu", "vwx", "yz"}, ", ") 
-- s is "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z" 
 
s = flatten_seq({"abc", "def", "ghi", {"jkl", "mno", "pqr"}, "stu", "vwx", "yz"}, ", ") 
-- s is "abc, def, ghi, jkl, mno, pqr, stu, vwx, yz" 

These are nearly as fast as flatten2() or flattenX() and I believe they produce the most "correct" output thus far.

I did some testing on large random sequences and flatten_seq() seems to be just as fast as flatten2(), while flatten_all() is about 3-5 times slower.

I'll admit I'm probably taking a performance hit by using join() but it made the code come out a lot cleaner.

-Greg

new topic     » goto parent     » topic index » view message » categorize

4. Re: better flatten()?

Independently of Greg, and with the now recursive passing of delim, I also realised that strings needed a bit of help.

This is my latest version (I'll take a closer look at Greg's in a minute, and (btw) I'm just using the Phix builtin string() function here)

global function flatten(sequence s, object delim = "") 
sequence res = "" 
object si 
 
    for i=1 to length(s) do 
        if i!=1 then 
            res &= delim 
        end if 
        si = s[i] 
--minor edit (replaced with next 2 lines): 
--      if atom(si) then 
--          res = append(res,si) 
--      elsif string(si) then 
        if atom(si) 
        or string(si) then 
            res &= si 
        else 
            res &= flatten(si,delim) 
        end if 
    end for 
    return res 
end function 

This makes

s = flatten({"one", "two", "three"}, '\n') 

produce

-- s is "one\ntwo\nthree" 

rather than (ugh)

-- s is "o\nn\ne\nt\nw\no\nt\nh\nr\ne\ne" 

Pete

new topic     » goto parent     » topic index » view message » categorize

5. Re: better flatten()?

petelomax said...

and (btw) I'm just using the Phix builtin string() function here

What are the differences between this and from the OE std string() function ? http://openeuphoria.org/docs/std_types.html#_2199_string

Aside, of course, from the obvious (speed!)....

new topic     » goto parent     » topic index » view message » categorize

6. Re: better flatten()?

ghaberek said...

I noticed that flatten2() does not seem to handle nested strings correctly.

The revised version above yields the much more reasonable "abc, def, ghi, jkl, mno, pqr, stu, vwx, yz", thanks

ghaberek said...

So I made my own attempt to better handle this behavior and I wrote two functions: flatten_all() and flatten_seq().

Without adequate testing, may I quickly suggest:

-- 
-- string type borrowed from std/types.e 
-- 
type string( object x )  -- unchanged 
     
    if not sequence(x) then 
        return 0 
    end if 
     
    for i = 1 to length(x) do 
        if not integer(x[i]) then 
            return 0 
        end if 
        if x[i] < 0 then 
            return 0 
        end if 
        if x[i] > 255 then 
            return 0 
        end if 
    end for 
     
    return 1 
end type 
 
-- 
-- flatten a sequence, preserving nested strings 
-- 
function flatten_seq2( sequence s1, string delim = "" )  
      
sequence s2 = ""  
object s1i 
      
    for i = 1 to length( s1 ) do  
        if i!=1 then  
            s2 &= delim  
        end if  
        s1i = s1[i] 
        if atom( s1i ) or string(s1i) then  -- (edit: wrong test) 
            s2 &= s1i  
              
        else  
            s2 &= flatten_seq2( s1i,delim )  
              
        end if  
          
    end for  
      
    return s2  
end function  

That string_array thing just looked wrong to me, like it was only ever going to work at a single nested level. Of course flatten() needs to do something sensible when passed a tree of words.

Oh, I had to remove join because I was making it recursive, it now calls itself rather than flatten_all, and that just feels more sensible.

I'm thinking it is reasonable to say that when processing non-string stuff you should not use a delimiter, and if you do specify one there is a fairly strong implication that you are expecting a single string back.

To that end I sneaked in a type check on delim.

I'm not absolutely against the idea of two functions, but I'm not convinced we really need that split yet either.

Pete

new topic     » goto parent     » topic index » view message » categorize

7. Re: better flatten()?

jimcbrown said...
petelomax said...

and (btw) I'm just using the Phix builtin string() function here

What are the differences between this and from the OE std string() function ? http://openeuphoria.org/docs/std_types.html#_2199_string

Aside, of course, from the obvious (speed!)....

The other obvious one is that you don't need to include anything to invoke string() in Phix, which is mainly why I mention it.

Otherwise nothing major, except that in Phix strings can get auto-expanded to dword-sequences, which leads to the subtle difference:

sequence s 
s = "fred"  -- string(s) yields true on Phix and OE 
s &= -1     -- string(s) yields false on Phix and OE 
s = s[1..$-1] -- string(s) yields false on Phix but true on OE 

Obviously in Phix, if you want something to stay a string you keep it in a string, so that anomaly is very unlikely to happen.

You could also argue that last true on OE is a bit of a "false positive" in the sense that it looks like a string, but it probably ain't really supposed to be one by then.

Oh, I forgot about declaration semantics:

string("fred") -- true on Phix and OE 
string({'f','r','e','d'}) -- false on Phix, true on OE. 

You can of course puts() etc dword_sequences as if they were strings, without any hassle.

Pete

new topic     » goto parent     » topic index » view message » categorize

8. Re: better flatten()?

The library version is definitely buggy and sub-optimal.

Here's my efforts ...

public function flatten(sequence pSource, object pDelim={}) 
	sequence lResult 
	object lTemp 
	 
	lResult = {} 
	 
	for i = 1 to length(pSource) do 
		lTemp = pSource[i] 
		if atom(lTemp) then 
			lResult &= lTemp 
		else 
			lResult &= flatten(lTemp, pDelim) 
			-- Only add delimiter if not the last sequence. 
			if i != length(pSource) then 
				lResult &= pDelim 
			end if 
		end if 
	end for 
	return lResult 
	 
end function 
 
new topic     » goto parent     » topic index » view message » categorize

9. Re: better flatten()?

DerekParnell said...

Here's my efforts ...

unfortunately this is what I get:

s = flatten({"abc",'x','y'}, ", ") 
-- s is "abc, xy" 

whereas I was expecting to see "abc, x, y", which is what my latest code above (just updated with a minor edit that did not change the outcome), Greg's, and my hastily-just-updated (which did change the result) version of Greg's code deliver.

Pete

new topic     » goto parent     » topic index » view message » categorize

10. Re: better flatten()?

Is one of these better/faster algorithms going to be adopted into Euphoria source? Just curious.

new topic     » goto parent     » topic index » view message » categorize

11. Re: better flatten()?

petelomax said...

I'm not absolutely against the idea of two functions, but I'm not convinced we really need that split yet either.

It seemed reasonable to me that flatten_all() always returned a sequence of atoms, but flatten_seq() handled strings gracefully.

-Greg

new topic     » goto parent     » topic index » view message » categorize

12. Re: better flatten()?

petelomax said...
DerekParnell said...

Here's my efforts ...

unfortunately this is what I get:

s = flatten({"abc",'x','y'}, ", ") 
-- s is "abc, xy" 

whereas I was expecting to see "abc, x, y", which is what my latest code above (just updated with a minor edit that did not change the outcome), Greg's, and my hastily-just-updated (which did change the result) version of Greg's code deliver.

Pete

I was going by the documentation in the current library one. It says 'An optional delimiter to place after each flattened sub-sequence (except the last one).' And as the source in your last example {"abc",'x','y'} has only one sub-sequence (the "abc"), I would have expected only one delimiter to be inserted. However, I do acknowledge that I interpreted the phrase (except the last one) to mean (except if the sub-sequence is the last element). Therefore I feel my code is producing what the original author has specified.

This why functional specifications are so important and extremely hard to write, often harder than implementing them in code.

new topic     » goto parent     » topic index » view message » categorize

13. Re: better flatten()?

ghaberek said...
petelomax said...

I'm not absolutely against the idea of two functions, but I'm not convinced we really need that split yet either.

It seemed reasonable to me that flatten_all() always returned a sequence of atoms, but flatten_seq() handled strings gracefully.

-Greg

My code implements flatten_all when the delimiter is {}, and flatten_seq when given a delimiter. Euphoria makes no fundamental distinction between strings and sequences. For example, how should flatten() handle {"Range", {-3.45, 102.50}} given the delimiter " is "?

new topic     » goto parent     » topic index » view message » categorize

14. Re: better flatten()?

DerekParnell said...

I was going by the documentation in the current library one. It says 'An optional delimiter to place after each flattened sub-sequence (except the last one).' And as the source in your last example {"abc",'x','y'} has only one sub-sequence (the "abc"), I would have expected only one delimiter to be inserted. However, I do acknowledge that I interpreted the phrase (except the last one) to mean (except if the sub-sequence is the last element). Therefore I feel my code is producing what the original author has specified.

Considering that you apparently seem to be the original author of that phrase, and the delim functionality of flatten() itself in the first place, I think you would know that best. http://scm.openeuphoria.org/hg/euphoria/annotate/9f25f7612b39/include/std/sequence.e#l1803

(I would suggest modifying the docs to make that more explict in the future however: 'An optional delimiter to place after each flattened sub-sequence (except if the sub-sequence is the last element).')

DerekParnell said...

This why functional specifications are so important and extremely hard to write, often harder than implementing them in code.

Agreed in full.

I am curious as to the rationale, though, of distinguishing between the two cases. That is, why should flatten() only insert a delimiter up to the last subsequence (or last element that is a sequence), and continue to append atomic elements after that without a delimiter? Seems to open the door to other kinds of weird cases. (E.g. if flatten() is passed a sequence that is already flat, and also given a delimiter, should merely return the original string or should it insert a delimiter between each of the atomic elements?)

new topic     » goto parent     » topic index » view message » categorize

15. Re: better flatten()?

jimcbrown said...

Considering that you apparently seem to be the original author of that phrase, and the delim functionality of flatten() itself in the first place, I think you would know that best.

Yeah ... you'd think so, right? Oh well ... I can't remember doing any of those changes to Jeremy's original code.

jimcbrown said...

(I would suggest modifying the docs to make that more explict in the future however: 'An optional delimiter to place after each flattened sub-sequence (except if the sub-sequence is the last element).')

I might if I could work out Hg. (That software is sooooo unintuitive, IMO)

jimcbrown said...

I am curious as to the rationale, though, of distinguishing between the two cases. That is, why should flatten() only insert a delimiter up to the last subsequence (or last element that is a sequence), and continue to append atomic elements after that without a delimiter? Seems to open the door to other kinds of weird cases. (E.g. if flatten() is passed a sequence that is already flat, and also given a delimiter, should merely return the original string or should it insert a delimiter between each of the atomic elements?)

I don't know. I have not really had a need to use this function.

new topic     » goto parent     » topic index » view message » categorize

16. Re: better flatten()?

DerekParnell said...
jimcbrown said...

Considering that you apparently seem to be the original author of that phrase, and the delim functionality of flatten() itself in the first place, I think you would know that best.

Yeah ... you'd think so, right? Oh well ... I can't remember doing any of those changes to Jeremy's original code.

Understandable, considering it's been over half a decade. (Anyways, I think most of the sillyness in that function is from CChris.)

DerekParnell said...
jimcbrown said...

(I would suggest modifying the docs to make that more explict in the future however: 'An optional delimiter to place after each flattened sub-sequence (except if the sub-sequence is the last element).')

I might if I could work out Hg. (That software is sooooo unintuitive, IMO)

Note to Tom!

I'm wondering if we shoudl revisit Heechee or another Hg-SVN gateway. Or even just move back to svn altogether.

DerekParnell said...
jimcbrown said...

I am curious as to the rationale, though, of distinguishing between the two cases. That is, why should flatten() only insert a delimiter up to the last subsequence (or last element that is a sequence), and continue to append atomic elements after that without a delimiter? Seems to open the door to other kinds of weird cases. (E.g. if flatten() is passed a sequence that is already flat, and also given a delimiter, should merely return the original string or should it insert a delimiter between each of the atomic elements?)

I don't know. I have not really had a need to use this function.

Why have the delim stuff in the first place? Euphoria generally doesn't follow the KISS principle, but I wonder if that's making the function overcomplicated. Does anybody use it?

new topic     » goto parent     » topic index » view message » categorize

17. Re: better flatten()?

jimcbrown said...
DerekParnell said...

I don't know. I have not really had a need to use this function.

Why have the delim stuff in the first place? Euphoria generally doesn't follow the KISS principle, but I wonder if that's making the function overcomplicated. Does anybody use it?

I am beginning to think you might be quite right there, instead of flatten(s,delim) the programming construct to use should perhaps be join(flatten(s),delim), or often just join(s,delim).

I was also thinking that flatten() should either have a "check for string elements" option or a sister function, at least that was until I did a quick search...

I found the following uses of flatten() with a delimiter:

std\cmdline.e\build_commandline()    -- wants string handling, eg flatten({"del","filename"}," ") -> "del filename", rather than "d e l f i l e n a m e". 
redydemo\gui\objects\hypertext.e()   -- "" 
edita\eacons.ew (private routine)    -- "", for messagebox, eg flatten({"line1","line2"},"\n") -> "line1\nline2", rather than "l\ni\nn\ne\n1\nl\ni\nn\ne\n2" 
edita\eafext.ew ("")                 -- "" 
edita\eatabl.e ("")                  -- "" 

and only demo\net\pastey.ex without, which relies on \n being left in by read_lines() for something similar to eacons.ew above.

Surprisingly, all existing uses of flatten() that I could find were "sequence of string -> string", and all were really join() under a different name.

Pete

new topic     » goto parent     » topic index » view message » categorize

18. Re: better flatten()?

petelomax said...
jimcbrown said...
DerekParnell said...

I don't know. I have not really had a need to use this function.

Why have the delim stuff in the first place? Euphoria generally doesn't follow the KISS principle, but I wonder if that's making the function overcomplicated. Does anybody use it?

I am beginning to think you might be quite right there, instead of flatten(s,delim) the programming construct to use should perhaps be join(flatten(s),delim), or often just join(s,delim).

Agreed in full.

petelomax said...

I was also thinking that flatten() should either have a "check for string elements" option or a sister function, at least that was until I did a quick search...

What would this sister function do?

petelomax said...

I was also thinking that flatten() should either have a "check for string elements" option or a sister function, at least that was until I did a quick search... I found the following uses of flatten() with a delimiter:

std\cmdline.e\build_commandline()    -- wants string handling, eg flatten({"del","filename"}," ") -> "del filename", rather than "d e l f i l e n a m e". 
redydemo\gui\objects\hypertext.e()   -- "" 
edita\eacons.ew (private routine)    -- "", for messagebox, eg flatten({"line1","line2"},"\n") -> "line1\nline2", rather than "l\ni\nn\ne\n1\nl\ni\nn\ne\n2" 
edita\eafext.ew ("")                 -- "" 
edita\eatabl.e ("")                  -- "" 

and only demo\net\pastey.ex without, which relies on \n being left in by read_lines() for something similar to eacons.ew above.

Surprisingly, all existing uses of flatten() that I could find were "sequence of string -> string", and all were really join() under a different name.

Pete

Yes, I think all those cases should be changed to replace the flatten() call with a join() call.

Maybe, a flatten(join(...)) call, in case (for example) a caller of build_commandline() passes in a sequence with more than a single sublevel (e.g. more depth than a mere sequence of strings). OTOH maybe we should just fail early in that case.

new topic     » goto parent     » topic index » view message » categorize

19. Re: better flatten()?

jimcbrown said...
petelomax said...

I was also thinking that flatten() should either have a "check for string elements" option or a sister function, at least that was until I did a quick search...

What would this sister function do?

I just meant that instead of an optional flag on the standard flatten(), you could have a similar function that always performed the string checks. I was thinking (given the plan to remove the delimiter)

flatten({{"this","that"},{"the","other"}}) --> "thisthattheother" 
flattens({{"this","that"},{"the","other"}}) --> {"this","that","the","other"} 

However, as I am now thinking that lists of strings, whether you want a delimiter or not, are better passed to join(), any more deeply nested structures that happen to contain strings are likely to contain all other kinds of junk, and passing that sort of thing to flatten() now seems much less likely to yield any kind of useful result no matter how you slice and dice it. If your application has something like { {"this","that"},{"the","other"} } or worse, then it is probably better to roll your own disentangle(), maybe one that sorts and removes duplicates, or only selects those words where some other flag is set, than expect standard routines to have the magic to deal with every possible situation.

Pete

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu