1. Optimizing multiple searches

CODE:

sequence lines = {"this is first line",  
		"second line",  
		"another line",  
		"and this is another line- will be fourth", 
		"and last line"}  
sequence terms = {"line", "will", "this"}  
 
sequence outlist = {}  
for i=1 to length(terms) do   
	for j=1 to length(lines) do   
		if match(terms[i], lines[j]) then   
		outlist = append(outlist, lines[j])  
		end if  
	end for   
	lines = outlist   
	outlist = {}   
end for   
display(lines) 

Output:

{ 
  "and this is another line- will be fourth" 
} 

I have 2 lists (lines and terms) and I have to select out only those entries of 'lines' where all items of 'terms' are present. Above code works but can it be optimized or shortened since I see that functions such as match_all, find_all, find_each etc are also available?

(Not able to put hashes for code in above text).

new topic     » topic index » view message » categorize

2. Re: Optimizing multiple searches

rneu said...

(Not able to put hashes for code in above text).

Tip: use the <eucode> tag for Euphoria code and {{{ / }}} for all other formatted text (e.g. console output). See CreoleHelp for more info.

-Greg

new topic     » goto parent     » topic index » view message » categorize

3. Re: Optimizing multiple searches

rneu said...

I have 2 lists (lines and terms) and I have to select out only those entries of 'lines' where all items of 'terms' are present. Above code works but can it be optimized or shortened since I see that functions such as match_all, find_all, find_each etc are also available?

Since you require all terms to match, then you can jump out of the loop of terms as soon as you don't find one you're looking for. This is easier if you invert the for loops: loop through all of the lines, scanning each one for the required terms.

include std/console.e 
include std/types.e 
 
sequence lines = { 
    "this is first line", 
    "second line", 
    "another line", 
    "and this is another line- will be fourth", 
    "and last line" 
} 
 
sequence terms = {"line", "will", "this"} 
 
sequence outlist = {} 
 
for i = 1 to length( lines ) do 
 
    integer found_all = TRUE 
 
    for j = 1 to length( terms ) do 
 
        if not match( terms[j], lines[i] ) then 
            found_all = FALSE 
            break 
        end if 
 
    end for 
 
    if found_all then 
        outlist = append( outlist, lines[i] ) 
    end if 
 
end for 
 
display( outlist ) 

-Greg

new topic     » goto parent     » topic index » view message » categorize

4. Re: Optimizing multiple searches

rneu said...

I have 2 lists (lines and terms) and I have to select out only those entries of 'lines' where all items of 'terms' are present. Above code works but can it be optimized or shortened since I see that functions such as match_all, find_all, find_each etc are also available?

(Not able to put hashes ## for code in above text).

Although there is a match_any() function in Euphoria it is not of the kind that works in this case. If there were you could've shortened the loop to:

for i=1 to length(lines) do 
   if match_any(terms, lines.i) = length(terms) then 
      outlist &= { lines.i } 
   end if 
end for 

You can easily write your own function, eg:

function match_any(sequence terms, sequence target) 
   sequence ret = {} 
   for i=1 to length(terms) do 
      integer n = match(terms.i, target) 
      if n then ret &= n end if 
   end for 
   return ret 
end function 

But the first thing to ask really is what are the data set sizes and what optimization (and other programming) goals do you have? Maybe when you say "optimize" you really mean beautify? Or is it lowest number of lines of code? Or is it maximum performance? There are compromises or trade-offs to be made in every choice. But the very first goal you must strive for is, of course, correctness. If it works correctly and you can live with the performance.. you're done!

Spock

new topic     » goto parent     » topic index » view message » categorize

5. Re: Optimizing multiple searches

Not impressed when looking at the manual...
http://openeuphoria.org/docs/std_search.html#_2480_match_all said...
public function match_all(sequence needle, sequence haystack, integer start = 1)
Match all items of haystack in needle.

Is it me or is that just completely backwards? It should perhaps say:
Locate all instances(/subsequences) of needle in haystack.

The example from the above link may clarify my wording change for the manual, but also shows that match_all() is not at all what is need for the task as asked for above:

s = match_all("the", "the dog chased the cat under the table.") 
-- s is {1,16,30} 

Anyway, going back to the original question, here's my suggestion

sequence lines = {  
    "this is first line",  
    "second line",  
    "another line",  
    "and this is another line- will be fourth",  
    "and last line"  
}  
  
sequence terms = {"line", "will", "this"}  
 
function match_alls(sequence terms, sequence lines) 
integer k = 0 
    for i=1 to length(lines) do 
        for j=1 to length(terms) do 
            if match(terms[j],lines[i])=0 then exit end if 
            if j=length(terms) then 
                k += 1 
                lines[k] = lines[i] 
            end if 
        end for 
    end for 
    return lines[1..k] 
end function 
 
?match_alls(terms,lines) 

Output

{"and this is another line- will be fourth"} 
(you may want to use display() instead of ?)

To get Spock's suggestion to work as I thought it should, and match my output, I had to change

   if match_any(terms, lines[i]) = length(terms) then  

to

   if length(match_any(terms, lines[i])) = length(terms) then  

Pete

PS: Spock asked some questions that need answering (I would have kept schtum were it not for the manual thing)

new topic     » goto parent     » topic index » view message » categorize

6. Re: Optimizing multiple searches

petelomax said...

To get Spock's suggestion to work as I thought it should, and match my output, I had to change [it] to

   if length(match_any(terms, lines[i])) = length(terms) then  

Pete

Nice catch.

Spock

new topic     » goto parent     » topic index » view message » categorize

7. Re: Optimizing multiple searches

search.e library takes some untangling before use.

Always look to the idea that Euphoria divides options into two: atom|sequence, one|many, 0|1 .

  • You are searching for something in a sequence or a string sequence.
  • An item is "usualy often an atom, but could be an object viewed as one whole."
  • An {items list} is "several items that could satisfy the search."
  • A slice is "a contiguous sequence that is part of an equal or longer sequence."
  • A {slices list} is "several slices that could satisfy the search."
  • A "new" sequence is "returned after a search and replace action."

find

A find function "locates an item in a list."

input search result use
item 0|1 is_in_list is_in_range
index find rfind
{ indices } find_all find_all_but
"new" find_replace




{ items list } 0|1 match_any
index find_any
{ indices } find_each

Warning: the match_any function does not fit the naming pattern.

match

A match function "locates a slice in a list."

input search result use
slice 0|1 begins ends
index match
{ indices } match_all
"new" match_replace

Note: for {slices list} you have to write nested loops.



Wanted Search Libray

An improved std/search.e would not have any conflicts with the std/regex.e library.

A lowercase search function ( find, match ) uses one item as a search argument. An uppercase search function ( Find, Match ) uses a list of items as a search argument.

find

A find function "searches for an item in a list."
A Find function "searches for {items} in a list."

input search result use
item 0|1 is_in_list is_in_range
index find rfind
{ indices } find_all find_all_but
"new" find_replace




{ items list } 0|1
index Find
{ indices } Find_all

A match function "searches for a slice in a list." A Match function "searches for {slices} in a list."

input search result use
slice 0|1 begins ends
index match
{ indices } match_all
"new" match_replace




{ items list } 0|1
index Match
{ indices } Match_all

_tom

new topic     » goto parent     » topic index » view message » categorize

8. Re: Optimizing multiple searches

search.e library takes some untangling before use.

Always look to the idea that Euphoria divides options into two: atom|sequence, one|many, 0|1 .

  • You are searching for something in a sequence or a string sequence.
  • An item is "usualy often an atom, but could be an object viewed as one whole."
  • An {items list} is "several items that could satisfy the search."
  • A slice is "a contiguous sequence that is part of an equal or longer sequence."
  • A {slices list} is "several slices that could satisfy the search."
  • A "new" sequence is "returned after a search and replace action."

find

A find function "locates an item in a list."

input search result use
item 0|1 is_in_list is_in_range
index find rfind
{ indices } find_all find_all_but
"new" find_replace




{ items list } 0|1 match_any
index find_any
{ indices } find_each

Warning: the match_any function does not fit the naming pattern.

match

A match function "locates a slice in a list."

input search result use
slice 0|1 begins ends
index match
{ indices } match_all
"new" match_replace

Note: for {slices list} you have to write nested loops.



Wanted Search Libray

An improved std/search.e would not have any conflicts with the std/regex.e library.

A lowercase search function ( find, match ) uses one item as a search argument. An uppercase search function ( Find, Match ) uses a list of items as a search argument.

find

A find function "searches for an item in a list."
A Find function "searches for {items} in a list."

input search result use
item 0|1 is_in_list is_in_range
index find rfind
{ indices } find_all find_all_but
"new" find_replace




{ items list } 0|1
index Find
{ indices } Find_all

A match function "searches for a slice in a list." A Match function "searches for {slices} in a list."

input search result use
slice 0|1 begins ends
index match
{ indices } match_all
"new" match_replace




{ items list } 0|1
index Match
{ indices } Match_all

_tom

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu