1. Regular Expressions: find_replace_limit doesn't make all replacements

Minimal program below misses a few replacements after the 6th match, then matches a few after.

 
include dll.e 
include std/error.e 
 
include std/get.e 
include std/math.e 
include std/console.e 
include std/os.e 
include std/types.e 
include std/io.e 
--include machine.e 
include std/regex.e 
 
	sequence fname="Y:\\temp\\ASX-screener\\2022-02-24\\9.txt" 
end if 
 
--object data = read_lines(fname) 
 
regex r = regex:new("\\s+(\\S+)", BSR_ANYCRLF ) 
tr(not regex(r), {"Error parsing regex;  %s", error_message(r)}) 
integer fin=open(fname, "r") 
object in, formatted 
 
while 1 do 
	in = gets(fin) 
	if not sequence(in) then exit end if 
	while 1 do 	--Must repeat replace to do missed replacements 
		formatted=in 
		in=regex:find_replace_limit(r, formatted, "~\\1", 20) 
		if equal(formatted, in) then exit end if 
		  --until replace makes no changes (hack; shouldnt need this loop) 
		exit  --comment out this line to repeat the replacement as a cludge fix 
	end while 
	puts(1, in & '\n') 
end while 
 
 

Text in file (\t tab seperated fields);

        0.013AUD        0.00%   0.000AUD        Sell    42      1       4.379MAU      -       -0.00AUD        -       Energy Minerals 

Gives output;

~0.013AUD~0.00%~0.000AUD~Sell~42~1  4.379MAUD~- -0.00AUD~-  Energy~Minerals     

If I repeat the replacement, then the missing matches get done. No problem using super-sed 3.60 and PCRE mode.

new topic     » topic index » view message » categorize

2. Re: Regular Expressions: find_replace_limit doesn't make all replacements

Is this what you're looking for? It works fine for me. I also tested here: https://regex101.com/r/wWCCCJ/1

include std/regex.e 
regex pattern = regex:new( `\s+(\s+)` ) 
 
sequence string = "        0.013AUD        0.00%   0.000AUD        Sell    42      1       4.379MAUD      -       -0.00AUD        -       Energy Minerals\n" 
sequence result = regex:find_replace_limit( pattern, string, `~\1`, 20 ) 
 
? equal( result, "~ 0.013AUD~ 0.00%~ 0.000AUD~ Sell~ 42~ 1~ 4.379MAUD~ -~ -0.00AUD~ -~ Energy Minerals\n" ) 
-- prints 1, strings are equal 

PS: Use the "raw" backtick strings when writing regular expressions to avoid escaping backslashes and help prevent errors.
PPS: Use the {{{ and }}} blocks when you want to format raw plain text in forum posts. I've edited them in for you above.

-Greg

new topic     » goto parent     » topic index » view message » categorize

3. Re: Regular Expressions: find_replace_limit doesn't make all replacements

Thanks greg for making a minimal test case. I've fixed your program to match my description (tabs not spaces on input), and fixed the error \s+(\s+) should be \s+(\S+) . Thanks for the heads up on backticks; no mention in the manual. What is the full details on them? Couldnt find much on the forum.

Used double quotes for your regex:find_replace_limit( pattern, string, `\1`, 20 ) since official eu 4.1 didnt accept this. After running, i get the error as i originally described where several replacement werent made, which may imply a serious error in find_replace_limit() / find_replace(). The output i get is (spaces is 1 tab);

~0.013AUD~0.00%~0.000AUD~Sell~42~1 4.379MAUD- -0.00AUD- Energy~Minerals

ie

0000 7e 30 2e 30 31 33 41 55  44 7e 30 2e 30 30 25 7e  ~0.013AUD~0.00%~ 
0010 30 2e 30 30 30 41 55 44  7e 53 65 6c 6c 7e 34 32  0.000AUD~Sell~42 
0020 7e 31 09 34 2e 33 37 39  4d 41 55 44 7e 2d 09 2d  ~1.4.379MAUD~-.- 
0030 30 2e 30 30 41 55 44 7e  2d 09 45 6e 65 72 67 79  0.00AUD~-.Energy 
0040 7e 4d 69 6e 65 72 61 6c  73                       ~Minerals 
ghaberek said...

Is this what you're looking for? It works fine for me. I also tested here: https://regex101.com/r/wWCCCJ/1

Errors as noted above, how can you enter tabs? but great link for quick testing. Users with super-sed can do this offline with;

sed -R -e "s/\s+(\S+)/~\1/g" 9.txt

include std/regex.e 
regex pattern = regex:new( `\s+(\S+)` )  -- ie a run of whitespace, and a capture of a run of non-whitespace. 
 
sequence string = "\t0.013AUD\t0.00%\t0.000AUD\tSell\t42\t1\t4.379MAUD\t-\t-0.00AUD\t-\tEnergy Minerals" 
puts(1, regex:find_replace_limit( pattern, string, "~\\1", 20 ) )   --ie like regex:split()  to separate non-whitespace fields with a single '~` by compressing whitespace . 
new topic     » goto parent     » topic index » view message » categorize

4. Re: Regular Expressions: find_replace_limit doesn't make all replacements

Some new demo code comparing equivalent correct behaviour of find_replace_callback() . Comparative bad then good output is;

~0.013AUD~0.00%~0.000AUD~Sell~42~1  4.379MAUD~- -0.00AUD~-  Energy~Minerals     
~0.013AUD~0.00%~0.000AUD~Sell~42~1~4.379MAUD~-~-0.00AUD~-~Energy~Minerals       

So would seem conclusive that there is a bug in Euphoria's backend implementation of find_replace_limit() . Anyone want to quickly point me to where this is on github and i'll take a quick look.

include std/regex.e 
function my_convert(sequence params) 
	return "~" & params[2]	--replace matched whitespace (\s)  with '~' and append the first capture of non-whitespace (\S) 
end function 
 
regex pattern = regex:new( `\s+(\S+)` ) 
 
sequence string = "\t0.013AUD\t0.00%\t0.000AUD\tSell\t42\t1\t4.379MAUD\t-\t-0.00AUD\t-\tEnergy Minerals" 
 
puts(1, regex:find_replace_limit( pattern, string, "~\\1", 20 ) & '\n') 
puts(1, find_replace_callback(regex:new( `\s+(\S+)` ), string,routine_id("my_convert")) ) 
abort(2) 
new topic     » goto parent     » topic index » view message » categorize

5. Re: Regular Expressions: find_replace_limit doesn't make all replacements

Not sure I can be that much help here, but I suspect it is length 1 rather than spaces vs. tabs:

include std/regex.e  
regex pattern = regex:new( `\s+(\S+)` )  
  
sequence string = " 0.013AUD 0.00% 0.000AUD Sell 42 1 4.379MAUD - -0.00AUD - Energy Minerals\n"  
sequence expctd = "~0.013AUD~0.00%~0.000AUD~Sell~42~1~4.379MAUD~-~-0.00AUD~-~Energy~Minerals\n" 
sequence result = regex:find_replace_limit( pattern, string, `~\1`, 20 )  
sequence reslt2 = regex:find_replace_limit( pattern, result, `~\1`, 20 )  
  
puts(1,string) 
puts(1,expctd) 
puts(1,result) 
puts(1,reslt2) 
? equal( result, expctd )  
? equal( reslt2, expctd )  

Running the above on https://tio.run/#euphoria4 gives me

 0.013AUD 0.00% 0.000AUD Sell 42 1 4.379MAUD - -0.00AUD - Energy Minerals 
~0.013AUD~0.00%~0.000AUD~Sell~42~1~4.379MAUD~-~-0.00AUD~-~Energy~Minerals 
~0.013AUD~0.00%~0.000AUD~Sell~42~1 4.379MAUD~- -0.00AUD~- Energy~Minerals 
~0.013AUD~0.00%~0.000AUD~Sell~42~1~4.379MAUD~-~-0.00AUD~-~Energy~Minerals 
0 
1 

(The second shot works because you've got rid of all the length 1 previous substitutions, and that would still be true even if they were originally back-to-back)

As above, no expert here, but the code you are looking for might be in https://github.com/OpenEuphoria/euphoria/blob/99dff754918b9f66267b631d9c0be1d63c256d87/source/be_pcre.c right at the end,

start_from = ovector[rc] + rep_s->length; 

might perhaps be missing a -1 [and don't blame me if that goes into an infinite loop/make sure you test with a trailing space on the substitution]

HTH, Pete

new topic     » goto parent     » topic index » view message » categorize

6. Re: Regular Expressions: find_replace_limit doesn't make all replacements

abuaf said...

Thanks for the heads up on backticks; no mention in the manual. What is the full details on them? Couldnt find much on the forum.

If you're using Euphoria 4.1, you should have a local copy of the documentation in the "docs" folder. Raw strings are covered in section 4.1 Definition.

abuaf said...

Errors as noted above, how can you enter tabs? but great link for quick testing. Users with super-sed can do this offline with;

When using raw strings, technically `\t` is the same as {'\\','t'} which of course isn't an actual Tab character like '\t'.

However, the regular expressions engine (PCRE) interprets that string as a tab character. Read more here: pcresyntax specification.

-Greg

new topic     » goto parent     » topic index » view message » categorize

7. Re: Regular Expressions: find_replace_limit doesn't make all replacements

petelomax said...

Not sure I can be that much help here, but I suspect it is length 1 rather than spaces vs. tabs:

(The second shot works because you've got rid of all the length 1 previous substitutions, and that would still be true even if they were originally back-to-back)

As above, no expert here, but the code you are looking for might be in https://github.com/OpenEuphoria/euphoria/blob/99dff754918b9f66267b631d9c0be1d63c256d87/source/be_pcre.c right at the end,

start_from = ovector[rc] + rep_s->length; 

might perhaps be missing a -1 [and don't blame me if that goes into an infinite loop/make sure you test with a trailing space on the substitution]

Thanks for finding this Pete. In my testing I had also determined this seemed related to the length-one capture groups being skipped due to an off-by-one error. I think you're correct about that line being the culprit here. I will do some more testing and get this resolved. (I really need to get our tickets moved to GitHub.)

-Greg

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu