1. Regular Expressions: find_replace_limit doesn't make all replacements
- Posted by abuaf Feb 26, 2022
- 1198 views
Minimal program below misses a few replacements after the 6th match, then matches a few after.
include dll.e include std/error.e include std/get.e include std/math.e include std/console.e include std/os.e include std/types.e include std/io.e --include machine.e include std/regex.e sequence fname="Y:\\temp\\ASX-screener\\2022-02-24\\9.txt" end if --object data = read_lines(fname) regex r = regex:new("\\s+(\\S+)", BSR_ANYCRLF ) tr(not regex(r), {"Error parsing regex; %s", error_message(r)}) integer fin=open(fname, "r") object in, formatted while 1 do in = gets(fin) if not sequence(in) then exit end if while 1 do --Must repeat replace to do missed replacements formatted=in in=regex:find_replace_limit(r, formatted, "~\\1", 20) if equal(formatted, in) then exit end if --until replace makes no changes (hack; shouldnt need this loop) exit --comment out this line to repeat the replacement as a cludge fix end while puts(1, in & '\n') end while
Text in file (\t tab seperated fields);
0.013AUD 0.00% 0.000AUD Sell 42 1 4.379MAU - -0.00AUD - Energy Minerals
Gives output;
~0.013AUD~0.00%~0.000AUD~Sell~42~1 4.379MAUD~- -0.00AUD~- Energy~Minerals
If I repeat the replacement, then the missing matches get done. No problem using super-sed 3.60 and PCRE mode.
2. Re: Regular Expressions: find_replace_limit doesn't make all replacements
- Posted by ghaberek (admin) Feb 26, 2022
- 1192 views
Is this what you're looking for? It works fine for me. I also tested here: https://regex101.com/r/wWCCCJ/1
include std/regex.e regex pattern = regex:new( `\s+(\s+)` ) sequence string = " 0.013AUD 0.00% 0.000AUD Sell 42 1 4.379MAUD - -0.00AUD - Energy Minerals\n" sequence result = regex:find_replace_limit( pattern, string, `~\1`, 20 ) ? equal( result, "~ 0.013AUD~ 0.00%~ 0.000AUD~ Sell~ 42~ 1~ 4.379MAUD~ -~ -0.00AUD~ -~ Energy Minerals\n" ) -- prints 1, strings are equal
PS: Use the "raw" backtick strings when writing regular expressions to avoid escaping backslashes and help prevent errors.
PPS: Use the {{{ and }}} blocks when you want to format raw plain text in forum posts. I've edited them in for you above.
-Greg
3. Re: Regular Expressions: find_replace_limit doesn't make all replacements
- Posted by abuaf Feb 27, 2022
- 1160 views
Thanks greg for making a minimal test case. I've fixed your program to match my description (tabs not spaces on input), and fixed the error \s+(\s+) should be \s+(\S+) . Thanks for the heads up on backticks; no mention in the manual. What is the full details on them? Couldnt find much on the forum.
Used double quotes for your regex:find_replace_limit( pattern, string, `\1`, 20 ) since official eu 4.1 didnt accept this. After running, i get the error as i originally described where several replacement werent made, which may imply a serious error in find_replace_limit() / find_replace(). The output i get is (spaces is 1 tab);
~0.013AUD~0.00%~0.000AUD~Sell~42~1 4.379MAUD- -0.00AUD- Energy~Minerals
ie
0000 7e 30 2e 30 31 33 41 55 44 7e 30 2e 30 30 25 7e ~0.013AUD~0.00%~ 0010 30 2e 30 30 30 41 55 44 7e 53 65 6c 6c 7e 34 32 0.000AUD~Sell~42 0020 7e 31 09 34 2e 33 37 39 4d 41 55 44 7e 2d 09 2d ~1.4.379MAUD~-.- 0030 30 2e 30 30 41 55 44 7e 2d 09 45 6e 65 72 67 79 0.00AUD~-.Energy 0040 7e 4d 69 6e 65 72 61 6c 73 ~Minerals
Is this what you're looking for? It works fine for me. I also tested here: https://regex101.com/r/wWCCCJ/1
Errors as noted above, how can you enter tabs? but great link for quick testing. Users with super-sed can do this offline with;
sed -R -e "s/\s+(\S+)/~\1/g" 9.txt
include std/regex.e regex pattern = regex:new( `\s+(\S+)` ) -- ie a run of whitespace, and a capture of a run of non-whitespace. sequence string = "\t0.013AUD\t0.00%\t0.000AUD\tSell\t42\t1\t4.379MAUD\t-\t-0.00AUD\t-\tEnergy Minerals" puts(1, regex:find_replace_limit( pattern, string, "~\\1", 20 ) ) --ie like regex:split() to separate non-whitespace fields with a single '~` by compressing whitespace .
4. Re: Regular Expressions: find_replace_limit doesn't make all replacements
- Posted by abuaf Feb 27, 2022
- 1117 views
Some new demo code comparing equivalent correct behaviour of find_replace_callback() . Comparative bad then good output is;
~0.013AUD~0.00%~0.000AUD~Sell~42~1 4.379MAUD~- -0.00AUD~- Energy~Minerals ~0.013AUD~0.00%~0.000AUD~Sell~42~1~4.379MAUD~-~-0.00AUD~-~Energy~Minerals
So would seem conclusive that there is a bug in Euphoria's backend implementation of find_replace_limit() . Anyone want to quickly point me to where this is on github and i'll take a quick look.
include std/regex.e function my_convert(sequence params) return "~" & params[2] --replace matched whitespace (\s) with '~' and append the first capture of non-whitespace (\S) end function regex pattern = regex:new( `\s+(\S+)` ) sequence string = "\t0.013AUD\t0.00%\t0.000AUD\tSell\t42\t1\t4.379MAUD\t-\t-0.00AUD\t-\tEnergy Minerals" puts(1, regex:find_replace_limit( pattern, string, "~\\1", 20 ) & '\n') puts(1, find_replace_callback(regex:new( `\s+(\S+)` ), string,routine_id("my_convert")) ) abort(2)
5. Re: Regular Expressions: find_replace_limit doesn't make all replacements
- Posted by petelomax Feb 28, 2022
- 1128 views
Not sure I can be that much help here, but I suspect it is length 1 rather than spaces vs. tabs:
include std/regex.e regex pattern = regex:new( `\s+(\S+)` ) sequence string = " 0.013AUD 0.00% 0.000AUD Sell 42 1 4.379MAUD - -0.00AUD - Energy Minerals\n" sequence expctd = "~0.013AUD~0.00%~0.000AUD~Sell~42~1~4.379MAUD~-~-0.00AUD~-~Energy~Minerals\n" sequence result = regex:find_replace_limit( pattern, string, `~\1`, 20 ) sequence reslt2 = regex:find_replace_limit( pattern, result, `~\1`, 20 ) puts(1,string) puts(1,expctd) puts(1,result) puts(1,reslt2) ? equal( result, expctd ) ? equal( reslt2, expctd )
Running the above on https://tio.run/#euphoria4 gives me
0.013AUD 0.00% 0.000AUD Sell 42 1 4.379MAUD - -0.00AUD - Energy Minerals ~0.013AUD~0.00%~0.000AUD~Sell~42~1~4.379MAUD~-~-0.00AUD~-~Energy~Minerals ~0.013AUD~0.00%~0.000AUD~Sell~42~1 4.379MAUD~- -0.00AUD~- Energy~Minerals ~0.013AUD~0.00%~0.000AUD~Sell~42~1~4.379MAUD~-~-0.00AUD~-~Energy~Minerals 0 1
(The second shot works because you've got rid of all the length 1 previous substitutions, and that would still be true even if they were originally back-to-back)
As above, no expert here, but the code you are looking for might be in https://github.com/OpenEuphoria/euphoria/blob/99dff754918b9f66267b631d9c0be1d63c256d87/source/be_pcre.c right at the end,
start_from = ovector[rc] + rep_s->length;
might perhaps be missing a -1 [and don't blame me if that goes into an infinite loop/make sure you test with a trailing space on the substitution]
HTH, Pete
6. Re: Regular Expressions: find_replace_limit doesn't make all replacements
- Posted by ghaberek (admin) Feb 28, 2022
- 1091 views
Thanks for the heads up on backticks; no mention in the manual. What is the full details on them? Couldnt find much on the forum.
If you're using Euphoria 4.1, you should have a local copy of the documentation in the "docs" folder. Raw strings are covered in section 4.1 Definition.
Errors as noted above, how can you enter tabs? but great link for quick testing. Users with super-sed can do this offline with;
When using raw strings, technically `\t` is the same as {'\\','t'} which of course isn't an actual Tab character like '\t'.
However, the regular expressions engine (PCRE) interprets that string as a tab character. Read more here: pcresyntax specification.
-Greg
7. Re: Regular Expressions: find_replace_limit doesn't make all replacements
- Posted by ghaberek (admin) Feb 28, 2022
- 1096 views
Not sure I can be that much help here, but I suspect it is length 1 rather than spaces vs. tabs:
(The second shot works because you've got rid of all the length 1 previous substitutions, and that would still be true even if they were originally back-to-back)
As above, no expert here, but the code you are looking for might be in https://github.com/OpenEuphoria/euphoria/blob/99dff754918b9f66267b631d9c0be1d63c256d87/source/be_pcre.c right at the end,
start_from = ovector[rc] + rep_s->length;
might perhaps be missing a -1 [and don't blame me if that goes into an infinite loop/make sure you test with a trailing space on the substitution]
Thanks for finding this Pete. In my testing I had also determined this seemed related to the length-one capture groups being skipped due to an off-by-one error. I think you're correct about that line being the culprit here. I will do some more testing and get this resolved. (I really need to get our tickets moved to GitHub.)
-Greg