1. RegEx with phix - howto
- Posted by begin Jul 03, 2018
- 1375 views
hi,
i try to regex
10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7
the expression always starts with '10.'
with phix and can't get it work. Can anyone help me out???
richard
2. Re: RegEx with phix - howto
- Posted by _tom (admin) Jul 03, 2018
- 1320 views
Exactly what are you trying to do?
- what is the source line of text?
- what is the pattern you are trying to match?
_tom
3. Re: RegEx with phix - howto
- Posted by begin Jul 03, 2018
- 1322 views
the text to regex is: 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7
it always starts with 10. and end with space, CR or TAB. there is no length limit, it is just an DOI number, and the 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 is a good example of one. usually, the are embeded in text, the top or the bottom of a page. the %20 stands for some hexnumber starting with % and at most 2 chars after the %.
4. Re: RegEx with phix - howto
- Posted by _tom (admin) Jul 03, 2018
- 1443 views
First rule of regex: do not use regex.
Try something simple first.
include std/search.e sequence line = "lorum ipsum lorum ipusm 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 lorum ipsum." integer n, m n = match( "10.", line ) ? n if n then m = find_any( " \t\r\n", line, n ) ? m if m then puts(1, line[n..m] ) end if end if --> a `DOI` number
_tom
5. Re: RegEx with phix - howto
- Posted by begin Jul 03, 2018
- 1278 views
wow
Ocham's razor - haven't thought that - great. never the less, a regex solution would be great - if possible with phix.
thank you
richard
6. Re: RegEx with phix - howto
- Posted by ghaberek (admin) Jul 03, 2018
- 1310 views
First rule of regex: do not use regex.
You can take my regular expressions when you pry them from my cold dead hands.
the text to regex is: 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7
it always starts with 10. and end with space, CR or TAB. there is no length limit, it is just an DOI number, and the 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 is a good example of one. usually, the are embeded in text, the top or the bottom of a page. the %20 stands for some hexnumber starting with % and at most 2 chars after the %.
Are you trying to extract those hexadecimal values, or do you just want to validate that the string is a DOI number?
I found this page via a quick Google search: https://www.crossref.org/blog/dois-and-matching-regular-expressions/
The author implies that there may be several formats of DOI, each with their own unique contents.
I guess if you're look for a quick-and-dirty DOI validation expression, use something like this: ^10.\d{4,9}[^\s]+$
Here's how I'd do it in Euphoria. (Sorry, I don't really use Phix.)
include std/regex.e include std/text.e constant re_doi = regex:new( `^10.\d{4,9}[^\s]+$` ) public function valid_doi( sequence doi_number ) return regex:is_match( re_doi, text:trim(doi_number) ) end function ? valid_doi( "10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7" ) -- prints 1 ? valid_doi( "help me, obi-wan kenobi" ) -- prints 0
Edit: I should note, that it may not be helpful to consider a string valid if it ends in whitespace. Might want to trim() it first.
-Greg
7. Re: RegEx with phix - howto
- Posted by begin Jul 03, 2018
- 1276 views
yes thank you. i tried that an many others but phix returns
^10.\d{4,9}[^\s]+$ ^invalid escape (negation mismatch)
i guess for now i have to take 'Ocham's razor'. i can do SSN'S, dewy etc no problem but DOI's #?*#s?!h=i?!t!. that sucks.
looks like i can post CDEEM tomorrow, might be interesting for some people for frequency analysis and statistical smoothing.
richard
8. Re: RegEx with phix - howto
- Posted by begin Jul 03, 2018
- 1269 views
not doable there needs to be some verification. merde
any body an idea?
9. Re: RegEx with phix - howto
- Posted by begin Jul 03, 2018
- 1303 views
not doable there needs to be some verification. merde
any body an idea?
10. Re: RegEx with phix - howto
- Posted by _tom (admin) Jul 03, 2018
- 1289 views
I tried this using phix:
include builtins/regex.e regex_options( RE_PIKEVM ) regex_options( RE_EARLY_EXIT ) sequence line = `lorum ipsum lorum ipusm 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 lorum ipsum.` sequence rx = `(10(\.)(.* ))\b` sequence DOI = regex( rx, line ) ? DOI if DOI = {} then puts(1, "not found..." ) else puts(1, line[ DOI[1] .. DOI[2] ] ) end if puts(1, "\n\n" ) --> 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 l -- ? where does the trailing `l` come from?
Removing the trailing ` l` noise may not be too difficult.
_tom
11. Re: RegEx with phix - howto
- Posted by petelomax Jul 03, 2018
- 1256 views
yes thank you. i tried that an many others but phix returns
^10.\d{4,9}[^\s]+$ ^invalid escape (negation mismatch)
That's probably a bug. Workaround below. I've not been well for the past few days (nothing serious, just a touch of man-flu, again) and my head's still groggy, but I'll take a closer look when my head clears.
? where does the trailing `l` come from?
As per the docs: "The result is pairs of start/end+1 indexes", plus your regular expression has a trailing space.
I tried this (in lieu of trying to fix anything):
include builtins\regex.e sequence line = "lorum ipsum lorum ipusm 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 lorum ipsum.", rx = `10.\d{4,9}\S+`, res = regex(rx,line) integer {s,e} = res ?line[s..e-1]
ie: I used \S instead of [^\s], and you don't use the outer ^$ unless you need the whole string to match, and I got:
"10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7"
Pete
12. Re: RegEx with phix - howto
- Posted by begin Jul 04, 2018
- 1239 views
hi pete,
i hope you are well. we had such an flu epidemic here too and my whole family went through that. unfortunately a wave of TB is running now (migrants) and in our county they had 5 sick pupils in schools already. the government tells us now - nothing to worry? thanks for your help, it works well.
richard
13. Re: RegEx with phix - howto
- Posted by petelomax Jul 05, 2018
- 1206 views
I've just pushed a quick bugfix to builtins\regex.e, so that [^\s] is permitted and treated the same as \S
Pete