1. RegEx with phix - howto

hi,

i try to regex

10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7

the expression always starts with '10.'

with phix and can't get it work. Can anyone help me out???

richard

new topic     » topic index » view message » categorize

2. Re: RegEx with phix - howto

Exactly what are you trying to do?

  • what is the source line of text?
  • what is the pattern you are trying to match?

_tom

new topic     » goto parent     » topic index » view message » categorize

3. Re: RegEx with phix - howto

the text to regex is: 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7

it always starts with 10. and end with space, CR or TAB. there is no length limit, it is just an DOI number, and the 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 is a good example of one. usually, the are embeded in text, the top or the bottom of a page. the %20 stands for some hexnumber starting with % and at most 2 chars after the %.

new topic     » goto parent     » topic index » view message » categorize

4. Re: RegEx with phix - howto

First rule of regex: do not use regex.

Try something simple first.

include std/search.e 
 
sequence line = "lorum ipsum lorum ipusm 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 lorum ipsum." 
 
 
integer n, m 
 
n =  match( "10.", line ) 
? n  
 
if n then 
	m = find_any( " \t\r\n", line, n ) 
	? m 
	 
	if m then puts(1, line[n..m] ) end if 
 
end if 
-->  a `DOI` number 

_tom

new topic     » goto parent     » topic index » view message » categorize

5. Re: RegEx with phix - howto

wow

Ocham's razor - haven't thought that - great. never the less, a regex solution would be great - if possible with phix.

thank you

richard

new topic     » goto parent     » topic index » view message » categorize

6. Re: RegEx with phix - howto

_tom said...

First rule of regex: do not use regex.

You can take my regular expressions when you pry them from my cold dead hands.

begin said...

the text to regex is: 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7

it always starts with 10. and end with space, CR or TAB. there is no length limit, it is just an DOI number, and the 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 is a good example of one. usually, the are embeded in text, the top or the bottom of a page. the %20 stands for some hexnumber starting with % and at most 2 chars after the %.

Are you trying to extract those hexadecimal values, or do you just want to validate that the string is a DOI number?

I found this page via a quick Google search: https://www.crossref.org/blog/dois-and-matching-regular-expressions/

The author implies that there may be several formats of DOI, each with their own unique contents.

I guess if you're look for a quick-and-dirty DOI validation expression, use something like this: ^10.\d{4,9}[^\s]+$

Here's how I'd do it in Euphoria. (Sorry, I don't really use Phix.)

include std/regex.e 
include std/text.e 
 
constant re_doi = regex:new( `^10.\d{4,9}[^\s]+$` ) 
 
public function valid_doi( sequence doi_number ) 
    return regex:is_match( re_doi, text:trim(doi_number) ) 
end function 
 
? valid_doi( "10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7" ) 
-- prints 1 
 
? valid_doi( "help me, obi-wan kenobi" ) 
-- prints 0 

Edit: I should note, that it may not be helpful to consider a string valid if it ends in whitespace. Might want to trim() it first.

-Greg

new topic     » goto parent     » topic index » view message » categorize

7. Re: RegEx with phix - howto

yes thank you. i tried that an many others but phix returns

^10.\d{4,9}[^\s]+$ 
              ^invalid escape (negation mismatch) 

i guess for now i have to take 'Ocham's razor'. i can do SSN'S, dewy etc no problem but DOI's #?*#s?!h=i?!t!. that sucks.

looks like i can post CDEEM tomorrow, might be interesting for some people for frequency analysis and statistical smoothing.

richard

new topic     » goto parent     » topic index » view message » categorize

8. Re: RegEx with phix - howto

not doable there needs to be some verification. merde

any body an idea?

new topic     » goto parent     » topic index » view message » categorize

9. Re: RegEx with phix - howto

not doable there needs to be some verification. merde

any body an idea?

new topic     » goto parent     » topic index » view message » categorize

10. Re: RegEx with phix - howto

I tried this using phix:

include builtins/regex.e 
regex_options( RE_PIKEVM ) 
regex_options( RE_EARLY_EXIT ) 
 
sequence line = `lorum ipsum lorum ipusm 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 lorum ipsum.` 
 
sequence rx = `(10(\.)(.* ))\b` 
 
sequence DOI = regex( rx, line ) 
? DOI 
 
if DOI = {} then 
    puts(1, "not found..." ) 
else 
    puts(1, line[ DOI[1] .. DOI[2] ] ) 
end if 
 
puts(1, "\n\n" ) 
 
--> 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 l 
 
-- ? where does the trailing `l` come from? 

Removing the trailing ` l` noise may not be too difficult.

_tom

new topic     » goto parent     » topic index » view message » categorize

11. Re: RegEx with phix - howto

begin said...

yes thank you. i tried that an many others but phix returns

^10.\d{4,9}[^\s]+$ 
              ^invalid escape (negation mismatch) 

That's probably a bug. Workaround below. I've not been well for the past few days (nothing serious, just a touch of man-flu, again) and my head's still groggy, but I'll take a closer look when my head clears.

_tom said...

? where does the trailing `l` come from?

As per the docs: "The result is pairs of start/end+1 indexes", plus your regular expression has a trailing space.

I tried this (in lieu of trying to fix anything):

include builtins\regex.e 
sequence line = "lorum ipsum lorum ipusm 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 lorum ipsum.", 
         rx = `10.\d{4,9}\S+`, 
         res = regex(rx,line) 
integer {s,e} = res 
?line[s..e-1] 

ie: I used \S instead of [^\s], and you don't use the outer ^$ unless you need the whole string to match, and I got:

"10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7" 

Pete

new topic     » goto parent     » topic index » view message » categorize

12. Re: RegEx with phix - howto

hi pete,

i hope you are well. we had such an flu epidemic here too and my whole family went through that. unfortunately a wave of TB is running now (migrants) and in our county they had 5 sick pupils in schools already. the government tells us now - nothing to worry? thanks for your help, it works well.

richard

new topic     » goto parent     » topic index » view message » categorize

13. Re: RegEx with phix - howto

I've just pushed a quick bugfix to builtins\regex.e, so that [^\s] is permitted and treated the same as \S

Pete

new topic     » goto parent     » topic index » view message » categorize

14. Re: RegEx with phix - howto

mille grazie

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu