Re: RegEx with phix - howto
- Posted by ghaberek (admin) Jul 03, 2018
- 1310 views
First rule of regex: do not use regex.
You can take my regular expressions when you pry them from my cold dead hands.
the text to regex is: 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7
it always starts with 10. and end with space, CR or TAB. there is no length limit, it is just an DOI number, and the 10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7 is a good example of one. usually, the are embeded in text, the top or the bottom of a page. the %20 stands for some hexnumber starting with % and at most 2 chars after the %.
Are you trying to extract those hexadecimal values, or do you just want to validate that the string is a DOI number?
I found this page via a quick Google search: https://www.crossref.org/blog/dois-and-matching-regular-expressions/
The author implies that there may be several formats of DOI, each with their own unique contents.
I guess if you're look for a quick-and-dirty DOI validation expression, use something like this: ^10.\d{4,9}[^\s]+$
Here's how I'd do it in Euphoria. (Sorry, I don't really use Phix.)
include std/regex.e include std/text.e constant re_doi = regex:new( `^10.\d{4,9}[^\s]+$` ) public function valid_doi( sequence doi_number ) return regex:is_match( re_doi, text:trim(doi_number) ) end function ? valid_doi( "10.1016.12.31/nature.S0735%20-1097(98)2000/12/31/34:7-7" ) -- prints 1 ? valid_doi( "help me, obi-wan kenobi" ) -- prints 0
Edit: I should note, that it may not be helpful to consider a string valid if it ends in whitespace. Might want to trim() it first.
-Greg