1. Converting with Regular Expressions
- Posted by euphoric (admin) Jan 21, 2011
- 1496 views
I need to convert something like this:
into
There will be multiple instances, so I'd like to use a regex match.
Who can write the regex for this?!?! :)
2. Re: Converting with Regular Expressions
- Posted by _tom (admin) Jan 21, 2011
- 1435 views
You may avoid a regex if this works for you:
include std/sequence.e sequence in = """<a href="http://www.mydomain.com/index.esp?var1=1&var2=2">A LINK</a>""" sequence want = """[[http://www.mydomain.com/index.esp?var1=1&var2=2 -> A LINK]]""" sequence out = transmute( in, { {}, `<a href="http:` , `">A LINK</a>` } , { {}, `[[http:` , ` -> A LINK]]` } ) puts(1, want) puts(1, "\n" ) puts(1, out ) puts(1, "\ndone" )
tom
3. Re: Converting with Regular Expressions
- Posted by useless Jan 23, 2011
- 1287 views
Tom, that's better example of replace_all(), i mean transmute(), than in the manual. Still, what's the first empty sequence in
{ {}, `<a href="http:` , `">A LINK</a>` } ,
mean?
useless
4. Re: Converting with Regular Expressions
- Posted by useless Jan 23, 2011
- 1328 views
I need to convert something like this:
into
There will be multiple instances, so I'd like to use a regex match.
Who can write the regex for this?!?! :)
Be aware if you are parsing webpages out in the wild, the urls you gave are legal, but these are also used (note i did not say they are legal):
<a href='http://www.mydomain.com/index.esp?var1=1&var2=2'>A LINK</a>
<a href=http://www.mydomain.com/index.esp?var1=1&var2=2>A LINK</a>
and then there's relative links following <base> links sprinked around the page. I have also seen ' and " intermingled in the same <a../a>.
A cheap way of doing this if you don't have Eu available is to delete all the CR and LN from the text, then replace all "href=" and "</a" with CR, then alphabetically sort, and all the http will be grouped together almost like you asked for them. Me, i use the strtok lib still.
useless
5. Re: Converting with Regular Expressions
- Posted by ne1uno Jan 23, 2011
- 1285 views
--demo regex extracting html links & text include std/io.e include std/error.e include std/regex.e as re include std/sequence.e as seq --put filename here to read html from file -- may need to preprocess, split on <a> sequence fname = "" sequence html = seq:split(` ____<a href="http://careers.overflow.com">careers</a> <div id="question-header"> <h1><a href="/questions/468/custom-regexp-function" class="question-hyperlink">Custom REGEXP Function</a></h1> </div> <a href="http://www.mydomain.com/index.esp?var1=1&var2=2">A LINK</a> `,'\n', 1) --use object instead of regex, -- to avoid typecheck error on bad regex to show error_message() -- {?i} case insensitive --extract rawlink <a..</a object h_link = re:new("(?i)\\<a.*\\/a[ \t\\\\]*\\>") --extract only href from rawlink, capture linkurl, linktext object h_href = re:new("(?i)href[ \t]*=[ \t\"']*([^\"'>]*).*\\>([^<]*)\\<") if atom(h_link) then crash("error h_link %s\n", {re:error_message(h_link)}) elsif atom(h_href) then crash("error h_href %s\n", {re:error_message(h_href)}) end if --** -- limitation to simplify, link is on one line, -- but there can be other text/html on that line. -- will mangle some/many url that misuse quotes or brackets -- doesn't try to catch multiple links on one line -- won't pick bare url -- edit to suit public function process_links(sequence lines) object result, resultn sequence line, pross = {} for x= 1 to length(lines) do if atom(lines[x]) or not length(lines[x]) then continue end if result = re:find(h_link, lines[x]) if atom(result) then --no link continue end if -- depending too much on well formed html -- will miss more than one link on line -- all fixable with more time and error checking line = lines[x][result[1][1]..result[1][2]] resultn = re:find(h_href, line) if atom(resultn) then printf(2, "href re err=%d %s\n", {resultn, line }) continue end if if length(resultn)<3 then printf(2, "href len err=%d %s\n", {length(resultn), line }) continue end if pross = append(pross,{ line[resultn[2][1]..resultn[2][2]], --link line[resultn[3][1]..resultn[3][2]], --linktext $ }) --?1/0 end for return pross end function sequence links if length(fname) then links = process_links(read_lines(fname)) else links = process_links(html) end if puts(1," Results:\n") for i= 1 to length(links) do printf(1,"[[%s -> %s]]\n",{ links[i][2], links[i][1], $ }) end for /* * Posted by euphoric I need to convert something like this: <a href="http://www.mydomain.com/index.esp?var1=1&var2=2">A LINK</a> into [[http://www.mydomain.com/index.esp?var1=1&var2=2 -> A LINK]] I'd like to use a regex match. note in euphoria creole it's [[link text -> link]] [[link | link text]] [[linktext link]] (hope I got that right) Results: [[careers -> http://careers.overflow.com]] [[Custom REGEXP Function -> /questions/468/custom-regexp-function]] [[A LINK -> http://www.mydomain.com/index.esp?var1=1&var2=2]] */
6. Re: Converting with Regular Expressions
- Posted by useless Jan 24, 2011
- 1113 views
Tom, that's better example of replace_all(), i mean transmute(), than in the manual. Still, what's the first empty sequence in
{ {}, `<a href="http:` , `">A LINK</a>` } ,
mean?
useless
7. Re: Converting with Regular Expressions
- Posted by _tom (admin) Jan 24, 2011
- 1099 views
{ {}, `<a href="http:` , `">A LINK</a>` } ,
I too am curious. I did notice that leaving out {} makes things not work.
Will the author of transmute() please help out?
8. Re: Converting with Regular Expressions
- Posted by mattlewis (admin) Jan 24, 2011
- 1103 views
{ {}, `<a href="http:` , `">A LINK</a>` } ,
I too am curious. I did notice that leaving out {} makes things not work.
Will the author of transmute() please help out?
By default, this routine operates on single elements from each of the arguments. That is to say, it scans source_data for elements that match any single element in current_items and when matched, replaces that with a single element from new_items.
For example, you can find all occurrances of 'h', 's', and 't' in a string and replace them with '1', '2', and '3' respectively. transmute(SomeString, "hts", "123") However, the routine can also be used to scan for sub-sequences and/or replace matches with sequences rather than single elements. This is done by making the first element in current_items and/or new_items an empty sequence.
Emphasis added.
I didn't write the routine, but I guess the author chose this somewhat awkward method over a separate routine or additional flag.
Matt