OpenEuphoria: Forum: Converting with Regular Expressions

1. Converting with Regular Expressions

Posted by euphoric (admin) Jan 21, 2011
1771 views

I need to convert something like this:

into

[[http://www.mydomain.com/index.esp?var1=1&var2=2 -> A LINK]]

There will be multiple instances, so I'd like to use a regex match.

Who can write the regex for this?!?! :)

new topic » topic index » view message » categorize

2. Re: Converting with Regular Expressions

Posted by _tom (admin) Jan 21, 2011
1726 views

You may avoid a regex if this works for you:

include std/sequence.e 
 
sequence in = 
"""<a href="http://www.mydomain.com/index.esp?var1=1&var2=2">A LINK</a>"""  
 
sequence want = 
"""[[http://www.mydomain.com/index.esp?var1=1&var2=2 -> A LINK]]""" 
 
sequence out = transmute( in,  
{ {},   `<a href="http:`    ,    `">A LINK</a>`  } , 
{ {},   `[[http:`           ,    ` -> A LINK]]`  }    ) 
 
puts(1, want) puts(1, "\n" ) 
puts(1, out ) 
 
puts(1, "\ndone" )

tom

new topic » goto parent » topic index » view message » categorize

3. Re: Converting with Regular Expressions

Posted by useless Jan 23, 2011
1573 views

Tom, that's better example of replace_all(), i mean transmute(), than in the manual. Still, what's the first empty sequence in

{ {},   `<a href="http:`    ,    `">A LINK</a>`  } ,

mean?

useless

new topic » goto parent » topic index » view message » categorize

4. Re: Converting with Regular Expressions

Posted by useless Jan 23, 2011
1607 views

euphoric said...

I need to convert something like this:

into

[[http://www.mydomain.com/index.esp?var1=1&var2=2 -> A LINK]]

There will be multiple instances, so I'd like to use a regex match.

Who can write the regex for this?!?! :)

Be aware if you are parsing webpages out in the wild, the urls you gave are legal, but these are also used (note i did not say they are legal):

<a href='http://www.mydomain.com/index.esp?var1=1&var2=2'>A LINK</a>
<a href=http://www.mydomain.com/index.esp?var1=1&var2=2>A LINK</a>

and then there's relative links following <base> links sprinked around the page. I have also seen ' and " intermingled in the same <a../a>.

A cheap way of doing this if you don't have Eu available is to delete all the CR and LN from the text, then replace all "href=" and "</a" with CR, then alphabetically sort, and all the http will be grouped together almost like you asked for them. Me, i use the strtok lib still.

useless

new topic » goto parent » topic index » view message » categorize

5. Re: Converting with Regular Expressions

Posted by ne1uno Jan 23, 2011
1587 views

--demo regex extracting html links & text 
 
 
include std/io.e 
include std/error.e 
include std/regex.e as re 
include std/sequence.e as seq 
 
--put filename here to read html from file 
-- may need to preprocess, split on <a> 
sequence fname = ""   
 
sequence html =  seq:split(` 

____<a href="http://careers.overflow.com">careers</a> 
     <div id="question-header"> 
	<h1><a href="/questions/468/custom-regexp-function" class="question-hyperlink">Custom REGEXP Function</a></h1> 
	</div> 
	<a href="http://www.mydomain.com/index.esp?var1=1&var2=2">A LINK</a> 
      `,'\n', 1) 

 
--use object instead of regex, 
-- to avoid typecheck error on bad regex to show error_message() 
-- {?i} case insensitive 
 
--extract rawlink <a..</a 
object h_link = re:new("(?i)\\<a.*\\/a[ \t\\\\]*\\>") 
 
--extract only href from rawlink, capture linkurl, linktext 
object h_href = re:new("(?i)href[ \t]*=[ \t\"']*([^\"'>]*).*\\>([^<]*)\\<") 
 
if atom(h_link) then 
	crash("error h_link %s\n", {re:error_message(h_link)}) 
elsif atom(h_href) then 
	crash("error h_href %s\n", {re:error_message(h_href)}) 
end if 
 
--** 
-- limitation to simplify, link is on one line, 
-- but there can be other text/html on that line. 
-- will mangle some/many url that misuse quotes or brackets 
-- doesn't try to catch multiple links on one line 
-- won't pick bare url  
-- edit to suit 
 
public function process_links(sequence lines) 
	object result, resultn 
	sequence line, pross = {} 
 
	for x= 1 to length(lines) do 
		if atom(lines[x]) or not length(lines[x]) then continue end if 
 
		result = re:find(h_link, lines[x]) 
         
		if atom(result) then --no link 
			continue  
		end if 
		 
        -- depending too much on well formed html 
        -- will miss more than one link on line 
        -- all fixable with more time and error checking 
		 
		 
		line = lines[x][result[1][1]..result[1][2]] 
		resultn = re:find(h_href, line) 
         
		if atom(resultn) then  
			printf(2, "href re err=%d %s\n", {resultn, line }) 
			continue  
		end if 
		if length(resultn)<3 then 
			printf(2, "href len err=%d %s\n", {length(resultn), line }) 
			continue  
		end if 
		pross = append(pross,{ 
					line[resultn[2][1]..resultn[2][2]], --link 
					line[resultn[3][1]..resultn[3][2]], --linktext 
					$ 
					}) 
 		--?1/0 
	end for 
 
	return pross 
end function 
 
sequence links 
if length(fname) then 
	links = process_links(read_lines(fname)) 
else 
	links = process_links(html) 
end if 
 
puts(1," Results:\n") 
for i= 1 to length(links) do 
	printf(1,"[[%s -> %s]]\n",{ 
			links[i][2], 
			links[i][1], 
			$ 
			}) 
end for 
 
/* 

    *  Posted by euphoric  
I need to convert something like this: 
<a href="http://www.mydomain.com/index.esp?var1=1&var2=2">A LINK</a> 
into 
[[http://www.mydomain.com/index.esp?var1=1&var2=2 -> A LINK]] 
 
I'd like to use a regex match. 
 
note in euphoria creole it's 
 [[link text -> link]] 
 [[link | link text]] 
 [[linktext link]] 
(hope I got that right) 
 
 Results: 
[[careers -> http://careers.overflow.com]] 
[[Custom REGEXP Function -> /questions/468/custom-regexp-function]] 
[[A LINK -> http://www.mydomain.com/index.esp?var1=1&var2=2]] 
 
*/

new topic » goto parent » topic index » view message » categorize

6. Re: Converting with Regular Expressions

Posted by useless Jan 24, 2011
1407 views

useless said...

Tom, that's better example of replace_all(), i mean transmute(), than in the manual. Still, what's the first empty sequence in

{ {},   `<a href="http:`    ,    `">A LINK</a>`  } ,

mean?

useless

new topic » goto parent » topic index » view message » categorize

7. Re: Converting with Regular Expressions

Posted by _tom (admin) Jan 24, 2011
1376 views

useless said...

{ {},   `<a href="http:`    ,    `">A LINK</a>`  } ,

I too am curious. I did notice that leaving out {} makes things not work.

Will the author of transmute() please help out?

new topic » goto parent » topic index » view message » categorize

8. Re: Converting with Regular Expressions

Posted by mattlewis (admin) Jan 24, 2011
1402 views

_tom said...

useless said...

{ {},   `<a href="http:`    ,    `">A LINK</a>`  } ,

I too am curious. I did notice that leaving out {} makes things not work.

Will the author of transmute() please help out?

man:std_sequence#transmute

TFM said...

By default, this routine operates on single elements from each of the arguments. That is to say, it scans source_data for elements that match any single element in current_items and when matched, replaces that with a single element from new_items.

For example, you can find all occurrances of 'h', 's', and 't' in a string and replace them with '1', '2', and '3' respectively. transmute(SomeString, "hts", "123") However, the routine can also be used to scan for sub-sequences and/or replace matches with sequences rather than single elements. This is done by making the first element in current_items and/or new_items an empty sequence.

Emphasis added.

I didn't write the routine, but I guess the author chose this somewhat awkward method over a separate routine or additional flag.

Matt

new topic » goto parent » topic index » view message » categorize

OpenEuphoria

1. Converting with Regular Expressions

2. Re: Converting with Regular Expressions

3. Re: Converting with Regular Expressions

4. Re: Converting with Regular Expressions

5. Re: Converting with Regular Expressions

6. Re: Converting with Regular Expressions

7. Re: Converting with Regular Expressions

8. Re: Converting with Regular Expressions

Search

Include:

Quick Links

User menu

Misc Menu