OpenEuphoria: Forum: Re: Stripping HTML Tags from a Text File

Re: Stripping HTML Tags from a Text File

new topic » goto parent » topic index » view thread » older message » newer message

Posted by achury May 30, 2009
1322 views

I wrote this code to extract plain unformated text from msword 2007 ooxml files. But you can use it for html, simply delete the system call to 7zip and open directly your html file.

 
include get.e 
 
integer infile 
integer char 
sequence out = "" 
sequence currtag ="" 
atom IsTag =0 
 
system("7z x -y document.docx document.xml",2) 
 
infile = open("document.xml", "rb") 
 
while 1 do  -- Loop forever 
    char = getc(infile) 
    if char=-1 then -- if end of file 
	exit        -- end main loop 
    else 
	if IsTag then 
	    currtag=currtag&char 
	    if char='>' then 
		IsTag=0 
		 
		-- Check for know tags 
 
	    else 
		-- Replace CR and/or LF with space 
		-- eliminate double space 
	    end if 
	else 
	    if char = '<' then  --Begins a new tag! 
		IsTag=1 
		currtag=currtag&char 
	    else 
		out= out & char 
		-- Replace CR and/or LF with space 
		-- eliminate double space 
		 
	    end if 
	end if 
	 
    end if 
     
end while 
 
puts(1, out) 
 
if wait_key() then 
end if

new topic » goto parent » topic index » view thread » older message » newer message

OpenEuphoria

Re: Stripping HTML Tags from a Text File

Search

Include:

Quick Links

User menu

Misc Menu