Re: Stripping HTML Tags from a Text File

new topic     » goto parent     » topic index » view thread      » older message » newer message

I wrote this code to extract plain unformated text from msword 2007 ooxml files. But you can use it for html, simply delete the system call to 7zip and open directly your html file.

 
include get.e 
 
integer infile 
integer char 
sequence out = "" 
sequence currtag ="" 
atom IsTag =0 
 
system("7z x -y document.docx document.xml",2) 
 
infile = open("document.xml", "rb") 
 
while 1 do  -- Loop forever 
    char = getc(infile) 
    if char=-1 then -- if end of file 
	exit        -- end main loop 
    else 
	if IsTag then 
	    currtag=currtag&char 
	    if char='>' then 
		IsTag=0 
		 
		-- Check for know tags 
 
	    else 
		-- Replace CR and/or LF with space 
		-- eliminate double space 
	    end if 
	else 
	    if char = '<' then  --Begins a new tag! 
		IsTag=1 
		currtag=currtag&char 
	    else 
		out= out & char 
		-- Replace CR and/or LF with space 
		-- eliminate double space 
		 
	    end if 
	end if 
	 
    end if 
     
end while 
 
puts(1, out) 
 
if wait_key() then 
end if 
 
new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu