Re: Stripping HTML Tags from a Text File
- Posted by achury May 30, 2009
- 1322 views
I wrote this code to extract plain unformated text from msword 2007 ooxml files. But you can use it for html, simply delete the system call to 7zip and open directly your html file.
include get.e integer infile integer char sequence out = "" sequence currtag ="" atom IsTag =0 system("7z x -y document.docx document.xml",2) infile = open("document.xml", "rb") while 1 do -- Loop forever char = getc(infile) if char=-1 then -- if end of file exit -- end main loop else if IsTag then currtag=currtag&char if char='>' then IsTag=0 -- Check for know tags else -- Replace CR and/or LF with space -- eliminate double space end if else if char = '<' then --Begins a new tag! IsTag=1 currtag=currtag&char else out= out & char -- Replace CR and/or LF with space -- eliminate double space end if end if end if end while puts(1, out) if wait_key() then end if