Extract text from new MS-Word OOXML format

new topic     » topic index » view thread      » older message » newer message

Is really easy to extract plain text from files with new .docx format, the real complication to get formatted text.

example:
I have made a 2 lines .bat file with:

7z e -so -y forma.doc > outfile2.xml
ex doc2txt.ex outfile2.xml

The code for doc2txt.ex is:

include get.e 
 
integer infile 
integer char 
sequence out = "" 
sequence currtag ="" 
atom IsTag =0 
 
 
infile = open("outfile2.xml", "rb") 
 
 
while 1 do  -- Loop forever 
    char = getc(infile) 
    if char=-1 then -- if end of file 
	exit        -- end main loop 
    else 
	if IsTag then 
	    currtag=currtag&char 
	    if char='>' then 
		IsTag=0 
		 
		-- Check for know tags 
 
	    else 
		-- Replace CR and/or LF with space 
		-- eliminate double space 
	    end if 
	else 
	    if char = '<' then  --Begins a new tag! 
		IsTag=1 
		currtag=currtag&char 
	    else 
		out= out & char 
		-- Replace CR and/or LF with space 
		-- eliminate double space 
		 
	    end if 
	end if 
	 
    end if 
     
end while 
 
puts(1, out) 
 
if wait_key() then 
end if 
 
new topic     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu