Extract text from new MS-Word OOXML format
- Posted by achury Jan 26, 2009
- 851 views
Is really easy to extract plain text from files with new .docx format, the real complication to get formatted text.
example:
I have made a 2 lines .bat file with:
7z e -so -y forma.doc > outfile2.xml
ex doc2txt.ex outfile2.xml
The code for doc2txt.ex is:
include get.e integer infile integer char sequence out = "" sequence currtag ="" atom IsTag =0 infile = open("outfile2.xml", "rb") while 1 do -- Loop forever char = getc(infile) if char=-1 then -- if end of file exit -- end main loop else if IsTag then currtag=currtag&char if char='>' then IsTag=0 -- Check for know tags else -- Replace CR and/or LF with space -- eliminate double space end if else if char = '<' then --Begins a new tag! IsTag=1 currtag=currtag&char else out= out & char -- Replace CR and/or LF with space -- eliminate double space end if end if end if end while puts(1, out) if wait_key() then end if