1. Stripping HTML Tags from a Text File

I've been asked to look into a tool that will strip out all HTML tags, leaving the 'plain text' which will then be loaded into a new system. Apparently the old system only produces HTML output and the vendor isn't interested in helping us leave their software system.

Does anyone have such a routine? I can use a Regex in several other languages but it leaves a lot of 'stuff' which will have to be cleaned up manually ... and there are going to be a lot of very large files to be processed.

Thanks in advance.

Tom

new topic     » topic index » view message » categorize

2. Re: Stripping HTML Tags from a Text File

This is very easy to handle using a regex capable text editor, or you can use specific tools for. There are some freely available and others very cheap. RegexBuddy comes to mind for the latter and V-Grep for the former.

Both can use a regex like
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>
with the "cleaned" text stored into the second capturing block (backreference). You may need to run this several times.

HTH

new topic     » goto parent     » topic index » view message » categorize

3. Re: Stripping HTML Tags from a Text File

html2txt :

win: http://www.bobsoft.com/html2txt/

lin: http://www.icewalkers.com/Linux/Software/51170/html2txt.html

new topic     » goto parent     » topic index » view message » categorize

4. Re: Stripping HTML Tags from a Text File

I second the use of html2txt, it's a great product. If you just want a simple function to remove html tags, then something like this should do:

function strip_html(sequence html) 
	integer tag_start 
 
	while tag_start != 0 with entry do 
		integer tag_end = find_from('>', html, tag_start) 
		if tag_end = 0 then 
			puts(1, "Malformed HTML, aborting\n") 
			abort(1) 
		end if 
 
		html = remove(html, tag_start, tag_end) 
	entry 
		tag_start = find('<', html) 
	end while 
 
	return html 
end function 

However, that simply strips the HTML tags, it does no conversion of HTML to Text, which is what you seem to want. It does not truly detect malformed HTML, for instance, this will pass: "<html Hello <b>World!</b>" as it simply strips from <html Hello <b> as 1 tag.

Jeremy

new topic     » goto parent     » topic index » view message » categorize

5. Re: Stripping HTML Tags from a Text File

tbohon said...

I've been asked to look into a tool that will strip out all HTML tags, leaving the 'plain text' which will then be loaded into a new system. Apparently the old system only produces HTML output and the vendor isn't interested in helping us leave their software system.

Does anyone have such a routine? I can use a Regex in several other languages but it leaves a lot of 'stuff' which will have to be cleaned up manually ... and there are going to be a lot of very large files to be processed.

Thanks in advance.

Tom

Tom:

Here is a shareware detager that's try before you buy.

http://www.jafsoft.com/products/

new topic     » goto parent     » topic index » view message » categorize

6. Re: Stripping HTML Tags from a Text File

Maybe Thomas Parslow (PatRat)'s XML Parser can help you: http://www.rapideuphoria.com/eebax.zip
The problem I see with this are img tags that old html didn't require to be closed and probably < > characters inside the data you're parsing (I guess this parser expects &lt; or &gt;).

new topic     » goto parent     » topic index » view message » categorize

7. Re: Stripping HTML Tags from a Text File

I searched google for [remove html tags] and [strip html] and [remove html script]

and there are lots of things out there. The most popular seem to be javascript and C+ .

Probably, you could make an include out of this:

Using webcontent in applications can be very annoying since webcontent usually contains lots of HTML elements. With one simple action, using regular expressions, all of these HTML elements can be removed from the content. What's left is a clean string, without HTML formatting. Snippet: using System.Text.RegularExpressions; ...

public static string RemoveHTML(string in_HTML) { return Regex.Replace(lv_HTML, "<(.|\n)*?>", ""); }

Labels: HTML, Regular Expression

Posted by Xander Zelders on http://www.dotnet4all.com/snippets/2007/04/how-to-remove-html-tags-from-web.html

If you do, could you please Post it here. Thanks! ...Vern

new topic     » goto parent     » topic index » view message » categorize

8. Re: Stripping HTML Tags from a Text File

This gives different methods for doing the same thing, from:

http://dotnetperls.com/Content/Remove-HTML-Tags.aspx

This is all in C#. Can someone tell me what's the diff between: c,c#,c ? And is and is it thrue that it's an easy conversion from C# to c? -Thanks!

And he goes into detailed explanations:

1. Removing HTML from strings First, here is a static class that tests three different ways of removing HTML tags and their contents. The methods receive string arguments and then process the string and return new strings that do not have the HTML tags. The methods have different performance characteristics. As a reminder, HTML tags start with < and end with >.

using System; using System.Text.RegularExpressions;

/ <summary> / Methods to remove HTML from strings. / </summary> public static class HtmlRemoval { / <summary> / Remove HTML from string with Regex. / </summary> public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); }

/ <summary> / Compiled regular expression for performance. / </summary> static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

/ <summary> / Remove HTML from string with compiled Regex. / </summary> public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); }

/ <summary> / Remove HTML tags from string using char array. / </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false;

for (int i = 0; i < source.Length; i) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex; } } return new string(array, 0, arrayIndex); } }

new topic     » goto parent     » topic index » view message » categorize

9. Re: Stripping HTML Tags from a Text File

I wrote this code to extract plain unformated text from msword 2007 ooxml files. But you can use it for html, simply delete the system call to 7zip and open directly your html file.

 
include get.e 
 
integer infile 
integer char 
sequence out = "" 
sequence currtag ="" 
atom IsTag =0 
 
system("7z x -y document.docx document.xml",2) 
 
infile = open("document.xml", "rb") 
 
while 1 do  -- Loop forever 
    char = getc(infile) 
    if char=-1 then -- if end of file 
	exit        -- end main loop 
    else 
	if IsTag then 
	    currtag=currtag&char 
	    if char='>' then 
		IsTag=0 
		 
		-- Check for know tags 
 
	    else 
		-- Replace CR and/or LF with space 
		-- eliminate double space 
	    end if 
	else 
	    if char = '<' then  --Begins a new tag! 
		IsTag=1 
		currtag=currtag&char 
	    else 
		out= out & char 
		-- Replace CR and/or LF with space 
		-- eliminate double space 
		 
	    end if 
	end if 
	 
    end if 
     
end while 
 
puts(1, out) 
 
if wait_key() then 
end if 
 
new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu