1. Stripping HTML Tags from a Text File
- Posted by tbohon May 28, 2009
- 1345 views
I've been asked to look into a tool that will strip out all HTML tags, leaving the 'plain text' which will then be loaded into a new system. Apparently the old system only produces HTML output and the vendor isn't interested in helping us leave their software system.
Does anyone have such a routine? I can use a Regex in several other languages but it leaves a lot of 'stuff' which will have to be cleaned up manually ... and there are going to be a lot of very large files to be processed.
Thanks in advance.
Tom
2. Re: Stripping HTML Tags from a Text File
- Posted by euler May 28, 2009
- 1419 views
This is very easy to handle using a regex capable text editor, or you can use specific tools for. There are some freely available and others very cheap. RegexBuddy comes to mind for the latter and V-Grep for the former.
Both can use a regex like
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>
with the "cleaned" text stored into the second capturing block (backreference). You may need to run this several times.
HTH
3. Re: Stripping HTML Tags from a Text File
- Posted by irv May 28, 2009
- 1400 views
html2txt :
win: http://www.bobsoft.com/html2txt/
lin: http://www.icewalkers.com/Linux/Software/51170/html2txt.html
4. Re: Stripping HTML Tags from a Text File
- Posted by jeremy (admin) May 28, 2009
- 1343 views
I second the use of html2txt, it's a great product. If you just want a simple function to remove html tags, then something like this should do:
function strip_html(sequence html) integer tag_start while tag_start != 0 with entry do integer tag_end = find_from('>', html, tag_start) if tag_end = 0 then puts(1, "Malformed HTML, aborting\n") abort(1) end if html = remove(html, tag_start, tag_end) entry tag_start = find('<', html) end while return html end function
However, that simply strips the HTML tags, it does no conversion of HTML to Text, which is what you seem to want. It does not truly detect malformed HTML, for instance, this will pass: "<html Hello <b>World!</b>" as it simply strips from <html Hello <b> as 1 tag.
Jeremy
5. Re: Stripping HTML Tags from a Text File
- Posted by bernie May 28, 2009
- 1398 views
I've been asked to look into a tool that will strip out all HTML tags, leaving the 'plain text' which will then be loaded into a new system. Apparently the old system only produces HTML output and the vendor isn't interested in helping us leave their software system.
Does anyone have such a routine? I can use a Regex in several other languages but it leaves a lot of 'stuff' which will have to be cleaned up manually ... and there are going to be a lot of very large files to be processed.
Thanks in advance.
Tom
Tom:
Here is a shareware detager that's try before you buy.
6. Re: Stripping HTML Tags from a Text File
- Posted by gbonvehi May 28, 2009
- 1361 views
Maybe Thomas Parslow (PatRat)'s XML Parser can help you: http://www.rapideuphoria.com/eebax.zip
The problem I see with this are img tags that old html didn't require to be closed and probably < > characters inside the data you're parsing (I guess this parser expects < or >).
7. Re: Stripping HTML Tags from a Text File
- Posted by vmars May 30, 2009
- 1389 views
- Last edited May 31, 2009
I searched google for [remove html tags] and [strip html] and [remove html script]
and there are lots of things out there. The most popular seem to be javascript and C+ .
Probably, you could make an include out of this:
Using webcontent in applications can be very annoying since webcontent usually contains lots of HTML elements. With one simple action, using regular expressions, all of these HTML elements can be removed from the content. What's left is a clean string, without HTML formatting. Snippet: using System.Text.RegularExpressions; ...
public static string RemoveHTML(string in_HTML) { return Regex.Replace(lv_HTML, "<(.|\n)*?>", ""); }
Labels: HTML, Regular Expression
Posted by Xander Zelders on http://www.dotnet4all.com/snippets/2007/04/how-to-remove-html-tags-from-web.html
If you do, could you please Post it here. Thanks! ...Vern
8. Re: Stripping HTML Tags from a Text File
- Posted by vmars May 30, 2009
- 1369 views
- Last edited May 31, 2009
This gives different methods for doing the same thing, from:
http://dotnetperls.com/Content/Remove-HTML-Tags.aspx
This is all in C#. Can someone tell me what's the diff between: c,c#,c ? And is and is it thrue that it's an easy conversion from C# to c? -Thanks!
And he goes into detailed explanations:
1. Removing HTML from strings First, here is a static class that tests three different ways of removing HTML tags and their contents. The methods receive string arguments and then process the string and return new strings that do not have the HTML tags. The methods have different performance characteristics. As a reminder, HTML tags start with < and end with >.
using System; using System.Text.RegularExpressions;
/ <summary> / Methods to remove HTML from strings. / </summary> public static class HtmlRemoval { / <summary> / Remove HTML from string with Regex. / </summary> public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); }
/ <summary> / Compiled regular expression for performance. / </summary> static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
/ <summary> / Remove HTML from string with compiled Regex. / </summary> public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); }
/ <summary> / Remove HTML tags from string using char array. / </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false;
for (int i = 0; i < source.Length; i) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex; } } return new string(array, 0, arrayIndex); } }
9. Re: Stripping HTML Tags from a Text File
- Posted by achury May 30, 2009
- 1323 views
- Last edited May 31, 2009
I wrote this code to extract plain unformated text from msword 2007 ooxml files. But you can use it for html, simply delete the system call to 7zip and open directly your html file.
include get.e integer infile integer char sequence out = "" sequence currtag ="" atom IsTag =0 system("7z x -y document.docx document.xml",2) infile = open("document.xml", "rb") while 1 do -- Loop forever char = getc(infile) if char=-1 then -- if end of file exit -- end main loop else if IsTag then currtag=currtag&char if char='>' then IsTag=0 -- Check for know tags else -- Replace CR and/or LF with space -- eliminate double space end if else if char = '<' then --Begins a new tag! IsTag=1 currtag=currtag&char else out= out & char -- Replace CR and/or LF with space -- eliminate double space end if end if end if end while puts(1, out) if wait_key() then end if