Hi,

I have a web application where the user uses an HTML editor to input some formatted text to show on the website. This text is stored, as HTML, in a database. It turns out now, after some time, that users are sometimes entering whitespace (newlines, spaces, etc) before and/or after the text they are entering, and this whitespace shows up in the rendered text on the page. Multiple sets of this text are displayed top to bottom, and due to this whitespace the separation between bits of text varies, which doesn't look very pretty.

So, I've been asked to get rid of the whitespace automatically.

Seemed like an easy task at first, I'd just call Trim on the string. But that won't work, the 'whitespace' when rendered isn't actually whitespace at all, it's a bunch of <br /> or <br> or <br></br> tags, or spaces such as &nbsp;, even empty paragraphs <p></p> sometimes. So calling Trim won't help, I need to get rid of all the tags that would render as whitespace.

The HTML editor control I'm using may or may not have built in support for this, but it doesn't matter, the database is already full with texts that contain these newline tags, so the best option I can see is to simply trim this whitespace before displaying it on the page.


For clarification, the HTML stored in the database might look like this
Code:
<br />
<p>This is some text<br /></p>
<p><b>Some bold text</b><br /><br />&nbsp;</p>
<br /><br />
As you can imagine, there will be an empty line at the top and 4 empty lines at the bottom of this text, once rendered to the page. I want to get rid of these.

The problem is that I can already think of a great number of different combinations of tags that all render as whitespace, and manually removing them all would probably work but I am sure I will forget some combinations, and in time, someone will manage to create those combinations and I'd have to edit the website again.

For example, these are straightforward:
Code:
<br>
<br />
<br></br>
&nbsp;
<p></p>
These however would also produce whitespace (view each line separately):
Code:
<p>Some text<br /><br /></p>
Some <b>text<br /><br /><br /></b>
So simply stripping 'br' tags from the start and end of the string won't work; they can be inside other tags such as <b> and <p> and <i>, etc, and as long as there is no more text after those other tags, they will all render as whitespace at the end which I want to get rid of...



There must be some kind of way to do this? Am I missing something obvious, or is it really as hard as I think it is?