Results 1 to 5 of 5

Thread: Trim 'empty' HTML 'whitespace'

  1. #1

    Thread Starter
    PowerPoster
    Join Date
    Apr 2007
    Location
    The Netherlands
    Posts
    5,070

    Trim 'empty' HTML 'whitespace'

    Hi,

    I have a web application where the user uses an HTML editor to input some formatted text to show on the website. This text is stored, as HTML, in a database. It turns out now, after some time, that users are sometimes entering whitespace (newlines, spaces, etc) before and/or after the text they are entering, and this whitespace shows up in the rendered text on the page. Multiple sets of this text are displayed top to bottom, and due to this whitespace the separation between bits of text varies, which doesn't look very pretty.

    So, I've been asked to get rid of the whitespace automatically.

    Seemed like an easy task at first, I'd just call Trim on the string. But that won't work, the 'whitespace' when rendered isn't actually whitespace at all, it's a bunch of <br /> or <br> or <br></br> tags, or spaces such as &nbsp;, even empty paragraphs <p></p> sometimes. So calling Trim won't help, I need to get rid of all the tags that would render as whitespace.

    The HTML editor control I'm using may or may not have built in support for this, but it doesn't matter, the database is already full with texts that contain these newline tags, so the best option I can see is to simply trim this whitespace before displaying it on the page.


    For clarification, the HTML stored in the database might look like this
    Code:
    <br />
    <p>This is some text<br /></p>
    <p><b>Some bold text</b><br /><br />&nbsp;</p>
    <br /><br />
    As you can imagine, there will be an empty line at the top and 4 empty lines at the bottom of this text, once rendered to the page. I want to get rid of these.

    The problem is that I can already think of a great number of different combinations of tags that all render as whitespace, and manually removing them all would probably work but I am sure I will forget some combinations, and in time, someone will manage to create those combinations and I'd have to edit the website again.

    For example, these are straightforward:
    Code:
    <br>
    <br />
    <br></br>
    &nbsp;
    <p></p>
    These however would also produce whitespace (view each line separately):
    Code:
    <p>Some text<br /><br /></p>
    Some <b>text<br /><br /><br /></b>
    So simply stripping 'br' tags from the start and end of the string won't work; they can be inside other tags such as <b> and <p> and <i>, etc, and as long as there is no more text after those other tags, they will all render as whitespace at the end which I want to get rid of...



    There must be some kind of way to do this? Am I missing something obvious, or is it really as hard as I think it is?

  2. #2
    PowerPoster techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,687

    Re: Trim 'empty' HTML 'whitespace'

    It's going to take several layers....
    first I'd remove all CR LFs ... since they don't render in HTML anyways, no sense in storing them.
    Then I'd replace all <br> with <br />
    Next I'd loop, replacing double breaks <br /><br /> with single breaks <br /> until there are no more double breaks
    Next I'd look for <br /></p> and replace them with </p>
    Next would be replacing <p>&nbsp;</p> with empty string
    And lastly, replacing <p><br /></p> also with empty strings...
    unfortunately that won't prevent them from doing this: <p><br /><i>&nbsp;</i><br /></p> or some other junk...

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  3. #3

    Thread Starter
    PowerPoster
    Join Date
    Apr 2007
    Location
    The Netherlands
    Posts
    5,070

    Re: Trim 'empty' HTML 'whitespace'

    Quote Originally Posted by techgnome View Post
    It's going to take several layers....
    first I'd remove all CR LFs ... since they don't render in HTML anyways, no sense in storing them.
    Then I'd replace all <br> with <br />
    Next I'd loop, replacing double breaks <br /><br /> with single breaks <br /> until there are no more double breaks
    Next I'd look for <br /></p> and replace them with </p>
    Next would be replacing <p>&nbsp;</p> with empty string
    And lastly, replacing <p><br /></p> also with empty strings...
    unfortunately that won't prevent them from doing this: <p><br /><i>&nbsp;</i><br /></p> or some other junk...

    -tg
    And that's the reason I don't like tackling this problem myself. I could theoretically try to catch every single possibility, but that's impossible, there's always going to be combinations of tags that I'm not catching. Of course, the chance of this HTML editor creating this combination should become smaller and smaller, but still... I've seen these editors output some horrible HTML, sometimes they don't even close their tags resulting in the rest of the page suddenly being bold...

    That's why I am hoping there is some kind of built in feature I'm missing. I'm displaying this HTML in a Literal control (ASP.NET), perhaps I can do something clever with that to trim the rendered whitespace... Maybe I should have posted this in the ASP.NET section after all

  4. #4
    PowerPoster techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,687

    Re: Trim 'empty' HTML 'whitespace'

    Yeah, I know... unfortunately there's only so much one can do... unless the editor you're using has somekind of built-in scrubber/cleaner that can recognize empty tags...

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  5. #5
    Hyperactive Member
    Join Date
    Apr 2011
    Location
    England
    Posts
    421

    Re: Trim 'empty' HTML 'whitespace'

    You could look at the HTML source in sections e.g. Paragraphs and Non-Paragraphs:

    Say you had:

    <body>
    <br />
    <p>This is some text<br /></p>
    <a>Something Here</a>
    </body>

    If you split it as suggested you would look at the sections as follows:

    Section1:
    <body>
    <br />

    Section2:
    <p>This is some text<br /></p>

    Section3:
    <a>Something Here</a>
    </body>

    Then you can apply the relevant logic to each section e.g.

    1. If the section is a Paragrah then temporarily strip the p tags and keep checking to see if it endswith a whitespace tag. If it does then remove the tag.

    2. If the section is not a paragraph then just replace all occurences of the WhiteSpace tags with nullstring if it is safe to assume that it is ok to do so (e.g. if <br /> is not freely used as part of the standard layout after images etc)

    After manipulating each section rebuild the HTML in a StringBuilder until you have processed the whole file.

    I made a working example that does all of the above. If you think it would be a practical solution for you then let me know and I will post it.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width