Spiga

Cut HTML string without breaking the tags

by Gabi Solomon

On a recent project i had to take a HTML source code and break it into several pieces, all with ought breaking the HTML tags. After struggling for a while i started googling for a solution from the most inspired developers :D . After some time of trying different codes i managed to come upon the code below. It is a method from cakephp and it did the job perfectly.

I hope your lucky to find this article and save your self some time.
Cheers

[php]
/**
* Truncates text.
*
* Cuts a string to the length of $length and replaces the last characters
* with the ending if the text is longer than length.
*
* @param string $text String to truncate.
* @param integer $length Length of returned string, including ellipsis.
* @param string $ending Ending to be appended to the trimmed string.
* @param boolean $exact If false, $text will not be cut mid-word
* @param boolean $considerHtml If true, HTML tags would be handled correctly
* @return string Trimmed string.
*/
function truncate($text, $length = 100, $ending = ‘…’, $exact = true, $considerHtml = false) {
if ($considerHtml) {
// if the plain text is shorter than the maximum length, return the whole text
if (strlen(preg_replace(‘/<.*?>/’, ”, $text)) <= $length) {
return $text;
}

// splits all html-tags to scanable lines
preg_match_all('/(<.+?>)?([^<>]*)/s’, $text, $lines, PREG_SET_ORDER);

$total_length = strlen($ending);
$open_tags = array();
$truncate = ”;

foreach ($lines as $line_matchings) {
// if there is any html-tag in this line, handle it and add it (uncounted) to the output
if (!empty($line_matchings[1])) {
// if it’s an “empty element” with or without xhtml-conform closing slash (f.e.
)
if (preg_match(‘/^<(\s*.+?\/\s*|\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param)(\s.+?)?)>$/is’, $line_matchings[1])) {
// do nothing
// if tag is a closing tag (f.e. )
} else if (preg_match(‘/^<\s*\/([^\s]+?)\s*>$/s’, $line_matchings[1], $tag_matchings)) {
// delete tag from $open_tags list
$pos = array_search($tag_matchings[1], $open_tags);
if ($pos !== false) {
unset($open_tags[$pos]);
}
// if tag is an opening tag (f.e. )
} else if (preg_match(‘/^<\s*([^\s>!]+).*?>$/s’, $line_matchings[1], $tag_matchings)) {
// add tag to the beginning of $open_tags list
array_unshift($open_tags, strtolower($tag_matchings[1]));
}
// add html-tag to $truncate’d text
$truncate .= $line_matchings[1];
}

// calculate the length of the plain text part of the line; handle entities as one character
$content_length = strlen(preg_replace(‘/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i’, ‘ ‘, $line_matchings[2]));
if ($total_length+$content_length > $length) {
// the number of characters which are left
$left = $length – $total_length;
$entities_length = 0;
// search for html entities
if (preg_match_all(‘/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i’, $line_matchings[2], $entities, PREG_OFFSET_CAPTURE)) {
// calculate the real length of all entities in the legal range
foreach ($entities[0] as $entity) {
if ($entity[1]+1-$entities_length <= $left) {
$left--;
$entities_length += strlen($entity[0]);
} else {
// no more characters left
break;
}
}
}
$truncate .= substr($line_matchings[2], 0, $left+$entities_length);
// maximum lenght is reached, so get off the loop
break;
} else {
$truncate .= $line_matchings[2];
$total_length += $content_length;
}

// if the maximum length is reached, get off the loop
if($total_length >= $length) {
break;
}
}
} else {
if (strlen($text) <= $length) {
return $text;
} else {
$truncate = substr($text, 0, $length - strlen($ending));
}
}

// if the words shouldn't be cut in the middle...
if (!$exact) {
// ...search the last occurance of a space...
$spacepos = strrpos($truncate, ' ');
if (isset($spacepos)) {
// ...and cut the text in this position
$truncate = substr($truncate, 0, $spacepos);
}
}

// add the defined ending to the text
$truncate .= $ending;

if($considerHtml) {
// close all unclosed html-tags
foreach ($open_tags as $tag) {
$truncate .= '‘;
}
}

return $truncate;

}
[/php]

  • fornetti

    I do not believe this

  • http://www.gsdesign.ro/ Gabi Solomon

    try it

  • Sean NIeuwoudt

    Thanks for this code, works like a charm

  • http://www.gsdesign.ro/ Gabi Solomon

    you welcome

  • http://xeoncross.com David

    Thanks for the code snippet! I just started printing out short excerpts from a HTML page and I didn’t want to strip_tags() just to get it to display right. ;)

  • Thomas

    Works fine. It does cut off UTF-8 encoded strings though, so that could be an improvement to scan for UTF-8 sequences and not cut right into them…

  • lecone

    i finding the kind of this code so long…. thank s for your code!

  • http://www.gsdesign.ro/ Gabi Solomon

    @lecone you welcome

  • phong-tt

    thanks for your code man, i’m keep searching for a while until now :)

  • http://www.gsdesign.ro/ Gabi Solomon

    @phong-tt
    you welcome

  • Pingback: How do I truncate an HTML string without breaking the HTML code? « Dodona gives you answers

  • Snypy

    Thumbs up! ;-)

    Thanks, really helped me. Since I used cuttext method, I was wondering how to cut html code without loosing tags. Now, you solved my problem :-)

  • http://www.kamiyeye.com kami

    what if set '$exact=false' with "…the last space is before tag nospace</html>" ? the tag '' will be lost, isn't it?

    • http://intensedebate.com/people/solomongaby solomongaby

      @kami
      if the exact parameters is true, the text will have an exact lenght and words will be cut, if false, the text might have a few letters extra because it will not cut the words.

  • Roly

    Thanks I was looking for something that does exactly this.

  • http://vanco.ordanoski.name/ Vanco

    Thanks!
    Your work saved me hours of work.

  • http://www.google.com/ Google

    too old version, on line 96 cutting html text '<a href' by space and ruins html.

    Google was here.

  • http://www.reallyeasycart.co.uk/ Andrew Stilliard

    This is brilliant cheers! Saved me hours!

  • yogix

    Nice one! Thank you. I´ve been searching for this few months already!

  • http://www.securityhacking.tk/ Vlad

    Hi,
    I've modified your version a little bit, because it was breaking tags if $exact=false and it hits space in the middle of the tag. Really simple, just adding another variable, $doingtag.
    You can check it our here http://www.securityhacking.tk/2010/02/cut-html-

    • tt

      tt

  • KDV

    this peice code is awesome…it really saved me a days work

  • Pingback: Wordpress : the_content_limit() v.2 | Staicu Ionuţ-Bogdan - the Frontend Developer

  • Gonzalo

    Awesome! It works perfectly. Thank so much.

  • http://www.air-jordan-13.com air jordan 13

    Mark S. is definitely on the right track. If you want to get a professional looking email address, Id recommend buying your name domain name, like or
    Gucci sweaters
    If its common it might be difficult to get, however, be creative and you can usually find something.

  • http://www.gammelsaeter.com Paul G

    I googled and found exactly what I was looking for in this blog post. Thank you, saved me a lot of time! :)

  • http://www.innovavista.net Alejandro

    Worked for me, thank you very much!

  • http://twitter.com/brixterdeleon brixter

    Hey @ Author,you might want to consider modifying it because if you included HTML tags within the truncated part it'l break.

  • http://www.reallyeasycart.co.uk Ecommerce Software

    Cascading style sheets (CSS) will greatly reduce the amount of code within your web pages. This will also cut down on the amount of web space and bandwidth used thus saving you money for hosting your site.

  • Jama211

    Thanks, this helped me a lot.

  • Alex Prokop

    If you are truncating a single paragraph eg. <p>…</p> that contains no other opening or closing tags, if $length is less than the character count of the input text the closing p tag will not be appended. This happens because you are unsetting the opening tag from $open_tags if you match a corresponding closing tag, but without checking if that closing tag will be truncated later.

  • Alex Prokop

    Apologies, it actually happens when $exact is set to false, as other people have commented it will cut off the closing tag if there is no following whitespace.

    A hacky fix is to add a space after each closing tag:

    if (!empty($line_matchings[1])) {
    // if it’s an "empty element" with or without xhtml-conform closing slash (f.e. <br/>)
    if (preg_match(‘/^<(s*.+?/s*|s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param)(s.+?)?)>$/is’, $line_matchings[1])) {
    // add html-tag to $truncate’d text
    $truncate .= $line_matchings[1];
    // if tag is a closing tag (f.e. </b>)
    } else if (preg_match(‘/^<s*/([^s]+?)s*>$/s’, $line_matchings[1], $tag_matchings)) {
    // delete tag from $open_tags list
    $pos = array_search($tag_matchings[1], $open_tags);
    if ($pos !== false) {
    unset($open_tags[$pos]);
    }
    // unpleasant fix to prevent exact = false chopping off tag
    $truncate .= $line_matchings[1] . ‘ ‘;
    // if tag is an opening tag (f.e. <b>)
    } else if (preg_match(‘/^<s*([^s>!]+).*?>$/s’, $line_matchings[1], $tag_matchings)) {
    // add tag to the beginning of $open_tags list
    array_unshift($open_tags, strtolower($tag_matchings[1]));
    $truncate .= $line_matchings[1];
    }
    }

  • Brennino

    Amazing! I’m loving you (and the cakephp team it’s implied…) :-)

  • Hiper

    Great code, specialy like tag closing ;)

  • Pingback: Tech Notes: A WordPress excerpt with HTML tags - Oikos

  • Pingback: Word Sensitive and also html tags aware version of PHP substr | Aminul Islam

  • GreenFin

    Is there a way to add a starting position for the string in this code, similar to the substr() method? i.e. truncate($text, $start, $length = 100, $ending = ‘…’, $exact = true, $considerHtml = false

    • micze

      Have you found anything ? I’m trying to do that for about 8h and nothing..

  • LongHorn

    There is any way to get also the other part? What i need is to insert {readmore} text between the too parts, i can add it to the end using $ending = ‘{readmore}’ but now i need to append the remaining text after the {readmore} text… how can i do this?

  • http://www.binarytides.com/ Silver Moon

    I guess, there must be some much shorter way of doing this using DomDocument.