Skip to content Skip to sidebar Skip to footer

Php Reg Ex To Find Data Not In Html Tags But Identify Html Using < And >

I have the following data <div dir='ltr' style='text-align: left;' trbidi='on'><div class='MsoNormal'><i><span s

Solution 1:

That looks like an HTML Fragment inside a XML, more specific inside the description of a RSS feed. If this is the case you should parse the RSS using DOM, this will decode the entities a long the way:

$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);

Iterate the items:

foreach ($xpath->evaluate('/rss/channel/item') as$rssItem) {

The title of an item is only a text value it can be used directly:

echo'Title: ', $xpath->evaluate('string(title)', $rssItem), "\n";

The description in your example contains the html fragment in a text node with escaped entities, I have seen other example with a CDATA. It doesn't really matter for the outer xml document. It is text and if you read is as text the entities will get transformed back into their respective characters.

$description = $xpath->evaluate('string(description)', $rssItem);

So now $description contains < and > again. It can be loaded into a DOM with loadHtml() or just cleaned up with strip_tags().

echo'Description: ', strip_tags($description), "\n\n";

A full example (RSS adapted from Wikipedia):

$rss = <<<'RSS'
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel> 
 <item>
  <title>Example entry</title>
  <description>Here is some &lt;b&gt;text&lt;/b&gt; containing an interesting &lt;i&gt;description&lt;/i&gt; with &lt;span class="important"&gt;html&lt;/span&gt;.</description>
 </item>
</channel>
</rss>
RSS;

$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);

foreach ($xpath->evaluate('/rss/channel/item') as $rssItem) {
  echo'Title: ', $xpath->evaluate('string(title)', $rssItem), "\n";
  $description = $xpath->evaluate('string(description)', $rssItem);
  echo'Description: ', strip_tags($description), "\n\n";
}

Output:

Title: Example entry
Description: Here is some text containing an interesting description with html.

Solution 2:

for decoding you can user htmlspecialchars_decode

for more detail please check http://php.net/manual/en/function.htmlspecialchars-decode.php

Solution 3:

To obtain quickly the raw text (without tags) you can do this replacement:

$result = preg_replace('~&lt;.*?&gt;~s', ' ', $source);

Solution 4:

This gives you all the texts you're seeking as an array:

preg_match_all("/(?<=&gt;)(?!&lt;).*?(?=&lt;)/", $source, $result);

See a live demo of this regex working with your sample input.

Post a Comment for "Php Reg Ex To Find Data Not In Html Tags But Identify Html Using < And >"