Skip to content Skip to sidebar Skip to footer

Rvest - Using A Dataframe Of Html Rather Than A Webpage - And Extracting Formatting Tags

I am trying to extract formatting tags from a column of HTML (and then go on to record whether each row is bold, italic, what colour etc.) I was trying to figure out whether to use

Solution 1:

A possible solution not with rvest, but with the XML-package could be the following:

htmlstring <- '<div align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 0%; font-size: 10pt; font-family: \'Times New Roman\', Times; color: #000000; background: #FFFFFF"> These forward-looking statements are also affected by the risk factors described below in Part I, Item 1A ("Risk Factors") and those set forth from time to time in our filings with the Securities and Exchange Commission ("SEC"), which are available through our website at <i>www.exterran.com </i>and through the SEC\'s Electronic Data Gathering and Retrieval System ("EDGAR") at <i><u>www.sec.gov</u></i>. Important factors that could cause our actual results to differ materially from the expectations reflected in these forward-looking statements include, among other things: </div>'

htmlstring <- XML::htmlParse(htmlstring)

And then you can use XPath to find out what you need, e.g. italicized parts:

XML::getNodeSet(htmlstring, '//i')

Post a Comment for "Rvest - Using A Dataframe Of Html Rather Than A Webpage - And Extracting Formatting Tags"