Play with YQL: HTML Scraping using YQL and PHP

Here is a quick post. It is a product created from a play with YQL (Yahoo Query Language). YQL is very powerful in scrapping HTML from any page on web. Yahoo YQL and Yahoo Pipes are very powerful tool. Yahoo pipes is very good for merging Feed.

This YQL play is stripping data from TOI list of Bihar Election winners. It is stripping data from the page correctly without any flaw. Only problem is that this stripping can destroy any time when developer of the page changes HTML tags or changes structure of the page.


<?php
// Just a normal CURL function.
function getSearchResult ( $url )
{
	$options = array(
		CURLOPT_RETURNTRANSFER => true,
		CURLOPT_HEADER         => false,
		CURLOPT_FOLLOWLOCATION => true,
		CURLOPT_ENCODING       => '',
		CURLOPT_AUTOREFERER    => true,
		CURLOPT_MAXREDIRS      => 2,
		CURLOPT_SSL_VERIFYPEER => false // if making https req and do not care of ssl/https certificate then set it off
	);
	$ch  = curl_init( $url );
	curl_setopt_array( $ch, $options );
	$content	= curl_exec( $ch );
	$err		= curl_errno( $ch );
	$errmsg		= curl_error( $ch );
	$info		= curl_getinfo( $ch );
	curl_close( $ch );
	$output['errno']   = $err;
	$output['errmsg']  = $errmsg;
	$output['content'] = $content;
	return $output;
}
// YQL code starts from here
$output = getSearchResult("http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Ftimesofindia.indiatimes.com%2Fbiharpollresult.cms%22%20and%20xpath%3D'%2Fhtml%2Fbody%2Fcenter%2Fdiv%5B%40id%3D%22netspidersosh%22%5D%2Fdiv%2Fdiv%5B%40class%3D%22navlft%22%5D%2Fdiv%2Fdiv%5B%40class%3D%22maintable12%22%5D%2F%2Fdiv'&format=json&callback=");
$arr = json_decode($output['content']);
echo '<table border=1>
	<tr>'. "n";
$i = 0;
foreach ($arr->query->results->div as $data) {
	// P tag is holding our data. We can see from print_r() on data. If it is not then just skip.
	if (empty ($data->p)) continue;
	echo "nt" . '<td>' . $data->p . '</td>' ;
	$i++;
	// only 4 cols. So after that new row
	if ($i >= 4) {
		$i = 0;
		echo "n</tr> n<tr>". "n" ;
	}
}
// I am not very sure of this code. But currently it is working. and
// it is not part of YQL tricks.
switch ($i) {
	case 0:
		echo '<td></td> <td></td> <td></td> <td></td> </tr>';
		break;
	case 1:
		echo '<td></td> <td></td> <td></td> </tr>';
		break;
	case 2:
		echo '<td></td> <td></td> </tr>';
	case 3:
		echo '</td><td> </td> </tr>';
	case 4:
		echo '</tr>';
}
echo '</table>';

Output:

Go and play with YQL. For getting xPath, Firebug can be helpful.

  • # 1 - by Buzzknow

    Hi currently i’m play with yql .. btw does yql support regex?

    i mean i want to scrape some dynamic xpath but with same prefix

    ex, <div id="post_xx"

    regards

Comments are open for an year period. Please, write here on Facebook page.