Board index   FAQ   Search  
Register  Login
Board index php forum :: php coding PHP coding => General

How do I grab information off a webpage?

Ask about general coding issues or problems here.

Moderators: macek, egami, gesf

How do I grab information off a webpage?

Postby MadMed » Thu Aug 08, 2002 4:02 pm

I am trying to make a function that would connect to a site and read the source code and return the description and the title of the page.

Example:

let say this is our function: get_description ($url)
Then, calling get_description ("http://www.excite.com") should return the following text:
"Excite is the leading personalization Web portal, featuring world-class search, content and functionality. From financial portfolios to sports scores, local weather forecasts to movie listings, Excite gathers what matters most to you every day. It's like your very own online personal assitant."

Note the description META tag at http://www.excite.com if this example is does not make sense yet.

You help is greatly appreciated and your function coding is even more.

Best Regards,
Med
madmed@dotnetos.com
User avatar
MadMed
New php-forum User
New php-forum User
 
Posts: 6
Joined: Thu Aug 08, 2002 3:41 pm
Location: Dallas, TX

Postby Jay » Thu Aug 08, 2002 4:18 pm

Read the manual on File system functions. You're looking for a function called fread().

You'll then also want to look up Regular Expression in the manual, so you can work out how filter out everything except what's between the tags you specifically want.

That should get you started
Jay
 

Postby MadMed » Thu Aug 08, 2002 5:03 pm

I've got this so far:

Code: Select all
<?php
$fp = fsockopen ("www.excite", 80, $errno, $errstr, 30);
$sitecode = "" ;
if (!$fp) {
    echo "$errstr ($errno)<br>\n";
} else {
    $document = "" ;
    fputs ($fp, "GET /" . $document . " HTTP/1.0\r\nHost: www.excite.com\r\n\r\n");
    while (!feof($fp)) {
        $sitecode .= fgets ($fp,128);

    }
    fclose ($fp);

$codepart = explode ( "</head>" , $sitecode )   ;
$codepart = explode ( "<head>" , $codepart[0] ) ;

$meta     = explode ( "<META" , $codepart[1] )  ;

for ( $i=1 ; $i<count($meta) ; $i++ ) {
  if ( stristr ( $meta[$i] , "name=description" ) or stristr ( $meta[$i] , "name=\"description\"" ) ) {
    $metadesc = explode ( ">" , $meta[$i] ) ;
    $content  = explode ( "=" , $metadesc[0] ) ;
    for ( $j=0 ; $j<count($content) ; $j++ ) {
      if ( stristr ( $content[$j] , "content" ) ) {
        $target = explode ( "\"" , $content[$j+1] ) ;
        $description = $target[1] ;
        break ;
      }
    }
  }
}

echo "<br>Description starts here <br>";
echo $description;
echo "<br>Description ends here <br>";
}
?>


It's kind of working but very weakly: I am not concerned too much with the data or how to filter it... but my main problem is handling sites that redirect you to others. I will be using this code with "jump" pages mainly.

Thanks for your help and inputs!

Med
madmed@dotnetos.com
User avatar
MadMed
New php-forum User
New php-forum User
 
Posts: 6
Joined: Thu Aug 08, 2002 3:41 pm
Location: Dallas, TX

Postby Jay » Thu Aug 08, 2002 6:36 pm

MadMed wrote:my main problem is handling sites that redirect you to others. I will be using this code with "jump" pages mainly.

Thanks for your help and inputs!

Med
madmed@dotnetos.com

That would really depend on how the sites are redirecting you. You see a webpage is just text, formatted to HTML standards, and available at a specific URL. This is how you access and read it. Your browser reads the text, and interprets the HTML commands (tags etc) and displays it accordingly.

If the sites are controlling the 'headers' to redirect you, you can't do much since there was never a page there to begin with. But if they're using javascript or meta-refresh tags, then this shouldn't be a problem because these are only actioned by your browser.
Jay
 


Return to PHP coding => General

Who is online

Users browsing this forum: Exabot [Bot], Google [Bot] and 5 guests

Sponsored by Sitebuilder Web hosting and Traduzioni Italiano Rumeno and antispam for cPanel.

cron