Submit Your Article Webforumz RegistrationAnnouncements Contact Webforumz StaffContact
Home Resources Blogs Meet the Team Contact Register
 

Go Back   WebForumz.com > The Code > PHP

Reply
 
LinkBack Thread Tools
Old July 21st, 2010, 02:37 AM   #1
New Member
 

Join Date: Jul 2010
Location: India
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Altering Power: 0 Nagadurga is on a distinguished road
Extract needed details from website and store in databse using php

Hi..
I am a beginner to use PHP. I already run the script to extract the data from web, but it wil work for extract data from web which is in tabular format. I want extract more number details such as if i want to search about hotel means that should displays name of the hotel, address, and phone, email and that should be stored in database... please give me a suggestion to solve my problem....

Thanks in advance.
Nagadurga is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
Old July 27th, 2010, 11:14 AM   #2
New Member
 

Join Date: Jul 2010
Location: India
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Altering Power: 0 Nagadurga is on a distinguished road
Re: Extract needed details from website and store in databse using php

Hi friends,
I was tried this code to extract the needed information from the web pages. I was used the following code

Code:
<?php
class ContentExtractor {
 
    var 
$container_tags = array(
            
'p','div'
        
);
    var 
$removed_tags = array(
             
'div class="resultancy claearfix"',
             
'div id="hd"','meta','link','title','script','a href','img','ul','li','form','input','label','strong','href',
             
'noscript','iframe','h2','head','ul','span class="iconkey"'
        
);
    var 
$ignore_len_tags = array(
            
'span'
        
);    
 
    var 
$link_text_ratio 0.04;
    var 
$min_text_len 20;
    var 
$min_words 0;    
 
    var 
$total_links 0;
    var 
$total_unlinked_words 0;
    var 
$total_unlinked_text='';
    var 
$text_blocks 0;
 
    var 
$tree null;
    var 
$unremoved=array();
 
    function 
sanitize_text($text){
        
$text str_ireplace('&nbsp;'' '$text);
        
$text html_entity_decode($textENT_QUOTES);
 
        
$utf_spaces = array("\xC2\xA0""\xE1\x9A\x80""\xE2\x80\x83"
            
"\xE2\x80\x82""\xE2\x80\x84""\xE2\x80\xAF""\xA0");
        
$text str_replace($utf_spaces' '$text);
 
        return 
trim($text);
    }
 
    function 
extract($text$ratio null$min_len null){
        
$this->tree = new DOMDocument();
 
        
$start microtime(true);
        if (!@
$this->tree->loadHTML($text)) return false;
 
        
$root $this->tree->documentElement;
        
$start microtime(true);
        
$this->HeuristicRemove($root, ( ($ratio == null) || ($min_len == null) ));
 
        if (
$ratio == null) {
            
$this->total_unlinked_text $this->sanitize_text($this->total_unlinked_text);
 
            
$words preg_split('/[\s\r\n\t\|?!.,]+/'$this->total_unlinked_text);
            
$words array_filter($words);
            
#$words = strip_tags($words);
            
$this->total_unlinked_words count($words);
            unset(
$words);
            if (
$this->total_unlinked_words>0) {
                
$this->link_text_ratio $this->total_links $this->total_unlinked_words;// + 0.01;
                
$this->link_text_ratio *= 1.3;
            }
 
        } else {
            
$this->link_text_ratio $ratio;
        };
 
        if (
$min_len == null) {
            
$this->min_text_len strlen($this->total_unlinked_text)/$this->text_blocks;
        } else {
            
$this->min_text_len $min_len;
        }
 
        
$start microtime(true);
        
$this->ContainerRemove($root);
 
        return 
$this->tree->saveHTML();
    }
 
    function 
HeuristicRemove($node$do_stats false){
        if (
in_array($node->nodeName$this->removed_tags)){
            return 
true;
        };
 
        if (
$do_stats) {
            if (
$node->nodeName == 'a') {
                
$this->total_links++;
            }
            
$found_text false;
        };
 
        
$nodes_to_remove = array();
 
        if (
$node->hasChildNodes()){
            foreach(
$node->childNodes as $child){
                if (
$this->HeuristicRemove($child$do_stats)) {
                    
$nodes_to_remove[] = $child;
                } else if ( 
$do_stats && ($node->nodeName != 'a') && ($child->nodeName == '#text') ) {
                    
$this->total_unlinked_text .= $child->wholeText;
                    if (!
$found_text){
                        
$this->text_blocks++;
                        
$found_text=true;
                    }
                };
            }
            foreach (
$nodes_to_remove as $child){
                
$node->removeChild($child);
            }
        }
 
        return 
false;
    }
 
    function 
ContainerRemove($node){
        if (
is_null($node)) return 0;
        
$link_cnt 0;
        
$word_cnt 0;
        
$text_len 0;
        
$delete false;
        
$my_text '';
 
        
$ratio 1;
 
        
$nodes_to_remove = array();
        if (
$node->hasChildNodes()){
            foreach(
$node->childNodes as $child){
                
$data $this->ContainerRemove($child);
 
                if (
$data['delete']) {
                    
$nodes_to_remove[]=$child;
                } else {
                    
$text_len += $data[2];
                }
 
                
$link_cnt += $data[0];
 
                if (
$child->nodeName == 'a') {
                    
$link_cnt++;
                } else {
                    if (
$child->nodeName == '#text'$my_text .= $child->wholeText;
                    
$word_cnt += $data[1];
                }
            }
 
            foreach (
$nodes_to_remove as $child){
                
$node->removeChild($child);
            }
 
            
$my_text $this->sanitize_text($my_text);
 
            
$words preg_split('/[\s\r\n\t\|?!.,\[\]]+/'$my_text);
            
$words array_filter($words); 
            
$word_cnt += count($words);
            
$text_len += strlen($my_text);
 
        };
 
        if (
in_array($node->nodeName$this->container_tags)){
            if (
$word_cnt>0$ratio $link_cnt/$word_cnt;
 
            if (
$ratio $this->link_text_ratio){
                    
$delete true;
            }
 
            if ( !
in_array($node->nodeName$this->ignore_len_tags) ) {
                if ( (
$text_len $this->min_text_len) || ($word_cnt<$this->min_words) ) {
                    
$delete true;
                }
            }
 
        }    
 
        return array(
$link_cnt$word_cnt$text_len'delete' => $delete);
    }
 
}
 

$html file_get_contents('http://www.local.ch/en/q/bar.html');
 
$extractor = new ContentExtractor();
$content $extractor->extract($html); 
echo 
$content;
?>
But i was got the output as
Results for in the current map areain
The number of results indicates how many listings correspond to your search.
To view these listings, click the Search button. Do you have any questions or
suggestions? Or maybe even come across a problem? Please let us know:
info@local.ch
Results for
You can choose if you want to print the map on this page,
by using the options (such as "No Map") which appear directly
above the map to display or hide it.

The Yellow Pages > Bar, Restaurant
Bleu Lézard

rue Enning 10, 1003 Lausanne
resultentry_06.63771746.520077Bleu Lézard
Bleu Lézard

rue Enning 10, 1003 Lausanne

Tel.: * 021 321 38 30
tel/search

The Yellow Pages > Bar, Restaurant, Events
Nordportal

Schmiedestrasse 12, 5400 Baden
resultentry_18.30031447.481186Nordportal
Nordportal

Schmiedestrasse 12, 5400 Baden

Tel.: * 056 221 15 72
tel/search
ADN Bar Café

rue de Lausanne 59, 1202 Genève
resultentry_26.14646746.215079ADN Bar Café
ADN Bar Café

rue de Lausanne 59, 1202 Genève

Tel.: * 022 731 40 18
tel/search
Bar Abdelmajid

Könizstrasse 3, 3008 Bern
resultentry_37.42168146.944324Bar Abdelmajid
Bar Abdelmajid

Könizstrasse 3, 3008 Bern

Tel.: 031 381 42 60
tel/search

The Yellow Pages > Hotel, Bar, Restaurant
Hotel SEDARTIS

Bahnhofstrasse 16, 8800 Thalwil
resultentry_48.56592247.295528Hotel SEDARTIS
Hotel SEDARTIS

Bahnhofstrasse 16, 8800 Thalwil

Tel.: 043 388 33 00
tel/search

The Yellow Pages > Club, Discotheque, Bar
Liquid

Genfergasse 10, 3011 Bern
resultentry_57.44115946.949633Liquid
Liquid

Genfergasse 10, 3011 Bern

Tel.: * 031 951 98 26
tel/search
Bar Amalfi

Spezialitäten aus dem Süden

Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH
resultentry_68.78198147.368167Bar Amalfi
Bar Amalfi

Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH

Tel.: * 043 535 90 05
tel/search
Bar Benjamin (-Gera)

Im Allmendli 11, 8703 Erlenbach ZH
resultentry_78.59953547.301096Bar Benjamin (-Gera)
Bar Benjamin (-Gera)

Im Allmendli 11, 8703 Erlenbach ZH

Tel.: * 076 232 23 21
tel/search

The Yellow Pages > Restaurant, Bar
Bohemia

Klosbachstrasse 2, 8032 Zürich
resultentry_98.55496947.364845Bohemia
Bohemia

Klosbachstrasse 2, 8032 Zürich

Tel.: 044 383 70 60
tel/search

Help local.ch improve this page
© 2010 local.ch ag
© 2010 local.ch ag - Terms of use

But i need only name of the bar, address , phone that should be stored in database.
Please any one help to do this..
Nagadurga is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
Old July 28th, 2010, 02:10 AM   #3
New Member
 

Join Date: Jul 2010
Location: India
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Altering Power: 0 Nagadurga is on a distinguished road
Re: Extract needed details from website and store in databse using php

please any one help to me .
I was tried another method to extract the data from web. But it was used to retrieve the single record. I was tried by using preg_match_all function also but i couldn't get the needed output.please anyone send the code how can i modify the following code as,

Code:
<?php

set_time_limit
(360);
function 
extract_unit($string$start$end)
{
$pos stripos($string$start);

$str substr($string$pos);

$str_two substr($strstrlen($start));

$second_pos stripos($str_two$end);

$str_three substr($str_two0$second_pos);

$unit[] = trim($str_three); 

return 
$unit;
}
$text=file_get_contents("http://local.ch/en/q/bar.html");
$text1=extract_unit($text,'<div class="hidden">','</div>');
$unit[] = extract_unit($text'<span class="head">''</span>');
$unit[] = extract_unit($text,'<span class="street-address">','</span>');
$unit[] = extract_unit($text,'<span class="postal-code">','</span>');
$unit[] = extract_unit($text,'<span class="locality">','</span>');
$unit[] = extract_unit($text,'<span class="label">','</span>');
$unit[] = extract_unit($text,'<span class="tel">','</span>');
print_r($unit);

?>
Nagadurga is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
Old July 30th, 2010, 02:40 AM   #4
New Member
 

Join Date: Jul 2010
Location: India
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Altering Power: 0 Nagadurga is on a distinguished road
Re: Extract needed details from website and store in databse using php

still i didn't get any suggestion from here.
please any one leave suggestion.
i was tried the following method
Code:
<html>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<table>
<tr>
<td>
<center><h3> Bar Name </h3></center>
<?php
$html 
file_get_contents ('http://tel.local.ch/en/q/bar.html?typeref=res');
$dom = new DomDocument();
@
$dom->loadHTML ($html);
$xpath = new DOMXPath ($dom);
$key1 $xpath->query ('//*[@class="fn"]');

foreach(
$key1 as $keys){
echo 
$keys->nodeValue ,"<br/> \n";
}

?>
</td>
<td>
<center><h3> Address </h3></center>
<?php
$html 
file_get_contents ('http://tel.local.ch/en/q/bar.html?typeref=res');
$dom = new DomDocument();
@
$dom->loadHTML ($html);
$xpath = new DOMXPath ($dom);
$key2 $xpath->query ('//*[@class="adr"]');
foreach(
$key2 as $keys){
echo 
$keys->nodeValue ,"<br/> \n";
}

?>
</td>
<td>
<center><h3> Conduct </h3></center>
<?php
$html 
file_get_contents ('http://tel.local.ch/en/q/bar.html?typeref=res');
$dom = new DomDocument();
@
$dom->loadHTML ($html);
$xpath = new DOMXPath ($dom);
$key3 $xpath->query ('//*[@class="phonenr"]');
foreach(
$key3 as $keys){
echo 
$keys->nodeValue ,"<br/> \n";
}

?>
</td>
</tr>
</table>
</html>

and i was got the output as
Bar Name

ADN Bar Café
Bar Abdelmajid
Bar Amalfi
Bar Benjamin (-Gera)
Bar Bistro Amigos
Bar Chez Franki
Bar Croce d'Oro
Bar Daniela (-Gera)
Bar Golf
Bar Gufo






Address

rue de Lausanne 59, 1202 Genève
Könizstrasse 3, 3008 Bern
Turmstrasse 7, 8330 Pfäffikon ZH
Im Allmendli 11, 8703 Erlenbach ZH
Bahnhofstrasse 2, 3360 Herzogenbuchsee
rue Victor-Tissot 4, 1630 Bulle
via Motta 3, 6900 Lugano
Im Allmendli 11, 8703 Erlenbach ZH
via della Posta 2, 6900 Lugano
via Girella, 6814 Lamone
Conduct

022 731 40 18
031 381 42 60
043 535 90 05
076 232 23 21
062 961 01 10
076 332 10 34
091 921 47 93
079 780 65 54
091 921 39 03
091 967 17 36


how can i store it in mysql database
please anyone give suggestion.
Thanks in advance.
Nagadurga is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!
Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
How do I change details on my friends website? thesismans Your Design and Layout 2 January 7th, 2009 10:03 AM
Add Store to my Website SawkaWorld Forums, Blogging and Content Management 3 August 15th, 2007 10:11 PM
Using a Website to store video's as a home security system phi11 Your Design and Layout 24 July 24th, 2007 09:27 PM


Search Engine Optimization by vBSEO 3.2.0 RC8