Here is where all the real work is being done. In this function we will collect the keywords for one URL and store them in the database. We start off by calling the _checkURL() function to determine the validity of the URL, and then get the source of the URL with the _getData() function.
Next we take the string that contains the source and split it into individual words and store them in an array. We can do this fairly easily with the preg_split() function. We will split the string at every occurrence of a white space character, a comma, or a period.
Then, we will use the array_walk() function and have it call the _prune() function for each array element. You may notice that the array_walk() function call is a little different than you have seen it in the past. For the second parameter, we have to pass it an array that contains the $this pointer as the first element and the name of the function as the second element. This is needed because we are calling a class function.
After the array_walk() function completes its task, we then use the sort function on the array of words. We are not so concerned about actually sorting the array, but we want to force our numerical keys to be sequential. After the array_walk() function finishes, it is very likely that we will have gaps in our enumerated array. As an example, we could have keys 0, 1, and 2 and then it might skip to key 6. In order to renumber our numerical keys, we can simply run the array through the sort function.
The next step we need to take is to insert the URL into the urls table. We should first look to see if it already exists, and if it does delete the keywords associated with it in the keywords table. Doing this will allow us to refresh the information in our database from time to time. If we did not check for the existence of the URL, we could end up with the same URL indexed multiple times.
The final step of the _harvest() function is to insert the keywords into the keywords table. Because we will have a variable number of keywords and those keywords will be ever changing, we need to construct the SQL query dynamically.
We will accomplish this by using the count() function to determine how many words are in the $words array and then adding each word to a variable called $values. We will add the first value to the $values variable outside of the loop so that we can format the SQL query properly with commas in the right places. The $url_id used in the $values variable is taken from the id of the URL in the urls table.
<?php function _harvest($url) { if(!$this->_checkURL($url)) { echo "URL is not valid ($url).<br />\n"; } elseif ($data = $this->_getData($url)) { $words = preg_split ("/[\s,.]+/", $data); array_walk ($words, array($this, '_prune'), &$words); sort ($words); $url_id = $this->_db->getone("SELECT id FROM urls WHERE url='$url'"); if($url_id) { $this->_db->query("DELETE FROM keywords WHERE url_id=$url_id"); } else { $this->_db->query("INSERT INTO urls SET url='$url'"); $url_id = mysql_insert_id(); } $values = "($url_id, '$words[0]')"; $numwords = count ($words); for ($i = 1; $i < $numwords; $i++) { $values .= ", ($url_id, '$words[$i]')"; } $this->_db->query("INSERT INTO keywords VALUES $values"); } } ?>