projects / org / contextualcode / crawler
Find the Exponential Software extensions you want
| UNIX name | Owner | Status |
|---|---|---|
| crawler | Contextual Code | stable |
| Version | Compatible with |
|---|---|
| N/A | N/A |
The content import process is very complicated and unpredictable. Especially the crawling phase. And the main reason for its complexity is that there is a vast amount of different possible scenarios:
And it is pretty obvious it is unrealistic to create a crawler that will handle all possible cases out of the box. That's why we focused on creating a flexible crawler. So it allows handling all the listed scenarios by using the provided configurations.
This crawler stores the data in persistent storage (database). And it was designed to be used in the content imports. But it is a separate component that might be used in any other use case. It's the only purpose is to crawl the site and stores its pages metadata in the persistent storage. And the metadata of the crawled pages could be used for any purpose: import/analyze/any custom functionality.
Require contextualcode/crawler via composer:
composer require contextualcode/crawler
Run the migration:
php bin/console doctrine:migrations:migrate --configuration=vendor/contextualcode/crawler/src/Resources/config/doctrine_migrations.yaml --no-interaction
This section describes the basic usage concepts. Please check usage example and reference pages for technical details.
The usage flow is the following:
Implement crawler handler.
It should be a PHP class which extends ContextualCode\Crawler\Service\Handler. It has a lot of flexible configuration options described in the reference. The simplest crawler handler needs to provide import identifier and site domain:
namespace App\ContentImport;
use ContextualCode\Crawler\Service\Handler as BaseCrawlerHandler;
class CrawlerHandler extends BaseCrawlerHandler
{
public function getImportIdentifier(): string
{
return 'unique-identifier';
}
public function getDomain(): string
{
return 'www.site-to-crawl.com';
}
}
Run the crawler:run command.
This command requires the only argument: crawler identifier defined on the previous step. More detailed description for this command is available in the reference:
php bin/console crawler:run unique-identifier
To get live logs, please run the following command in a new terminal:
tail -f /XXX/var/log/contextualcode-crawler.log
Running the crawler ...
=======================
Url: http://www.site-to-crawl.com/
Referer:
282/282 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100% 7 secs/7 secs
All links are processed:
* 281 valid links
* 1 invalid links
Analyze and use crawled pages metadata.
The command from the previous step will populate ContextualCode\Crawler\Entity\Page entities in the database. They could be used for the content import or any other custom functionality. Detailed explanation about what data is stored in those entities is available in the reference.