Building a crawler in PHP
Building a crawler in PHP

When Spatie unleashes a new site on the web we want to make sure that all, both internal and external, links it work. To facilitate that process, we released a tool to check the statuscode of every link on a given website. It can easily be installed via composer:

Let’s for example scan the domain:

Our little tool will spit out the status code of all links it finds:

screenshot
screenshot

And once finished a summary with the amount of links per status code will be displayed.

The package uses a home grown crawler. Sure, there are already many other crawlers available. I built a custom one part as a learning exercise, part because the other crawlers didn’t exactly what I wanted to.

Let’s take a look at the code that crawls all links for a piece of HTML:

So first we get all links. Then we’ll filter out mailto-links. The next step normalizes all links. After that we’ll let a crawl Profile determine if that link should be crawled. And finally the link will get crawled.

Determining all links on a page
Determining which links there are on a page may sound quite daunting, but Symfony’sDomCrawler makes that very easy. Here’s the code:

The DomCrawler returns strings. Those strings get mapped to Url-objects to make it easy to work with them.

Normalizing links
On a webpage protocol independent-links (eg. ‘//domain.com/contactpage’) and relative links (‘/contactpage’) may appear. To make our little crawler needs absolute links (‘https://domain.com/contactpage’) so all links need to be normalized. The code to do that:

$baseUrl in the code above contains the url of the site we’re scanning.

Determining if a url should be crawled
The crawler delegates the question if a url should be crawled to a dedicated class that implements the CrawlProfile-interface

The package provides an implementation that will crawl all url’s. If you want to filter out some url’s there’s no need to change the code of the crawler. Just create your own CrawlProfile-implementation.

Crawling an url
Guzzle makes fetching the html of a url very simple.

There’s a little bit more to it, but the code above is the essential part.

Observering the crawl process
Again, you shouldn’t touch the code of the crawler itself to add behavior to it. When instantiating the crawler it expects that you pass it an implementation of CrawlObserver

Looking at the interface should make things clear:

The http-status-check tool uses this implementation to display all found links to the console.

That concludes the little tour of the code. I hope you’ve seen that creating a crawler is not that difficult. If you want to know more, just read the code of the Crawler-class on GitHub.

My colleague Sebastian had a great idea to create a Laravel-package that provides an Artisan command to check all links of a Laravel application. You might see that appear amongst our current Laravel packages soon.

SEE MORE:

Interesting Facts About PHP

Php Conditional Statements