A client recently came to Avenue A | Razorfish with a major problem. Their site, containing over 150,000 URLs, had hundreds of non-functioning pages, often leading visitors to a 404 error. With such a large site, the client had difficulty identifying every “broken” page. Even if they had a complete list of broken pages, they couldn’t identify the pages that contained the dynamically generated links to them, making the process of correcting this issue very difficult.
The AA|RF Product Technology team in Philadelphia, tasked with the development and support of two search technologies - SEOsource and SEOdirect - has developed a spider called SiLC (Super Inteligent Link Crawler) that can efficiently crawl large dynamically generated sites. In addition to tracking site errors, SiLC tracks the linking relationship between the links it encounters while performing a crawl, resulting in a hierarchical view of the site’s structure. From this “link map”, AA|RF was able to provide the client with a complete list of broken pages, as well as the URL of the page that contained each broken link.
Since SiLC is designed to crawl and process content that is difficult for standard search engine spiders, it can be used to generate a list of pages that are not visible to search engines. When combined with a link map, we can visually show which sections of a site are effectively invisible to spiders. By providing this information to the client, AA|RF made it much easier for them to identify and fix the broken pages on their site. Future articles will explore additional ways that AA|RF is leveraging SiLC to solve client problems.