Scrapy: Fix wrong sitemap URLs with custom downloader middleware

On stackoverflow, the topic was discussed, how to solve the problem of some sitemaps having absolute URLs without a scheme. According to RFC this is fine, but as the maintainers of scrapy pointed out, sitemaps require the contents of <loc> to include a scheme (called protocol in the sitemap specs).

So it remains to the programmer of a spider to fix this issue, if he encounters some websites using the wrong format.

Overwriting the default spider does not work so well in this case, because one would have to copy a lot of code. So this seems like a good case for middlewares: Change the response to a valid format without the spider noticing.

The downloader middleware allows us to change the response by returning either a modified or a totally new Response object from process_response.

Thus, it’s rather easy to implement a middleware which takes care of replacing wrongly formatted URLs to the correct ones - at least for the most simplistic cases. I did not implement any sophisticated XML namespace parsing, nor did I implement support for Google’s alternate language pages. The XML namespace parsing would only be important in theory (and for alternate language pages), because the sitemap author could set an additional namespace for the normally default namespace and then the element would not be called <loc> but maybe <sitemap:loc>.

Ignoring these things, one can just use regular expressions to add the scheme where missing.

import re
import urlparse
from scrapy.http import XmlResponse
from scrapy.utils.gz import gunzip, is_gzipped
from scrapy.contrib.spiders import SitemapSpider

# downloader middleware
class SitemapWithoutSchemeMiddleware(object):
    def process_response(self, request, response, spider):
        if isinstance(spider, SitemapSpider):
            body = self._get_sitemap_body(response)

            if body:
                scheme = urlparse.urlsplit(response.url).scheme
                body = re.sub(r'<loc>\/\/(.+)<\/loc>', r'<loc>%s://\1</loc>' % scheme, body)
                return response.replace(body=body)

        return response

    def _get_sitemap_body(self, response):
        """Return the sitemap body contained in the given response, or None if the
        response is not a sitemap.
        """
        if isinstance(response, XmlResponse):
            return response.body
        elif is_gzipped(response):
            return gunzip(response.body)
        elif response.url.endswith('.xml'):
            return response.body
        elif response.url.endswith('.xml.gz'):
            return gunzip(response.body)

The newly created middleware can then be added to your project through the settings file (exact setting of course depends on where you saved the middleware).

DOWNLOADER_MIDDLEWARES = {                                                      
    'middlewares.SitemapWithoutSchemeMiddleware': 900
}

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.