Scrapy: Fix wrong sitemap URLs with custom downloader middleware
On stackoverflow, the topic was discussed, how to solve the problem of some sitemaps having absolute URLs without a scheme. According to RFC this is fine, but as the maintainers of scrapy pointed out, sitemaps require the contents of
<loc> to include a scheme (called protocol in the sitemap specs).
So it remains to the programmer of a spider to fix this issue, if he encounters some websites using the wrong format.
Overwriting the default spider does not work so well in this case, because one would have to copy a lot of code. So this seems like a good case for middlewares: Change the response to a valid format without the spider noticing.
The downloader middleware allows us to change the response by returning either a modified or a totally new
Response object from
Thus, it’s rather easy to implement a middleware which takes care of replacing wrongly formatted URLs to the correct ones - at least for the most simplistic cases. I did not implement any sophisticated XML namespace parsing, nor did I implement support for Google’s alternate language pages. The XML namespace parsing would only be important in theory (and for alternate language pages), because the sitemap author could set an additional namespace for the normally default namespace and then the element would not be called
<loc> but maybe
Ignoring these things, one can just use regular expressions to add the scheme where missing.
The newly created middleware can then be added to your project through the settings file (exact setting of course depends on where you saved the middleware).I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to email@example.com.