This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.
In Scrapy, the download latency is measured as the time elapsed between establishing the TCP connection and receiving the HTTP headers.
Note that these latencies are very hard to measure accurately in a cooperative multitasking environment because Scrapy may be busy processing a spider callback, for example, and unable to attend downloads. However, these latencies should still give a reasonable estimate of how busy Scrapy (and ultimately, the server) is, and this extension builds on that premise.
This adjusts download delays and concurrency based on the following rules:
Note
The AutoThrottle extension honours the standard Scrapy settings for concurrency and delay. This means that it will never set a download delay lower than DOWNLOAD_DELAY or a concurrency higher than CONCURRENT_REQUESTS_PER_DOMAIN (or CONCURRENT_REQUESTS_PER_IP, depending on which one you use).
The settings used to control the AutoThrottle extension are:
For more information see Throttling algorithm.
Default: 60.0
The maximum download delay (in seconds) to be set in case of high latencies.
Default: False
Enable AutoThrottle debug mode which will display stats on every response received, so you can see how the throttling parameters are being adjusted in real time.