From around 3 PM to 9:30 PM on Friday, July 18th our systems experienced abnormally high scanning volume. At 9 PM it was apparent that something was not right, and upon investigation the issue was fixed and scanning volumes returned to normal before 9:30 PM. This is a write up about the cause of the increased scanning volume, the impact, solution, and what we are doing to prevent it from happening again in the future.
We noticed that the vast majority of the websites being scanned during this period were those with very low scanning [billing] limits. Interestingly, these sites where scanning above their currently billed limit. For example if an account had a scanning limit of 1000 scans, we were seeing scans taking place that were scan #1400 of the 1000… which should not be happening.
To understand how something that should not happen but did happen, you need to understand how we keep track of usage.
We have 2 mechanisms to keep track of usage. The first being immutable records of each scan occurrence, and the second being realtime counters in Redis. When a scan is committed, an immutable record of that event is added and the counter incremented. The counter allows us to efficiently query scan counts for various functions without actually counting database rows. Doing so would be a big drain given how frequently we want to check this information.
When we say that we were observing scan #1400 of 1000 occurring, we mean that our immutable records counted 1400 scans, and the account had a 1000 scan limit. The first instinct here was checking our Redis count to see if it also showed 1400, and lo and behold, in this example, we saw a count of 400. This told us that for some reason these websites effectively had their Redis counts cleared at some point. This signaled to our scheduling service that these websites had plenty of utilization left should they need to perform more automated scans.
Immediately upon realizing this as the cause of the problem, we re-seeded our Redis counts from our immutable scan records. This put a stop to scanning for accounts that had already hit their limit this month, and scanning volumes returned to expected levels. We also credited the affected accounts with the appropriate scans to avoid users seeing any overages inside their accounts.
How did this happen? Why would some (but not all) usage counters get cleared in Redis?
Earlier in the day, we were investigating a memory leak in Redis. Nothing that had any affect on the service, but a slow creep in utilization that would one day become an issue if we didn’t find the place in our application causing this leak. We identified a few problematic keys, and thought we had fixed the issue.
However, during this investigation we had misused a particular Redis command pertaining to the streams data type. We had inadvertently tried loading the entirety of a very large stream into a single response. In doing this we received an error, alongside a few errors in other services indicating we just (very temporarily) locked up our Redis instance. This was bad, but seen as a temporary problem, and certainly not one that had any effect on data durability. Just a few slow responses, or response retries is all that we caused.
After going through a list of other potential possibilities into what caused these count resets, this error from earlier in the day came into mind. Is it possible that as benign as it seemed, something else happened here too? Research into this led us to our Redis configuration, in particular, our maxmemory policy. This policy dictates how Redis behaves when it runs up against memory limits. Our policy dictated that when Redis needed more memory to store data or to fulfill a request (like accidentally reading in an entire stream) it would automatically evict the least read keys. What we thought was a no big deal error actually evicted some of our older usage count keys.
This error occurred at 12:30 PM.
Our schedule immediately, but slowly started picking up on these freshly unrestricted accounts and scheduling scans accordingly. Because of the slow nature of this increase in volume it was not apparent that something had just happened until much later.
The impact affected customers (outside of some free scanning) was the potential for duplicated notifications. When customers hit or approach their scanning limit we send email notifications such as “You have hit your limit this month”. Customers that hit their limit earlier this month received this email at that time, as expected. Because of this incident, however, the affected subset of these customers that hit their limit again, received these emails, again.
To re-iterate what was NOT impacted:
The most concrete change we will be making here is updating our Redis configuration during our next maintenance period. While the large majority of our Redis usage is automatically recoverable caching or short lived values, we do have a few pieces of data we need to more durably hold onto. For example, these scan counts need to be extremely durable during their lifetime, which is the entire month they refer to. We will configure Redis more appropriately for this use case.
Beside this, some other improvements we will be exploring