Page Monitoring Data Inconsistencies

Incident Report for Accessible Web

Postmortem

This bug was discovered after our team noticed a URL mismatch on an accessibility result object. After quick investigation we found that while rare (%0.07 incidence rate among pages scanned during the affected period), we were were able to find a few additional occurrences.

When our monitoring system runs an accessibility scan on a page, it saves those results in an object containing some high level keys such as date, duration, and the page URL that was originally set out to be scanned (to name a few). It also contains a key to store actual accessibility results that we receive from running axe-core on the page. Part of axe-cores results include the URL of the page that the results are from. This is where we initially noticed the mismatch. At a top level, the result object was telling us the results were for page A, but the actual results from the accessibility scan appeared (and were) from an unrelated page B. Somewhere along the lines the wrong accessibility scan was being attached to the wrong results object. This presented itself in RAMP as completely unrelated accessibility results from a random URL in our system.

To understand how this might have happened and how we have fixed it in the interim, it’s helpful to understand how our system goes about running scans on websites. To perform accessibility scans, our monitoring system reads the websites it needs to scan at a particular time from a queue. Reading from this queue is a large number of workers ready to do the actual scanning. Each worker is capable of grabbing 2 items from the queue and processing them at the same time.

Each item grabbed will get its own web browser to perform its scans inside of. What we saw is that under very specific conditions the results of pages being scanned in each of the 2 web browsers at the same time by the same worker where being mixed. Any case where there was a URL mismatch appears to have been between a worker processing those 2 websites at the same time.

While we don’t yet know why once in a blue moon messages passed back from these browsers were mixed, we have a high level of confidence that our interim solution of limiting workers to grabbing 1 item from the queue at a time has fixed the issue. We also promptly removed the results from the affected pages and triggered recalculations for the appropriate pages, websites, etc.

Regarding the time frame of this incident: On February 21st, we rolled out some significant changes to our scanners internal architecture. When we combine this with the fact that it's probable we would have run into this before had it existed, we believe this is the date the bug was introduced.

Posted Mar 10, 2023 - 12:07 EST

Resolved

There was a bug discovered that inexplicably assigned the wrong results to some page scans. Over this period of time (2/21 - 3/10), only 0.07% of all pages that were scanned during this period were affected. Once this was discovered, we promptly removed the results from the affected pages and triggered recalculations for the appropriate pages, websites, etc. As such you may see a small deviation from your websites typical statistics in this date range.

Posted Feb 21, 2023 - 09:00 EST