UX for AI
Posts
Point and Change Point Anomaly Detection

Point and Change Point Anomaly Detection

Types of Anomalies, why you should care, and the finer points of Point and Change Point Anomaly Detection

Greg Nudelman
January 09, 2024

Now that we have tackled the Dos, Don’ts, and Forecasting, it’s time to cover the next part of the line graph-based UX for AI design patterns: Anomalies.

Andrew Maguire writes that time series (read: line graphs) Anomalies come in five flavors:

Source: https://andrewm4894.com/2020/10/19/different-types-of-time-series-anomalies/

And that is an exceptionally useful and wise classification from the UX standpoint. In this article, we will review the use cases for Anomaly detection in general and dig into the details of Point and Change Point Anomaly Detection.

After reading the article, a UX Designer or Product Manager should be well equipped to have high-quality, detailed conversations with your Data Science and Engineering colleagues and discuss important considerations of the interface design. You will also have access to UI best practices for fine-tuning your system to optimize usability and avoid false positives and false negatives.

Why is Detecting Anomalies Important?

For each anomaly type, the use case might be slightly different. However, all anomaly detection shares a few similarities. Anomaly detection is useful for a wide range of use cases:

Identification of Critical Production Issues:
A sudden and significant drop in signal strength at a specific telecommunications tower likely points to a critical issue, such as equipment failure. Anomaly detection helps engineers catch the problem early and fix it before it impacts the quality of service.
Quality Control and Assurance:
In manufacturing (particularly in 6-Sigma shops), the detection of an anomaly in some measurement of a gadget on a production line may signal a quality control issue. Identifying this anomaly helps the manufacturer identify the source of the problem, improve the manufacturing process, and ensure the production of high-quality gadgets.
Security and Fraud Detection:
In the financial industry, the sudden use of a credit card for multiple high-value transactions in different countries within a short time frame can be indicative of fraudulent activity. Early detection of such anomalies alerts the bank to block the card to decrease the impact of the liability and stop the hack.
Early Warning System:
Anomalies play a crucial role in predictive maintenance for machinery. For example, an unusual increase in vibration or temperature readings from a specific component of an industrial machine may indicate impending failure. Detecting anomalies early allows maintenance teams to schedule timely repairs or replacements, preventing unexpected downtime.
Improving Decision-Making:
In an e-commerce platform, a sudden surge in website traffic beyond normal patterns during a specific time period (e.g., due to a marketing campaign) can be considered a “happy” anomaly. Successfully recognizing this anomaly allows the marketing team to adjust strategies in real-time to capitalize on the increased interest and potentially boost sales.
Compliance and Regulation:
In the pharmaceutical industry, the manufacturing process is often particularly strict to help avoid product contamination. Detecting anomalies in the manufacturing process (such as longer wait times) is crucial in order to comply with regulatory standards. Identifying and addressing these anomalies ensures that the company meets quality and safety regulations and avoids expensive fines and lawsuits.

Now, let’s take a closer look at the considerations associated with each type of anomaly.

1. Detecting Point Anomalies

Point anomalies occur whenever the value briefly “spikes” and exceeds some predetermined static or dynamic threshold. A classic example might be a computer’s CPU Busy Percent metric that spikes because a rogue process is taking too much processing capacity.

How do we determine what constitutes a spike? Broadly speaking, Point Anomaly detection falls into two categories: static thresholds and dynamic thresholds.

Static thresholds are exactly as they sound: the system or user can set a static threshold (like 90%) exceeding which signals an anomaly. In the picture below, note that the threshold is set too low (at only 10%), and so the blue line is constantly straying into the red “anomaly territory.” (Such a detection monitor would be very noisy and would be in an “anomaly state” most of the time.) (That would be weird because anomaly would be the normality… But that seems to be what the world is coming to.)

Source: DataDog https://stackoverflow.com/questions/76236544/datadog-only-alert-after-x-amount-of-time

As you can see from this example from DataDog, not much IA machinery is involved, and the algorithm is pretty straightforward:

If value is > threshold, then value is an anomaly.

Dynamic Thresholds are significantly more interesting. The classic dynamic threshold method for detecting anomalies is Bollinger Bands. The idea is simple: first, we calculate a Simple Moving Average (SMA) for the last N days, then we determine a standard deviation from SMA and double it. The SMA + 2 standard deviations form the upper band, while the SMA - 2 standard deviations form the lower band.

Source: Commodity.com

Bollinger Bands are very useful for determining price anomalies, and day traders swear that whenever a stock price “pushes through” the band, it is because something significant is happening to the price of the stock, and so one should either buy or sell the stock. (Personally, I rather think that this kind of prediction is tantamount to a forecasting process using chicken entrails. If you want a refresher on forecasting, check out our article on the topic here.)

Regardless of whether you agree with the magical stock price predictive value of Bollinger Bands, the fact remains that they do help decrease the number of false positives, especially for measurements that can rise and fall dramatically and stay at a certain value for a time or those without preset bounds.

Here is a Dynamic Threshold example that is again from DataDog, so you can more easily compare it with the static threshold above:

Source: https://docs.datadoghq.com/monitors/types/anomaly/

You might wonder: are Bollinger Bands all that sophisticated? Where exactly does AI enter the picture here? And you would be right — just as we covered in the forecasting article, simple statistical methods are often very effective, and AI is sometimes an unnecessary overkill for detecting point anomalies. However, in this case, as you can see, the interface for fine-tuning dynamic threshold settings can get pretty hairy, and having users adjust this type of thing for hundreds or thousands of metrics manually is just not feasible:

Source: https://docs.datadoghq.com/monitors/types/anomaly/

That is why you may want to design the system in such a way as to have the AI algorithm select the best automatic presets based on some learning parameters (like a number of false positives and negatives). AI/ML methods can help optimize the right number of standard deviations (bands) for a particular measurement (“2” above) and update the SMA interval (“1 week” above) in a way that minimizes the number of false negatives and false positives. You may also wish to have AI select from a number of alternative proprietary detection algorithms (labeled “agile” above, the algorithm is proprietary, but you can read what DataDog says about various selections here).

WARNING: Dynamic thresholds are not suitable for every measurement.

For example, something like CPU busy percent or a number of transaction errors may not be a great candidate for dynamic threshold detection because any unexpected spike of those metrics is a critical condition and will degrade performance, and we don’t want the system to “learn” to ignore larger and larger numbers of errors over time. Other measurements not suitable for dynamic thresholds include compliance, quality control, and SLAs (Service Level Agreements) metrics. For example, required system uptime, maximum PPM of lead in drinking water, or maximum number of minutes raw chicken can sit at room temperature in an industrial kitchen that makes baby food. (Gross. I know. But someone’s got to track it. So it might as well be designers helping build a usable interface that hard-codes the FDA anomaly guidelines that do not dynamically change over time.)

On the other hand, something like network traffic volume, number of transactions, order volume, session length, revenue, etc., would be perfect candidates for dynamic threshold monitoring because, just like stock prices, those metrics tend to go up or down and stay in a particular dynamic range for some time due to external factors. For example, network traffic volume might be very low during the weekend, and the system that adjusts to a lower threshold during Saturday and Sunday would be just what is needed to detect a security breach. However, come Monday morning, we need to automatically re-up the threshold; otherwise, network traffic due to legitimate work activity would set off the alarms. The use case for using static or dynamic threshold is not always intuitive, and some UX research and in-depth conversations are always a great idea.

This is why you, as a UX person, will need to interview Engineers, Data Scientists, Customers, and SMEs to figure out which type of dynamic threshold applies to your particular measurement.

Lastly, it is important to mention that a point anomaly is often not significant in in of itself. Rather, it is the count of spikes within a certain time period (like a minute or an hour) that is much more significant. An occasional CPU Busy Percent spike is not that impactful to performance, but multiple spikes within the same minute will rapidly degrade the responsiveness and performance of the machine. Thus most monitoring systems differentiate clearly between an Anomaly and an Alert. Again, in-depth research via quality conversations with users, SMEs, and Data Scientists is always an excellent idea.

For your homework, figure out which metrics in your system need to be monitored for anomalies and why. In what use cases would a static threshold apply? Is there a use case for a dynamic threshold, and if so what is it? Can some variation of the Bollinger Bands algorthim be used? If so, what time period will the SMA (Simple Moving Average) be based on and how many standard deviations (bands) will need to be utilized? Who should you be talking to find out the details?

2. Change Point Anomaly Detection

Change Point Anomalies are similar to Point Anomalies in most respects. Both can be considered “amplitude” anomalies, e.g., they occur whenever the observed value unexpectedly breaches a certain static or dynamic threshold. The only big difference between Change Point and Point Anomalies is that Point Anomalies are “spikes” that quickly come back down to baseline, and Change Point is an unexpected change that remains sustained over time. Therefore, your approach to UI for tuning the system will be slightly different for each type of anomaly.

When you construct the interface to tune Point Anomaly detection to see if you need to generate an alert, you want to provide a way for users to indicate the number of times the point anomaly occurred within a certain period of time; for example, if there are 3 or more anomalous CPU busy percent spikes in 1 minute the system triggers an alert:

In contrast, when you construct the interface to tune Change Point Anomaly detection to see if it needs to generate an alert, you want your interface to indicate how long the anomalous reading sustained an abnormal value; for example, if the value was greater than a certain threshold for over 1 minute, then the system triggers the alert:

Differentiating between Change Point and Point Anomalies helps you fine-tune the system for the specific use case, but in many use cases, you will not need to make this distinction. It is common to see the UI that states something like:

“Trigger a Critical \/ alert whenever the value exceeds 90 % for a period of
1 minute \/ OR 3 times in 1 minute \/”

Thus, you are essentially covering both types of anomalies with a single statement. Dynamic and static thresholds should work along similar principles for both of these anomaly types.

It is really helpful to a user if the system fine-tunes these values automatically using AI/ML methods and only “asks” the user to respond to alerts by designating them as “true positive” or “false positive.” Thus the system learns over time and fine-tunes the alert generation values for best performance. In all cases, providing a manual way to override the AI values is essential (more on this critical UX feature in future columns.)

How are anomalies shown in the line graph?

As shown above, anomalies are usually indicated by a red circle drawn around the first incidence of the anomaly. Red dots indicating anomalous data point readings are also quite common:

Source: https://millimetric.ai/blog/what-is-an-anomaly/

Sometimes the part of the the line graph that falls outside of the expected parameter is itself colored bold red:

Source: https://medium.com/millimetric-ai/what-is-an-anomaly-ed50eb0ccc29

Design Principles for Fine-Tuning the Point and Change Point Anomaly Detection UI

The concept of Point and Change Point “amplitude” Anomalies is pretty straightforward: “An anomaly occurs when a value of a variable you are interested in tracking exceeds a certain threshold.” However, the UI for fine-tuning alerting based on anomaly detection can get somewhat involved. Here are three key “design principles” that are helpful to keep in mind (let’s call them “design principles” as these are not yet established enough to call them “design patterns.”)

The DataDog screens earlier in the article provide an excellent example of these principles. Here are those screens again for easy reference:

Source: https://stackoverflow.com/questions/76236544/datadog-only-alert-after-x-amount-of-time

Source: https://docs.datadoghq.com/monitors/types/anomaly/

Here are the UX principles for designing the Point and Change Point Anomaly Fine-Tuning UIs:

Just because an anomaly occurs once is no cause for alarm. Depending on the use case, if we alert the users of the system about every instance of anomaly, there will be too many false positives. So your UI often needs to have some kind of occurrence timer, either for a count of occurrences (for Point Anomalies) or for the duration of the change (for Change Point Anomalies.)
Make it a priority to provide a line graph (for the past 15 minutes/1 hour/24 hours, etc.) as a model to allow users to preview in real time the effects their changes will have on a decent-sized sample of historical data. This helps users to avoid inadvertedly creating too many false positives or missing signals of interest (and creating false negatives).
For complex dynamic algorithm tuning, try using a Mad Lib “fill-in-the-blank” UI design pattern so that the entire tuning UI reads like an English sentence, with drop-downs and text fields providing the inputs embedded in the text. This will help even users unfamiliar with the subject to understand what the tuning parameters do in the context of other settings.
Use AI/ML methods to create good defaults for the fine-tuning UI such that users do not need to change it often. Use learning algorithms so that, as the system is being used, it will self-tune appropriately and save users the work, especially in cases where many hundreds or thousands of variables need to be monitored on an ongoing basis.
Even when the AI/ML is doing the fine-tuning, never disable manual tuning methods.

Happy Anomaly Detecting,

Greg & Daria

Reply

or to participate.