Monitoring for Large-Scale Networks

UNX-OBP

Error Handling and Monitoring

General Error Handling

Some error handling is done by the Simple Queue Service (SQS) queue configuration, namely the use of a Dead Letter Queue (DLQ) and a redrive policy. If the optimized integration between Lambda and SQS (in our case between unx-obp-process-incoming-sqs-lambda-function and unx-obp-incoming-queue) fails to process messages for whatever reason, it will retry processing those messages up to a configurable Max Receive Count (default: 5). After that, failed messages are placed on the DLQ. See https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-scaling for more information.

Some error handling is done in the various Lambda functions, generally in order to shape the logic and continue execution, provide informational messages, and/or to raise significant errors. Most of the Lambda functions are invoked in the context of the state machine, and failures in those Lambda functions will essentially get handled by the state machine’s error handling. Either way, Lambda function issues should reveal themselves in each function’s CloudWatch Log Group.

Most of the overall error handling is done by the defined state machine through the use of Retry and Catch clauses configured for each Task state. You can configure multiple Retry clauses to handle different errors in different ways by defining the errors you want to handle, how many times to retry the Task, and how long to wait between retries. See https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html for more information.

Most of the Task states invoke a Lambda function, and as a best practice the state machine handles some common Lambda service exceptions in a distinct manner, and all other types of exceptions using slightly different Retry parameters. Finally, if a Task continues to fail, the state machine will place a message on the DLQ and end the execution gracefully. DLQ messages sent by the state machine include the relevant fields of the original message as originally sent to the incoming queue, plus some information about the error in a field called, error_info.

Monitoring

The most straightforward way to get a sense of how the analytic is getting on is via the AWS Management Console. Of course, you can always setup monitoring to be more automated and to provide notifications when certain things occur, but this is more about where and what to look for.

Viewing the SQS main console page and refreshing the number of messages in each queue is an easy way to see if things are moving along or if something has gone wrong. Any messages appearing in the DLQ indicate that those messages could not be processed at all (i.e., did not even make it to the state machine execution) or were not processed successfully by the state machine (i.e., something failed in the state machine execution). DLQ messages will remain there until they are either purged or “redriven” back to the incoming queue, which you must do manually and is easy to do in the console. Redriving the DLQ messages is a manual process because you want to first identify what went wrong, fix it, and only then redrive the messages. It is recommended to set up alarms and/or automated notifications for when there are any messages available in the DLQ.

DLQ messages sent by the state machine are easy to identify - each will have an error_info field with information about the error. From this field you can usually tell what happened and in which Task (i.e., typically which Lambda function) it happened. Typically, there will be only one or two things causing an error, with many or all of the DLQ messages showing the same error, so you generally only need to look at a few messages to determine what’s wrong and where to fix it. An easy way to view DLQ messages and the error_info is to poll for and look at the content of a few available messages in the DLQ console page. An alternative place to view state machine errors is the CloudWatch Log Group for the state machine; although, these logs tend to provide slightly less information, but you may still be able to determine what’s wrong and where to fix it.

DLQ messages sent by the Lambda-SQS optimized integration will not have an error_info field - each will just be the original message. This usually occurs due to Lambda service exceptions, such as Lambda.TooManyRequestsException, which can occur if Lambda cannot keep up with the rate of messages on the incoming queue and overall Lambda use is brushing up against the concurrency limits of the AWS account and region being used. See https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html for more information. For this reason, it is recommended to keep an eye on Throttles and ConcurrentExecutions metrics across all functions, which can be found in the AWS/Lambda namespace of the CloudWatch Metrics page. See https://docs.aws.amazon.com/lambda/latest/dg/monitoring-metrics.html for more information.

In any case, it is highly recommended to request an increase to your Lambda concurrency limit (like to 2000-5000 to start), especially as you onboard additional protocols to this capability and/or if Lambda is utilized for other things in your environment.

Additionally, you can control the rate of messages on the incoming queue by adjusting the properties of the ControlRate processor(s) in NiFi.

Extra Tips

Debugging State Machine Execution

Express type state machines only allow you to log and debug via CloudWatch Logs. A nicer way to see the state machine execution is to create a Standard type state machine with the same definition and IAM role. Then use the Test Execution feature in the state machine’s console to provide message input that you’re interested in debugging - it will visually show you where the execution failed and provides the full execution data, including inputs, outputs, and any errors, all within the same interface.

Redriving Messages from the DLQ

Once you think you have fixed the issue(s), from the main SQS console page, select the DLQ. Click the ‘Start DLQ redrive’ button at top right of page. The only change you need to make is to choose to ‘Redrive to a custom destination’ and then select the UNX-OBP Incoming Queue from the dropdown menu. Finally, click ‘DLQ redrive’ at bottom right of page.

Pausing Processing

If DLQ messages continue to pile up as you are troubleshooting, you may want to temporarily pause the processing. You can do this in one of two ways:

If NiFi queues can handle the backpressure for a little while, pause one of the processors before it writes to the incoming queue. Make sure to run it again when you’re ready to resume processing. This is the preferable way to temporarily pause processing.
Otherwise, you can enable/disable the automatic processing of incoming messages by Lambda by updating the event source mapping associated with the unx-obp-process-incoming-sqs-lambda-function - simply disable/deactivate it in that function’s Configuration > Triggers tab, then enable it again when you’re ready to resume processing. Note, when doing it this way and re-enabling the event source mapping, it may cause an initial spike in concurrent executions as Lambda ramps up to handle the extra accumulated messages on the incoming queue.

Enrichment Errors

Besides fatal errors in the enrichments function, which will show up in function or state machine logs and error handling mechanisms, API lookups may encounter various errors and/or quota limits but still receive valid responses. Typically these manifest as non-200 HTTP response status messages, and the enrichments function handles these without causing the function to fail and disrupt overall processing. Importantly, the capability will continue to function in these cases, just without the analytical benefit of having additional context for alerts.

You will likely not know this has occurred unless analysts bring it to your attention or you periodically check for signs this is occurring. You can discover this in the alert records, specifically if the following field is present:

enrichments.censys.services.error_code

Also, if any of the above fields are present, there will be a tag appended to the tags field indicating the same. Lastly, you should see messages in the function’s CloudWatch Log Group starting with “[!] Saw unexpected status code:”.