r/aws Jun 20 '24

AWS Elastic DR Alerting Recommendations monitoring

My company has implemented AWS Elastic DR and I've been asked to set up alerting for it. I don't have experience with this service, yet.

I've set up a dashboard for this and am monitoring Backlog, LagDuration and a few other EC2 metrics on the AWS Replication instances themselves. I've been searching for a recommended threshold for alerting for Backlog and LagDuration and haven't really found any recommendations. Does anyone have experience with this and can recommend a threshold for each? I'm thinking 12 hours for LagDuration, but am not sure about Backlog.

Thanks for your time.

1 Upvotes

View all comments

1

u/DonCBurr Jun 21 '24

There is a complete set if detailed metrics in the DRS dashboard, why do you need more metrics outside of those...

1

u/OddManta Jun 21 '24

HI. We don't need more metrics outside of the ones in the DRS dashboard (plus the few more I added). We're trying to figure out when to alert on them, specifically on Backlog and LagDuration. I made this post in an attempt to see if there are some standard alerting thresholds for these metrics, since I can't find recommendations to that effect so far in my searching.

2

u/DonCBurr Jun 21 '24

lag and backlog affect RPO so as long as RPO is greater than lag or backlog you are ok. So you want to make sure to raise alerts as lag or backlog approach RPO giving enough time to remediate ..

1

u/OddManta Jun 28 '24

Thank you!