Let us explore this case in detail :
What is the architecture ?
A simple multi-tiered architecture with : A load balancer deployed at the front. The Web/App Server has the health script/program. The database is MySQL deployed Master+ Slave mode.
What is deeper Health Check ?
The script/program deployed in the Web/App Server is little intelligent; when it is called by load balancer it performs simple operations and checks the status of the Database. So when you get a response back from the health check script / program you are verifying whether the health of DB and Web/App server is sound at the load balancer tier.
What is the problem scenario ?
Imagine when migrating this infrastructure to AWS you have adopted the standard architecture pattern consisting of :
- Amazon ELB is used as the Load balancer
- Web/App Server in auto scaling mode
- MySQL moved to Amazon RDS+Multi-AZ with RR
Now let us explore this problem in detail :
- Imagine the any of the following condition in your production, network between database and Web/App is down intermittently for few minutes or RDS MySQL is elevating the Hot Standby as new Master. In such scenarios, the health check response actually timeouts at Database level, whereas the Load Balancer will mark the even the healthy App Servers as unhealthy because of the deeper health checks. This is not good especially for Amazon Auto Scaled scenario's where Amazon ELB marks Web/App EC2 as unhealthy because of deeper health check and Amazon Auto Scaling keeps restarting the Web/App EC2 auto automatically to maintain minimum healthy farm. This unwanted effect can cascade the overall availability and surely not good for the production in AWS. So in short Deeper Health checks are not surely recommended for complex N-Tier systems that follows Auto scaling/healing and Service oriented architecture patterns in AWS.
- Usually the purpose of health check is to check the status of next tier or service consumed by a particular tier. Deeper health checks is heavy weight and it usually takes much more time to respond because majority of your tiers are exercised in this process. If we set this frequency too aggressive, then health checks itself will eat lots of your CPU. So the frequency of the health checks and the response time out have to be set considerably large. Also during heavy traffic scenario, such heavy weight calls can be queued and you might not get faster response in deeper health checks.
- Deeper health checks are usually suitable for simple and fixed infrastructures. When your infrastructure is non elastic , the decisions are taken manually by the ops team after analyzing the particular failing tier. For Elastic Auto scaled workloads in AWS it is better to isolate the health checks of load balancing tier separate from Deeper Health checks that can be used for assessing the availability of the infrastructure.