Updating Amazon VPC Subnets - and some Pitfalls
Today I updated some subnet layouts in the VPC of an AWS service. Back in the day when the service had started it ran on only one subnet per availability zone. I wanted to change this to a cleaner architecture with a public and a private subnet per availability zone. Each private subnet communicates with the public internet over a NAT gateway.
Setting up the subnets is actually quite simple if you have some remaining IP address space in your VPC. First of all, I created all the required subnets, started NAT gateways and setup route tables for all private subnets to route public traffic through the NAT gateway.
Next, all services using the old subnets needed to be migrated to the new subnets.
For auto scaling groups this was quite simple - mind one pitfall. Each auto scaling group has a list of subnets assigned that it should use for instances. The pitfall is: As soon as you update these, the auto scaling group starts to update the servers. And it does not check whether this means that there are zero remaining instances left in the group. This is because AWS Auto Scaling always removes an instance before starting a new one instead of starting a new one, waiting until it’s fully ready and then deleting one (I really would like to know what’s the reason for this design decision, because it has cost me some downtime). So before you update subnets in your auto scaling group make sure that you have at least two fully healthy instances in your scaling group.
Next, there were some individual EC2 instances. An EC2 instance can only be “moved” to another subnet by creating an image and launching a new instance from that image. This is easy enough, though. Create an image from the instance, start a new instance from the image into the new subnet, and re-assign the Elastic IP address to the new instance. Then remove the temporary image and stop and/or terminate the old instance.
Now came the real pitfalls: Don’t forget that some other services also require subnet associations, because they run managed servers. For example, an Elastic File System (EFS) has mount targets and each of these mount targets is allocated in a specific subnet. Changing these is quite tricky, as you can only have one mount target per availability zone. This means if you want to move them to another subnet in the same availability zone, you first have to delete the mount target and then can add a new one. This means that all running instances in that availability zone will not be able to connect to the EFS. In my case I had to make sure that all auto scaling groups temporarily only use instances from some subnets, then update the mount targets of the other subnets, switch auto scaling groups over to the other subnets, again update some mount targets on the EFS and then apply all subnets to all auto scaling groups again. Quite some work.
Another service with subnet association is the Elastic Load Balancer. Make sure that you move your public ELBs into the public subnet and the internal ELBs into the private subnet.
I assume that an RDS database also has subnet associations, but this is something I will check tomorrow.
To sum up, it’s quite some work to switch from a single subnet setup to a public/private subnet setup. But I still think that it was OK to start with a single subnet considering the uncertainty and the size at which the project initially started and the speed at which it has grown. When you change your subnets make sure that you don’t forget all those services that don’t directly expose their servers as EC2 instances to you, but still run servers and thus require a subnet.
I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.