Worker Node Not Ready – EKS Managed Node Group

To be honest, this is my first time to setup EKS cluster. I had some experience on self hosted kubernetes, but what I realised, as you can imagine, AWS managed EKS is a different ball game altogether.

I am deploying the whole cluster using terraform, with aws eks module. Using the module is pretty easy and below is the main resource code.

module "eks_cluster" {
  depends_on = [module.s3-bucket]
  source     = "terraform-aws-modules/eks/aws"
  version    = "~> 20.0"

  cluster_name                             = var.name
  cluster_version                          = var.cluster_version
  subnet_ids                               = [var.subnet_az1, var.subnet_az2]
  vpc_id                                   = data.aws_ssm_parameter.vpc_id.value
  tags                                     = merge(local.tags,var.tags, { Service = "EKS" })
  cluster_endpoint_public_access           = true
  cluster_endpoint_private_access          = true
  enable_cluster_creator_admin_permissions = true

  cluster_addons = {
    vpc-cni                = {
      service_account_role_arn = aws_iam_role.vpncni_role.arn
    }
    coredns                = {}
    eks-pod-identity-agent = {}
    kube-proxy             = {}
    aws-efs-csi-driver     = {}
    aws-mountpoint-s3-csi-driver = {
      service_account_role_arn = aws_iam_role.s3_role.arn
    }
  }

  eks_managed_node_group_defaults = {
    instance_types = ["t3.large"]
  }
  eks_managed_node_groups = {
    proxy_apps = {
      # Starting on 1.30, AL2023 is the default AMI type for EKS managed node groups
      ami_type                   = "AL2023_x86_64_STANDARD"
      instance_types             = ["t3.large"]
      # use_custom_launch_template = true
      min_size                   = 2
      max_size                   = 4
      desired_size               = 2
      iam_role_additional_policies = {
        "s3_policy" = aws_iam_policy.s3_policy.arn
      }
      tags = merge(local.tags,var.tags, { Service = "EKS EC2 Worker Nodes" })
      vpc_security_group_ids = [aws_security_group.node_sg.id]
      # remote_access = {
      #   ec2_ssh_key = join("-", ["dsaws", local.name, "pub-key", data.aws_region.current.name])
      # }
    }
  }
}

I did faced couple of issues though which might help you guys, if setting something similar.

Error#1

As you can see, I got this error while setting up EKS managed node group. This error is pretty basic, I can see there is error in pulling the image on EC2 worker node. But than again, I am using all private network behind Hub NAT GW using CloudWAN. So it doesn’t make sense as internet is allowed.

Solution

In order to fix that, first I enabled remote access to my worker nodes and redeployed the config. Below is specific part, which I enabled.

      remote_access = {
        ec2_ssh_key = join("-", ["mykey", local.name, "pub-key", data.aws_region.current.name])
      }

One worker node was redeployed, I was able to ssh in the machine and realised indeed I cannot download any image, although internet access is there. Next I tried to download docker to run some test on pulling images manually. What I realised was any AWS backed repo is inaccessible.

EKS managed worker nodes uses Amazon linux, so their repo is also different, which is S3 backed. Finally, I realised a new S3 endpoint policy rolled out in the VPC which only permits access to S3 traffic from within organization unit and drop everything else. Well that was it, disabling the policy made sure S3 is reachable again, and hence I can reach AWS default repo and registeries, in this case, https://602401143452.dkr.ecr.eu-west-2.amazonaws.com/amazon-k8s-cni-init:v1.19.2-eksbuild.1

Error#2

Second error, which I got was around my Network loadbalancer, created as front end of the application. I had strict requirement that my end devices should be able to communicate to the pod using static IP or FQDN, which is not possible if I only rely on Pod IP. I create network loadbalancer with below config.

apiVersion: v1
kind: Service
metadata:
  name: ds-cfps-nlb
  namespace: chronicle
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"
    service.beta.kubernetes.io/aws-load-balancer-subnets: "subnet-0377e86c456a7a71c,subnet-0819a2e6f328da824"
spec:
  # allocateLoadBalancerNodePorts: false
  ports:
    - name: syslog
      port: 15514
      targetPort: 15514
      protocol: TCP
    - name: winlog
      port: 14514
      targetPort: 14514
      protocol: TCP
  type: LoadBalancer
  selector:
    app: ds-cfps

Yet somehow, my Loadbalancer was always trying to access Pods via Node port and not Pod IP. Below is background what I was trying to achieve.

IP mode¶
NLB IP mode is determined based on the service.beta.kubernetes.io/aws-load-balancer-nlb-target-type annotation. If the annotation value is ip, then NLB will be provisioned in IP mode. Here is the manifest snippet:


    metadata:
      name: my-service
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-type: "external"
        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
backwards compatibility

For backwards compatibility, controller still supports the nlb-ip as the type annotation. For example, if you specify


service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
the controller will provision NLB in IP mode. With this, the service.beta.kubernetes.io/aws-load-balancer-nlb-target-type annotation gets ignored.

Instance mode¶
Similar to the IP mode, the instance mode is based on the annotation service.beta.kubernetes.io/aws-load-balancer-type value instance

Going back on reading the guidelines again, I realised I missed to deploy AWS Load Balancer Controller. Its written specifically in pre-requisite here,

https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html

I installed the controller and wallah redeployment brought up my loadbalancer listening directly to Pod IP in target group.

https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html

Like I say. Pretty straightforward, **Only if you know what you doing 😉*

March 19, 2025

dev_ves

Comments

Leave a Reply Cancel reply