Impact of temporary disconnection from Google Cloud

Google Kubernetes Engine (GKE) Enterprise edition is the Google Cloud application modernization platform. It is based on Kubernetes and can be deployed on Google Cloud, on other clouds, and on-premises with Google Distributed Cloud (both on VMware and on bare metal servers). Even when a GKE Enterprise-managed cluster runs on-premises, it is designed to have a permanent connection to Google Cloud for a number of reasons, including monitoring and management. However, you might need to know what would happen if, for any reason, the connection to Google Cloud is lost (for example, because of a technical problem). This document outlines the impact of a loss of connectivity for clusters in a Google Distributed Cloud software-only deployment (on bare metal or on VMware), and which workarounds you can use in this event.

This information is useful for architects who need to prepare for an unplanned or forced disconnection from Google Cloud and understand its consequences. However, you should not plan to use a software-only Google Distributed Cloud deployment disconnected from Google Cloud as a nominal working mode. Remember that we design GKE Enterprise to take advantage of the scalability and availability of Google Cloud services. This document is informed by the design and architecture of the various GKE Enterprise components during a temporary interruption. We can not guarantee that this document is exhaustive.

This document assumes that you are familiar with GKE Enterprise. If that's not the case, we recommend that you first read the GKE Enterprise technical overview.

GKE Enterprise license validation and metering

If you have enabled GKE Enterprise, which means the Anthos API (anthos.googleapis.com) is enabled in your Google Cloud project, the GKE Enterprise metering controller, running in the cluster, generates and refreshes the GKE Enterprise entitlement periodically. The tolerance for disconnection is 12 hours. Additionally, metering and billing are managed through the connection.

This table lists the behavior of features related to licensing and metering in case of temporary disconnection from Google Cloud.

Feature Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
GKE Enterprise license validation The GKE Enterprise metering controller generates and refreshes the GKE Enterprise entitlement custom resource periodically, as long as anthos.googleapis.com is enabled in the Google Cloud project. The components that consume the entitlement custom resource support a grace period: they continue to function as long as the entitlement custom resource is refreshed within the grace period. Currently unlimited. After the grace period expires, components start to log errors. You cannot upgrade your cluster anymore. None
Metering and billing The GKE Enterprise metering controller reports the vCPU capacity of the cluster to the Google Cloud Service Control API for billing purposes. There is an in-cluster agent that persists billing records in the cluster when disconnected, and the records are retrieved once the cluster re-connects to Google Cloud. Unlimited. However, GKE Enterprise metering information is required for compliance as stated in the Service Specific Terms for "Premium Software". None

Cluster lifecycle

This section covers scenarios such as creating, updating, deleting, and resizing clusters, as well as monitoring the status of these activities.

For most scenarios, you can use CLI tools such as bmctl, gkectl, and kubectl to perform operations during a temporary disconnection. You can also monitor the status of these operations with these tools. Upon reconnection, the Google Cloud console updates to display the results of operations performed during the disconnected period.

Action Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
Cluster creation You use the bmctl or gkectl CLI tools to create clusters. This operation requires a connection to Google Cloud. You cannot create clusters. Zero None
Cluster upgrade You use the bmctl or gkectl CLI tools to upgrade clusters. This operation requires a connection to Google Cloud. You cannot upgrade clusters. Zero None
Cluster deletion You use the bmctl or gkectl CLI tools to delete clusters. This operation does not require a connection to Google Cloud. You can delete clusters. Unlimited -
Viewing cluster status You can see information about your clusters in the Google Cloud console, in the list of Google Kubernetes Engine clusters. Cluster information is not shown in the Google Cloud console. Unlimited Use kubectl to directly query your clusters and get the information you need.
Removing nodes from a cluster You do not need a connection to Google Cloud to remove nodes from a cluster. You can remove nodes from a cluster. Unlimited -
Adding nodes to a cluster The new node pulls container images from Container Registry to properly work. A preflight check runs to validate that there is connectivity to Google Cloud. The preflight checks that run when adding a new node validate that there is connectivity to Google Cloud. Therefore, you cannot add a new node to a cluster when disconnected. Zero None

Application lifecycle

Managing your applications running in an on-premises cluster is mostly unaffected by a temporary disconnection from Google Cloud. Only the Connect Gateway is impacted. If you are using Container Registry, Artifact Registry, Cloud Build, or Cloud Deploy to manage your container images or CI/CD pipelines in Google Cloud, they are not available anymore in case of disconnection. Strategies to deal with disconnection for those products are outside of the scope of this document.

Action Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
Application deployment Done locally using kubectl, through CI/CD tooling, or using the Connect Gateway. The Connect Gateway is not available. All other methods of deployments still work as long as they connect directly to the Kubernetes API. Unlimited If you were using the Connect Gateway, switch to using kubectl locally.
Application removal Done locally using kubectl, through CI/CD tooling, or using the Connect Gateway. The Connect Gateway is not available. All other methods of deployments still work as long as they connect directly to the Kubernetes API. Unlimited If you were using the Connect Gateway, switch to using kubectl locally.
Application scale-out Done locally using kubectl, through CI/CD tooling, or using the Connect Gateway. The Connect Gateway is not available. All other methods of deployments still work as long as they connect directly to the Kubernetes API. Unlimited If you were using the Connect Gateway, switch to using kubectl locally.

Logging and monitoring

Auditability helps your organization meet its regulatory requirements and compliance policies. GKE Enterprise helps with auditability by offering application logging, Kubernetes logging, and audit logging. Many customers choose to leverage Google's Cloud Logging and Cloud Monitoring to avoid managing a logging and monitoring infrastructure on-prem. Other customers prefer to centralize their logs into an on-prem system for aggregation. To support these customers, GKE Enterprise provides direct integration to services such as Prometheus, Elastic, Splunk, or Datadog. In this mode, during temporary disconnection from Google Cloud, there is no impact on logging or monitoring functionality.

Feature Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
Application logging using Cloud Logging Logs are written to Cloud Logging. Logs are buffered to the local disk. 4.5h or 4GiB local buffer per node. When the buffer fills or the disconnection lasts 4.5 hours, then the oldest entries are dropped. Use a local logging solution.
System/Kubernetes logging using Cloud Logging Logs are written to Cloud Logging. Logs are buffered to the local disk. 4.5h or 4GiB local buffer per node. When the buffer fills or the disconnection lasts 4.5 hours, then the oldest entries are dropped. Use a local logging solution.
Audit logging using Cloud Audit Logs Logs are written to Cloud Logging. Logs are buffered to the local disk. 10GiB local buffer per control plane node. When the buffer fills, then the oldest entries are dropped. Set up log forwarding to a local logging solution.
Application logging using other provider You can use different third-party providers like Elastic, Splunk, Datadog, or Loki. No impact Unlimited -
System/Kubernetes logging using other provider You can use different third-party providers like Elastic, Splunk, or Datadog. No impact Unlimited -
Application and Kubernetes metrics written to Cloud Monitoring The metrics are written to Cloud Monitoring. Metrics are buffered to the local disk. 24h or 6GiB local buffer per node for system metrics and 1GiB local buffer per node for application metrics. When the buffer fills or the disconnection lasts 24 hours, then the oldest entries are dropped Use a local monitoring solution.
Accessing and reading monitoring data from Kubernetes and application workloads All metrics are available in the Google Cloud console and through the Cloud Monitoring API. Metrics are not updated in Cloud Monitoring during the disconnection. 24h or 6GiB local buffer per node for system metrics and 1GiB local buffer per node for application metrics. When the buffer fills or the disconnecton lasts 24 hours, then the oldest entries are dropped Use a local monitoring solution.
Alerting rules and paging for metrics Cloud Monitoring supports alerting. You can create alerts for any metric. Alerts can be sent through different channels. Alerts are not triggered while disconnected. Alerts are only triggered from metrics data already sent into Cloud Monitoring Use a local monitoring and alerting solution.

Config and policy management

Config Sync and Policy Controller lets you manage configuration and policies at scale, across all of your clusters. You store configurations and policies in a Git repository, and they are synchronized automatically to your clusters.

Config Sync

Config Sync uses in-cluster agents to connect directly to a Git repository. You can manage changes to the repository URL or the synchronization parameters with the gcloud or kubectl tools.

During temporary disconnection, the synchronization is unaffected if the in-cluster agents can still reach the Git repository. However, if you change the synchronization parameters with the Google Cloud CLI or the Google Cloud console, they are not applied to the cluster during the disconnection. You can temporarily overwrite them locally using kubectl. Any local changes are overwritten on reconnection.

Policy Controller

Policy Controller enables the enforcement of fully programmable policies for your clusters. These policies act as "guardrails" and prevent any changes that violate security, operational, or compliance controls that you have defined.

Action Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
Syncing configuration from a Git repository In-cluster agents connect directly to the Git repository. You can change the repository URL or synchronization parameters with a Google Cloud API. Syncing of configurations is unaffected. If you change the synchronization parameters with gcloud or in the Google Cloud console, they are not applied to the cluster during the disconnection. You can temporarily overwrite them locally using kubectl. Any local changes is overwritten on reconnection. Unlimited Never use the Fleet API for Config Sync, and only configure it by using the Kubernetes API.
Enforcing policies on requests to the Kubernetes API The in-cluster agent enforces constraints thanks to its integration with the Kubernetes API. You manage policies using the local Kubernetes API. You manage the system configuration of Policy Controller with a Google Cloud API. Policy enforcement is unaffected. Policies are still managed using the local Kubernetes API. Changes to the Policy Controller system configuration using the Google Cloud API are not propagated to the cluster, but you can temporarily overwrite them locally. Any local changes is overwritten on reconnection. Unlimited Never use the Fleet API for Policy Controller, and only configure it by using the Kubernetes API.
Installing, configuring, or upgrading Config Sync using the Google Cloud API You use the Google Cloud API to manage the installation and upgrade of in-cluster agents. You also use this API (or gcloud, or the Google Cloud console) to manage the configuration of these agents. In-cluster agents continue to operate normally. You cannot install, upgrade, or configure in-cluster agents using the Google Cloud API. Any pending installations, upgrades, or configurations done using the API proceed upon reconnection. Zero Never use the Fleet API for Policy Controller, and only configure it by using the Kubernetes API.
Viewing system or sync status in the Google Cloud console You can view the health of the in-cluster agents and the synchronization status using a Google Cloud API or the Google Cloud console. Status information in the Google Cloud API or Google Cloud console becomes stale. The API shows a connection error. All the information remains available on a per-cluster basis using the local Kubernetes API. Zero Use the nomos CLI or the local Kubernetes API.

Security

Identity, authentication, and authorization

GKE Enterprise can connect directly to Cloud Identity for application and user roles, to manage workloads using Connect, or for endpoint authentication using OIDC. In case of disconnection from Google Cloud, the connection to Cloud Identity is also severed, and those features are not available anymore. For workloads that require additional resiliency through a temporary disconnection, you can use GKE Identity Service to integrate with an LDAP or OIDC provider (including ADFS) to configure end-user authentication.

Feature Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
Cloud Identity as identity provider, using the Connect gateway You can access GKE Enterprise resources using Cloud Identity as the identity provider, and connecting through the Connect gateway. The Connect gateway requires a connection to Google Cloud. You are not able to connect to your clusters during the disconnection. Zero Use GKE Identity Service to federate with another identity provider.
Identity and authentication using a third-party identity provider Supports OIDC and LDAP. You use the gcloud CLI to first login. For OIDC providers, you can use the Google Cloud console to login. You can then authenticate normally against the cluster API (for example, using kubectl). As long as the identity provider remains accessible to both you and the cluster, then you can still authenticate against the cluster API. You can't login through the Google Cloud console. You can only update the OIDC or LDAP configuration of your clusters locally, you cannot use the Google Cloud console. Unlimited -
Authorization GKE Enterprise supports role-based access control (RBAC). Roles can be attributed to users, groups, or service accounts. User identities and groups can be retrieved from the identity provider. The RBAC system is local to the Kubernetes cluster and is not affected by disconnection from Google Cloud. However, if it relies on identities coming from Cloud Identity then, they are not available in case of disconnection. Unlimited -

Secret and key management

Secret and key management is an important part of your security posture. The behavior of GKE Enterprise in case of disconnection from Google Cloud depends on which service you are using for those features.

Feature Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
Secret and key management using Cloud Key Management Service and Secret Manager You directly use Cloud Key Management Service for your cryptographic keys, and Secret Manager for your secrets. Both Cloud Key Management Service and Secret Manager are not available. Zero Use local systems instead.
Secret and key management using Hashicorp Vault and Google Cloud services You configure Hashicorp Vault to use Cloud Storage or Spanner to store secrets, and Cloud Key Management Service to manage keys. If Hashicorp Vault runs on your Anthos cluster and is also impacted by the disconnection, then secret storage and key management are not available during the disconnection. Zero Use local systems instead.
Secret and key management using Hashicorp Vault and on-premises services You configure Hashicorp Vault to use an on-premises storage backend for secrets, and an on-premises key management system (such as a hardware security module). Disconnection from Google Cloud has no impact. Unlimited -

Networking and network services

Load Balancing

To expose Kubernetes Services hosted in an on-premises cluster to users, you have the choice to use the provided bundled load balancer (MetalLB on bare metal, Seesaw or MetalLB on VMware) or your load balancer, external to GKE Enterprise. Both options keep working in case of a disconnection from Google Cloud.

Feature Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
L4 bundled load-balancer Provides L4 load balancing entirely locally with no dependency on Google Cloud APIs or network. No change Unlimited -
Manual or integrated load balancer Supports F5 BIG-IP and others that are also hosted on-premises. No change Unlimited -

Cloud Service Mesh

You can use Cloud Service Mesh to manage, observe, and secure communications across your services running in an on-premises cluster. Not all Cloud Service Mesh features are supported on Google Distributed Cloud: see the list of supported features for more information.

Feature Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
Deploying or updating policies (routing, authorization, security, audit, etc.) You can use the Google Cloud console, kubectl, asmcli, or istioctl to manage Cloud Service Mesh policies. You can only use kubectl or istioctl to manage Cloud Service Mesh policies. Unlimited Use kubectl or istioctl
Certificate authority (CA) You can use either the in-cluster CA or the Cloud Service Mesh certificate authority to manage the certificates used by Cloud Service Mesh. There is no impact if you are using the in-cluster CA.
If you are using the Cloud Service Mesh certificate authority, then certificates expire after 24 hours. New service instances cannot retrieve certificates.
Unlimited for in-cluster CA.
Degraded service during 24h, and no service after 24h for Cloud Service Mesh certificate authority.
Use the in-cluster CA.
Cloud Monitoring for Cloud Service Mesh You can use Cloud Monitoring to store, explore and exploit HTTP-related metrics coming from Cloud Service Mesh. Metrics are not stored. Zero Use a compatible local monitoring solution such as Prometheus.
Cloud Service Mesh audit logging Cloud Service Mesh relies on the local Kubernetes logging facilities. The behavior depends on how you configured logging for your GKE Enterprise cluster. Depends on how you configured logging for your GKE Enterprise cluster. - -
Ingress gateway You can define external IPs with the Istio Ingress Gateway. No impact Unlimited -
Istio Container Network Interface (CNI) You can configure Cloud Service Mesh to use the Istio CNI instead of iptables to manage the traffic. No impact Unlimited -
Cloud Service Mesh end-user authentication for web applications You can use the Cloud Service Mesh ingress gateway to integrate with your own identity provider (through OIDC) to authenticate and authorize end-users on web applications that are part of the mesh. No impact Unlimited -

Other network services

Feature Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
DNS The Kubernetes DNS server runs inside the cluster. The Kubernetes DNS service works normally as it runs inside the cluster itself. Unlimited -
Egress proxy You can configure GKE Enterprise to use a proxy for egress connections. If your proxy runs on-premises, GKE Enterprise is still able to use it during a temporary disconnection. However, if the proxy loses the connection to Google Cloud, then all the scenarios from this document still apply. Unlimited -

Google Cloud Marketplace

Feature Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
Deploying and managing applications and services from the Cloud Marketplace The Cloud Marketplace is available in the Google Cloud console, and you can use it to discover, acquire, and deploy solutions. You cannot use the Cloud Marketplace. Some solutions from the Cloud Marketplace might have their own connectivity requirements which are not documented here. Zero None

Support

This section covers the scenarios that you might have to go through while interacting with Google Cloud support or your operating partner for a case related to your GKE on GDC clusters.

Feature Connected behavior Temporary disconnection behavior Maximum disconnection tolerance Loss of connectivity workaround
Sharing a cluster snapshot with the support team You can create a cluster snapshot locally using the bmctl check cluster or gkectl diagnose snapshot commands. You share this snapshot through the normal support process. You can still generate the snapshot as it is a local operation. If you lost access to Google Cloud and its support web interfaces, you can phone the support team provided you have subscribed to the Enhanced or Premium support plans. Unlimited -
Sharing relevant log data with the support team You can collect logs locally from your cluster and share them through the normal support process. You can still collect logs from your cluster. If you lost access to Google Cloud and its support web interfaces, you can phone the support team provided you have subscribed to the Enhanced or Premium support plans. Unlimited -