Feed aggregator
AI & Humans: Making the Relationship Work
Leaders of many organizations are urging their teams to adopt agentic AI to improve efficiency, but are finding it hard to achieve any benefit. Managers attempting to add AI agents to existing human teams may find that bots fail to faithfully follow their instructions, return pointless or obvious results or burn precious time and resources spinning on tasks that older, simpler systems could have accomplished just as well.
The technical innovators getting the most out of AI are finding that the technology can be remarkably human in its behavior. And the more groups of AI agents are given tasks that require cooperation and collaboration, the more those human-like dynamics emerge...
API Security Features
Kubernetes v1.35: A Better Way to Pass Service Account Tokens to CSI Drivers
If you maintain a CSI driver that uses service account tokens,
Kubernetes v1.35 brings a refinement you'll want to know about.
Since the introduction of the TokenRequests feature,
service account tokens requested by CSI drivers have been passed to them through the volume_context field.
While this has worked, it's not the ideal place for sensitive information,
and we've seen instances where tokens were accidentally logged in CSI drivers.
Kubernetes v1.35 introduces a beta solution to address this:
CSI Driver Opt-in for Service Account Tokens via Secrets Field.
This allows CSI drivers to receive service account tokens
through the secrets field in NodePublishVolumeRequest,
which is the appropriate place for sensitive data in the CSI specification.
Understanding the existing approach
When CSI drivers use the TokenRequests feature,
they can request service account tokens for workload identity
by configuring the TokenRequests field in the CSIDriver spec.
These tokens are passed to drivers as part of the volume attributes map,
using the key csi.storage.k8s.io/serviceAccount.tokens.
The volume_context field works, but it's not designed for sensitive data.
Because of this, there are a few challenges:
First, the protosanitizer tool that CSI drivers use doesn't treat volume context as sensitive,
so service account tokens can end up in logs when gRPC requests are logged.
This happened with CVE-2023-2878 in the Secrets Store CSI Driver
and CVE-2024-3744 in the Azure File CSI Driver.
Second, each CSI driver that wants to avoid this issue needs to implement its own sanitization logic, which leads to inconsistency across drivers.
The CSI specification already has a secrets field in NodePublishVolumeRequest
that's designed exactly for this kind of sensitive information.
The challenge is that we can't just change where we put the tokens
without breaking existing CSI drivers that expect them in volume context.
How the opt-in mechanism works
Kubernetes v1.35 introduces an opt-in mechanism that lets CSI drivers choose how they receive service account tokens. This way, existing drivers continue working as they do today, and drivers can move to the more appropriate secrets field when they're ready.
CSI drivers can set a new field in their CSIDriver spec:
#
# CAUTION: this is an example configuration.
# Do not use this for your own cluster!
#
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: example-csi-driver
spec:
# ... existing fields ...
tokenRequests:
- audience: "example.com"
expirationSeconds: 3600
# New field for opting into secrets delivery
serviceAccountTokenInSecrets: true # defaults to false
The behavior depends on the serviceAccountTokenInSecrets field:
When set to false (the default), tokens are placed in VolumeContext with the key csi.storage.k8s.io/serviceAccount.tokens, just like today.
When set to true, tokens are placed only in the Secrets field with the same key.
About the beta release
The CSIServiceAccountTokenSecrets feature gate is enabled by default
on both kubelet and kube-apiserver.
Since the serviceAccountTokenInSecrets field defaults to false,
enabling the feature gate doesn't change any existing behavior.
All drivers continue receiving tokens via volume context unless they explicitly opt in.
This is why we felt comfortable starting at beta rather than alpha.
Guide for CSI driver authors
If you maintain a CSI driver that uses service account tokens, here's how to adopt this feature.
Adding fallback logic
First, update your driver code to check both locations for tokens. This makes your driver compatible with both the old and new approaches:
const serviceAccountTokenKey = "csi.storage.k8s.io/serviceAccount.tokens"
func getServiceAccountTokens(req *csi.NodePublishVolumeRequest) (string, error) {
// Check secrets field first (new behavior when driver opts in)
if tokens, ok := req.Secrets[serviceAccountTokenKey]; ok {
return tokens, nil
}
// Fall back to volume context (existing behavior)
if tokens, ok := req.VolumeContext[serviceAccountTokenKey]; ok {
return tokens, nil
}
return "", fmt.Errorf("service account tokens not found")
}
This fallback logic is backward compatible and safe to ship in any driver version, even before clusters upgrade to v1.35.
Rollout sequence
CSI driver authors need to follow a specific sequence when adopting this feature to avoid breaking existing volumes.
Driver preparation (can happen anytime)
You can start preparing your driver right away by adding fallback logic that checks both the secrets field and volume context for tokens. This code change is backward compatible and safe to ship in any driver version, even before clusters upgrade to v1.35. We encourage you to add this fallback logic early, cut releases, and even backport to maintenance branches where feasible.
Cluster upgrade and feature enablement
Once your driver has the fallback logic deployed, here's the safe rollout order for enabling the feature in a cluster:
- Complete the kube-apiserver upgrade to 1.35 or later
- Complete kubelet upgrade to 1.35 or later on all nodes
- Ensure CSI driver version with fallback logic is deployed (if not already done in preparation phase)
- Fully complete CSI driver DaemonSet rollout across all nodes
- Update your CSIDriver manifest to set
serviceAccountTokenInSecrets: true
Important constraints
The most important thing to remember is timing.
If your CSI driver DaemonSet and CSIDriver object are in the same manifest or Helm chart,
you need two separate updates.
Deploy the new driver version with fallback logic first,
wait for the DaemonSet rollout to complete,
then update the CSIDriver spec to set serviceAccountTokenInSecrets: true.
Also, don't update the CSIDriver before all driver pods have rolled out. If you do, volume mounts will fail on nodes still running the old driver version, since those pods only check volume context.
Why this matters
Adopting this feature helps in a few ways:
- It eliminates the risk of accidentally logging service account tokens as part of volume context in gRPC requests
- It uses the CSI specification's designated field for sensitive data, which feels right
- The
protosanitizertool automatically handles the secrets field correctly, so you don't need driver-specific workarounds - It's opt-in, so you can migrate at your own pace without breaking existing deployments
Call to action
We (Kubernetes SIG Storage) encourage CSI driver authors to adopt this feature and provide feedback on the migration experience. If you have thoughts on the API design or run into any issues during adoption, please reach out to us on the #csi channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).
You can follow along on KEP-5538 to track progress across the coming Kubernetes releases.
The Wegman’s Supermarket Chain Is Probably Using Facial Recognition
The New York City Wegman’s is collecting biometric information about customers.
HolmesGPT: Agentic troubleshooting built for the cloud native era
If you’ve ever debugged a production incident, you know that the hardest part often isn’t the fix, it’s finding where to begin. Most on-call engineers end up spending hours piecing together clues, fighting time pressure, and trying to make sense of scattered data. You’ve probably run into one or more of these challenges:
- Unwritten knowledge and missing context:
You’re pulled into an outage for a service you barely know. The original owners have changed teams, the documentation is half-written, and the “runbook” is either stale or missing altogether. You spend the first 30 minutes trying to find someone who’s seen this issue before — and if you’re unlucky, this incident is a new one. - Tool overload and context switching:
Your screen looks like an air traffic control dashboard. You’re running monitoring queries, flipping between Grafana and Application Insights, checking container logs, and scrolling through traces — all while someone’s asking for an ETA in the incident channel. Correlating data across tools is manual, slow, and mentally exhausting. - Overwhelming complexity and knowledge gaps:
Modern cloud-native systems like Kubernetes are powerful, but they’ve made troubleshooting far more complex. Every layer — nodes, pods, controllers, APIs, networking, autoscalers – introduces its own failure modes. To diagnose effectively, you need deep expertise across multiple domains, something even seasoned engineers can’t always keep up with.
The challenges require a solution that can look across signals, recall patterns from past incidents, and guide you toward the most likely cause.
This is where HolmesGPT, a CNCF Sandbox project, could help.
HolmesGPT was accepted as a CNCF Sandbox project in October 2025. It’s built to simplify the chaos of production debugging – bringing together logs, metrics, and traces from different sources, reasoning over them, and surfacing clear, data-backed insights in plain language.
What is HolmesGPT?
HolmesGPT is an open-source AI troubleshooting agent built for Kubernetes and cloud-native environments. It combines observability telemetry, LLM reasoning, and structured runbooks to accelerate root cause analysis and suggest next actions.
Unlike static dashboards or chatbots, HolmesGPT is agentic: it actively decides what data to fetch, runs targeted queries, and iteratively refines its hypotheses – all while staying within your environment.
Key benefits:
- AI-native control loop: HolmesGPT uses an agentic task list approach
- Open architecture: Every integration and toolset is open and extensible, works with existing runbooks and MCP servers
- Data privacy: Models can run locally or inside your cluster or on the cloud
- Community-driven: Designed around CNCF principles of openness, interoperability, and transparency.
How it works
When you run:
holmes ask “Why is my pod in crash loop back off state”
HolmesGPT:
- Understands intent → it recognizes you want to diagnose a pod restart issue
- Creates a task list → breaks down the problem into smaller chunks and executes each of them separately
- Queries data sources → runs Prometheus queries, collects Kubernetes events or logs, inspects pod specs including which pod
- Correlates context → detects that a recent deployment updated the image
- Explains and suggests fixes → returns a natural language diagnosis and remediation steps.
Here’s a simplified overview of the architecture:
Extensible by design
HolmesGPT’s architecture allows contributors to add new components:
- Toolsets: Build custom commands for internal observability pipelines or expose existing tools through a Model Context Protocol (MCP) server.
- Evals: Add custom evals to benchmark performance, cost , latency of models
- Runbooks: Codify best practices (e.g., “diagnose DNS failures” or “debug PVC provisioning”).
Example of a simple custom tool:
holmes:
toolsets:
kubernetes/pod_status:
description: "Check the status of a Kubernetes pod."
tools:
- name: "get_pod"
description: "Fetch pod details from a namespace."
command: "kubectl get pod {{ pod }} -n {{ namespace }}"
Getting started
- Install Holmesgpt
There are 4-5 ways to install Holmesgpt, one of the easiest ways to get started is through pip.
brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
The detailed installation guide has instructions for helm, CLI and the UI.
- Setup the LLM (Any Open AI compatible LLM) by setting the API Key
In most cases, this means setting the appropriate environment variable based on the LLM provider.
- Run it locally
holmes ask "what is wrong with the user-profile-import pod?" --model="anthropic/claude-sonnet-4-5"
- Explore other features
- GitHub: https://github.com/robusta-dev/holmesgpt
- Docs: holmesgpt.dev
How to get involved
HolmesGPT is entirely community-driven and welcomes all forms of contribution:
Area How you can help Integrations Add new toolsets for your observability tools or CI/CD pipelines. Runbooks Encode operational expertise for others to reuse. Evaluation Help build benchmarks for AI reasoning accuracy and observability insights. Docs and tutorials Improve onboarding, create demos, or contribute walkthroughs. Community Join discussions around governance and CNCF Sandbox progression.All contributions follow the CNCF Code of Conduct.
Further Resources
- GitHub Repository
- Join CNCF Slack → #holmesgpt
- Contributing Guide
The year in review: Kubewarden's progress in 2025
A Cyberattack Was Part of the US Assault on Venezuela
We don’t have many details:
President Donald Trump suggested Saturday that the U.S. used cyberattacks or other technical capabilities to cut power off in Caracas during strikes on the Venezuelan capital that led to the capture of Venezuelan President Nicolás Maduro.
If true, it would mark one of the most public uses of U.S. cyber power against another nation in recent memory. These operations are typically highly classified, and the U.S. is considered one of the most advanced nations in cyberspace operations globally.
Red Hat Hybrid Cloud Console: Your questions answered
Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)
Many production Kubernetes clusters blend on-demand (higher-SLA) and spot/preemptible (lower-SLA) nodes to optimize costs while maintaining reliability for critical workloads. Platform teams need a safe default that keeps most workloads away from risky capacity, while allowing specific workloads to opt-in with explicit thresholds like "I can tolerate nodes with failure probability up to 5%".
Today, Kubernetes taints and tolerations can match exact values or check for existence, but they can't compare numeric thresholds. You'd need to create discrete taint categories, use external admission controllers, or accept less-than-optimal placement decisions.
In Kubernetes v1.35, we're introducing Extended Toleration Operators as an alpha feature. This enhancement adds Gt (Greater Than) and Lt (Less Than) operators to spec.tolerations, enabling threshold-based scheduling decisions that unlock new possibilities for SLA-based placement, cost optimization, and performance-aware workload distribution.
The evolution of tolerations
Historically, Kubernetes supported two primary toleration operators:
Equal: The toleration matches a taint if the key and value are exactly equalExists: The toleration matches a taint if the key exists, regardless of value
While these worked well for categorical scenarios, they fell short for numeric comparisons. Starting with v1.35, we are closing this gap.
Consider these real-world scenarios:
- SLA requirements: Schedule high-availability workloads only on nodes with failure probability below a certain threshold
- Cost optimization: Allow cost-sensitive batch jobs to run on cheaper nodes that exceed a specific cost-per-hour value
- Performance guarantees: Ensure latency-sensitive applications run only on nodes with disk IOPS or network bandwidth above minimum thresholds
Without numeric comparison operators, cluster operators have had to resort to workarounds like creating multiple discrete taint values or using external admission controllers, neither of which scale well or provide the flexibility needed for dynamic threshold-based scheduling.
Why extend tolerations instead of using NodeAffinity?
You might wonder: NodeAffinity already supports numeric comparison operators, so why extend tolerations? While NodeAffinity is powerful for expressing pod preferences, taints and tolerations provide critical operational benefits:
- Policy orientation: NodeAffinity is per-pod, requiring every workload to explicitly opt-out of risky nodes. Taints invert control—nodes declare their risk level, and only pods with matching tolerations may land there. This provides a safer default; most pods stay away from spot/preemptible nodes unless they explicitly opt-in.
- Eviction semantics: NodeAffinity has no eviction capability. Taints support the
NoExecuteeffect withtolerationSeconds, enabling operators to drain and evict pods when a node's SLA degrades or spot instances receive termination notices. - Operational ergonomics: Centralized, node-side policy is consistent with other safety taints like disk-pressure and memory-pressure, making cluster management more intuitive.
This enhancement preserves the well-understood safety model of taints and tolerations while enabling threshold-based placement for SLA-aware scheduling.
Introducing Gt and Lt operators
Kubernetes v1.35 introduces two new operators for tolerations:
Gt(Greater Than): The toleration matches if the taint's numeric value is less than the toleration's valueLt(Less Than): The toleration matches if the taint's numeric value is greater than the toleration's value
When a pod tolerates a taint with Lt, it's saying "I can tolerate nodes where this metric is less than my threshold". Since tolerations allow scheduling, the pod can run on nodes where the taint value is greater than the toleration value. Think of it as: "I tolerate nodes that are above my minimum requirements".
These operators work with numeric taint values and enable the scheduler to make sophisticated placement decisions based on continuous metrics rather than discrete categories.
Note:
Numeric values for Gt and Lt operators must be positive 64-bit integers without leading zeros. For example, "100" is valid, but "0100" (with leading zero) and "0" (zero value) are not permitted.
The Gt and Lt operators work with all taint effects: NoSchedule, NoExecute, and PreferNoSchedule.
Use cases and examples
Let's explore how Extended Toleration Operators solve real-world scheduling challenges.
Example 1: Spot instance protection with SLA thresholds
Many clusters mix on-demand and spot/preemptible nodes to optimize costs. Spot nodes offer significant savings but have higher failure rates. You want most workloads to avoid spot nodes by default, while allowing specific workloads to opt-in with clear SLA boundaries.
First, taint spot nodes with their failure probability (for example, 15% annual failure rate):
apiVersion: v1
kind: Node
metadata:
name: spot-node-1
spec:
taints:
- key: "failure-probability"
value: "15"
effect: "NoExecute"
On-demand nodes have much lower failure rates:
apiVersion: v1
kind: Node
metadata:
name: ondemand-node-1
spec:
taints:
- key: "failure-probability"
value: "2"
effect: "NoExecute"
Critical workloads can specify strict SLA requirements:
apiVersion: v1
kind: Pod
metadata:
name: payment-processor
spec:
tolerations:
- key: "failure-probability"
operator: "Lt"
value: "5"
effect: "NoExecute"
tolerationSeconds: 30
containers:
- name: app
image: payment-app:v1
This pod will only schedule on nodes with failure-probability less than 5 (meaning ondemand-node-1 with 2% but not spot-node-1 with 15%). The NoExecute effect with tolerationSeconds: 30 means if a node's SLA degrades (for example, cloud provider changes the taint value), the pod gets 30 seconds to gracefully terminate before forced eviction.
Meanwhile, a fault-tolerant batch job can explicitly opt-in to spot instances:
apiVersion: v1
kind: Pod
metadata:
name: batch-job
spec:
tolerations:
- key: "failure-probability"
operator: "Lt"
value: "20"
effect: "NoExecute"
containers:
- name: worker
image: batch-worker:v1
This batch job tolerates nodes with failure probability up to 20%, so it can run on both on-demand and spot nodes, maximizing cost savings while accepting higher risk.
Example 2: AI workload placement with GPU tiers
AI and machine learning workloads often have specific hardware requirements. With Extended Toleration Operators, you can create GPU node tiers and ensure workloads land on appropriately powered hardware.
Taint GPU nodes with their compute capability score:
apiVersion: v1
kind: Node
metadata:
name: gpu-node-a100
spec:
taints:
- key: "gpu-compute-score"
value: "1000"
effect: "NoSchedule"
---
apiVersion: v1
kind: Node
metadata:
name: gpu-node-t4
spec:
taints:
- key: "gpu-compute-score"
value: "500"
effect: "NoSchedule"
A heavy training workload can require high-performance GPUs:
apiVersion: v1
kind: Pod
metadata:
name: model-training
spec:
tolerations:
- key: "gpu-compute-score"
operator: "Gt"
value: "800"
effect: "NoSchedule"
containers:
- name: trainer
image: ml-trainer:v1
resources:
limits:
nvidia.com/gpu: 1
This ensures the training pod only schedules on nodes with compute scores greater than 800 (like the A100 node), preventing placement on lower-tier GPUs that would slow down training.
Meanwhile, inference workloads with less demanding requirements can use any available GPU:
apiVersion: v1
kind: Pod
metadata:
name: model-inference
spec:
tolerations:
- key: "gpu-compute-score"
operator: "Gt"
value: "400"
effect: "NoSchedule"
containers:
- name: inference
image: ml-inference:v1
resources:
limits:
nvidia.com/gpu: 1
Example 3: Cost-optimized workload placement
For batch processing or non-critical workloads, you might want to minimize costs by running on cheaper nodes, even if they have lower performance characteristics.
Nodes can be tainted with their cost rating:
spec:
taints:
- key: "cost-per-hour"
value: "50"
effect: "NoSchedule"
A cost-sensitive batch job can express its tolerance for expensive nodes:
tolerations:
- key: "cost-per-hour"
operator: "Lt"
value: "100"
effect: "NoSchedule"
This batch job will schedule on nodes costing less than $100/hour but avoid more expensive nodes. Combined with Kubernetes scheduling priorities, this enables sophisticated cost-tiering strategies where critical workloads get premium nodes while batch workloads efficiently use budget-friendly resources.
Example 4: Performance-based placement
Storage-intensive applications often require minimum disk performance guarantees. With Extended Toleration Operators, you can enforce these requirements at the scheduling level.
tolerations:
- key: "disk-iops"
operator: "Gt"
value: "3000"
effect: "NoSchedule"
This toleration ensures the pod only schedules on nodes where disk-iops exceeds 3000. The Gt operator means "I need nodes that are greater than this minimum".
How to use this feature
Extended Toleration Operators is an alpha feature in Kubernetes v1.35. To try it out:
-
Enable the feature gate on both your API server and scheduler:
--feature-gates=TaintTolerationComparisonOperators=true -
Taint your nodes with numeric values representing the metrics relevant to your scheduling needs:
kubectl taint nodes node-1 failure-probability=5:NoSchedule kubectl taint nodes node-2 disk-iops=5000:NoSchedule -
Use the new operators in your pod specifications:
spec: tolerations: - key: "failure-probability" operator: "Lt" value: "1" effect: "NoSchedule"
Note:
As an alpha feature, Extended Toleration Operators may change in future releases and should be used with caution in production environments. Always test thoroughly in non-production clusters first.What's next?
This alpha release is just the beginning. As we gather feedback from the community, we plan to:
- Add support for CEL (Common Expression Language) expressions in tolerations and node affinity for even more flexible scheduling logic, including semantic versioning comparisons
- Improve integration with cluster autoscaling for threshold-aware capacity planning
- Graduate the feature to beta and eventually GA with production-ready stability
We're particularly interested in hearing about your use cases! Do you have scenarios where threshold-based scheduling would solve problems? Are there additional operators or capabilities you'd like to see?
Getting involved
This feature is driven by the SIG Scheduling community. Please join us to connect with the community and share your ideas and feedback around this feature and beyond.
You can reach the maintainers of this feature at:
- Slack: #sig-scheduling on Kubernetes Slack
- Mailing list: [email protected]
For questions or specific inquiries related to Extended Toleration Operators, please reach out to the SIG Scheduling community. We look forward to hearing from you!
How can I learn more?
- Taints and Tolerations for understanding the fundamentals
- Numeric comparison operators for details on using
GtandLtoperators - KEP-5471: Extended Toleration Operators for Threshold-Based Placement
Telegram Hosting World’s Largest Darknet Market
Wired is reporting on Chinese darknet markets on Telegram.
The ecosystem of marketplaces for Chinese-speaking crypto scammers hosted on the messaging service Telegram have now grown to be bigger than ever before, according to a new analysis from the crypto tracing firm Elliptic. Despite a brief drop after Telegram banned two of the biggest such markets in early 2025, the two current top markets, known as Tudou Guarantee and Xinbi Guarantee, are together enabling close to $2 billion a month in money-laundering transactions, sales of scam tools like stolen data, fake investment websites, and AI deepfake tools, as well as other black market services as varied as ...
Friday Squid Blogging: Squid Found in Light Fixture
Probably a college prank.
As usual, you can also use this squid post to talk about the security stories in the news that I haven’t covered.
Kubernetes v1.35: New level of efficiency with in-place Pod restart
The release of Kubernetes 1.35 introduces a powerful new feature that provides a much-requested capability: the ability to trigger a full, in-place restart of the Pod. This feature, Restart All Containers (alpha in 1.35), allows for an efficient way to reset a Pod's state compared to resource-intensive approach of deleting and recreating the entire Pod. This feature is especially useful for AI/ML workloads allowing application developers to concentrate on their core training logic while offloading complex failure-handling and recovery mechanisms to sidecars and declarative Kubernetes configuration. With RestartAllContainers and other planned enhancements, Kubernetes continues to add building blocks for creating the most flexible, robust, and efficient platforms for AI/ML workloads.
This new functionality is available by enabling the RestartAllContainersOnContainerExits feature gate. This alpha feature extends the Container Restart Rules feature, which graduated to beta in Kubernetes 1.35.
The problem: when a single container restart isn't enough and recreating pods is too costly
Kubernetes has long supported restart policies at the Pod level (restartPolicy) and, more recently, at the individual container level. These policies are great for handling crashes in a single, isolated process. However, many modern applications have more complex inter-container dependencies. For instance:
- An init container prepares the environment by mounting a volume or generating a configuration file. If the main application container corrupts this environment, simply restarting that one container is not enough. The entire initialization process needs to run again.
- A watcher sidecar monitors system health. If it detects an unrecoverable but retriable error state, it must trigger a restart of the main application container from a clean slate.
- A sidecar that manages a remote resource fails. Even if the sidecar restarts on its own, the main container may be stuck trying to access an outdated or broken connection.
In all these cases, the desired action is not to restart a single container, but all of them. Previously, the only way to achieve this was to delete the Pod and have a controller (like a Job or ReplicaSet) create a new one. This process is slow and expensive, involving the scheduler, node resource allocation and re-initialization of networking and storage.
This inefficiency becomes even worse when handling large-scale AI/ML workloads (>= 1,000 Nodes with one Pod per Node). A common requirement for these synchronous workloads is that when a failure occurs (such as a Node crash), all Pods in the fleet must be recreated to reset the state before training can resume, even if all the other Pods were not directly affected by the failure. Deleting, creating and scheduling thousands of Pods simultaneously creates a massive bottleneck. The estimated overhead of this failure could cost $100,000 per month in wasted resources.
Handling these failures for AI/ML training jobs requires a complex integration touching both the training framework and Kubernetes, which are often fragile and toilsome. This feature introduces a Kubernetes-native solution, improving system robustness and allowing application developers to concentrate on their core training logic.
Another major benefit of restarting Pods in place is that keeping Pods on their assigned Nodes allows for further optimizations. For example, one can implement node-level caching tied to a specific Pod identity, something that is impossible when Pods are unnecessarily being recreated on different Nodes.
Introducing the RestartAllContainers action
To address this, Kubernetes v1.35 adds a new action to the container restart rules: RestartAllContainers. When a container exits in a way that matches a rule with this action, the kubelet initiates a fast, in-place restart of the Pod.
This in-place restart is highly efficient because it preserves the Pod's most important resources:
- The Pod's UID, IP address and network namespace.
- The Pod's sandbox and any attached devices.
- All volumes, including
emptyDirand mounted volumes from PVCs.
After terminating all running containers, the Pod's startup sequence is re-executed from the very beginning. This means all init containers are run again in order, followed by the sidecar and regular containers, ensuring a completely fresh start in a known-good environment. With the exception of ephemeral containers (which are terminated), all other containers—including those that previously succeeded or failed—will be restarted, regardless of their individual restart policies.
Use cases
1. Efficient restarts for ML/Batch jobs
For ML training jobs, rescheduling a worker Pod on failure is a costly operation that wastes valuable compute resources. On a 1,000-node training cluster, rescheduling overhead can waste over $100,000 in compute resources monthly.
With RestartAllContainers actions you can address this by enabling a much faster, hybrid recovery strategy: recreate only the "bad" Pods (e.g., those on unhealthy Nodes) while triggering RestartAllContainers for the remaining healthy Pods. Benchmarks show this reduces the recovery overhead from minutes to a few seconds.
With in-place restarts, a watcher sidecar can monitor the main training process. If it encounters a specific, retriable error, the watcher can exit with a designated code to trigger a fast reset of the worker Pod, allowing it to restart from the last checkpoint without involving the Job controller. This capability is now natively supported by Kubernetes.
Read more details about future development and JobSet features at KEP-467 JobSet in-place restart.
apiVersion: v1
kind: Pod
metadata:
name: ml-worker-pod
spec:
restartPolicy: Never
initContainers:
# This init container will re-run on every in-place restart
- name: setup-environment
image: my-repo/setup-worker:1.0
- name: watcher-sidecar
image: my-repo/watcher:1.0
restartPolicy: Always
restartPolicyRules:
- action: RestartAllContainers
onExit:
exitCodes:
operator: In
# A specific exit code from the watcher triggers a full pod restart
values: [88]
containers:
- name: main-application
image: my-repo/training-app:1.0
2. Re-running init containers for a clean state
Imagine a scenario where an init container is responsible for fetching credentials or setting up a shared volume. If the main application fails in a way that corrupts this shared state, you need the init container to rerun.
By configuring the main application to exit with a specific code upon detecting such a corruption, you can trigger the RestartAllContainers action, guaranteeing that the init container provides a clean setup before the application restarts.
3. Handling high rate of similar tasks execution
There are cases when tasks are best represented as a Pod execution. And each task requires a clean execution. The task may be a game session backend or some queue item processing. If the rate of tasks is high, running the whole cycle of Pod creation, scheduling and initialization is simply too expensive, especially when tasks can be short. The ability to restart all containers from scratch enables a Kubernetes-native way to handle this scenario without custom solutions or frameworks.
How to use it
To try this feature, you must enable the RestartAllContainersOnContainerExits feature gate on your Kubernetes cluster components (API server and kubelet) running Kubernetes v1.35+. This alpha feature extends the ContainerRestartRules feature, which graduated to beta in v1.35 and is enabled by default.
Once enabled, you can add restartPolicyRules to any container (init, sidecar, or regular) and use the RestartAllContainers action.
The feature is designed to be easily usable on existing apps. However, if an application does not follow some best practices, it may cause issues for the application or for observability tooling. When enabling the feature, make sure that all containers are reentrant and that external tooling is prepared for init containers to re-run. Also, when restarting all containers, the kubelet does not run preStop hooks. This means containers must be designed to handle abrupt termination without relying on preStop hooks for graceful shutdown.
Observing the restart
To make this process observable, a new Pod condition, AllContainersRestarting, is added to the Pod's status. When a restart is triggered, this condition becomes True and it reverts to False once all containers have terminated and the Pod is ready to start its lifecycle anew. This provides a clear signal to users and other cluster components about the Pod's state.
All containers restarted by this action will have their restart count incremented in the container status.
Learn more
- Read the official documentation on Pod Lifecycle.
- Read the detailed proposal in the KEP-5532: Restart All Containers on Container Exits.
- Read the proposal for JobSet in-place restart in JobSet issue #467.
We want your feedback!
As an alpha feature, RestartAllContainers is ready for you to experiment with and any use cases and feedback are welcome. This feature is driven by the SIG Node community. If you are interested in getting involved, sharing your thoughts, or contributing, please join us!
You can reach SIG Node through:
- Slack: #sig-node
- Mailing list
The Kimwolf Botnet is Stalking Your Local Network
The story you are reading is a series of scoops nestled inside a far more urgent Internet-wide security advisory. The vulnerability at issue has been exploited for months already, and it’s time for a broader awareness of the threat. The short version is that everything you thought you knew about the security of the internal network behind your Internet router probably is now dangerously out of date.

The security company Synthient currently sees more than 2 million infected Kimwolf devices distributed globally but with concentrations in Vietnam, Brazil, India, Saudi Arabia, Russia and the United States. Synthient found that two-thirds of the Kimwolf infections are Android TV boxes with no security or authentication built in.
The past few months have witnessed the explosive growth of a new botnet dubbed Kimwolf, which experts say has infected more than 2 million devices globally. The Kimwolf malware forces compromised systems to relay malicious and abusive Internet traffic — such as ad fraud, account takeover attempts and mass content scraping — and participate in crippling distributed denial-of-service (DDoS) attacks capable of knocking nearly any website offline for days at a time.
More important than Kimwolf’s staggering size, however, is the diabolical method it uses to spread so quickly: By effectively tunneling back through various “residential proxy” networks and into the local networks of the proxy endpoints, and by further infecting devices that are hidden behind the assumed protection of the user’s firewall and Internet router.
Residential proxy networks are sold as a way for customers to anonymize and localize their Web traffic to a specific region, and the biggest of these services allow customers to route their traffic through devices in virtually any country or city around the globe.
The malware that turns an end-user’s Internet connection into a proxy node is often bundled with dodgy mobile apps and games. These residential proxy programs also are commonly installed via unofficial Android TV boxes sold by third-party merchants on popular e-commerce sites like Amazon, BestBuy, Newegg, and Walmart.
These TV boxes range in price from $40 to $400, are marketed under a dizzying range of no-name brands and model numbers, and frequently are advertised as a way to stream certain types of subscription video content for free. But there’s a hidden cost to this transaction: As we’ll explore in a moment, these TV boxes make up a considerable chunk of the estimated two million systems currently infected with Kimwolf.

Some of the unsanctioned Android TV boxes that come with residential proxy malware pre-installed. Image: Synthient.
Kimwolf also is quite good at infecting a range of Internet-connected digital photo frames that likewise are abundant at major e-commerce websites. In November 2025, researchers from Quokka published a report (PDF) detailing serious security issues in Android-based digital picture frames running the Uhale app — including Amazon’s bestselling digital frame as of March 2025.
There are two major security problems with these photo frames and unofficial Android TV boxes. The first is that a considerable percentage of them come with malware pre-installed, or else require the user to download an unofficial Android App Store and malware in order to use the device for its stated purpose (video content piracy). The most typical of these uninvited guests are small programs that turn the device into a residential proxy node that is resold to others.
The second big security nightmare with these photo frames and unsanctioned Android TV boxes is that they rely on a handful of Internet-connected microcomputer boards that have no discernible security or authentication requirements built-in. In other words, if you are on the same network as one or more of these devices, you can likely compromise them simultaneously by issuing a single command across the network.
THERE’S NO PLACE LIKE 127.0.0.1
The combination of these two security realities came to the fore in October 2025, when an undergraduate computer science student at the Rochester Institute of Technology began closely tracking Kimwolf’s growth, and interacting directly with its apparent creators on a daily basis.
Benjamin Brundage is the 22-year-old founder of the security firm Synthient, a startup that helps companies detect proxy networks and learn how those networks are being abused. Conducting much of his research into Kimwolf while studying for final exams, Brundage told KrebsOnSecurity in late October 2025 he suspected Kimwolf was a new Android-based variant of Aisuru, a botnet that was incorrectly blamed for a number of record-smashing DDoS attacks last fall.
Brundage says Kimwolf grew rapidly by abusing a glaring vulnerability in many of the world’s largest residential proxy services. The crux of the weakness, he explained, was that these proxy services weren’t doing enough to prevent their customers from forwarding requests to internal servers of the individual proxy endpoints.
Most proxy services take basic steps to prevent their paying customers from “going upstream” into the local network of proxy endpoints, by explicitly denying requests for local addresses specified in RFC-1918, including the well-known Network Address Translation (NAT) ranges 10.0.0.0/8, 192.168.0.0/16, and 172.16.0.0/12. These ranges allow multiple devices in a private network to access the Internet using a single public IP address, and if you run any kind of home or office network, your internal address space operates within one or more of these NAT ranges.
However, Brundage discovered that the people operating Kimwolf had figured out how to talk directly to devices on the internal networks of millions of residential proxy endpoints, simply by changing their Domain Name System (DNS) settings to match those in the RFC-1918 address ranges.
“It is possible to circumvent existing domain restrictions by using DNS records that point to 192.168.0.1 or 0.0.0.0,” Brundage wrote in a first-of-its-kind security advisory sent to nearly a dozen residential proxy providers in mid-December 2025. “This grants an attacker the ability to send carefully crafted requests to the current device or a device on the local network. This is actively being exploited, with attackers leveraging this functionality to drop malware.”
As with the digital photo frames mentioned above, many of these residential proxy services run solely on mobile devices that are running some game, VPN or other app with a hidden component that turns the user’s mobile phone into a residential proxy — often without any meaningful consent.
In a report published today, Synthient said key actors involved in Kimwolf were observed monetizing the botnet through app installs, selling residential proxy bandwidth, and selling its DDoS functionality.
“Synthient expects to observe a growing interest among threat actors in gaining unrestricted access to proxy networks to infect devices, obtain network access, or access sensitive information,” the report observed. “Kimwolf highlights the risks posed by unsecured proxy networks and their viability as an attack vector.”
ANDROID DEBUG BRIDGE
After purchasing a number of unofficial Android TV box models that were most heavily represented in the Kimwolf botnet, Brundage further discovered the proxy service vulnerability was only part of the reason for Kimwolf’s rapid rise: He also found virtually all of the devices he tested were shipped from the factory with a powerful feature called Android Debug Bridge (ADB) mode enabled by default.

Many of the unofficial Android TV boxes infected by Kimwolf include the ominous disclaimer: “Made in China. Overseas use only.” Image: Synthient.
ADB is a diagnostic tool intended for use solely during the manufacturing and testing processes, because it allows the devices to be remotely configured and even updated with new (and potentially malicious) firmware. However, shipping these devices with ADB turned on creates a security nightmare because in this state they constantly listen for and accept unauthenticated connection requests.
For example, opening a command prompt and typing “adb connect” along with a vulnerable device’s (local) IP address followed immediately by “:5555” will very quickly offer unrestricted “super user” administrative access.
Brundage said by early December, he’d identified a one-to-one overlap between new Kimwolf infections and proxy IP addresses offered for rent by China-based IPIDEA, currently the world’s largest residential proxy network by all accounts.
“Kimwolf has almost doubled in size this past week, just by exploiting IPIDEA’s proxy pool,” Brundage told KrebsOnSecurity in early December as he was preparing to notify IPIDEA and 10 other proxy providers about his research.
Brundage said Synthient first confirmed on December 1, 2025 that the Kimwolf botnet operators were tunneling back through IPIDEA’s proxy network and into the local networks of systems running IPIDEA’s proxy software. The attackers dropped the malware payload by directing infected systems to visit a specific Internet address and to call out the pass phrase “krebsfiveheadindustries” in order to unlock the malicious download.
On December 30, Synthient said it was tracking roughly 2 million IPIDEA addresses exploited by Kimwolf in the previous week. Brundage said he has witnessed Kimwolf rebuilding itself after one recent takedown effort targeting its control servers — from almost nothing to two million infected systems just by tunneling through proxy endpoints on IPIDEA for a couple of days.
Brundage said IPIDEA has a seemingly inexhaustible supply of new proxies, advertising access to more than 100 million residential proxy endpoints around the globe in the past week alone. Analyzing the exposed devices that were part of IPIDEA’s proxy pool, Synthient said it found more than two-thirds were Android devices that could be compromised with no authentication needed.
SECURITY NOTIFICATION AND RESPONSE
After charting a tight overlap in Kimwolf-infected IP addresses and those sold by IPIDEA, Brundage was eager to make his findings public: The vulnerability had clearly been exploited for several months, although it appeared that only a handful of cybercrime actors were aware of the capability. But he also knew that going public without giving vulnerable proxy providers an opportunity to understand and patch it would only lead to more mass abuse of these services by additional cybercriminal groups.
On December 17, Brundage sent a security notification to all 11 of the apparently affected proxy providers, hoping to give each at least a few weeks to acknowledge and address the core problems identified in his report before he went public. Many proxy providers who received the notification were resellers of IPIDEA that white-labeled the company’s service.
KrebsOnSecurity first sought comment from IPIDEA in October 2025, in reporting on a story about how the proxy network appeared to have benefitted from the rise of the Aisuru botnet, whose administrators appeared to shift from using the botnet primarily for DDoS attacks to simply installing IPIDEA’s proxy program, among others.
On December 25, KrebsOnSecurity received an email from an IPIDEA employee identified only as “Oliver,” who said allegations that IPIDEA had benefitted from Aisuru’s rise were baseless.
“After comprehensively verifying IP traceability records and supplier cooperation agreements, we found no association between any of our IP resources and the Aisuru botnet, nor have we received any notifications from authoritative institutions regarding our IPs being involved in malicious activities,” Oliver wrote. “In addition, for external cooperation, we implement a three-level review mechanism for suppliers, covering qualification verification, resource legality authentication and continuous dynamic monitoring, to ensure no compliance risks throughout the entire cooperation process.”
“IPIDEA firmly opposes all forms of unfair competition and malicious smearing in the industry, always participates in market competition with compliant operation and honest cooperation, and also calls on the entire industry to jointly abandon irregular and unethical behaviors and build a clean and fair market ecosystem,” Oliver continued.
Meanwhile, the same day that Oliver’s email arrived, Brundage shared a response he’d just received from IPIDEA’s security officer, who identified himself only by the first name Byron. The security officer said IPIDEA had made a number of important security changes to its residential proxy service to address the vulnerability identified in Brundage’s report.
“By design, the proxy service does not allow access to any internal or local address space,” Byron explained. “This issue was traced to a legacy module used solely for testing and debugging purposes, which did not fully inherit the internal network access restrictions. Under specific conditions, this module could be abused to reach internal resources. The affected paths have now been fully blocked and the module has been taken offline.”
Byron told Brundage IPIDEA also instituted multiple mitigations for blocking DNS resolution to internal (NAT) IP ranges, and that it was now blocking proxy endpoints from forwarding traffic on “high-risk” ports “to prevent abuse of the service for scanning, lateral movement, or access to internal services.”

An excerpt from an email sent by IPIDEA’s security officer in response to Brundage’s vulnerability notification. Click to enlarge.
Brundage said IPIDEA appears to have successfully patched the vulnerabilities he identified. He also noted he never observed the Kimwolf actors targeting proxy services other than IPIDEA, which has not responded to requests for comment.
Riley Kilmer is founder of Spur.us, a technology firm that helps companies identify and filter out proxy traffic. Kilmer said Spur has tested Brundage’s findings and confirmed that IPIDEA and all of its affiliate resellers indeed allowed full and unfiltered access to the local LAN.
Kilmer said one model of unsanctioned Android TV boxes that is especially popular — the Superbox, which we profiled in November’s Is Your Android TV Streaming Box Part of a Botnet? — leaves Android Debug Mode running on localhost:5555.
“And since Superbox turns the IP into an IPIDEA proxy, a bad actor just has to use the proxy to localhost on that port and install whatever bad SDKs [software development kits] they want,” Kilmer told KrebsOnSecurity.

Superbox media streaming boxes for sale on Walmart.com.
ECHOES FROM THE PAST
Both Brundage and Kilmer say IPIDEA appears to be the second or third reincarnation of a residential proxy network formerly known as 911S5 Proxy, a service that operated between 2014 and 2022 and was wildly popular on cybercrime forums. 911S5 Proxy imploded a week after KrebsOnSecurity published a deep dive on the service’s sketchy origins and leadership in China.
In that 2022 profile, we cited work by researchers at the University of Sherbrooke in Canada who were studying the threat 911S5 could pose to internal corporate networks. The researchers noted that “the infection of a node enables the 911S5 user to access shared resources on the network such as local intranet portals or other services.”
“It also enables the end user to probe the LAN network of the infected node,” the researchers explained. “Using the internal router, it would be possible to poison the DNS cache of the LAN router of the infected node, enabling further attacks.”
911S5 initially responded to our reporting in 2022 by claiming it was conducting a top-down security review of the service. But the proxy service abruptly closed up shop just one week later, saying a malicious hacker had destroyed all of the company’s customer and payment records. In July 2024, The U.S. Department of the Treasury sanctioned the alleged creators of 911S5, and the U.S. Department of Justice arrested the Chinese national named in my 2022 profile of the proxy service.
Kilmer said IPIDEA also operates a sister service called 922 Proxy, which the company has pitched from Day One as a seamless alternative to 911S5 Proxy.
“You cannot tell me they don’t want the 911 customers by calling it that,” Kilmer said.
Among the recipients of Synthient’s notification was the proxy giant Oxylabs. Brundage shared an email he received from Oxylabs’ security team on December 31, which acknowledged Oxylabs had started rolling out security modifications to address the vulnerabilities described in Synthient’s report.
Reached for comment, Oxylabs confirmed they “have implemented changes that now eliminate the ability to bypass the blocklist and forward requests to private network addresses using a controlled domain,” the company said in a written statement. But it said there is no evidence that Kimwolf or other other attackers exploited its network.
“In parallel, we reviewed the domains identified in the reported exploitation activity and did not observe traffic associated with them,” the Oxylabs statement continued. “Based on this review, there is no indication that our residential network was impacted by these activities.”
PRACTICAL IMPLICATIONS
Consider the following scenario, in which the mere act of allowing someone to use your Wi-Fi network could lead to a Kimwolf botnet infection. In this example, a friend or family member comes to stay with you for a few days, and you grant them access to your Wi-Fi without knowing that their mobile phone is infected with an app that turns the device into a residential proxy node. At that point, your home’s public IP address will show up for rent at the website of some residential proxy provider.
Miscreants like those behind Kimwolf then use residential proxy services online to access that proxy node on your IP, tunnel back through it and into your local area network (LAN), and automatically scan the internal network for devices with Android Debug Bridge mode turned on.
By the time your guest has packed up their things, said their goodbyes and disconnected from your Wi-Fi, you now have two devices on your local network — a digital photo frame and an unsanctioned Android TV box — that are infected with Kimwolf. You may have never intended for these devices to be exposed to the larger Internet, and yet there you are.
Here’s another possible nightmare scenario: Attackers use their access to proxy networks to modify your Internet router’s settings so that it relies on malicious DNS servers controlled by the attackers — allowing them to control where your Web browser goes when it requests a website. Think that’s far-fetched? Recall the DNSChanger malware from 2012 that infected more than a half-million routers with search-hijacking malware, and ultimately spawned an entire security industry working group focused on containing and eradicating it.
XLAB
Much of what is published so far on Kimwolf has come from the Chinese security firm XLab, which was the first to chronicle the rise of the Aisuru botnet in late 2024. In its latest blog post, XLab said it began tracking Kimwolf on October 24, when the botnet’s control servers were swamping Cloudflare’s DNS servers with lookups for the distinctive domain 14emeliaterracewestroxburyma02132[.]su.
This domain and others connected to early Kimwolf variants spent several weeks topping Cloudflare’s chart of the Internet’s most sought-after domains, edging out Google.com and Apple.com of their rightful spots in the top 5 most-requested domains. That’s because during that time Kimwolf was asking its millions of bots to check in frequently using Cloudflare’s DNS servers.

The Chinese security firm XLab found the Kimwolf botnet had enslaved between 1.8 and 2 million devices, with heavy concentrations in Brazil, India, The United States of America and Argentina. Image: blog.xLab.qianxin.com
It is clear from reading the XLab report that KrebsOnSecurity (and security experts) probably erred in misattributing some of Kimwolf’s early activities to the Aisuru botnet, which appears to be operated by a different group entirely. IPDEA may have been truthful when it said it had no affiliation with the Aisuru botnet, but Brundage’s data left no doubt that its proxy service clearly was being massively abused by Aisuru’s Android variant, Kimwolf.
XLab said Kimwolf has infected at least 1.8 million devices, and has shown it is able to rebuild itself quickly from scratch.
“Analysis indicates that Kimwolf’s primary infection targets are TV boxes deployed in residential network environments,” XLab researchers wrote. “Since residential networks usually adopt dynamic IP allocation mechanisms, the public IPs of devices change over time, so the true scale of infected devices cannot be accurately measured solely by the quantity of IPs. In other words, the cumulative observation of 2.7 million IP addresses does not equate to 2.7 million infected devices.”
XLab said measuring Kimwolf’s size also is difficult because infected devices are distributed across multiple global time zones. “Affected by time zone differences and usage habits (e.g., turning off devices at night, not using TV boxes during holidays, etc.), these devices are not online simultaneously, further increasing the difficulty of comprehensive observation through a single time window,” the blog post observed.
XLab noted that the Kimwolf author “shows an almost ‘obsessive’ fixation on Yours Truly, apparently leaving “easter eggs” related to my name in multiple places through the botnet’s code and communications:

Image: XLAB.
ANALYSIS AND ADVICE
One frustrating aspect of threats like Kimwolf is that in most cases it is not easy for the average user to determine if there are any devices on their internal network which may be vulnerable to threats like Kimwolf and/or already infected with residential proxy malware.
Let’s assume that through years of security training or some dark magic you can successfully identify that residential proxy activity on your internal network was linked to a specific mobile device inside your house: From there, you’d still need to isolate and remove the app or unwanted component that is turning the device into a residential proxy.
Also, the tooling and knowledge needed to achieve this kind of visibility just isn’t there from an average consumer standpoint. The work that it takes to configure your network so you can see and interpret logs of all traffic coming in and out is largely beyond the skillset of most Internet users (and, I’d wager, many security experts). But it’s a topic worth exploring in an upcoming story.
Happily, Synthient has erected a page on its website that will state whether a visitor’s public Internet address was seen among those of Kimwolf-infected systems. Brundage also has compiled a list of the unofficial Android TV boxes that are most highly represented in the Kimwolf botnet.
If you own a TV box that matches one of these model names and/or numbers, please just rip it out of your network. If you encounter one of these devices on the network of a family member or friend, send them a link to this story and explain that it’s not worth the potential hassle and harm created by keeping them plugged in.

The top 15 product devices represented in the Kimwolf botnet, according to Synthient.
Chad Seaman is a principal security researcher with Akamai Technologies. Seaman said he wants more consumers to be wary of these pseudo Android TV boxes to the point where they avoid them altogether.
“I want the consumer to be paranoid of these crappy devices and of these residential proxy schemes,” he said. “We need to highlight why they’re dangerous to everyone and to the individual. The whole security model where people think their LAN (Local Internal Network) is safe, that there aren’t any bad guys on the LAN so it can’t be that dangerous is just really outdated now.”
“The idea that an app can enable this type of abuse on my network and other networks, that should really give you pause,” about which devices to allow onto your local network, Seaman said. “And it’s not just Android devices here. Some of these proxy services have SDKs for Mac and Windows, and the iPhone. It could be running something that inadvertently cracks open your network and lets countless random people inside.”
In July 2025, Google filed a “John Doe” lawsuit (PDF) against 25 unidentified defendants collectively dubbed the “BadBox 2.0 Enterprise,” which Google described as a botnet of over ten million unsanctioned Android streaming devices engaged in advertising fraud. Google said the BADBOX 2.0 botnet, in addition to compromising multiple types of devices prior to purchase, also can infect devices by requiring the download of malicious apps from unofficial marketplaces.
Google’s lawsuit came on the heels of a June 2025 advisory from the Federal Bureau of Investigation (FBI), which warned that cyber criminals were gaining unauthorized access to home networks by either configuring the products with malware prior to the user’s purchase, or infecting the device as it downloads required applications that contain backdoors — usually during the set-up process.
The FBI said BADBOX 2.0 was discovered after the original BADBOX campaign was disrupted in 2024. The original BADBOX was identified in 2023, and primarily consisted of Android operating system devices that were compromised with backdoor malware prior to purchase.
Lindsay Kaye is vice president of threat intelligence at HUMAN Security, a company that worked closely on the BADBOX investigations. Kaye said the BADBOX botnets and the residential proxy networks that rode on top of compromised devices were detected because they enabled a ridiculous amount of advertising fraud, as well as ticket scalping, retail fraud, account takeovers and content scraping.
Kaye said consumers should stick to known brands when it comes to purchasing things that require a wired or wireless connection.
“If people are asking what they can do to avoid being victimized by proxies, it’s safest to stick with name brands,” Kaye said. “Anything promising something for free or low-cost, or giving you something for nothing just isn’t worth it. And be careful about what apps you allow on your phone.”
Many wireless routers these days make it relatively easy to deploy a “Guest” wireless network on-the-fly. Doing so allows your guests to browse the Internet just fine but it blocks their device from being able to talk to other devices on the local network — such as shared folders, printers and drives. If someone — a friend, family member, or contractor — requests access to your network, give them the guest Wi-Fi network credentials if you have that option.
There is a small but vocal pro-piracy camp that is almost condescendingly dismissive of the security threats posed by these unsanctioned Android TV boxes. These tech purists positively chafe at the idea of people wholesale discarding one of these TV boxes. A common refrain from this camp is that Internet-connected devices are not inherently bad or good, and that even factory-infected boxes can be flashed with new firmware or custom ROMs that contain no known dodgy software.
However, it’s important to point out that the majority of people buying these devices are not security or hardware experts; the devices are sought out because they dangle something of value for “free.” Most buyers have no idea of the bargain they’re making when plugging one of these dodgy TV boxes into their network.
It is somewhat remarkable that we haven’t yet seen the entertainment industry applying more visible pressure on the major e-commerce vendors to stop peddling this insecure and actively malicious hardware that is largely made and marketed for video piracy. These TV boxes are a public nuisance for bundling malicious software while having no apparent security or authentication built-in, and these two qualities make them an attractive nuisance for cybercriminals.
Stay tuned for Part II in this series, which will poke through clues left behind by the people who appear to have built Kimwolf and benefited from it the most.
Flock Exposes Its AI-Enabled Surveillance Cameras
404 Media has the story:
Unlike many of Flock’s cameras, which are designed to capture license plates as people drive by, Flock’s Condor cameras are pan-tilt-zoom (PTZ) cameras designed to record and track people, not vehicles. Condor cameras can be set to automatically zoom in on people’s faces as they walk through a parking lot, down a public street, or play on a playground, or they can be controlled manually, according to marketing material on Flock’s website. We watched Condor cameras zoom in on a woman walking her dog on a bike path in suburban Atlanta; a camera followed a man walking through a Macy’s parking lot in Bakersfield; surveil children swinging on a swingset at a playground; and film high-res video of people sitting at a stoplight in traffic. In one case, we were able to watch a man rollerblade down Brookhaven, Georgia’s Peachtree Creek Greenway bike path. The Flock camera zoomed in on him and tracked him as he rolled past. Minutes later, he showed up on another exposed camera livestream further down the bike path. The camera’s resolution was good enough that we were able to see that, when he stopped beneath one of the cameras, he was watching rollerblading videos on his phone...
Credential Stuffing Attacks Vs. Brute Force Attacks - What is the difference?
Kubernetes 1.35: Enhanced Debugging with Versioned z-pages APIs
Debugging Kubernetes control plane components can be challenging, especially when you need to quickly understand the runtime state of a component or verify its configuration. With Kubernetes 1.35, we're enhancing the z-pages debugging endpoints with structured, machine-parseable responses that make it easier to build tooling and automate troubleshooting workflows.
What are z-pages?
z-pages are special debugging endpoints exposed by Kubernetes control plane components. Introduced as an alpha feature in Kubernetes 1.32, these endpoints provide runtime diagnostics for components like kube-apiserver, kube-controller-manager, kube-scheduler, kubelet and kube-proxy. The name "z-pages" comes from the convention of using /*z paths for debugging endpoints.
Currently, Kubernetes supports two primary z-page endpoints:
/statusz- Displays high-level component information including version information, start time, uptime, and available debug paths
/flagz- Shows all command-line arguments and their values used to start the component (with confidential values redacted for security)
These endpoints are valuable for human operators who need to quickly inspect component state, but until now, they only returned plain text output that was difficult to parse programmatically.
What's new in Kubernetes 1.35?
Kubernetes 1.35 introduces structured, versioned responses for both /statusz and /flagz endpoints. This enhancement maintains backward compatibility with the existing plain text format while adding support for machine-readable JSON responses.
Backward compatible design
The new structured responses are opt-in. Without specifying an Accept header, the endpoints continue to return the familiar plain text format:
$ curl --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
https://localhost:6443/statusz
kube-apiserver statusz
Warning: This endpoint is not meant to be machine parseable, has no formatting compatibility guarantees and is for debugging purposes only.
Started: Wed Oct 16 21:03:43 UTC 2024
Up: 0 hr 00 min 16 sec
Go version: go1.23.2
Binary version: 1.35.0-alpha.0.1595
Emulation version: 1.35
Paths: /healthz /livez /metrics /readyz /statusz /version
Structured JSON responses
To receive a structured response, include the appropriate Accept header:
Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Statusz
This returns a versioned JSON response:
{
"kind": "Statusz",
"apiVersion": "config.k8s.io/v1alpha1",
"metadata": {
"name": "kube-apiserver"
},
"startTime": "2025-10-29T00:30:01Z",
"uptimeSeconds": 856,
"goVersion": "go1.23.2",
"binaryVersion": "1.35.0",
"emulationVersion": "1.35",
"paths": [
"/healthz",
"/livez",
"/metrics",
"/readyz",
"/statusz",
"/version"
]
}
Similarly, /flagz supports structured responses with the header:
Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Flagz
Example response:
{
"kind": "Flagz",
"apiVersion": "config.k8s.io/v1alpha1",
"metadata": {
"name": "kube-apiserver"
},
"flags": {
"advertise-address": "192.168.8.4",
"allow-privileged": "true",
"authorization-mode": "[Node,RBAC]",
"enable-priority-and-fairness": "true",
"profiling": "true"
}
}
Why structured responses matter
The addition of structured responses opens up several new possibilities:
1. Automated health checks and monitoring
Instead of parsing plain text, monitoring tools can now easily extract specific fields. For example, you can programmatically check if a component has been running with an unexpected emulated version or verify that critical flags are set correctly.
2. Better debugging tools
Developers can build sophisticated debugging tools that compare configurations across multiple components or track configuration drift over time. The structured format makes it trivial to diff configurations or validate that components are running with expected settings.
3. API versioning and stability
By introducing versioned APIs (starting with v1alpha1), we provide a clear path to stability. As the feature matures, we'll introduce v1beta1 and eventually v1, giving you confidence that your tooling won't break with future Kubernetes releases.
How to use structured z-pages
Prerequisites
Both endpoints require feature gates to be enabled:
/statusz: Enable theComponentStatuszfeature gate/flagz: Enable theComponentFlagzfeature gate
Example: Getting structured responses
Here's an example using curl to retrieve structured JSON responses from the kube-apiserver:
# Get structured statusz response
curl \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
-H "Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Statusz" \
https://localhost:6443/statusz | jq .
# Get structured flagz response
curl \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
-H "Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Flagz" \
https://localhost:6443/flagz | jq .
Note:
The examples above use client certificate authentication and verify the server's certificate using--cacert.
If you need to bypass certificate verification in a test environment, you can use --insecure (or -k),
but this should never be done in production as it makes you vulnerable to man-in-the-middle attacks.
Important considerations
Alpha feature status
The structured z-page responses are an alpha feature in Kubernetes 1.35. This means:
- The API format may change in future releases
- These endpoints are intended for debugging, not production automation
- You should avoid relying on them for critical monitoring workflows until they reach beta or stable status
Security and access control
z-pages expose internal component information and require proper access controls. Here are the key security considerations:
Authorization: Access to z-page endpoints is restricted to members of the system:monitoring group, which follows the same authorization model as other debugging endpoints like /healthz, /livez, and /readyz. This ensures that only authorized users and service accounts can access debugging information. If your cluster uses RBAC, you can manage access by granting appropriate permissions to this group.
Authentication: The authentication requirements for these endpoints depend on your cluster's configuration. Unless anonymous authentication is enabled for your cluster, you typically need to use authentication mechanisms (such as client certificates) to access these endpoints.
Information disclosure: These endpoints reveal configuration details about your cluster components, including:
- Component versions and build information
- All command-line arguments and their values (with confidential values redacted)
- Available debug endpoints
Only grant access to trusted operators and debugging tools. Avoid exposing these endpoints to unauthorized users or automated systems that don't require this level of access.
Future evolution
As the feature matures, we (Kubernetes SIG Instrumentation) expect to:
- Introduce
v1beta1and eventuallyv1versions of the API - Gather community feedback on the response schema
- Potentially add additional z-page endpoints based on user needs
Try it out
We encourage you to experiment with structured z-pages in a test environment:
- Enable the
ComponentStatuszandComponentFlagzfeature gates on your control plane components - Try querying the endpoints with both plain text and structured formats
- Build a simple tool or script that uses the structured data
- Share your feedback with the community
Learn more
- z-pages documentation
- KEP-4827: Component Statusz
- KEP-4828: Component Flagz
- Join the discussion in the #sig-instrumentation channel on Kubernetes Slack
Get involved
We'd love to hear your feedback! The structured z-pages feature is designed to make Kubernetes easier to debug and monitor. Whether you're building internal tooling, contributing to open source projects, or just exploring the feature, your input helps shape the future of Kubernetes observability.
If you have questions, suggestions, or run into issues, please reach out to SIG Instrumentation. You can find us on Slack or at our regular community meetings.
Happy debugging!
LinkedIn Job Scams
Interesting article on the variety of LinkedIn job scams around the world:
In India, tech jobs are used as bait because the industry employs millions of people and offers high-paying roles. In Kenya, the recruitment industry is largely unorganized, so scamsters leverage fake personal referrals. In Mexico, bad actors capitalize on the informal nature of the job economy by advertising fake formal roles that carry a promise of security. In Nigeria, scamsters often manage to get LinkedIn users to share their login credentials with the lure of paid work, preying on their desperation amid an especially acute unemployment crisis...
Kubernetes v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager
Up to and including Kubernetes v1.34, the route controller in Cloud Controller Manager (CCM) implementations built using the k8s.io/cloud-provider library reconciles routes at a fixed interval. This causes unnecessary API requests to the cloud provider when there are no changes to routes. Other controllers implemented through the same library already use watch-based mechanisms, leveraging informers to avoid unnecessary API calls. A new feature gate is being introduced in v1.35 to allow changing the behavior of the route controller to use watch-based informers.
What's new?
The feature gate CloudControllerManagerWatchBasedRoutesReconciliation has been
introduced to k8s.io/cloud-provider in alpha stage by SIG Cloud Provider.
To enable this feature you can use --feature-gate=CloudControllerManagerWatchBasedRoutesReconciliation=true
in the CCM implementation you are using.
About the feature gate
This feature gate will trigger the route reconciliation loop whenever a node is
added, deleted, or the fields .spec.podCIDRs or .status.addresses are updated.
An additional reconcile is performed in a random interval between 12h and 24h, which is chosen at the controller's start time.
This feature gate does not modify the logic within the reconciliation loop. Therefore, users of a CCM implementation should not experience significant changes to their existing route configurations.
How can I learn more?
For more details, refer to the KEP-5237.
Using AI-Generated Images to Get Refunds
Scammers are generating images of broken merchandise in order to apply for refunds.
Happy 16th Birthday, KrebsOnSecurity.com!
KrebsOnSecurity.com celebrates its 16th anniversary today! A huge “thank you” to all of our readers — newcomers, long-timers and drive-by critics alike. Your engagement this past year here has been tremendous and truly a salve on a handful of dark days. Happily, comeuppance was a strong theme running through our coverage in 2025, with a primary focus on entities that enabled complex and globally-dispersed cybercrime services.

Image: Shutterstock, Younes Stiller Kraske.
In May 2024, we scrutinized the history and ownership of Stark Industries Solutions Ltd., a “bulletproof hosting” provider that came online just two weeks before Russia invaded Ukraine and served as a primary staging ground for repeated Kremlin cyberattacks and disinformation efforts. A year later, Stark and its two co-owners were sanctioned by the European Union, but our analysis showed those penalties have done little to stop the Stark proprietors from rebranding and transferring considerable network assets to other entities they control.
In December 2024, KrebsOnSecurity profiled Cryptomus, a financial firm registered in Canada that emerged as the payment processor of choice for dozens of Russian cryptocurrency exchanges and websites hawking cybercrime services aimed at Russian-speaking customers. In October 2025, Canadian financial regulators ruled that Cryptomus had grossly violated its anti-money laundering laws, and levied a record $176 million fine against the platform.

In September 2023, KrebsOnSecurity published findings from researchers who concluded that a series of six-figure cyberheists across dozens of victims resulted from thieves cracking master passwords stolen from the password manager service LastPass in 2022. In a court filing in March 2025, U.S. federal agents investigating a spectacular $150 million cryptocurrency heist said they had reached the same conclusion.
Phishing was a major theme of this year’s coverage, which peered inside the day-to-day operations of several voice phishing gangs that routinely carried out elaborate, convincing, and financially devastating cryptocurrency thefts. A Day in the Life of a Prolific Voice Phishing Crew examined how one cybercrime gang routinely abused legitimate services at Apple and Google to force a variety of outbound communications to their users, including emails, automated phone calls and system-level messages sent to all signed-in devices.
Nearly a half-dozen stories in 2025 dissected the incessant SMS phishing or “smishing” coming from China-based phishing kit vendors, who make it easy for customers to convert phished payment card data into mobile wallets from Apple and Google.
In January, we highlighted research into a dodgy and sprawling content delivery network called Funnull that specialized in helping China-based gambling and money laundering websites distribute their operations across multiple U.S.-based cloud providers. Five months later, the U.S. government sanctioned Funnull, identifying it as a top source of investment/romance scams known as “pig butchering.”

Image: Shutterstock, ArtHead.
In May, Pakistan arrested 21 people alleged to be working for Heartsender, a phishing and malware dissemination service that KrebsOnSecurity first profiled back in 2015. The arrests came shortly after the FBI and the Dutch police seized dozens of servers and domains for the group. Many of those arrested were first publicly identified in a 2021 story here about how they’d inadvertently infected their computers with malware that gave away their real-life identities.
In April, the U.S. Department of Justice indicted the proprietors of a Pakistan-based e-commerce company for conspiring to distribute synthetic opioids in the United States. The following month, KrebsOnSecurity detailed how the proprietors of the sanctioned entity are perhaps better known for operating an elaborate and lengthy scheme to scam westerners seeking help with trademarks, book writing, mobile app development and logo designs.
Earlier this month, we examined an academic cheating empire turbocharged by Google Ads that earned tens of millions of dollars in revenue and has curious ties to a Kremlin-connected oligarch whose Russian university builds drones for Russia’s war against Ukraine.

An attack drone advertised the website hosted on the same network as Russia’s largest private education company — Synergy University.
As ever, KrebsOnSecurity endeavored to keep close tabs on the world’s biggest and most disruptive botnets, which pummeled the Internet this year with distributed denial-of-service (DDoS) assaults that were two to three times the size and impact of previous record DDoS attacks.
In June, KrebsOnSecurity.com was hit by the largest DDoS attack that Google had ever mitigated at the time (we are a grateful guest of Google’s excellent Project Shield offering). Experts blamed that attack on an Internet-of-Things botnet called Aisuru that had rapidly grown in size and firepower since its debut in late 2024. Another Aisuru attack on Cloudflare just days later practically doubled the size of the June attack against this website. Not long after that, Aisuru was blamed for a DDoS that again doubled the previous record.
In October, it appeared the cybercriminals in control of Aisuru had shifted the botnet’s focus from DDoS to a more sustainable and profitable use: Renting hundreds of thousands of infected Internet of Things (IoT) devices to proxy services that help cybercriminals anonymize their traffic.
However, it has recently become clear that at least some of the disruptive botnet and residential proxy activity attributed to Aisuru last year likely was the work of people responsible for building and testing a powerful botnet known as Kimwolf. Chinese security firm XLab, which was the first to chronicle Aisuru’s rise in 2024, recently profiled Kimwolf as easily the world’s biggest and most dangerous collection of compromised machines — with approximately 1.83 million devices under its thumb as of December 17.
XLab noted that the Kimwolf author “shows an almost ‘obsessive’ fixation on the well-known cybersecurity investigative journalist Brian Krebs, leaving easter eggs related to him in multiple places.”

Image: XLab, Kimwolf Botnet Exposed: The Massive Android Botnet with 1.8 million infected devices.
I am happy to report that the first KrebsOnSecurity stories of 2026 will go deep into the origins of Kimwolf, and examine the botnet’s unique and highly invasive means of spreading digital disease far and wide. The first in that series will include a somewhat sobering and global security notification concerning the devices and residential proxy services that are inadvertently helping to power Kimwolf’s rapid growth.
Thank you once again for your continued readership, encouragement and support. If you like the content we publish at KrebsOnSecurity.com, please consider making an exception for our domain in your ad blocker. The ads we run are limited to a handful of static images that are all served in-house and vetted by me (there is no third-party content on this site, period). Doing so would help further support the work you see here almost every week.
And if you haven’t done so yet, sign up for our email newsletter! (62,000 other subscribers can’t be wrong, right?). The newsletter is just a plain text email that goes out the moment a new story is published. We send between one and two emails a week, we never share our email list, and we don’t run surveys or promotions.
Thanks again, and Happy New Year everyone! Be safe out there.