When Elasticsearch Is Not the Problem: How a Backup Mount Froze an Entire Cluster

At first, the symptoms looked like a familiar Elasticsearch outage. One node stopped responding, cluster health dropped to yellow, replica shards became unassigned, HTTP requests returned nothing, and standard diagnostic tools painted an increasingly confusing picture. It felt like Elasticsearch had failed. In reality, Elasticsearch was only the victim. The real problem was much lower in the stack: a hanging network mount used only for backups.

This case is worth documenting because it shows how easily infrastructure teams can lose hours investigating the wrong layer. The cluster was unhealthy, but Elasticsearch was not the root cause. The real issue was an I/O and mount problem that blocked normal system calls and prevented the Java process from completing startup.

The Setup

The environment consisted of multiple Elasticsearch 2.3.5 clusters running on Ubuntu-based servers. One of the affected clusters used two nodes, with primary shards on one machine and replica shards intended to live on the other. Another cluster was used for a different market and behaved separately. This distinction became important during troubleshooting, because some symptoms belonged to one cluster while others belonged to another.

At one point, cluster health looked like this:

{
  "cluster_name" : "vindazofr",
  "status" : "yellow",
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 30,
  "active_shards" : 30,
  "unassigned_shards" : 30
}

That output suggested a simple story: one node was gone, and all replica shards were left unassigned. But the deeper problem was still hidden.

The Misleading Symptoms

Several things made this incident difficult to diagnose. Systemd reported Elasticsearch as running, yet the service did not answer on port 9200. Port 9300 was not listening either. Manual curl requests to the node returned nothing. The Elasticsearch process was visible in ps, but memory consumption stayed surprisingly low for a node that was supposedly alive. Standard tools also behaved strangely. Commands such as df -h and lsof became unusually slow, and at times appeared to hang.

That combination matters. When Elasticsearch is genuinely broken because of index corruption, bad cluster state, or allocation failures, the operating system itself normally remains responsive. In this case, basic filesystem inspection was slow. That was the clue that changed the direction of the investigation.

The Real Cause: A Hanging CIFS Backup Mount

The turning point came when the system showed warnings related to a CIFS mount used only for backup storage. This mount was not part of Elasticsearch data storage. It was not hosting shards. It was not used for the live index path. And yet it was enough to stall normal filesystem operations.

That is the trap. On Linux, an unresponsive network filesystem can affect unrelated processes because common system calls such as stat, directory traversal, or mount enumeration may block while the kernel waits for the remote filesystem. Elasticsearch performs a substantial amount of filesystem work during startup: checking data paths, reading metadata, inspecting locks, and validating directories. If the system is hanging on an unrelated mount, Elasticsearch can appear to freeze before it ever binds to HTTP or transport ports.

In other words, this was not an Elasticsearch configuration failure and not a shard corruption issue. It was an operating system I/O problem caused by a hanging backup mount.

Why the Cluster Turned Yellow

Once the affected node stopped working correctly, the cluster could continue with its primary shards on the healthy node, but all replica shards became unavailable. That led to a yellow cluster rather than a red one. The distinction is important.

A yellow cluster means the data is still available because primary shards are active, but redundancy has been lost because replica shards are not assigned. A red cluster would indicate missing primary shards and active data loss or unavailability. In this case, the cluster stayed yellow because the healthy node still held the primaries.

After the failing node’s data directory was reset, the node had to rejoin as a clean participant and rebuild its shard copies from the healthy node. That is why the cluster remained yellow for a while even after Elasticsearch came back: the node was rebuilding replicas, not serving them yet.

How We Confirmed It Was Not Really Elasticsearch

Several checks pointed away from Elasticsearch as the primary cause:

df -h was abnormally slow. That immediately suggested a filesystem-level issue.
ss -ltnp showed no Elasticsearch ports open, even though the Java process existed.
strace showed the JVM waiting internally, consistent with blocked initialization rather than normal network activity.
Foreground startup produced no useful Elasticsearch output, which often happens when the process is blocked before normal logging becomes active.
Unmounting the backup share restored normal system responsiveness and allowed Elasticsearch to start correctly.

That final point settled the case. Once the problematic mount was lazily unmounted, the system behaved normally again, and Elasticsearch resumed startup. The cluster then moved into a normal recovery state.

The Commands That Helped Diagnose Cluster Health and Startup Problems

These are the commands that proved most useful during the investigation.

Basic Cluster Health

curl -s http://localhost:9200/_cluster/health?pretty
curl -s http://138.201.8.XXX:9200/_cluster/health?pretty
curl -s http://159.69.65.XXX:9200/_cluster/health?pretty

Node Visibility

curl -s http://localhost:9200/_cat/nodes?v
curl -s http://144.76.157.XX:9200/_cat/nodes?v
curl -s http://159.69.65.XXX:9200/_cat/nodes?v

Shard Allocation and State

curl -s http://localhost:9200/_cat/shards?v
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason'
curl -s http://localhost:9200/_cat/allocation?v
curl -XPOST 'http://localhost:9200/_cluster/reroute?retry_failed=true'

Service and Process Checks

systemctl status elasticsearch --no-pager -l
ps -ef | grep '[e]lasticsearch'
pgrep -af elasticsearch
ss -ltnp | egrep '9200|9300'

Disk and Filesystem Checks

df -h
df -hT
df -i
du -sh /var/lib/elasticsearch/*
namei -l /var/lib/elasticsearch
find /var/lib/elasticsearch -maxdepth 3 -ls | head -n 80

Mount and CIFS Investigation

mount | egrep 'cifs|backupbox'
timeout 5 ls /mnt/backupbox
timeout 5 stat /mnt/backupbox
umount -l /mnt/backupbox

Deeper Startup Analysis

journalctl -u elasticsearch -n 200 --no-pager
tail -n 100 /var/log/elasticsearch/*.log
strace -tt -s 200 -f -p <PID>
jstack <PID>
lsof -p <PID>

Minimal Logging Configuration for Elasticsearch

One of the side problems in this case was log growth. At one stage, Elasticsearch logs had exploded to hundreds of gigabytes, which created additional disk pressure and made the problem harder to reason about. The safest short-term response was to move to minimal logging.

This configuration keeps logging extremely limited while still leaving enough visibility for serious errors:

# /etc/elasticsearch/logging.yml

es.logger.level: ERROR
rootLogger: ${es.logger.level}, console

logger:
  action: ERROR
  deprecation: ERROR
  com.amazonaws: ERROR
  com.amazonaws.jmx.SdkMBeanRegistrySupport: ERROR
  com.amazonaws.metrics.AwsSdkMetrics: ERROR
  org.apache.http: ERROR

additivity:
  index.search.slowlog: false
  index.indexing.slowlog: false
  deprecation: false

appender:
  console:
    type: console
    layout:
      type: consolePattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

  file:
    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %.10000m%n"

  deprecation_log_file:
    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}_deprecation.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

  index_search_slow_log_file:
    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}_index_search_slowlog.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

  index_indexing_slow_log_file:
    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}_index_indexing_slowlog.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

If file logging is needed temporarily for troubleshooting, rootLogger can be changed to include file as well. During emergencies, however, keeping only console logging can reduce disk churn.

Cluster Configuration Notes for Elasticsearch 2.3.5

Another issue uncovered during the investigation was configuration drift. Some nodes had newer-style Elasticsearch settings mixed into an Elasticsearch 2.3.5 environment. That is risky because older versions ignore some newer settings, while others may create confusion during debugging.

A cleaner node configuration for Elasticsearch 2.3.5 looks like this:

cluster.name: vindazofr
node.name: vindazofrDb

path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

network.host: 159.69.65.XXX
http.port: 9200

discovery.zen.ping.unicast.hosts: ["159.69.65.XXX", "144.76.157.XX"]
discovery.zen.minimum_master_nodes: 2

node.master: true
node.data: true

cluster.routing.allocation.enable: all

For Elasticsearch 2.x, it is best to avoid mixing in newer syntax such as node.roles or discovery.seed_hosts.

What Happened After the Data Directory Was Reset

Once it became clear that the problematic node had become harder to repair than to rebuild, the fastest option was chosen: the local Elasticsearch data directory on the affected node was moved aside, the service was restarted, and the node was allowed to rejoin empty. Elasticsearch then began to recover replicas from the healthy node.

That process immediately explained the new health output:

{
  "status" : "yellow",
  "number_of_nodes" : 2,
  "active_primary_shards" : 30,
  "active_shards" : 41,
  "initializing_shards" : 2,
  "unassigned_shards" : 17
}

This was expected and healthy. The node was no longer failing to start. It was simply rebuilding. Yellow was not a sign of a new error in this phase. It was the normal intermediate state between a clean rejoin and a fully green cluster.

The Most Important Lesson

The biggest lesson from this incident is simple: when Elasticsearch behaves strangely, do not stop at Elasticsearch. If the operating system itself is slow to answer basic filesystem commands, the real failure may be at the mount, storage, or kernel level.

In this case, a backup disk that “did nothing” in daily Elasticsearch operations still managed to freeze startup and destabilize cluster recovery. That is what makes this class of problem so dangerous. It is indirect. It is misleading. And it can consume hours if all attention remains fixed on Elasticsearch alone.

Recommendations for Production Systems

To avoid repeating this kind of incident, several operational improvements are worth considering.

Do not keep fragile CIFS backup mounts permanently attached on production search nodes unless absolutely necessary.
Prefer autofs or on-demand mounts over always-on network shares.
Use rsync or pull-based backup jobs instead of mounted remote storage where possible.
Keep Elasticsearch logging conservative unless actively debugging.
Monitor not only Elasticsearch health, but also basic host responsiveness. A slow df command is itself a critical signal.

Final Thought

It is easy to blame Elasticsearch when shards are unassigned, nodes disappear, and recovery stalls. But not every Elasticsearch outage starts in Elasticsearch. Sometimes the search engine is healthy enough to recover, and the real fault lies in the surrounding operating system.

That was the case here. The cluster was noisy. The symptoms were dramatic. But the root cause was much simpler:

This was not an Elasticsearch problem. It was an I/O and mount problem.

Search This Blog