A “failed disk” revealed a nine-month silent Exadata failure.

A “failed disk” revealed a nine-month silent Exadata failure.

Most service organizations treat datacenter infrastructure problems as hardware problems.

A disk fails? Replace it. A controller misbehaves? Swap it.

Oracle Engineered Systems like Exadata need to be treated differently. Their real complexity — and majority of its more complex failures — exists above the hardware layer, inside the interlocking software stack that Oracle never publicly documents.

A recent incident with a current customer (a top 100 US retailer), whose Exadata X6-2 half rack supports a large portion of their in-store revenue, demonstrates exactly why Exadata requires more than hardware break/fix support.

And why Natrinsic is the only third party support provider that has the experience to diagnose and repair these issues.


A Routine Ticket That Quickly Became Anything But

The customer opened a ticket for a predictive drive failure. Standard. Routine. The new disk was installed and the system attempted to rebalance.

It failed.

And failed again.

That was the moment Perry West, Natrinsic’s Director of Engineered Systems Support, knew something wasn’t right.

This wasn’t a disk problem.

The Discovery: An Entire Exadata STORAGE Cell Had Been Failing for Nine Months

As Perry dug deeper, the real story emerged:

  • All 12 disks in the affected cell were in dropped or attempting to drop state

  • This condition had persisted for nine months

  • Over 30 known recovery commands failed

  • Several disks showed negative utilization, meaning Exadata had dipped into its emergency space

  • The RAID controller was running at 115°C, high enough to interrupt rebalances

  • The cell was essentially offline, yet the customer didn’t know

Why?
Because Exadata is so resilient that losing an entire cell often doesn’t cause an immediate outage.

The system silently absorbed the failure.

The Root Cause: A Patching Mistake

The turning point came when Perry asked a simple question:

“What happened on this date in February?”

The customer answered:
“Oracle patched our systems”

And then:

“On that same day, the patching team found a bad controller, replaced it, and continued the patch.”

There it was.

The controller was replaced while the cell was in the middle of a rebalance.
Rebalance + patching is a forbidden combination — the result is exactly what we found:

All disks in the cell were severed from the configuration.

And the failure remained hidden for nine months.


The Fix: Deep Engineered-Systems Expertise

To resolve the issue, Natrinsic had to go far beyond hardware break-fix:

  • Performed a full undrop of all 12 disks — a recovery method most teams never use and few even know about

  • Cleared out KEP files, unused databases, and stale allocations so the undrop could succeed

  • Replaced the overheating controller

  • Restored the ASM and cell disk configuration

  • Completed the rebalance for the first time since February

  • Delivered new monitoring logic so this condition can never go unnoticed again

This wasn’t reading an Oracle support article.
This was experience.

Why This Matters for Every Exadata Customer

This case teaches a simple truth:

Exadata requires more than hardware support.
It requires a team that understands the entire engineered system — software, metadata, ASM, on-disk behavior, and patching interactions — at expert depth.

A break/fix hardware provider cannot solve these issues.
A ticket-taker cannot diagnose them.
And even Oracle, according to Perry, would have needed weeks.

This customer had Natrinsic. The issue was resolved in days.

Final Thoughts

Exadata is one of the most resilient platforms in the world — but also one of the most complex.
When something goes wrong, it rarely presents where the real problem lies.

That’s why Natrinsic invests so heavily in engineered-systems expertise.
It’s why customers trust us with the platforms that run their business.
And it’s why something as “simple” as a disk replacement revealed a nine-month silent failure that only specialists could see.

If you run Exadata, and you want real support — not hardware swaps and ticket responders — we should talk.

Related Posts