Most service organizations treat datacenter infrastructure problems as hardware problems.
A disk fails? Replace it. A controller misbehaves? Swap it.
Oracle Engineered Systems like Exadata need to be treated differently. Their real complexity — and majority of its more complex failures — exists above the hardware layer, inside the interlocking software stack that Oracle never publicly documents.
A recent incident with a current customer (a top 100 US retailer), whose Exadata X6-2 half rack supports a large portion of their in-store revenue, demonstrates exactly why Exadata requires more than hardware break/fix support.
And why Natrinsic is the only third party support provider that has the experience to diagnose and repair these issues.
The customer opened a ticket for a predictive drive failure. Standard. Routine. The new disk was installed and the system attempted to rebalance.
It failed.
And failed again.
That was the moment Perry West, Natrinsic’s Director of Engineered Systems Support, knew something wasn’t right.
This wasn’t a disk problem.
As Perry dug deeper, the real story emerged:
All 12 disks in the affected cell were in dropped or attempting to drop state
This condition started nine months ago, while under Oracle support.
Over 30 known recovery commands failed
Several disks showed negative utilization, meaning Exadata had dipped into its emergency space
The RAID controller was running at 115°C, high enough to interrupt rebalances
The cell was essentially offline, yet the customer didn’t know
Why?
Because Exadata is so resilient that losing an entire cell often doesn’t cause an immediate outage.
The system silently absorbed the failure.
The turning point came when Perry asked a simple question:
“What happened on this date in February?”
The customer answered:
“Oracle patched our systems”
And then:
“On that same day, the patching team found a bad controller, replaced it, and continued the patch.”
There it was.
The controller was replaced while the cell was in the middle of a rebalance.
Rebalance + patching is a forbidden combination — the result is exactly what we found:
All disks in the cell were severed from the configuration.
This failure started nine months ago while under support with Oracle.
To resolve the issue, Natrinsic had to go far beyond hardware break-fix:
Performed a full undrop of all 12 disks — a recovery method most teams never use and few even know about
Cleared out KEP files, unused databases, and stale allocations so the undrop could succeed
Replaced the overheating controller
Restored the ASM and cell disk configuration
Completed the rebalance for the first time since February
Delivered new monitoring logic so this condition can never go unnoticed again
This wasn’t reading an Oracle support article.
This was experience.
This case teaches a simple truth:
Exadata requires more than hardware support.
It requires a team that understands the entire engineered system — software, metadata, ASM, on-disk behavior, and patching interactions — at expert depth.
A break/fix hardware provider cannot solve these issues.
A ticket-taker cannot diagnose them.
And even Oracle, according to Perry, would have needed weeks.
This customer had Natrinsic. The issue was resolved in days.
Exadata is one of the most resilient platforms in the world — but also one of the most complex.
When something goes wrong, it rarely presents where the real problem lies.
That’s why Natrinsic invests so heavily in engineered-systems expertise.
It’s why customers trust us with the platforms that run their business.
And it’s why something as “simple” as a disk replacement revealed a nine-month silent failure that only specialists could see.
If you run Exadata, and you want real support — not hardware swaps and ticket responders — we should talk.
---
Experience is everything when it comes to supporting Oracle Engineered Systems.
|
|
Paul AndersonSNR. DIRECTOR - ENGINEERED SYSTEMS
|
Perry WestDIRECTOR OF ENGINEERED SYSTEMS
|