An outage can start as ordinarily as a single relay falling silent in the field. Then alarms rain down, data flow stops, and decision-making is delayed. In sectors like energy, water, wastewater, oil and gas, and transportation, this delay translates into real-world costs. This is precisely where redundancy comes into play. An architecture using dual SCADA servers, dual RTUs, and dual SIMs maintains data and control with hot standby, failover, and high availability. The goal is clear: we do not want operations to stop, we want zero data loss, and the disaster recovery plan to be ready at all times.
In this article, we address the fundamental building blocks of this architecture and practical implementation points in a simple and focused manner. We explain the path to establishing a culture of high availability, from the field to the center, and from hardware to connection.
Why is High Availability Critical? Risks, Costs, and Goals
Downtime in a SCADA environment is not just the screen going dark. A pump might stop, a valve might remain in the wrong position, or alarms might be delayed. This means production loss, risk of environmental discharge, security vulnerabilities, and regulatory violations. Scenarios such as loss of pressure in water distribution, uncontrolled switching in a substation, or a leak detected late in a pipeline can escalate quickly.
Uptime percentage simply illustrates the business impact. The difference between 99.95% and 99.999% translates to minutes, not days, over the course of a year.
| Uptime Percentage | Estimated Annual Downtime |
| 99.95% | Approx. 4 hours 22 minutes |
| 99.99% | Approx. 52 minutes |
| 99.999% | Approx. 5 minutes |
RTO (Recovery Time Objective) is the maximum acceptable recovery time for the system. RPO (Recovery Point Objective) is the maximum acceptable data loss interval. In the context of SCADA, RTO should be expressed in minutes, and RPO in seconds. This is because alarm and event data must be timestamped and remain consistent during retrieval. Security is a separate topic; especially in critical infrastructures, incorrect data is as dangerous as an incorrect command. Environmental and compliance risks are managed with reportable data and archive continuity.
Therefore, the goals must be clear: high availability, reliable failover, consistent data, and rapid disaster recovery. Redundancy is not an option; it is the foundation of sustainable operation.
Dual SCADA Server: Hot Standby and Fast Failover Design
Consider two SCADA servers: one Active, the other Hot Standby. The Active server runs, handles all sessions, and generates alarms. The Standby server keeps the same database, alarm status, and user sessions synchronized. If the Active server fails, failover occurs automatically and quickly. The goal is for the operator to continue their work without noticing the change.
Critical components of this design:
-
Heartbeat: Servers regularly check each other. Packet loss, latency, and threshold values are well-tuned.
-
Quorum: Decisions in a multi-node setup are made by majority. This prevents unilateral decisions.
-
Split-brain prevention: Prevents two active servers in the event of a network partition. A witness node or tie-breaker is used.
-
Database replication: Data remains up-to-date with synchronous or semi-synchronous replication. The RPO target is determinant here.
-
Session and alarm sync: Operator screens, alarm flow, and acknowledgment information must remain consistent.
To mitigate risks during testing and maintenance:
-
Conduct planned failover drills, and combine the observation with an automated report.
-
Apply software updates in a phased manner, starting with the standby, then the active server.
-
Regularly run backup and restore scenarios.
On the security side, role-based access, multi-factor authentication, and network segmentation are basic needs. HA licensing model and activation rules for the dual node must be clarified beforehand in license management.
To quickly recall SCADA and RTU concepts, this guide offers a useful summary: What is an RTU and how does it work with SCADA.
Field Security with Dual RTU: Reliable I/O and Control
RTU devices are the heart of field control. In a redundant RTU architecture, two devices can share the same I/O. One performs active control, and the other monitors and remains synchronized. If the active device fails, the second device takes over control without interruption.
How it works:
-
I/O sharing can be active-passive or active-active. Active-passive is preferred in most SCADA environments.
-
The primary RTU is selected as the leader during commissioning. The secondary RTU is synchronized in passive mode.
-
Fault is detected by a watchdog signal, communication loss, or power drop.
-
Time synchronization is fixed with NTP or GPS. Event and trend data are maintained with accurate timestamps.
-
Protocol support is important. Protocols like IEC 60870-5-104, DNP3, Modbus TCP, and MQTT are selected for both central connection and inter-station communication. Devices with appropriate class for environmental resilience, temperature, EMC, and vibration conditions should be preferred.
Good practices for power redundancy and field cabling:
-
Use dual power lines and an external UPS.
-
Perform segregation in I/O cables; route input and output groups to separate channels.
-
Adhere to line termination and shielding rules.
-
Define a safe shutdown scenario with watchdog relays.
For those who want to examine RTU examples that support redundancy, two different product families offer a good reference: DM100 RTU redundant SCADA solution and DM500 RTU with redundant CPU modules. This document provides a practical resource for detailed programming and protocol blocks: Mikrodev DCS programming guide.
Dual SIM and Multiple Connections: Seamless Data Communication
Dual SIM makes a big difference in sites relying on cellular infrastructure. Two operators, one goal: connection continues without interruption. The basic logic is to use the primary line as long as it is healthy, and automatically switch to the secondary upon detecting a problem.
Practical settings:
-
Switchover rules: Trigger the switch with signal level, packet loss, RTT threshold, and the number of consecutive errors.
-
Data quota: Monitor the monthly limit, and define the rule for activating the replacement line.
-
Health check: Perform a test to the actual endpoint with Keepalive and periodic ping.
Alternative path options:
-
Ethernet or fiber can be used as the primary path if feasible in the field.
-
Industrial radio links provide low-latency backup connections over short distances.
-
MPLS or SD-WAN solutions offer intelligent routing with central policies.
Security topics:
-
Private APN provides isolation in the cellular network.
-
VPN tunnel protects data with encryption and authentication.
-
Certificate management and device identity prevent unauthorized access.
Hot standby and failover concepts are not only for the server; they are also applied at the network layer.
Disaster Recovery Plan and Continuous Improvement
The disaster recovery plan is not a single document; it is a living process. But it can be managed with simple steps.
-
Determine goals: Define RTO and RPO values based on business impact. RPO can be seconds for critical alarms, and minutes for reporting.
-
Backup strategy: Use a combination of full, incremental, and continuous backups. Keep backups offline and geographically separated.
-
Switchover to the secondary center: Write down step-by-step in the Runbook. Include DNS, connection tunnels, SCADA license migration, operator access, and rollback plan.
-
Drills: Supplement planned drills with surprise tests. Measure results, and record RTO and RPO deviations.
-
Observation and root cause analysis: Generate permanent corrective actions after an incident. Avoid repeating errors with configuration management and versioning.
-
Documentation and training: Prepare short, visual, and role-based guides for operator, maintenance, and network teams. Avoid knowledge loss when personnel changes.
-
Change management: Every patch, device replacement, or architectural update must pass through impact analysis. Approval and a rollback plan are mandatory.
This cycle strengthens the redundancy culture. High availability is sustainable not just with equipment, but with process.
Conclusion
When dual SCADA servers, dual RTUs, and dual SIMs are used together, a backbone is established that maintains control and data from the field to the center. Hot standby, failover, high availability, and disaster recovery disciplines should be considered under one roof. Take action now: clarify your goals, rank risks, test with a small pilot, and then gradually expand. Plan a controlled and measurable journey, not a problem-free one. If you have a scenario you would like to share, leave it as a comment, and let’s clarify it together.











