rubicon44TechBlog
    Search by

    Reliability Design.

    created_at:August 27, 2022

    updated_at:August 27, 2022

    Today, I am going to explain physical design process.

    Purposes of this article creation

    • To be able to create DB tables in MySQL in Ruby on Rails on my own.
    • To be able to do physical design.

    Reliability Design Process

    In this case, I will create physical data model for a SNS task management application on Cacoo as a deliverable.

    【What is Reliability Design?】

    • A method of designing a system, device, or component so that it can perform its expected functions throughout the period from the start of use to the end of its service life, i.e., to prevent failures and performance degradation.

    • Software reliability is also closely related to “quality sub-characteristics,” such as fault tolerance, which is whether the user will not be burdened even if the software fails, recoverability, which is whether the software can recover quickly from a failure, and maturity, which is how often the software fails.

    【How to Ensure Software Reliability】

    • Define indicators to measure reliability
      • Average time to failure
      • Average repair time
      • System uptime

    ※Indicators are defined after actual operation.

    • Put in place mechanisms to ensure reliability
      • Be aware of avoiding small errors from the stage of creating design documents and specifications, and keep a close eye not only on software developed in-house but also on software procured from outside sources.
      • It is also important to inspect and verify program behavior using inspection and verification tools that match the format and methodology.
        • e.g.: Keep abreast of the latest information on tools, bugs, and updates, so that if a failure occurs, the cause can be identified immediately and you are prepared to quickly resolve it.
          • The information should be grasped and recorded on a daily, weekly, monthly, etc. basis, not only on an individual basis but also on an organizational basis.
      • Software testing is essential to ensure software reliability.
        • Actions to reduce bugs include implementing IoC in the infrastructure, writing sufficient tests and automating them, etc.

    【Reliability Design】

    • Typical Reliability Design Methods
    • System Redundancy
    • RAID
    • Specific AWS Design Techniques/Methods For System Redundancy
    • Minimum System Configuration For Startup

    Typical Reliability Design Methods

    • Fault tolerance

      • A design method in which a failure in one part of a system is covered by the entire system to prevent a functional outage.
      • Redundancy is a method in which one system is covered by other systems even if one system is down.
        • Commonly used redundancies in AWS include Web server and DB server redundancies.
    • Fault avoidance

      • The idea is to reduce the probability of individual equipment failures themselves and increase reliability as a whole.
    • Fail-safe

      • Methods to control to the safe side when a system fails.
      • When a signal fails, the system is controlled so that the failure does not create new failures, such as turning on the red light for the time being.
      • Sometimes, the process is halted.
    • Fail-soft

      • When a system fails, a method of keeping the system running at a minimum by isolating the part of the system where the failure occurred.
        • In this case, continuing operation with limited functions is called fallback.
    • Fault Masking

      • Methods to prevent external influence in the event of equipment failure.
      • Specifically, equipment redundancy is used to prevent the entire system from being affected in the event of a single device failure.
    • Foolproof

      • Design methods to prevent users from being in a dangerous situation even if they make a mistake, or to prevent them from making a mistake in the first place.
      • Specifically, there are methods such as preventing the pressing of buttons that should not be pressed on the screen.

    System Redundancy

    There are two main ways to make a system redundant.

    • Dual System

      • A method of preparing two or more systems, running the same process in parallel, and comparing the results.
      • High reliability can be obtained by comparing results.
      • Also, even if one system fails, processing can continue on the other system.
    • Duplex System

      • Two or more systems are prepared, but usually only one system is operated and the others are kept on standby.
        • In this case, the system to be operated is called the main system (active system) and the system on standby is called the secondary system (standby system).
      • System Standby method(divided into three types depending on how to make secondary system standby(recovery time))
        • Hot Standby
          • Keep the secondary system always ready for operation.
          • Specifically, start up the server and keep all applications and OS running in the same manner as the main system.
          • Therefore, if a failure occurs in the main system, it is possible to switch to the secondary system immediately.
            • In the event of a failure, the system automatically switches to the secondary system and continues processing, which is called Failover.
        • Warm Standby
          • Prepare the secondary system in the same state as the production system, but keep it on standby in a state that does not allow immediate operation.
          • Specifically, the server is running, but the application is not running or is doing something else, so it takes some time to switch over.
        • Cold Standby
          • Standby for a secondary system in a state where only equipment is prepared but not in operation.
          • Specifically, only spare equipment is prepared without power, and when a failure occurs, the power is turned on and the system is put into operation, ready to replace the main system.
          • The most time-consuming method for switching from the main system to the secondary system.

    ※As for the reliability design, it is necessary to consider it together with the AWS design.

    RAID

    RAID (Redundant Arrays of Inexpensive Disks) is a system that connects multiple hard disks and treats them as a single storage device as a whole. There are two main purposes of RAID: to speed up disk access and to enhance disk failure tolerance. There are several ways to do this, but combining multiple disks increases reliability and performance.

    ※RAID-based mechanisms such as “triple-record mirroring” support 99.99999999999% (eleven nines) of the durability of cloud services provided by AWS.Basically, when using AWS cloud services, there are few opportunities to implement RAID configurations. Only basic RAID is introduced here.

    ※Because “triple record mirroring” may not be enough to completely protect data, it is important to incorporate AWS’ recommended method of “distributing data to different regions” and create a triple*2 set data storage mechanism.

    RAID0

    • Faster by distributing data across multiple hard disks (this is called striping).
    • Performance is increased, but reliability is reduced compared to a single disk.

    RAID1

    • Writing the same data to multiple hard disks (this is called mirroring).
    • Even if there are two disks, one is a complete backup, so reliability is increased, but performance is not particularly improved.

    Specific AWS Design Techniques/Methods For System Redundancy

    【Main Server Configuration】

    • Operate Web and DB on a single server

    • Web server ×1、DB server ×1

      • Carve out DB to another server when one server spec is not enough.
    • Web server ×2、DB server ×1

      • Redundancy and load balancing on the Web side by using multiple Web servers when the performance of the Web side is insufficient.
    • Web server ×2、DB server ×2

      • Make the DB redundant by using a master-slave system (Web redundancy and load balancing and DB redundancy are possible).

    Basic Point Before Redundancy

    • Be Aware Of Fault-Tolerance

      • To ensure that operations continue in the event of a failure without service stoppage or performance loss.
    • Aim For High Availability

      • Aim to minimize system downtime.
      • Multi-AZ, Multi-Region, etc.
      • Determine how much time (downtime) is allowed for the system to be down.
    • Eliminate Single Point Of Failure (SPOF)

      • The key point to emphasize when considering a redundant configuration is the elimination of “SPOF”.
      • A “SPOF” is a point where the entire system stops if that point of failure stops.

    ※The above is the same as the “Reliability Pillar” content of AWS’s “The Five Pillars of the Well-Architected Framework”.

    DB Server Redundancy

    【Methods】

    • Replication of MySQL database

      • In the event of a failure of the master server, a system shutdown can be avoided by promoting the slave server to the master.
    • RDS Multi-AZ Configuration

      • RDS can experience short outages due to backups and security patching when used in a single configuration, but there is no need to worry about this when using a Multi-AZ configuration.
      • Multiple database instances are launched across Availability Zone (AZ) and automatically synchronized.
      • Avoid system outages by automatically switching to the other AZ in the event of a failure of one instance or the entire AZ.

    Minimum System Configuration For Startup

    • RDS Multi-AZ Configuration

      • DB server x 2 (DB redundancy)
      • Hot Standby configuration
    • Load balancing by creating RDS Read Replicas

    Deliverables

    • RDS Multi-AZ Configuration
      • DB server x 2 (DB redundancy)
      • Hot Standby configuration

    ※RDS provides redundant DBs in a Hot Standby configuration when “Multi-AZ” is enabled. In addition, in the event of a failure of the master, the system will automatically fail over.

    ※This time, a minimal system configuration will be used in order to keep the price as low as possible.

    ※Load balancing by creating a Read Replicas will be considered if DB performance is insufficient (to be determined during performance measurement at the time of prototyping).

    ※Design documents will be created during the “Infrastructure Design (Web server and DB server)” stage.

    Word Explanation

    [Redundancy]

    • To provide multiple machines with the same role so that the entire system will not stop even if a failure occurs in one part of the system.

    References

    [Typical Reliability Design Methods]

    [System Redundancy]

    [RAID]

    →公式が「RAIDを考えるのは不要である」と宣言している。

    [Specific AWS Design Techniques/Methods For System Redundancy]

    [RDS Reliability Assurance]

    [Amazon RDS Multi AZ Deployments]

    [Others]

    Summary/What I learned this time

    In this time, I learned about “Typical Reliability Design Methods,” “System Redundancy Methods,” and “Main Redundancy Methods in AWS.
    I am very interested in redundancy of small-scale services, especially in start-ups, etc. I would like to be able to create various redundant services precisely on my own.

    © 2022, rubicon44TechBlog All rights reserved.