Reliability Design.

created_at:August 27, 2022

updated_at:August 27, 2022

Table-of-contents

Purposes of this article creation
Reliability Design Process
Word Explanation
References
Summary/What I learned this time

Today, I am going to explain physical design process.

Purposes of this article creation

To be able to create DB tables in MySQL in Ruby on Rails on my own.
To be able to do physical design.

Reliability Design Process

In this case, I will create physical data model for a SNS task management application on Cacoo as a deliverable.

【What is Reliability Design?】

A method of designing a system, device, or component so that it can perform its expected functions throughout the period from the start of use to the end of its service life, i.e., to prevent failures and performance degradation.
Software reliability is also closely related to “quality sub-characteristics,” such as fault tolerance, which is whether the user will not be burdened even if the software fails, recoverability, which is whether the software can recover quickly from a failure, and maturity, which is how often the software fails.

【How to Ensure Software Reliability】

Define indicators to measure reliability
- Average time to failure
- Average repair time
- System uptime

※Indicators are defined after actual operation.

Put in place mechanisms to ensure reliability
- Be aware of avoiding small errors from the stage of creating design documents and specifications, and keep a close eye not only on software developed in-house but also on software procured from outside sources.
- It is also important to inspect and verify program behavior using inspection and verification tools that match the format and methodology.
  - e.g.: Keep abreast of the latest information on tools, bugs, and updates, so that if a failure occurs, the cause can be identified immediately and you are prepared to quickly resolve it.
    - The information should be grasped and recorded on a daily, weekly, monthly, etc. basis, not only on an individual basis but also on an organizational basis.
- Software testing is essential to ensure software reliability.
  - Actions to reduce bugs include implementing IoC in the infrastructure, writing sufficient tests and automating them, etc.

【Reliability Design】

Typical Reliability Design Methods
System Redundancy
RAID
Specific AWS Design Techniques/Methods For System Redundancy
Minimum System Configuration For Startup

Typical Reliability Design Methods

Fault tolerance
- A design method in which a failure in one part of a system is covered by the entire system to prevent a functional outage.
- Redundancy is a method in which one system is covered by other systems even if one system is down.
  - Commonly used redundancies in AWS include Web server and DB server redundancies.
Fault avoidance
- The idea is to reduce the probability of individual equipment failures themselves and increase reliability as a whole.
Fail-safe
- Methods to control to the safe side when a system fails.
- When a signal fails, the system is controlled so that the failure does not create new failures, such as turning on the red light for the time being.
- Sometimes, the process is halted.
Fail-soft
- When a system fails, a method of keeping the system running at a minimum by isolating the part of the system where the failure occurred.
  - In this case, continuing operation with limited functions is called fallback.
Fault Masking
- Methods to prevent external influence in the event of equipment failure.
- Specifically, equipment redundancy is used to prevent the entire system from being affected in the event of a single device failure.
Foolproof
- Design methods to prevent users from being in a dangerous situation even if they make a mistake, or to prevent them from making a mistake in the first place.
- Specifically, there are methods such as preventing the pressing of buttons that should not be pressed on the screen.

System Redundancy

There are two main ways to make a system redundant.

Dual System
- A method of preparing two or more systems, running the same process in parallel, and comparing the results.
- High reliability can be obtained by comparing results.
- Also, even if one system fails, processing can continue on the other system.
Duplex System
- Two or more systems are prepared, but usually only one system is operated and the others are kept on standby.
  - In this case, the system to be operated is called the main system (active system) and the system on standby is called the secondary system (standby system).
- System Standby method(divided into three types depending on how to make secondary system standby(recovery time))
  - Hot Standby
    - Keep the secondary system always ready for operation.
    - Specifically, start up the server and keep all applications and OS running in the same manner as the main system.
    - Therefore, if a failure occurs in the main system, it is possible to switch to the secondary system immediately.
      - In the event of a failure, the system automatically switches to the secondary system and continues processing, which is called Failover.
  - Warm Standby
    - Prepare the secondary system in the same state as the production system, but keep it on standby in a state that does not allow immediate operation.
    - Specifically, the server is running, but the application is not running or is doing something else, so it takes some time to switch over.
  - Cold Standby
    - Standby for a secondary system in a state where only equipment is prepared but not in operation.
    - Specifically, only spare equipment is prepared without power, and when a failure occurs, the power is turned on and the system is put into operation, ready to replace the main system.
    - The most time-consuming method for switching from the main system to the secondary system.

※As for the reliability design, it is necessary to consider it together with the AWS design.

RAID

RAID (Redundant Arrays of Inexpensive Disks) is a system that connects multiple hard disks and treats them as a single storage device as a whole. There are two main purposes of RAID: to speed up disk access and to enhance disk failure tolerance. There are several ways to do this, but combining multiple disks increases reliability and performance.

※RAID-based mechanisms such as “triple-record mirroring” support 99.99999999999% (eleven nines) of the durability of cloud services provided by AWS.Basically, when using AWS cloud services, there are few opportunities to implement RAID configurations. Only basic RAID is introduced here.

※Because “triple record mirroring” may not be enough to completely protect data, it is important to incorporate AWS’ recommended method of “distributing data to different regions” and create a triple*2 set data storage mechanism.

RAID0

Faster by distributing data across multiple hard disks (this is called striping).
Performance is increased, but reliability is reduced compared to a single disk.

RAID1

Writing the same data to multiple hard disks (this is called mirroring).
Even if there are two disks, one is a complete backup, so reliability is increased, but performance is not particularly improved.

Specific AWS Design Techniques/Methods For System Redundancy

【Main Server Configuration】

Operate Web and DB on a single server
Web server ×1、DB server ×1
- Carve out DB to another server when one server spec is not enough.
Web server ×2、DB server ×1
- Redundancy and load balancing on the Web side by using multiple Web servers when the performance of the Web side is insufficient.
Web server ×2、DB server ×2
- Make the DB redundant by using a master-slave system (Web redundancy and load balancing and DB redundancy are possible).

Basic Point Before Redundancy

Be Aware Of Fault-Tolerance
- To ensure that operations continue in the event of a failure without service stoppage or performance loss.
Aim For High Availability
- Aim to minimize system downtime.
- Multi-AZ, Multi-Region, etc.
- Determine how much time (downtime) is allowed for the system to be down.
Eliminate Single Point Of Failure (SPOF)
- The key point to emphasize when considering a redundant configuration is the elimination of “SPOF”.
- A “SPOF” is a point where the entire system stops if that point of failure stops.

※The above is the same as the “Reliability Pillar” content of AWS’s “The Five Pillars of the Well-Architected Framework”.

DB Server Redundancy

【Methods】

Replication of MySQL database
- In the event of a failure of the master server, a system shutdown can be avoided by promoting the slave server to the master.
RDS Multi-AZ Configuration
- RDS can experience short outages due to backups and security patching when used in a single configuration, but there is no need to worry about this when using a Multi-AZ configuration.
- Multiple database instances are launched across Availability Zone (AZ) and automatically synchronized.
- Avoid system outages by automatically switching to the other AZ in the event of a failure of one instance or the entire AZ.

Minimum System Configuration For Startup

RDS Multi-AZ Configuration
- DB server x 2 (DB redundancy)
- Hot Standby configuration
Load balancing by creating RDS Read Replicas

Deliverables

RDS Multi-AZ Configuration
- DB server x 2 (DB redundancy)
- Hot Standby configuration

※RDS provides redundant DBs in a Hot Standby configuration when “Multi-AZ” is enabled. In addition, in the event of a failure of the master, the system will automatically fail over.

※This time, a minimal system configuration will be used in order to keep the price as low as possible.

※Load balancing by creating a Read Replicas will be considered if DB performance is insufficient (to be determined during performance measurement at the time of prototyping).

※Design documents will be created during the “Infrastructure Design (Web server and DB server)” stage.

Word Explanation

[Redundancy]

To provide multiple machines with the same role so that the entire system will not stop even if a failure occurs in one part of the system.

References

[Typical Reliability Design Methods]

「徹底攻略データベーススペシャリスト教科書令和3年度」：https://www.amazon.co.jp/%E5%85%A8%E6%96%87PDF%E3%83%BB%E5%8D%98%E8%AA%9E%E5%B8%B3%E3%82%A2%E3%83%97%E3%83%AA%E4%BB%98-%E5%BE%B9%E5%BA%95%E6%94%BB%E7%95%A5-%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3%83%BC%E3%82%B9%E3%82%B9%E3%83%9A%E3%82%B7%E3%83%A3%E3%83%AA%E3%82%B9%E3%83%88%E6%95%99%E7%A7%91%E6%9B%B8-%E4%BB%A4%E5%92%8C3%E5%B9%B4%E5%BA%A6/dp/4295009903/ref=tmm_pap_swatch_0?_encoding=UTF8&qid=&sr=
「ソフトウェアの信頼性とは？」：https://www.softwarejobs.jp/media/000037

[System Redundancy]

「徹底攻略データベーススペシャリスト教科書令和3年度」：https://www.amazon.co.jp/%E5%85%A8%E6%96%87PDF%E3%83%BB%E5%8D%98%E8%AA%9E%E5%B8%B3%E3%82%A2%E3%83%97%E3%83%AA%E4%BB%98-%E5%BE%B9%E5%BA%95%E6%94%BB%E7%95%A5-%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3%83%BC%E3%82%B9%E3%82%B9%E3%83%9A%E3%82%B7%E3%83%A3%E3%83%AA%E3%82%B9%E3%83%88%E6%95%99%E7%A7%91%E6%9B%B8-%E4%BB%A4%E5%92%8C3%E5%B9%B4%E5%BA%A6/dp/4295009903/ref=tmm_pap_swatch_0?_encoding=UTF8&qid=&sr=

[RAID]

「【後編】AWS/Azureの高耐久性/高持続性を支える「3重記録ミラーリング」～クラウドの耐久性とは～」：https://joho-manage.com/article/049/
「AWS のストレージサービス入門」：https://d1.awsstatic.com/events/jp/2017/summit/slide/D2T2-8.pdf

→公式が「RAIDを考えるのは不要である」と宣言している。

「RAIDがなかなか覚えられなかったので軽く実践までしてみた」：https://qiita.com/tomiyama0119/items/d70861b4634378d763fb
「AWS における OSのソフトウェア RAID」：https://plaza.rakuten.co.jp/toshiba1/diary/201808240000/
「なぜEBSで RAID5/RAID6が推奨されていない(非推奨)のか。」：https://awsjp.com/AWS/Faq/c/RAID5-RAID6-not-recommended-AC10.html
「AWS EC2インスタンスのRAID化（Windows Server編）」：https://waku.nagoya/blog/2019/11/07/post-647/

[Specific AWS Design Techniques/Methods For System Redundancy]

「AWSでよく使われる冗長化構成の例を紹介｜データベースサーバーを冗長化する方法とは？」：https://www.fenet.jp/aws/column/aws-beginner/790/
★「[AWS]Webレイヤを冗長化する方法」：https://qiita.com/kono-hiroki/items/1362e9bdc37a321301dc
「AWSの大規模障害を受けて冗長化を再考する」：https://qiita.com/tnoce/items/313a9aaa65206a3638f8
「リージョン、アベイラビリティーゾーン、および Local Zones」：https://docs.aws.amazon.com/ja_jp/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZones.html

[RDS Reliability Assurance]

「【AWS】知識ゼロから理解するRDS超入門」：https://miyabi-lab.space/blog/31
「【AWS】RDSのレプリケーションについて解説します。」：https://www.acrovision.jp/service/aws/?p=2462
★「多分誰でも作れるAWSのコンソール画面でマルチAZ環境に踏み台サーバー、RDS（マスター・スレーブ）の環境を構築する」：https://zenn.dev/dodonki1223/articles/1d121c2ff7b70b6873ff

[Amazon RDS Multi AZ Deployments]

「AWSでELBを利用しWEBサーバーを冗長化、ついでにRDSもマルチAZ構成にしてみました。」：https://tech.hippo-lab.com/post-17985/
「【AWS】知識ゼロから理解するRDS超入門」：https://miyabi-lab.space/blog/31
「【AWS】RDSのレプリケーションについて解説します。」：https://www.acrovision.jp/service/aws/?p=2462
「RDSのフェイルオーバー時の挙動を理解してみる」：https://lab.taf-jp.com/rds%E3%81%AE%E3%83%95%E3%82%A7%E3%82%A4%E3%83%AB%E3%82%AA%E3%83%BC%E3%83%90%E3%83%BC%E6%99%82%E3%81%AE%E6%8C%99%E5%8B%95%E3%82%92%E7%90%86%E8%A7%A3%E3%81%97%E3%81%A6%E3%81%BF%E3%82%8B/
「ELB+EC2+RDS(Multi-AZ)構成をあえてActive-Standbyにする」：https://dev.classmethod.jp/articles/multi-az-changed-active-standby/
「フェイルオーバー」：https://wa3.i-3-i.info/word12588.html

[Others]

「Reliability Pillar」：https://docs.aws.amazon.com/wellarchitected/latest/high-performance-computing-lens/reliability-pillar.html
「システムの構成―Web、クラウドなどシステムの要素、全般的な知識を解説。」：https://learning.zealseeds.com/contents/text/IPA/technology/compositionOfSystem/index.html
「第13回信頼できるWebシステムを目指す（前編）」：https://xtech.nikkei.com/it/article/COLUMN/20071120/287654/

Summary/What I learned this time

In this time, I learned about “Typical Reliability Design Methods,” “System Redundancy Methods,” and “Main Redundancy Methods in AWS.
I am very interested in redundancy of small-scale services, especially in start-ups, etc. I would like to be able to create various redundant services precisely on my own.