rubicon44TechBlog
    Search by

    Operation and Maintenance Design.

    created_at:September 10, 2022

    updated_at:September 10, 2022

    Today, I am going to explain physical design process.

    Purposes of this article creation

    • To be able to create DB tables in MySQL in Ruby on Rails on my own.
    • To be able to do physical design.

    Operation and Maintenance Design Process

    In this case, I will create physical data model for a SNS task management application on Cacoo as a deliverable.

    【What is Operation and Maintenance Design?】

    • To define information such as operation rules, operation methods, and failure handling methods in advance in order to ensure stable operation of the system after delivery.

    • In the case of DB Design, the database should be designed so that it can be operated comfortably without any failures.

    • What is important in the concept during operation is to emphasize proactive (active) efforts to act spontaneously, rather than reactive (passive) operations to respond to problems as they arise.

    【Operation and maintenance design Merits】

    • System operation can be started smoothly.
    • Streamline system operations.
    • Avoid failures and troubles before they occur.
    • Promptly deal with failures and problems when they occur.
    • Share system operation know-how and flow.

    Considerations in Operation and Maintenance Design

    • Monitoring
    • Maintenance
    • Failure Recovery

    Monitoring

    • Database failures and performance degradation may not be detected as obvious failures.
    • Therefore, it is important to monitor logs and DBMS performance and to confirm the situation.

    【Items to be considered】

    • Process and Alive Monitoring
      • Background Processes
    • Resource and Performance Monitoring
      • Threshold, bottleneck, response, throughput, and Resource Monitoring
    • Application and Log Monitoring
      • Trace files, alert logs, listener logs

    Maintenance

    • As databases continue to be used, their processing performance deteriorates.
    • The hardware also deteriorates, so periodic inspection and maintenance is important.
    • In addition, repeated additions and deletions of databases often result in inefficient disk usage and index optimization.
    • It is also important to periodically review data placement by reorganizing or rebuilding disks and indexes.

    【Items to be considered】

    • Performance Tuning
      • CPU, memory, network, disk, programs (SQL)
    • Capacity Planning
      • CPU, memory, network, disk

    Failure Recovery

    • Consider how to recover in the event of a failure.
    • Take backups and logs in advance and use rollback and roll forward to recover.

    【Items to be considered】

    • Failure Types
      • Transaction failure, software failure, hardware failure
    • Recovery Point
      • Recovery point (CRUD diagram), JOB flow
    • Backup Types
      • Full backup, differential backup, incremental backup
    • Multiplexing Redundancy set
      • Data files, control files, RED0 log files, initialization parameter files, flashback recovery
    • System Recovery Requirements/Action
      • Physical backup, logical backup
    • Backup/recovery plan
      • Consistent backup (offline backup), Inconsistent backup

    Schedule Design

    • In normal operations, it is necessary to determine the operation method for each stage, such as daily, weekly, and monthly processing.
    • For operational jobs, the schedule should be designed in consideration of the job network (job linkage) in relation to the preceding jobs.

    BCP (Business Continuity Plan)

    • In the event of a failure, switching to a standby system, data recovery, etc., must be designed in advance based on BCM (Business Continuity Management).
    • The basic plan for a company to continue its business, which is developed as part of BCM, is called a BCP (Business Continuity Plan).
    • In formulating a BCP, the RTO (Recovery Time Objective) and RPO (Recovery Point Objective) should be determined.
    • Disaster recovery refers to the recovery of a system from a catastrophic system failure due to a disaster, or recovery measures taken to prepare for such a failure recovery.
    • A system that must continue to operate 24 hours a day, 365 days a year is called a mission critical system because a failure would have a serious impact on corporate activities.

    Operation and Maintenance Design Process

    • Determine monitoring methods, work items/procedures, and schedules
    • Decide maintenance methods, work items/procedures, and schedule
    • Determine failure recovery methods, work items/procedures, and schedule

    Determine monitoring methods, work items/procedures, and schedules

    [Monitoring methods]

    • Daily monitoring and reporting using monitoring service (Amazon CloudWatch)
      • Performance trends for the day
      • Lowest, highest, average performance

    [Work items/Procedures]
    ※This will be decided when it comes to time to actually operate the system.

    ※At the same time, I will also gain a deep understanding of Amazon CloudWatch.

    [Schedules]

    • Daily.

    Decide maintenance methods, work items/procedures, and schedule

    [Maintenance methods]

    • Monitor performance on a weekly basis, tuning and planning if requirements are not met.

    [Work items/Procedures]

    1. Check the monitoring service
    2. Check numbers against requirements
    3. Perform tuning and planning

    [Schedules]

    • Weekly.

    Determine failure recovery methods, work items/procedures, and schedule

    [Failure recovery methods]

    • Backup and log in advance.
      • If using AWS RDS, there are ways to create a snapshot and restore from it.
      • Find out the reason for the failure from Amazon CloudWatch logs.
    • Use rollback and roll forward to recover from the failure.

    [Work items/Procedures]
    ※This will be decided when it comes to time to actually operate the system.

    [Schedules]

    • When a failure alert occurs.

    Deliverables(Physical Data Model)

    • URL:NONE.

    References

    [Word Explanation]

    [Others]

    Word Explanation

    [Capacity planning]

    • To estimate the processing capacity and quantity of system resources based on the service demand/service level required for the IT system under planning/development or in operation, and to plan the optimal system configuration, including resource procurement and system enhancement/relocation planning.

    Summary/What I learned this time

    【Summary】
    This time, only the general flow is summarized. Procedures for failure countermeasures, etc., will be decided just prior to actual operation.

    【What I learned this time】
    If you are new to engineering, you will often be involved in the downstream process, but you should always consider the upstream process as you work. I believe this will lead to quick growth.
    In addition, I want to be an engineer who not only performs assigned tasks, but also always can think about “Are there any areas for improvement,” “Is there anything that can be made more efficient,” “Is it possible to automate the process,” and “Can another technology be used instead?

    © 2022, rubicon44TechBlog All rights reserved.