Key Accountabilities
1. First and foremost, to build and own operational excellence
2. Owning the production issues and provide resolution within agreed SLA and with user satisfaction.
3. Manage all major incidents and resolve/recover the system without/with less impact to the business including capacity
4. Producing batch and incidents trending and measuring systems performance against KPIs. Track SLIs, SLOs and Error Budget
5. Overcoming any functional limitations and providing the most fit for purpose solutions.
6. Ensure strong, clear and effective communication across all release stakeholders
Job Duties & responsibilities
Facilitate, Drive & Lead recovery calls for major incidents and coordinate with multiple teams to drive the resolution. Responsible to communicate on major incidents and provide regular update to the stakeholders Plan and manage personnel across shift optimally Ensure Preventive and detective measures of the applications are identified and implemented. Automation of manual activities/ processes and System Health checks for Production teams. (Automation experience required) and ensuring SLIs/ SLOs are met Identifies persistent or recurring problems and recommends creative solutions Great People skills to build and manage performing team Strong communications skills and Understands and works well within global team, ensures proper handoff of incidents and details Ensure incidents are escalated and facilitated to enable efficient and timely service restorations Drives Root Cause Analysis with technology partners, post incident resolution and facilitates RCA reviews. Work with Risk team to respond timely to Audit & Risk RFIs. Manage Audit walkthroughs Build and practice devops practices SRE. Implement Site Reliability Engineering principles with regards to performance, reliability, monitoring, alerting and maintenance in Production environment. Pro-active Capacity monitoring & Observability of production Infrastructure, automated alerting, performance monitoring and reporting tools Automation of manual tasks in a CORE Banking ecosystem Build and maintain Production monitoring and automation solutions Build and implement Service improvements. Identify, measure and report performance trends – SLIs/ SLOs/ SLAs periodically and improve systems performance and associated performance KPIs Production batch and incidents trending and measuring systems performance against KPIs – manage SLIs, SLOs and Error Budget. Provide continuous monitoring and improvement of systems – job automation, performance tuning, capacity planning. Review production changes and approve fit for purpose changes Manage Product Risk pro-actively across all countries
Primary Location: India
Job: Technology
Schedule: Regular
Employee Status:
Full-time
:
Job Posting: Oct 6, 2021, 12:26:08 AM