Join Our Team as a Major Incident Manager
💰 $110,000 - $165,000
🎯 Role Definition
As a Major Incident Manager, you are the cornerstone of our operational stability and service resilience. You will take decisive ownership of our most critical IT incidents from detection through resolution, recovery, and review. This high-visibility role requires a unique blend of technical acumen, leadership, and exceptional communication skills. You will orchestrate the efforts of diverse technical teams, making critical decisions under pressure to restore services rapidly. Your primary objective is to minimize business disruption, protect revenue, and maintain customer trust by expertly managing the entire lifecycle of major incidents.
📈 Career Progression
Typical Career Path
Entry Point From:
- Senior NOC (Network Operations Center) Analyst
- IT Support Team Lead or Manager
- Senior Systems/Network Administrator with incident response duties
- Problem Manager
Advancement To:
- Director of IT Operations / Head of Operations
- Senior Manager, Service Delivery
- Head of Incident and Problem Management
- Director of Site Reliability Engineering (SRE)
Lateral Moves:
- Service Delivery Manager
- IT Change Manager
- Business Continuity / Disaster Recovery Planner
- Senior Problem Manager
Core Responsibilities
Primary Functions
- Assume complete command and control of any Major Incident, acting as the single point of contact and authority to orchestrate the recovery process.
- Immediately assess the business impact and technical severity of an outage to declare a Major Incident and trigger the appropriate response protocols.
- Establish and facilitate technical and management "war room" bridge calls, ensuring the right technical resources are engaged and working cohesively towards resolution.
- Develop and disseminate clear, concise, and timely communications to all levels of the organization, from technical teams to executive leadership, regarding incident status, impact, and progress.
- Maintain a detailed and accurate timeline of all incident-related activities, communications, and key decisions within the ITSM platform for post-incident analysis.
- Make critical, time-sensitive decisions to steer the incident response, including authorizing emergency changes or invoking disaster recovery procedures.
- Act as the primary liaison between technical teams and business stakeholders, translating complex technical information into understandable business-centric updates.
- Drive the post-incident review (PIR) process by chairing post-mortem meetings to identify the root cause, contributing factors, and lessons learned.
- Author and publish comprehensive Root Cause Analysis (RCA) and Post-Incident Reports, ensuring they are distributed to relevant stakeholders and leadership.
- Track and drive the completion of preventative actions and follow-up tasks identified during the PIR to mitigate the risk of recurrence.
- Manage and maintain the on-call rotation schedules for incident response teams, ensuring adequate coverage and readiness across all shifts.
- Continuously refine and improve the Major Incident Management process, documentation, and communication templates based on feedback and lessons learned.
- Collaborate closely with the Problem Management team to transition resolved incidents for in-depth root cause analysis and long-term remediation.
- Ensure all incident management Service Level Agreements (SLAs) and Operational Level Agreements (OLAs) for response and resolution are tracked, measured, and reported on.
- Provide coaching, mentoring, and training to technical support teams and other stakeholders on the Major Incident Management process and their respective roles.
- Analyze incident trend data to proactively identify recurring issues, service degradation patterns, and opportunities for systemic improvements.
- Engage and manage third-party vendors and service providers during incidents involving their services, holding them accountable for their resolution efforts.
- Develop a deep understanding of the company's critical business services and the underlying technology stack to effectively assess incident impact.
- Ensure the quality and integrity of all data logged for major incidents within the ITSM tool (e.g., ServiceNow, Jira), facilitating accurate reporting and metrics.
- Serve as a subject matter expert on ITIL best practices for Incident Management, promoting a culture of service excellence and continuous improvement across the organization.
Secondary Functions
- Participate in Change Advisory Board (CAB) meetings to provide an operational perspective on potential risks associated with proposed changes.
- Contribute to the development and testing of Business Continuity and Disaster Recovery (BC/DR) plans and procedures.
- Collaborate with application and infrastructure teams during the service transition process to ensure new services are onboarded with robust support and incident response plans.
- Mentor junior members of the IT Operations and Service Desk teams, fostering their skills in incident triage and response.
Required Skills & Competencies
Hard Skills (Technical)
- ITIL v3/v4 Foundation Certification: A strong, demonstrable understanding of ITIL principles is essential, particularly in Incident, Problem, and Change Management.
- ITSM Tool Proficiency: Extensive hands-on experience with enterprise-grade ITSM platforms such as ServiceNow, Jira Service Management, or BMC Remedy.
- Monitoring & Observability Tools: Familiarity with modern monitoring solutions like Datadog, New Relic, Splunk, Dynatrace, or Prometheus to interpret alerts and data.
- Cloud & Infrastructure Knowledge: Solid understanding of public cloud platforms (AWS, Azure, GCP) and traditional on-premise infrastructure (servers, networks, storage, databases).
- Technical Acumen: Ability to grasp complex technical concepts related to application architecture, networking, and databases to facilitate technical discussions effectively.
- Reporting & Analytics: Competency in creating and analyzing incident metrics, KPIs, and trend reports to drive data-informed decisions.
Soft Skills
- Leadership Under Pressure: The ability to remain calm, command respect, and lead decisively in high-stress, ambiguous situations.
- Exceptional Communication: World-class verbal and written communication skills, with the ability to tailor messages for technical, business, and executive audiences.
- Analytical & Problem-Solving Mindset: A methodical approach to troubleshooting, with a strong ability to guide others in diagnosing complex issues.
- Stakeholder Management & Influence: Proven ability to build relationships, manage expectations, and influence outcomes with stakeholders at all levels without direct authority.
- Ownership & Accountability: A relentless drive to see incidents through to resolution and ensure preventative measures are implemented successfully.
- Empathy and Composure: The capacity to understand the customer and business impact while maintaining a composed and professional demeanor.
Education & Experience
Educational Background
Minimum Education:
- Bachelor's Degree in a relevant field or equivalent combination of professional experience and certifications (e.g., ITIL, PMP).
Preferred Education:
- Bachelor's or Master's Degree in a technology-focused discipline.
Relevant Fields of Study:
- Computer Science
- Information Technology
- Management Information Systems (MIS)
- Business Administration
Experience Requirements
Typical Experience Range:
- 7+ years of experience within IT, with a minimum of 3-4 years in a dedicated Major Incident Management, Crisis Management, or senior-level IT Operations role.
Preferred:
- Demonstrated track record of managing high-severity (P1/S1) incidents in a large-scale, 24/7/365 global enterprise environment (e.g., SaaS, e-commerce, financial services).
- Experience working in an Agile/DevOps environment and familiarity with Site Reliability Engineering (SRE) principles.