Senior Software Engineer - L3 Support
About the Role
The Software Engineer (L3) leads production reliability efforts for KAST's core financial platforms, partnering closely with engineering teams. This role combines deep engineering insight with hands-on operational ownership of production systems, directly protecting customer trust as we scale globally.
Responsibilities
- Lead production incident response for KAST's core platforms, ensuring fast resolution and minimal customer impact.
- Drive deep root cause analysis for high-severity incidents and turn learnings into permanent, long-term reliability improvements.
- Debug and resolve issues across application, data, and infrastructure layers in distributed, cloud-based systems.
- Use logs, metrics, and traces to understand system behavior, identify failure patterns, and improve observability.
- Partner closely with engineering and platform teams to resolve defects and raise the overall reliability bar.
- Lead incident reviews and contribute to improving how we prevent, detect, and respond to production issues.
- Execute configuration changes, hotfixes, and rollbacks safely while protecting system availability.
- Improve operational readiness by evolving runbooks, SOPs, alerts, and dashboards as systems scale.
- Ensure production systems consistently meet availability, performance, security, and compliance expectations.
- Participate in on-call rotations and take ownership during live incidents when reliability matters most.
- Proactively identify operational risks, technical debt, and system stability gaps before they impact users.
Requirements
- 5+ years of experience in application development, site reliability engineering or similar roles with exposure to handling high-impact production incidents.
- A strong foundation in supporting cloud-hosted systems on platforms like AWS and GCP, with an understanding of how reliability scales in real-world environments.
- Hands-on experience working with containerized, microservices-based architectures and the challenges that come with operating them in production.
- Confidence debugging and troubleshooting both front-end and back-end applications, including systems built with modern programming languages and frameworks such as Go, Python, JavaScript, Next.js, Flutter (Dart), and related ecosystems.
- Solid understanding of CI/CD pipelines, deployment strategies, and release management practices.
- Experience using monitoring, logging, and alerting tools to diagnose production issues and improve system observability over time.