American Express Careers
Site Reliability Engineer
Are you someone that says, “Why not?” rather than “Why?” Are you someone that lays down new paths? Do you love to dream bold, explore and discover new experiences?
The success of our entire company rests on our systems, networks, and people. Ours is a team of highly skilled DevOps, ProdOps and SRE engineers that strongly advocate automation and monitoring across all the applications and platforms we support. In the last few years, with innovation at its core and a never-say-die attitude, this team has been shaping the digital future at AmEx while becoming the poster child for defining the art of possible!
As a Site Reliability Engineer (SRE), you will be responsible for a broad range of activities. You will work closely with application development teams to build standards that drive the highest levels of availability across our critical Servicing, Messaging and Marketing portfolios. You will join a team that provides 24/7 support and are expected to develop solutions that improve production support and monitoring services, while responding to incidents to ensure a high level of availability of applications. You will also drive engineering work, including things such as infrastructure automation, designing and building tools, as well as code to support our application teams.
In this role, you will be responsible for (but not limited to) the following:
You will lead a team of DevOps, ProdOps and SRE engineers in supporting critical Servicing, Messaging and Marketing applications.
Work closely with our application engineering teams to launch and maintain applications both on-premise and hybrid-cloud.
Act as primary escalation point for our L1 support team in helping to make decisions to restore service and minimize impact to availability.
Provide production support and respond to production incidents as the first line of defense for the organization
Diagnose intricate software problems, provide solutions and workarounds to ensure the highest level of reliability and availability for critical applications.
Facilitate the resolutions of non-application issues (3rd party upstream issues, infrastructure issues, storage, database, network, file transfer etc.).
Debug network and performance issues in large scale distribute systems.
Provide consultation and strategic recommendations by quickly assessing and remediating complex availability issues.
Participate and oversee overall upgrades or migration of platforms and applications to production, and other planned maintenance activities.
Drive monitoring requirements to ensure business-service level visibility for all support teams
Introduce new and impactful technologies to the production support tool chain. This helps minimize friction for production releases and that results in quick diagnosis and recovery from production incidents.
Challenge the status quo, identify opportunities to adopt innovative technologies to enable business capabilities, generate creative ideas and solutions to difficult problems.
Have an “Automation First” mindset in order that repetitive tasks are not manually handled.
Be highly influential at all levels, including peers, leaders and key stakeholders. Distill complex ideas and concepts with clear, structured, easy to understand language.
8+ year’s software development experience, including experience in a DevOps environment
- Experience with Java/J2EE/UI applications
- BS degree in Computer Science, Computer Engineering, other Technical discipline, or equivalent work experience.
- Experience supporting a 24/7 enterprise environment with on-call responsibilities for production support
- Broad technical field exposure, with preference to following skills: Cloud Infrastructure, VM, load balancing, containers, JVMs, web servers, application debugging, queuing technologies, caching technologies, databases, routing and switching, etc.
- Knowledge of Linux internals and experience managing Linux systems in high traffic environments.
- Experience managing relational and NoSQL databases such as Oracle, Couchbase.
- Hands-on experience leveraging enterprise tools such as Splunk, Grafana, Dynatrace, AppDynamics.
- Strong interpersonal communication skills and the ability to work well in a diverse team-focused environment
- Google Cloud, Python, Hive, Hadoop a plus
Schedule (Full-Time/Part-Time): Full-time
Date Posted: Jun 3, 2019, 9:48:05 AM