Site Reliability Engineer

Get Referred

Job Description

Why American Express?

There’s a difference between having a job and making a difference.

American Express has been making a difference in people’s lives for over 160 years, backing them in moments big and small, granting access, tools, and resources to take on their biggest challenges and reap the greatest rewards.
We’ve also made a difference in the lives of our people, providing a culture of learning and collaboration, and helping them with what they need to succeed and thrive. We have their backs as they grow their skills, conquer new challenges, or even take time to spend with their family or community. And when they’re ready to take on a new career path, we’re right there with them, giving them the guidance and momentum into the best future they envision.

Because we believe that the best way to back our customers is to back our people.


The powerful backing of American Express.
Don’t make a difference without it.
Don’t live life without it.



Purpose of the Role:


We're looking for a Site Reliability Engineers responsible for web application performance, availability and reliability. Candidate is responsible to provide consultation and strategic recommendations by quickly assessing and remediating complex platform availability issues.


his role will drive the DevOps mindset which strives to use software engineering to build and run better production systems. You will write software to optimize day to day work through better automation, monitoring, alerting testing and deployment.


You'll be expected to work with several Technology partners to identify areas of opportunity within the availability platform and build a solution to automate monitoring solutions for the next generation platform, technology and constant innovations to drive efficiencies. You will be responsible for implementing tracing, monitoring, tooling solutions to maximize the performance and availability of our Web applications.




Ability to collaborate with high-performing teams and individuals throughout the firm to accomplish common goals.


        ○ Strong analysis, research, investigation and evaluation skills, with a structured approach to problem solving


        ○ Ability to work and effectively prioritize in a highly dynamic work environment that includes a global focus.


        ○ Exposure to ITIL processes is preferred.


        ○ Focus on improvements in Automations, Logging and Monitoring.


        ○ Build monitoring and alerting tools to help SRE and Operations teams to quickly pinpoint, isolate and resolve issues related to infrastructure, platform services and applications.


        ○ Produce weekly, monthly and quarterly uptime and status reports for production and critical internal infrastructure and application.


        ○ Be the first line of defense for the development team analyzing the outages, driving the RCAs and subsequently bringing the required product enhancements from the RCAs.


        ○ Financial Services background or experience preferred.





Critical Factors to Success:

Experience supporting a 24/7 enterprise environment with on-call responsibilities for production support

Experience working in a distributed team model with daily hand off of issues during shift change and at close of business

Broad Technical field exposure, with preference to following skills: Cloud Infrastructure, VM, load balancing, containers, Kubernetes, JVM’s, web servers, application debugging Caching technologies, databases, routing and switching, etc.

Familiarity with financial services and authorizations systems is a plus.

Understanding of using Agile Practices in Operations teams

Past Experience:

0-2 years work experience in DevOps with Java/J2EE/REACT JS applications

0-2 years work experience on supporting 3 tier architecture which includes exposure to IBM DB2 and Oracle

Hands on experience leveraging enterprise tools such as Grafana, Dynatrace, AppDynamics, Jenkins.

Analytical knowledge and exposure on root cause identification using analyzer tools like IBM support assistant, Splunk etc

Hands on experience configuring Splunk dashboards, Alert Setups

Understanding for cloud technologies such as Kubernetes, Openshift



Academic Background:

A BS in Computer Science, Computer Engineering, other Technical discipline, or equivalent work experience.

Functional Skills/Capabilities:

Technical Skills/Capabilities:

Experience in design and development of SRE capabilities such as self-healing, Advance predictive analytics and Monitoring.

        ○ Proficient and in-depth knowledge with one or more real time analytics, monitoring solutions – Splunk, ELK, Kafka

        ○ Hands on Experience in design and development fast scalable applications using Machine learning algorithms/tools and frameworks.

        ○ Knowledge of server-side technologies such as WebSphere, JBoss, NodeJS.


Experience with scripting languages like Python, Perl , Unix etc.

        ○ Experience with Docker, microservices and container based deployment and service orchestration using Kubernetes.



Knowledge of Platforms:

○ Experience in Private Cloud Platforms and Public Clouds (AWS, GCP, Azure, OpenStack etc.)

Knowledge on network rules creation, load balancer configurations, network packet analysis and network protocols.


Behavioral Skills/Capabilities:

Enterprise Leadership Behaviors

            Set The Agenda: Define What Winning Looks Like, Put Enterprise Thinking First, Lead with an External Perspective

            Bring Others With You: Build the Best Team, Seek & Provide Coaching Feedback, Make Collaboration Essential

            Do It The Right Way: Communicate Frequently, Candidly & Clearly, Make Decisions Quickly & Effectively, Live the Blue Box Values, Great Leadership Demands Courage  

ReqID: 19018355
Schedule (Full-Time/Part-Time): Full-time
Date Posted: Dec 11, 2019, 5:05:58 AM