American Express Careers

Site Reliability Engineer

Fort Lauderdale, Florida
Digital Commerce Technology

Apply Get Referred

Job Description


Primary focus is to provide technical expertise, education, and tooling to ensure the highest level of reliability and availability for critical applications. Able to provide consultation and strategic recommendations by quickly assessing and remediating complex availability issues.  Responsible for driving automation, efficiencies to increase quality, availability and security. 

Organizational Context

Technical individual contributor integrated with technology and business partners to ensure efficiencies in increasing quality, availability and security to technical platforms.  Works individually and with teams to drive reliability goals and objectives across platforms.

Key Responsibilities

Product Development

       Enable creation and updating of logging standards to streamline dashboard creation and ensure usability of logging repository

       Drive monitoring requirements to ensure business-service level visibility for all support teams

       Participate in architectural decisions to ensure software transaction flows are appropriately supported and designed

       Is an IT infrastructure Subject Matter Expert (SME) and works with Development teams to build to standards that drive the highest levels of availability

       Provides guidance to software engineers related to design patterns that are  resistant to failure

       Communicates effectively with Development and Operation teams to align on requirements, driving SDLC requirements, capabilities, and limitations pertinent to delivering highly resilient applications



       Responsible for evaluating and implementing orchestration, automation, and tooling solutions to ensure consistent processes and repetitive tasks are performed with a higher level of accuracy and reduced defects

       Build, implement and advise on recovery tooling to adhere to enterprise standards and/or frameworks

       Introduce new and impactful technologies to the production support tool chain that help minimize friction for production releases and support, and to more quickly diagnose and recover from production incidents


Operational Readiness

       Responsible for availability, proactive monitoring / alerting, capacity planning, performance (reducing latency and increasing efficiency) to include testing for technical platforms

       Partner with appropriate supporting  teams to ensure operational readiness throughout the application lifecycle


Production Support

       Ensure application data flows are accurate and up to date with the objective to increase the knowledge base of all support teams and drive reliability.

       Facilitates the resolutions of non-application issues (3rd party upstream issues, infrastructure issues, storage, database, network, file transfer etc.)

Scope of Impact/Influence

       Consults with teams to build standards that drive the highest levels of availability

       Mentors teams through ongoing development efforts

       Partner with development teams to adhere to SDLC standards

       Center of Enablement – coach and advise about the SRE function working with varies teams and provide real-life examples when necessary


       Bachelor’s Degree in related field preferred; Relevant industry experience can substitute

       8+ years of engineering and/or architecture experience in a complex environment, such as: large scale web infrastructure or development team

       Experience supporting a 24/7 enterprise environment with on-call responsibilities for production support

       Experience in a broad range of software development and operations technologies such as Infrastructure, virtualization, load balancing, containers, JVM’s, web servers, application debugging, queueing technologies, caching technologies, databases (RDBMS and NoSQL), routing and switching, etc.

       Experience in high transaction volume OLTP sites or the Financial Services industry is preferred

High-performing Behaviors

       Has an ‘Automation First” mindset – fundamentally will not accept doing things over and over by hand

       Combines deep technical expertise, a continuous improvement and automation mindset, and systematic and rational root cause analysis to identify opportunities to make things faster and better

       Challenges the status quo, identifies opportunities to adopt innovative technologies to enable business capabilities, generates creative ideas and solutions to difficult problems

       A recognized expert and highly sought-after consultant that is knowledgeable regarding current research and technology in the industry and uses that knowledge to continually improve the function

       Highly influential at all levels including peers, leaders and key stakeholders, distilling complex ideas and concepts with clear, structured, easy to understand language

       Adapts to change quickly and easily and helps others adjust to changes through effective communication

       Knows when to escalate decisions and when to make on-the-spot decisions


       Knowledge of DevOps related practices such as CI/CD, Canary Pushes/ Blue-Green deployments, Software Defined Infrastructure and tools and etc.

       Ability to guide and implement the scripting or development of production support tooling that can be leveraged by your team and others.

       Understanding of multi-tier application architectures and related development technologies in support of service virtualization and API implementation/support  

       Ability to write and build code and/or interpret and understand code

       Experience with helping drive story/non-functional/functional test planning

       Ability to create and update operating procedures on an ongoing basis

       Ability to assess logging and understanding value

       Knowledge and understanding of the SDLC principles and key controls

       Understand problem and incident management processes

       Ability to ensure appropriate database technology is recommended based on application of knowledge and the given software needs   

       Working knowledge of operations to include certificate management, firewall rules, websites, XaaS, load balancer configuration, website virtualizations (VM’s and containers), etc.

       Knowledge of financial industry standards and business practices

       Large scale application support experience

       Deep understanding of *nix technologies, e.g. AIX, RedHat and CentOS

       Exposure to web frameworks

       Knowledge of configuration management, release automation, and orchestration technologies

       Experience working with enterprise applications to include queueing or shared services technologies

       Understanding monitoring technologies, focused on logging, time-series or machine-learning products from a product owners’ perspective

Technology Core Competencies

       Adaptive Communication

       Agile Practices

       Industry and Company Knowledge

       Organizational Change Management

       Technical Acumen

       Technology Industry Trends

Game Changers


       Collaboration & Teamwork

       Continuous Improvement




       Servant Leadership


Role Core Competencies


·         Programming Languages & Frameworks

·         Programming/Software Development


·         Business Analysis

·         IT Infrastructure

·         Network Support

·         Release & Deployment


·         Coaching & Mentoring

·         Consultancy

·         Decision-Making

·         Influencing & Negotiation

·         Relationship Management

·         Strategy Formulation


·         Facilitation

·         Problem Solving


ReqID: 18001600
Schedule (Full-Time/Part-Time): Full-time
Apply Get Referred