Open Position -

Site Reliability Engineer

REF. NO:
SpM - SRE

SKILL SET

  • Typically 4-6 years of relevant experience
  • In depth understanding of software engineering and cloud operations
  • Familiar with cloud automation concepts, tools and processes
  • Experience in designing large-scale distributed information systems, server load balancing architectures
  • Working experience with Ansible
  • Professional experience with container related technologies such as Kubernetes, Helm and Docker
  • Solid work experience with cloud platforms such as AWS, IBM or Azure
  • Solid understand of networking concepts, TCP/IP stack
  • Programming experience in at least one of the following languages: C, C++, Java, Python, Perl, or Ruby
  • Practical experience with Linux administration (Debian is a plus), monitoring tools, troubleshooting and performance tuning
  • JavaScript experience desired
  • Experience with deployment and maintaining of Erlang/OTP based systems is an asset
  • Strong analytical and problem solving skills
  • Excellent written and verbal communication skills; mastery in English
  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent

RESPONSIBILITIES

  • Work with engineering teams to design and build a scalable platform that provides mission critical services to our end customers and users
  • Participate in the design and development of internal tooling and scripts to monitor and automate our infrastructure related processes
  • Implement automated and failsafe platform deployment concepts typically canary releases
  • Define deployment strategy and tools to ensure smooth service operation through resistance to failure, automatic upscaling and downscaling as well as zero downtime deployments
  • Solve issues across the entire stack it being software or hardware related
  • Work with architects to help define new system architectures in order to achieve high availability and failsafe services
  • Responsible for on-going maintenance and support of internal tools, improve system health and reliability
  • Document and provide cross-training to peers for projects and products worked on