Work in concert with engineering teams to evolve services for better scalability, reliability, and development velocity.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Focus on improving Reliability.
You will be responsible for defining, measuring, and tuning key performance indicators and metrics in order to ensure a seamless experience.
Develop tools to improve the ability to rapidly deploy and effectively monitor custom applications in large-scale environments.
Practice sustainable incident response and blameless postmortems.
Expert in the configuration and maintenance of common applications such as Apache, Tomcat, Nginx, Memcache, Squid, Oauth, NFS, DHCP, DNS, and SNMP.
Thorough knowledge of deployment, management, and cost optimization techniques for Machine Instances on Public Clouds (AWS or GCP, or Azure).
Designed Monitoring, Logging, and Reliability Processes for systems at scale.
Design and develop solutions for cloud security, secrets management, and key rotations.
Providing on-call support on a rotation basis for services running on the mindtickle platform, Incidents Management, and working with Application Developers for Root Cause Analysis.
Ability to quickly learn new processes, applications, and tools as required.
Maintain, review, propose, and implement improvements to existing infrastructure, tools, and processes.
Build, maintain, and own InfoSec compliance efforts by implementing and enforcing appropriate processes and standards across the organization.