Employer: Box
WHAT IS BOX
Box is the market leader for Cloud Content Management. Our mission is to power how the world works together. Box is partnering with enterprise organizations to accelerate their digital transformation by creating a single platform for secure content management, collaboration and workflow. We have an amazing opportunity to further establish ourselves as leaders in the space, and we need strong advocates to help us achieve that goal.
By joining Box, you will have the unique opportunity to help capture a majority of this developing market and define what content management looks like for the digital enterprise. Today, Box powers over 97,000 businesses, including 70% of the Fortune 500 who trust Box to manage their content in the cloud.
WHY BOX NEEDS YOU
Box is looking for a dynamic Technical Duty Officer to help lead our Global Technical Operations Center and support an industry-leading platform. It is the responsibility of the GTOC team to monitor, troubleshoot, and resolve issues that affect the availability and quality of the Box platform. The team is the frontline of defense in making sure our customers like GE, Pandora, Apple and Gap have a seamless experience when accessing their content on Box.
This is an integral job function within the GTOC that ensures the overall production site health and the performance of core customer facing journeys. This role will help maintain total site awareness, detecting metric and service deviations, monitoring changes, and proactively identifying potential issues and resolving before they escalate to customer impacting levels.
We are building a world class Operations Center and need the best talent possible to get us there. That’s where you come in!
WHAT YOU’LL DO:
- Own live-site Incident Management from triage to resolution of customer impact.
- Drive customer-impacting events and lead a cross-functional group of teams to quickly mitigate the problem and restore service
- Ensure accurate, valid and timely communication to key stakeholders and business entities.
- Operate across organizational boundaries to protect our customers, their data, and the availability of all Box services
- Troubleshoot critical problems through applications, systems, clouds, and networks
- Provide technical leadership and key insights to improve Box’s Reliability Engineering capabilities
- Lead daily review of planned changes; accountable for minimizing change risk
- Contribute to post-mortem process, driving prioritization of action items related to site reliability and resiliency
- Lead projects to improve tools and processes related to manageability, observability, resiliency
WHO YOU ARE:
- You are confident and comfortable communicating from the individual-contributor level up through C-level staff
- You have a rock solid command presence and are calm and collected in stressful situations, such as a major service outage.
- You’re driven to learn new skills and technologies
- You have 5+ years of large-scale production operations or development experience and enjoy talking reliability engineering
- Bachelor’s degree in Computer Science or Information Systems or equivalent technical field, or similar work experience in a large-scale 24/7 production environment supporting critical, real-time applications
- Remote Friendly!
Required Skills:
- Solid grasp of Redhat/Ubuntu Linux, shell scripting
- Experience with Bare metal, Openstack and Kubernetes platforms
- Experience with Content Delivery Networks (Cloudflare, Akamai)
- Experience working in virtualized environments and cloud implementations (GCP preferred, AWS)
- Solid understanding of the TCP/IP, BGP, IP Anycast and DNS
- Experience with message bus technology (Kafka, RabbitMQ, MQS)
- Experience with relational and non-relational databases (Mysql, HBase, Elastic Search)
- Experience with caching (Redis, Memcache)
- Experience with service mesh technologies in a hybrid-cloud environment (Zookeeper, Smart Stack)
- Experience with observability tools in a large scale environment (Splunk, Datadog, Wavefront, Catchpoint, ThousandEyes, Sensu, Distributed Tracing, RUM)
- Understanding of with CI/CD pipelines
- Outstanding interpersonal and communication skills.
- Incident management in a large scale, high uptime environment
- Flexibility to work shift model