Lead Site Reliability Engineer

Employer: Ontada

Ontada is an oncology technology and insights business dedicated to transforming the fight against cancer. Part of McKesson Corporation, Ontada was founded on the core belief that precise insights â€“ delivered exactly at the point of need â€“ can save more patients’ lives. We connect the full patient journey by combining technologies used by The US Oncology Network and other community oncology providers with real-world data and research relied on by all top 15 global life sciences companies. Our work helps accelerate innovation and power the future of cancer care.

Ontada is an oncology data science and technology business. We specialize in real-world data and evidence generation that accelerates life science research, clinical technologies that support community providers with precise care, and provider engagement channels that enable education and insights.

We’re looking for a driven, enthusiastic technology leader to be part of the team that’s transforming the fight against cancer and improving the lives of patients. Backed by the strength of a Fortune 7 company.

Position Summary:

Ontada’s Site Reliability Engineering (SRE) team works with operations and development teams to ensure the monitoring and alerting of oncology-focused software systems. Additionally, the SRE team administers AWS for development teams, and shares CentOS administration responsibilities with an on-premises systems administration team. We apply software development principles to the automation of manual tasks.

Frequently used technologies:

We use Ansible playbooks for running tasks, provisioning servers, updating configuration, and installing software
Often, these playbooks are run from Groovy-based pipelines in Jenkins, which itself is used as a task scheduler (in addition to CI/CD)

If a task fails, or if a system is reporting abnormal metrics, Prometheus notices this and raises the issue to Alertmanager
Depending on how we’ve configured its PromQL-based matchers, Alertmanager may send this information on to PagerDuty, which notifies the SRE on-call
We use tools like CloudFormation, Troposphere, and Boto3 to provision AWS infrastructure

Some administration of Kubernetes instances is done in EKS and AKS
We use some miscellaneous such as: Internal utilities written in Python and Go, Wildfly server administration, a handful of assets in Azure, automation of Oracle RMAN

Working environment:

SREs communicate with each other and with other geographically dispersed teams through Slack, WebEx, Teams, and Outlook. Long-living documentation is stored in Confluence, JIRA, and often Markdown files in the repositories they describe. The main recurring meeting is a planning meeting once every two weeks. Other meetings are scheduled independently between an SRE and the teams (development/operational/business) that are working on specific tasks. The rest of your time is fairly flexible; we help each other out to ensure PagerDuty coverage.

Who we’re looking for:

You like to find the gaps in software systems and suggest fixes for them. You see unmonitored systems as a challenge to be brought up to a more transparent state. When you see a manual task being done, you naturally begin thinking of ways to automate it.

If you had worked here last week, you would have helped with:

Update a data export process to use different SQL tables and different file stores

Increase the frequency of Elasticsearch snapshots and their associated alerting
Add new AWS users to our SSO provider and configure their permissions
Research installation of a new Jenkins instance with the latest version

Code review Python for adding Slack notifications to a Django-based admin site

This description is general in nature and is not intended to be an exhaustive list of all responsibilities. Other duties may be assigned as needed to meet company goals.

Typical Minimum Requirements

10+ years of experience building and supporting infrastructure at scale
All candidates must be authorized to work in the U.S. No sponsorship or relocation is available for this position

Required Technical Experience:

Ansible/Python
AWS/GCP/Azure Experience (preferably AWS)
Monitoring Experience
CI/CD Experience
Kubernetes Administration

Critical Requirements

Experience troubleshooting complex systems
A history of implementing and troubleshooting large-scale distributed systems
Experience troubleshooting complex systems, including the operating system, network, and application code
Significant experience with Docker and Kubernetes
Experience with the ELK stack
Proficiency in Java, Python, Perl, Ruby or another high-level programming language
Experience implementing and troubleshooting Linux systems

Additional Skills:

Health care (EHR) familiarity
Working in regulated environments
Hands on experience with public cloud infrastructure (AWS and Azure)

Proficiency in Java, Scala, Groovy or another JVM-based language
Experience with public cloud infrastructure
Hands on experience with Hadoop and related technologies
Experience implementing and troubleshooting large-scale distributed systems

Education/Training

4-year degree in Computer Science / Computer Engineering or related field, or equivalent experience

Working Conditions

Remote office location
General Office Duties

McKesson Total Rewards

McKesson believes superior performance â€“ individual and team â€“ that helps us drive innovations and solutions to promote better health should be recognized and rewarded. We provide a competitive compensation and benefits programs to attract, retain and motivate a high-performance workforce, and it’s flexible enough to meet the different needs of our diverse employee population.

This is a full-time, salaried position with an expected salary range of $180,000 to $240,000. A competitive salary is determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as alignment with market data

Other Compensation: This position may be eligible to participate in the annual management incentive plan
Paid time off subject to eligibility, including paid parental leave, vacation, sick, and bereavement
Other benefits, subject to elections, eligibility, and collective bargaining agreement terms: Medical, Dental, Vision, Disability, Health and Dependent Care Reimbursement Accounts, Employee Assistance Program (EAP), Insurance (Accident, Group Legal, Life), 401k and Stock Purchase Programs

McKesson is an Equal Opportunity/Affirmative Action employer.

All qualified applicants will receive consideration for employment without regard to race, color, religion, creed, sex, sexual orientation, gender identity, national origin, disability, or protected Veteran status.Qualified applicants will not be disqualified from consideration for employment based upon criminal history.

McKesson is committed to being an Equal Employment Opportunity Employer and offers opportunities to all job seekers including job seekers with disabilities. If you need a reasonable accommodation to assist with your job search or application for employment, please contact us by sending an email to McKessonTalentAcquisition@mckesson.com . Resumes or CVs submitted to this email box will not be accepted.

Current employees must apply through the internal career site.

Join us at McKesson!

APPLY HERE

Employer: Ontada

Position Summary:

This description is general in nature and is not intended to be an exhaustive list of all responsibilities. Other duties may be assigned as needed to meet company goals.

Typical Minimum Requirements

Education/Training

Working Conditions

Find Us

Search

About This Site