Job Description
Invest Super Family is looking for a Site Reliability Lead to lead the reliability engineering initiatives. Invest Super Family products and journeys directly enable Vanguard’s engine #1 (mutual funds & ETFs) and engine #2 (advice) and underpin the client experience for all clients (advised and self-directed). The Invest Super Family, in conjunction with the brokerage platform, are poised to execute on a series of initiatives that will power RIG into the future:
Are you an engineer who loves to solve impactful complex problems? Are you passionate about finding opportunities to improve system performance and efficiency, scalability, fault tolerance, and self-healing capabilities? Do you want to apply chaos engineering principles and creatively experiment with our systems to discover hidden weaknesses and make our systems resilient? Are you obsessed with understanding systems inner state, interactions between systems or observability-driven development? If so, this role may be for you!
As a hands-on technical leader, you will be closely working with product teams, Retail core SRE team, architecture, CTO/GTO resiliency org and solutioning reliability problems across Invest to improve our systems’ “-ilities”. You will also help define, maintain, and enforce subdivisional reliability engineering standards, contribute to enterprise-wide libraries for reliability, and influence both product teams and SRE practitioners.
The ideal candidate will be a strong conceptual and system thinker, familiar with resiliency architecture principles, observability frameworks, AWS services as well as Vanguard platform abstractions. This person will be expected to roll up sleeves to solve localized resiliency and scalability problems, contribute to shared libraries and build automation to remove repetitive tasks. This person will influence and help shape Invest resiliency strategy to make Invest super family reliable and resilient!
Core Responsibilities:
-
Drives observability maturity.
-
Partners with key stakeholders and collaborates with internal teams to evaluate the health, stability and reliability of systems/platforms. Provides subject matter expertise and consultation on complex architecture and programming design decisions related to availability and resiliency.
-
Leads localized failure modes when new features and architecture patterns are introduced. Facilitates post-incident reviews for any client-impacting events local to the product family. Develops and aggregates data to report back to senior leadership.
-
Leads the planning and execution of high impact chaos experiments to meet the development and maintenance requirements of complex systems/platforms for the product family or families. Coordinates performance tests for the product family or families.
-
Leads product teams in triage and troubleshooting during complex client impacting incidents. Regularly attends and contributes to Reliability Engineering and Resilience communities of practice. Remains informed about site reliability engineering activities happening within the subdivision.
-
Ensures alignment between service level indicators and objectives within the product family. Develops and communicates new standards and newly available tools and frameworks across subdivisions. Enforces reliability standards.
-
Develops and maintains product-level runbooks for incident response, in collaboration with SRE Practitioners on each product team, to document the step-by-step process to recover from specific components within a system. Makes final decisions regarding usage of tools, libraries, and standards for SRE in situations where multiple options have been provided by SRE.
-
Participates in special projects and performs other duties as assigned.
What it Takes:
-
Minimum of 8 years related work experience, with at least 5 years of software development experience in one or more programming languages.
-
Minimum of 3 years of experience in designing analyzing and troubleshooting large-scale distributed systems and 2 years of experience with data structures or algorithms.
-
Undergraduate degree in Computer Science or equivalent combination of training and experience. Graduate degree preferred.
-
Full stack development – JDK8+ preferred with spring boot, Rest APIs, multithreaded, multiprocessing applications, Graphql. Experience with UI development (familiar with Angular, TypeScript, NodeJS etc.) is a plus.
-
Ability to diagnose and resolve problems in high-throughput applications.
-
Experience with one or more observability frameworks or tools – Experience with OpenTelemetry (java, js, etc.), Cloudwatch, Grafana, Splunk, etc.
-
Exposure to *nix environments including some shell script development and basic command execution.
-
Strong understanding of database principles and working knowledge in distributed storage and infrastructural solutions.
-
Experience with container management and micro-services architectures such as Docker in cloud and on-premises infrastructure.
-
Working knowledge of AWS network foundations, application networking, edge, and network security.
-
Passionate about building and fostering good engineering practices and processes.
-
Excellent communication, and documentation skills.
Special Factor
Vanguard is not offering visa sponsorship for this position.
About Vanguard
We are Vanguard. Together, we’re changing the way the world invests.
For us, investing doesn’t just end in value. It starts with values. Because when you invest with courage, when you invest with clarity, and when you invest with care, you can get so much more in return. We invest with purpose – and that’s how we’ve become a global market leader. Here, we grow by doing the right thing for the people we serve. And so can you.
We want to make success accessible to everyone. This is our opportunity. Let’s make it count.
Inclusion Statement
Vanguard’s continued commitment to diversity and inclusion is firmly rooted in our culture. Every decision we make to best serve our clients, crew (internally employees are referred to as crew), and communities is guided by one simple statement: “Do the right thing.”
We believe that a critical aspect of doing the right thing requires building diverse, inclusive, and highly effective teams of individuals who are as unique as the clients they serve. We empower our crew to contribute their distinct strengths to achieving Vanguard’s core purpose through our values.
When all crew members feel valued and included, our ability to collaborate and innovate is amplified, and we are united in delivering on Vanguard’s core purpose.
Our core purpose: To take a stand for all investors, to treat them fairly, and to give them the best chance for investment success.
Future of Work
During the pandemic, we transitioned to a work from home model for the majority of our crew and we continue to interview, hire, and on-board future crew remotely.
As we have developed the path forward, we have taken a thoughtful approach that both maximizes the advantages of working remotely and the many benefits of coming together and collaborating in a shared workspace. We believe that in-person interactions among our crew are important for preserving our unique culture and advantageous for the personal development of our crew.
When our Crew return to the office, many will work in our hybrid model. A smaller proportion of our crew will operate in the Work from Home work model (for example, field sales crew); or in the Work from Office model (for example, portfolio managers).
The working model that your role falls into will be communicated to you in the interview process – please do ask if you are unsure. We encourage you to make the decision regarding your job interview and offer knowing which model your role will fall into. We will test and learn as our ways of working evolve and will continue to evaluate working models along the way.
Vanguard, one of the world’s largest investment management companies, serves individual investors, institutions, employer-sponsored retirement plans, and financial professionals. We have a diverse and talented crew with a culture that promotes teamwork, along with an unwavering focus on serving our clients’ best interests.
This website uses “cookies” to distinguish you from other users. A cookie is a small file of letters and numbers placed on your computer or device. This helps us to provide you with a good experience when you browse our website and also allows us to improve our site and services. The cookies are stored locally on your computer or mobile device. To accept cookies you can continue browsing as normal. Or you can go to our Privacy Policy ([ Link removed ] – Click here to apply to Reliability Engineering Technical Lead to read more information and learn how to change your preferences.