Site Reliability Engineer SRE

Engineering  |  Berkeley, San Jose  |  Full Time

 

Nefeli Networks is an award-winning early stage startup in Silicon Valley working on bringing distributed cloud and Lean NFV to the market. This is the opportunity to get in early as we begin to ramp revenue and expand our offer. You’ll be working with a talented team of innovative developers, making key new contributions to the cloud and networking fields.

We are looking for a highly motivated DevOps/Site Reliability Engineer to join our exceptional team. The candidate we are looking for is ready to design, automate and support our cloud infrastructure, back-end systems and do technical integration with our partners. The ideal candidate would have some experience operating and supporting networking solutions and familiarity with automation tools and processes.

As an SRE at Nefeli, you will play a critical role in helping us shape our software stack and hardware infrastructure. Your knowledge of design, analytics, development, coding, testing and application programming will enhance our development team to satisfy customer business and functional requirements. This person will also be instrumental in deploying systems at customer sites.

Apply Now »

Responsibilities:

  • Improve the whole product lifecycle through inception, design, deployment, operation and refinement
  • Design, build and operate Cloud infrastructure to enable reliable and rapid deployment of microservices with effective monitoring and resilient operations
  • Work with development teams to make sure applications are production ready, scalable and reliable from the ground up
  • Identify and drive opportunities to improve automation for code deployment, management and visibility of application services
  • Develop tools and framework to automate operational tasks, deployment of machines, services, applications
  • Write automation code for provisioning and operating infrastructure at massive scale
  • Establish end-to-end monitoring and alerting on all critical components of the applications, including availability, latency and overall system health
  • Participate in the on-call rotation supporting the platform and/or the production application
  • Direct root-cause-corrective-action analysis of critical business and production issues
  • Develop standard methodology for Infra orchestration and troubleshooting application service in production
  • Represent DevOps/SRE in design reviews and works with Engineering teams on operational readiness

Technical Qualifications:

    • BS Computer Science, Engineering or a related field, or equivalent professional experience
    • Experience with Unix/Linux operating systems internals and administration
    • Good understanding of networking technologies such as SDN, NFV, SD-WAN and sound knowledge of Ethernet switch and routing technology
    • Good understanding in the areas of server & network virtualization, and global infrastructure, distributed systems, load balancing and security
    • Experience with at least one configuration management solution with hands-on experience in server virtualization (i.e.: VMware ESXi, KVM, Hyper-V)
    • Expertise in configuration management with a framework such as Ansible, Chef, Puppet or Terraform
    • Experience in AWS, Azure or GCP cloud computing and its related services
    • Strong fundamentals in HTTP including HTTP headers, Process and System API services; experience working with third party RESTful APIs
    • Experience with Python, Go and/or C++
    • Experience with CI/CD pipeline, GitHub and Jenkins
    • Ability to debug and optimize code
    • Passion for automation and monitoring instrumentation in the code
    • Knowledge of best practices related to security, performance, and disaster recovery

Other Qualifications:

  • Ability to communicate effectively and succinctly
  • Strong systematic problem solving skills and able to work in ambiguity
  • Excellent written and verbal communication, able to collaborate and rally support
  • Excellent interpersonal skills and the ability to work well in a team
  • Self-disciplined, self-managed, self-motivated and strong sense of ownership, urgency and drive; positive attitude with the ability to quickly learn new technologies and effectively manage parallel projects
  • Ability to diagnose and troubleshoot complex distributed systems handling high volume transactions
  • Passionate to learn, understand, and dissect new technologies quickly and independently

Preferred Qualifications:

  • 5+ years of related experience
  • Experience with modern logging/reporting tools such as Prometheus
  • Experience with networking (e.g., TCP/IP, routing, network topologies and hardware, SDN, NFV)
  • Experience with implementing monitoring tools such as Grafana, collectd, and Zabbi
  • Experience with etcd, NoSQL and time series Databases
  • Proven experience working with customers and vendors
  • Proven leadership of small informal teams

Begin the Application Process

Click or drag files to this area to upload. You can upload up to 2 files.

Ready to get started? Contact us today.