Comcast Sr. Site Reliability Engineer in Philadelphia, Pennsylvania
Comcast brings together the best in media and technology. We drive innovation to create the world's best entertainment and online experiences. As a Fortune 50 leader, we set the pace in a variety of innovative and fascinating businesses and create career opportunities across a wide range of locations and disciplines. We are at the forefront of change and move at an amazing pace, thanks to our remarkable people, who bring cutting-edge products and services to life for millions of customers every day. If you share in our passion for teamwork, our vision to revolutionize industries and our goal to lead the future in media and technology, we want you to fast-forward your career at Comcast.
Comcast's Technology & Product organization works at the intersection of media and technology. Our innovative teams are continually developing and delivering products that transform the customer experience. We work every day to make a positive impact through innovation in the pursuit of building amazing products that are enjoyable, easy to use, and accessible across all platforms. The team also develops and supports our evolving network architecture, including next-generation consumer systems and technologies, infrastructure and engineering, network integration and management tools, and technical standards.
As a member of the Services Engineering and Delivery (SED) team you will work on a team of multidisciplinary engineers to produce mission-critical infrastructure, tools, and processes that enable our systems to scale at a rapid pace. One day might involve performance tuning of a Java web application; the next may be building tools to enable continuous delivery. You'll investigate and create new systems for scaling development and production. As a senior member of the team you will be expected to work with management, peers, and customers to define and implement the technical vision of the team. You will also work directly with Software Engineering teams to build our next generation, cloud-based, microservice architecture.
You're right for the job if you're comfortable with deep technical Linux, networking topics, and distributed architectures. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization.
Where are we headed?
Our goal is to build, scale and guard the systems that delight customers. To do so, you will need to strong skills in following areas:
- Building tools and alarms that would inform of potential problems or customer issues
- Adapting what exists and building what doesn't to scale the system
- Building tools and developing processes for continuous integration and delivery of infrastructure services
- Developing automation systems that maintain system health
Site Reliability Engineering / Operations
- Root-cause analysis of complex problems involving multiple parties, networks, hardware and software that relate to scaling and performance
- Participating in an on-call rotation
- Engendering reliability and availability starting with metrics and measurements
- Enabling scaling by providing tools, developing training, or augmenting processes
- Securing systems and platforms from issues, be they real, perceived or notional
- Obsessing over collecting and digesting metrics
- Working with configuration management tools such as Ansible, Chef or Puppet
- Building tools for automation (building, testing, releasing, monitoring and alarming)
- Leveraging container orchestration systems such as Docker Swarm, Mesos/Marathon, Kubernetes, or Amazon EC2 Container Services.
- Operating distributed systems such as Consul, etcd, ZooKeeper, Elasticsearch, or Cassandra
Additional responsibilities may include:
- Responsible for building, managing, operating, and continuously improving Systems, Storage, Database, and/or Tools Infrastructure that support Comcast's customer facing applications, back office, and provisioning infrastructure in a 24/7 environment.
- Focuses on architecting, building, deploying, and stabilizing code, services, systems, and tools.
- Drives standardization and service focused instrumentation.
- Provides subject matter expertise.
- Resolves break/fix scenarios, engaging broader teams as necessary; and partners/leads vendors to achieve continuous improvement.
- Contributes to command and control related activities focused on restoration of complex outages, communication across Comcast, and rapid restoration.
- Works and directly leads external vendors, third parties, and associated agencies when necessary to address issues across the infrastructure.
- May participate on 24/7 on-call rotation.
- Acts as a technical expert in own area within the organization.
- May work independently or as part of a team on more complex projects.
- Provides mentoring and guidance to more junior team members.
- Develops solutions for very complex and wide reaching systems engineering problems.
- Sets new policies and procedures to handle future issues.
- Creates systems engineering and architectural documentation to be used by others to build and maintain systems.
- Operating Systems and Disk Management responsibilities (if applicable): Provides in-depth knowledge of Operating System internals to aid in troubleshooting complex problems. Acts as an expert on at least one supported Operating System. Mentors and trains more junior team members on Operating System concepts, configuration, tuning, and troubleshooting techniques. Creates complex automation scripts in Bash, Python, Go, or similar. May manage servers remotely in a distributed environment.
- Database Platform Management responsibilities (if applicable): Masters understanding of database concepts, availability, performance, usage and configuration. Sets up, troubleshoots, and tunes complex standard and non-standard replication. Uses knowledge of existing database platforms to evaluate and recommend new technologies. Uses database knowledge to solve issues on unfamiliar products. Creates and maintains database policies, standards, and overall documentation including availability, replication, availability, and backup and recovery policy, service level agreement, baseline architecture, change management, access to production, unsupported HW/SW, security and audit violations, and risk acceptance.
- Storage and Backup responsibilities (if applicable): Masters understanding of storage concepts, availability, performance, usage and configuration. Sets up, troubleshoots, and tunes complex SAN software issues. Uses knowledge of existing storage platforms to evaluate and recommend new technologies. Uses storage knowledge to solve issues on unfamiliar products. Creates and maintains policies, standard, and overall documentation including availability, and backup and recovery, service level agreement, baseline architecture, change management, access to production, unsupported HW/SW, security and audit violations, and risk acceptance.
- Scripting and Development responsibilities (if applicable): Expertly develop software in several modern languages. Develops large/complex database-backed systems and has a solid understanding of DB schema and query performance. Given a broad set of goals, can create detailed requirements, technical design specifications, and LOE analysis. Designs modular systems to be co-developed by teams of less experienced developers. Designs horizontally-scalable solutions with innovative use of storage and networking including solid APIs for integration with other systems. Utilizes professional best practices in day-to-day work like revision control, unit testing, or other. Applies statistical data analysis techniques.
- Networking responsibilities (if applicable): Recommends or helps architect an entire system, including network design and topology. Acts as an expert in understanding and performing TCPdumps, snoop, and other network sniffers. Understands and applies knowledge of most protocols (TCP/IP, HTTP, UDP, etc.)
- Application Technologies (Web Servers, J2EE, Applications Servers) responsibilities (if applicable): Provides expert recommendations and advice to the team and/or department in the areas of web services, OS, and storage, including being an active liaison to Development, QA and the Business. Provides scaling, design, costing, troubleshooting, and impact analysis consultation.
- Analyzes systems and makes recommendations to prevent possible problems. Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
- Leads end-to-end audit of monitors and alarms based on subsystem knowledge. Takes the lead on defining the requirements for new tools required for operations.
- Utilizes time management and project management skills to lead the resolution of issues in a timely and organized manner, effectively communicating necessary information. May consult directly with clients or third party vendors; provides subject matter expertise.
- Consistent exercise of independent judgment and discretion in matters of significance.
- Regular, consistent and punctual attendance.
- Other duties and responsibilities as assigned.
- 7 years in a software development role, operations role, or closely related position
- Experience administering Linux systems in a production environment
- Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell
- Bachelor's Degree in Computer Science or a related field, or relevant work experience
- Excellent problem solving skills with a strong attention to detail
- Experience with distributed version control like Git or Mercurial
- Ability to dive deep into complex technical problems
- Experience with IaaS and PaaS providers such as AWS, OpenStack, Heroku, or CloudFoundry
- A sense of ownership, initiative, and drive
- Experience with enterprise monitoring solutions like AppDynamics, Graphite, InfluxDB, Prometheus, or Splunk
Comcast is an EOE/Veterans/Disabled/LGBT employer