Existing Skills, New Frontier: How Ops Pros Will Thrive in the Age of AI

The headlines are buzzing with Machine Learning (ML) and Generative AI (GenAI), and it’s easy to think you need a whole new toolkit to stay relevant. While these technologies are transformative, the good news is that many of your existing skills in technology operations are not just relevant, they’re crucial. The game is changing, but the foundations you’ve built are more important than ever. Let’s explore how.

What is it?

Technology systems operations are the backbone of any digital endeavour. Think of it as the foundation of a building: it’s responsible for keeping everything that is built on it, safe, steady and secure. This includes everything from deploying and maintaining infrastructure, managing networks, ensuring data integrity, and responding to incidents. Why is it important? Without robust operations, even the most groundbreaking AI models would falter. They wouldn’t have the stable, scalable, and secure environment they need to perform, learn, and deliver value. In essence, technology operations ensures the technology actually works for the business, day in and day out.

What does it mean from a business perspective?

The rise of ML and GenAI presents both exciting opportunities and new challenges for current operations teams. It’s not about throwing out old skills but adapting and expanding them. Here’s what it means:

  • Understand Costs by Evolving Capacity Management: Remember when CPU, RAM Disk and Network were your primary capacity concerns? While still important, ML and GenAI workloads often demand significant Graphical Processing Unit (GPU) power. This means capacity planning now needs to incorporate GPU availability, utilisation monitoring, and understanding of the specific hardware requirements of different AI models. You’re not just managing servers; you’re managing highly specialised compute resources.
  • Streamline Operations by giving CI/CD Pipelines Get a Makeover: Continuous Integration and Continuous Deployment (CI/CD) pipelines are still vital, but they need to adapt. Your pipelines will now manage not just code, but also data versions, model training, model versioning, and continuous monitoring of model performance in production and become CI/CT/CD (Continuous Training) pipelines. It’s about ensuring reproducibility and reliability for data-driven applications.
  • Manage Risk as Security Adapts to New Threats and Risks: Security remains paramount, but the attack surface expands with AI. You’ll be looking at securing training data to prevent poisoning, protecting intellectual property embedded in models, ensuring ethical AI use, and managing novel vulnerabilities. Your security lens needs to widen to encompass the unique aspects of AI.
  • Understand Operations as Monitoring and Observability Go Deeper: Traditional monitoring of CPU, memory, disk and network is still necessary, but for AI, you also need to monitor model drift (how a model’s performance changes over time as new data comes in), prediction accuracy, and data pipeline integrity. Observability needs to provide insights into the “why” behind a model’s behaviour, not just its operational status.
  • Train Your Team – Upskilling and Cross-Skilling are Key: Your team’s foundational knowledge in networking, infrastructure, and security is invaluable. The next step is to build upon this with AI-specific operational knowledge. This could involve training on MLOps tools, understanding the basics of machine learning workflows and GenAI, and learning about new security considerations for AI.
  • Ensure That Governance and Compliance Adapts: With regulatory requirements still emerging it will become important to stay on top of these emerging requirements and start building systems and processes with these in mind, even if they are not quite here today.

What do I do with it?

Here are some concrete actions you can take to navigate this evolving landscape:

  • Assess Your Current Skillset: Identify the strengths within your existing operations team. Where do your current skills in capacity management, CI/CT/CD, and security naturally align with the needs of ML and GenAI?
  • Invest in Targeted Training: Look for training programs and certifications focusing on MLOps, AI infrastructure management (including GPU-based systems), and AI security. Encourage your team to explore these areas.
  • Start Small and Iterate: You don’t need to transform everything overnight. Begin by operationalising smaller, less critical AI projects. This allows your team to learn and adapt in a lower-risk environment.
  • Foster Collaboration: Break down silos between your operations teams, data scientists, and ML engineers. A collaborative approach is essential for successfully deploying and managing AI systems.
  • Explore New Tooling: Familiarise yourself with platforms and tools designed to automate and manage the AI lifecycle. Many of these build upon concepts you already know from traditional DevOps (and in most cases are likely just extensions to the systems you are already familiar with).
  • Stay Curious and Adaptable: This goes without saying – the world of AI is moving fast. Cultivate a mindset of continuous learning, be prepared to adapt your strategies as the technology and best practices evolve, and lay the groundwork now for legislative compliance.

The integration of Machine Learning and Generative AI into mainstream business operations isn’t a signal to discard your hard-earned operational expertise. Instead, it’s an invitation to enhance and evolve it.   Your skills that ensure systems are secure, reliable, and efficient are more critical than ever. By understanding the nuances of AI workloads and proactively upskilling, you and your team can become indispensable leaders in this exciting new era.

Let’s not just support AI, let’s operationalise it.


Further Reading

Machine learning operations (Microsoft)

The Backbone of AI Evolution: CI/CD Pipelines for Generative Models (Naman Vyas)