Portfolio #other
Read time: 09'23''
29 March 2023
Meet Uptime Labs, the startup recreating realistic IT system failures
© Uptime Labs

Meet Uptime Labs, the startup recreating realistic IT system failures

As part of our quick fire questions series – or QFQs – we spoke to Hamed Silatani, cofounder and CEO of Uptime Labs about using AI to recreate realistic IT system failures, chain link logic and the importance of hiring people that are better than yourself.

The real-life challenge at work. For over a decade, I was responsible for Uptime and operational resiliency in a global financial firm in a heavily regulated environment. Every minute of downtime had a significant financial, reputational, and regulatory cost for us. As a result, there was massive pressure on engineers responsible for keeping the trading system running. Like many other engineers in charge of the availability of mission-critical systems, I remember freezing completely under pressure, not being able to do the most simple tasks, feeling guilty about letting down the business because I couldn’t fix the systems quick enough, crying in a couple of occasions, and suffering from symptoms of stress in the form of physical pain. Then, by coincidence, I observe medical staff preparing a patient for an emergency operation. The level of coordination and pace under such high-stress circumstances (someone’s life at stake) blew my mind. I thought the IT industry had something to learn from other safety-critical industries. The second aha moment for me was when I realised that in safety-critical industries, a vast amount of investment goes to keeping staff prepared and ready to respond to emergencies.

In contrast, IT engineers have to learn incident response on the job at the cost of their employer, their customers, and their mental health. In the IT industry, a lot of money goes into monitoring and troubleshooting tools, but engineers still learn how to troubleshoot on the job. So I thought, in my line of work, it was time to invest in people to match the advancement in tooling.

My observations were a topic of discussion with two other colleagues, who later became founding members of Uptime Labs, both passionate and experienced in IT incident response. I clearly remember it was a Thursday evening in July 2021 when we decided to take 3 days off from work and run a design sprint to work out if we were to solve the problem for the IT industry how it could look like. 27th, 28th, and 29th July 2021, then became the genesis of Uptime Labs.

The third element of the catalyst was the considerable support and encouragement from a friend who had a successful DevOps consultancy. We took the idea to him to get his views. He could immediately see our vision and joined the journey as a founding member.

Tell me about the business – what it is, what it aims to achieve, who you work with, how you reach customers and so on?

Uptime Labs is a realistic IT incident simulation platform. We leverage conversational AI and generative AI to recreate realistic IT system failures with all the socio-technical parameters that happen in real life, such as anxious executives, inexperienced engineers, and pressure from customer services. Our platform allows users to customise the tooling in simulation to match what they are using at work in real life. Each incident drill takes about 30 minutes, and users can play independently or as a team at whatever time suits them. We’ve worked hard to make the experience engaging and fun, so users see it as an entertaining pastime than a work duty. Once users achieve a certain level of competency, they get a license-to-operate certificate, giving them confidence that they are ready to tackle complex IT failures at work; it also assures senior IT managers that their team is prepared. However, license-to-operate is not a permanent accreditation; like sports, users need to continuously practice to maintain their level of readiness.

We aim to mitigate IT system failures’ financial, reputational, regulatory and human costs. We keep IT teams at the highest level of readiness to respond to inevitable IT system failures and support senior IT managers to benchmark, develop, and prove their organisation’s IT incident response capability.

We work with senior IT managers who bear the heavy responsibility of availability, frontline engineers who save the day when things break, and IT executives who are affected by IT system failures’ financial, reputational, and regulatory impact. Any business that is heavily dependent on technology is our customer. We are currently focusing on FinTech and SAAS.

At this early stage, we are keen to work with visionary customers who believe what we believe. We see them as partners to build an awesome product from which the rest of the market will draw value. So we figured the best way to find visionary people, who are passionate about IT incident response and have a problem to solve, is by creating a community for them to share their views and learn from each other. Therefore, we started a community called OOPS (Outage OPerationS meetup). We also regularly write about the challenges of IT incident response and trigger deep conversations.

How has the business evolved since its launch? When was this?

We incorporated in November 2021 with three part-time employees (weekends and evenings alongside a full-time job). We closed our pre-seed round in May 2022 and created our core full-time team. By July, we launched our basic working product and started a busy campaign of field trials. The product improved enough by September to give us the confidence to go out to market trying to sell it. We signed our first customer (a consumer brand) through the OOPS community in November. We onboarded our second customer (a global forex trading firm) in December. In February, the third customer (a global FTSE250 financial trading platform) started using our platform. In Jan and Feb, we had exciting breakthroughs with our AI models promising to significantly increase our level of automation for designing and delivering incident drills. This is excellent news for our scalability.

Tell us about the working culture at Uptime Labs

  • We do what we love with people we like working with.
  • Hire people that are kinder, smarter, more ambitious, and better learners than us.
  • Always keen to learn from customers.
  • We put a lot of thinking and effort into identifying the biggest and most urgent challenge we face, working with our advisors to develop policies to tackle the challenge and using OKRs to deliver outcomes in line with our policies. I find this approach very powerful; it gives a lot of freedom to the team to use their creativity to overcome challenges and draw satisfaction.
  • We drive our decisions using data and experiments.
  • Love by design culture (inspired by the work of Alan Mulally – Knowledge project podcast 151) – an environment that is obsessed with delivery and, at the same time, celebrates accountability and asking for help.
  • We care about quality in everything we do, not limited to product.
  • Learning and a growth mindset is a huge thing for us. So we’ve a book club and a dedicated personal development opportunity.

How are you funded?

We approached IT engineers and senior managers to validate our product idea early on. Many of them loved the vision and asked to invest. We closed our pre-seed round in May 2022 without running any fundraising campaign. Most people on our cap table are from the IT industry. We also have a revenue stream from customers.

What has been your biggest challenge so far and how have you overcome this?

Identifying the most urgent and the most critical challenge to focus on. There are many important and valid activities, but not all are urgent and relevant to our business stage. This is a considerable risk because you can quickly spread yourself and the team thinly, lose focus, and burn your limited capital inefficiently.

The most crucial step to overcoming this challenge is being aware of it. So we talk to founders who are ahead of us, read failure stories, take advice from mentors, read books, and constantly analyse the realities of the market to identify our most urgent challenge. Four books that really helped us are Radical Focus (Christina Wodtke), Crossing the Chasm (Geoffrey A Moore), Good strategy/bad strategy (Richard Rumelt), and Find your Market (Étienne Garbugli).

How does Uptime Labs answer an unmet need?

Chain link logic is in play when recovering from complex IT system failures. Highly skilled engineers must operate advanced monitoring and troubleshooting tools to restore service quickly. While there is a heavy investment in the tools, the people aspect is largely ignored. Uptime Labs focuses on the workforce’s skillset, building their incident response muscle memory and ability to use tools effectively and work as a team during high-stress situations. IT executives spend a lot of money on monitoring tools ($3.5B global spend on infrastructure monitoring 2022, report) only to see that engineers struggle to effectively use the tools and work as a team during high-stress incident response situations. The reason is that IT engineers are still learning incident response on the job, unlike other safety-critical industries.

We also give senior IT managers a unique insight into the team’s skillset and ability to recover from complex IT incidents. Senior management can benchmark, develop, and prove their teams’ incident response capabilities for the first time.

What’s in store for the future?

At an unprecedented pace, every business is becoming an IT business. IT failures impact our money, commute, communications, health, etc. We see regulators, starting with financial services, recognise the huge cost of IT system failures. Regulators and industry are increasingly aware that human factors are vital to the operational resiliency of IT systems. EU recently passed the Digital Operational Resiliency Act (DORA), which holds managers accountable for taking a holistic view of resiliency, including people and processes. It is only a matter of time that IT operators, like pilots and firefighters, are expected to maintain a high level of incident response competency by way of obtaining and keeping license-to-operate.

Uptime Labs is committed to the human element of IT incident response. We want IT engineers and senior IT managers to have confidence and assurance that they are well-equipped to respond to IT system outages. In addition, we want our customers’ senior executives and customers’ customers to be proud and pleased with the effectiveness and professionalism of our customers’ handling of outages.

We’ve a long way to go; globally, over 28M IT engineers and operators can be called in to fix outages. We’ve to help every one of them to become world-class troubleshooters. Hundreds of thousands (and counting) of businesses also run mission-critical IT systems, and we need to support them to adopt a human-centric approach to IT incident response.

Our conversational AI and generative AI models are very young. So much more we can do to recreate more complex failure scenarios and human interactions to enhance the depth and speed of skill acquisition for our end users. We are teaming up with world-class universities to solve these hard problems.

IT industry’s understanding of human actors’ behaviour during high-stress, high-paced IT outages is very basic. Simple questions like what makes a good incident responder are still a huge point of debate. Answering critical questions like this requires research and investment.

What one piece of advice would you give other founders or future founders?

I’m unsure I can give advice, but I’ll share what helped me.

Future founders: It’s a small world; help people around you as much as possible without any expectations. You’ll get a helping hand from the most unexpected people when you need it the most. In particular, help young businesses; some of these young businesses will become giants and investors of tomorrow and will hold your hand when you are in need.

Other founders: entrepreneurship is a long journey and can sometimes be lonely and challenging. Invest time in your family (at an early stage, this includes the team; we are family) and do not differ it to “once I raised my series A”… You are investing in a strong base that will keep you going.

And finally, a more personal question! What’s your daily routine and the rules you’re living by at the moment?

I wake up early for a couple of hours of focus work before my son wakes up. I can’t focus without music, so I use the time to listen to decent rock music from the 70s,80s, and 90s.

8am to 9am is a quality time with my wife and son running up to school time.

I’ve a theme for each day of the week: orientation, big thinking, sales, product, and fundraising. The themes change every quarter depending on our strategic focus. In startup life, you are not short of things to be done; without discipline, I feel I can quickly lose focus.

I switch to family time after school pickup; most of the time, I take my son to some sort of after-school activity.

I get to spend a couple of hours in the late evening work, and I use the time to make sure learnings from the day are captured, and follow-up actions are done.

I love reading paper books. So I dedicate 2 hours a week to reading books away from the screen.

Rules I live by:

  • Have fun. We started Uptime Labs to do the work we love with people we enjoy working with. Both parts are very important to me, and I make sure the rest of the team feels the same.
  • Learn all the time. I read one book each quarter. I also take every opportunity to learn from others’ experiences.
  • Inertia kills innovation. I actively break routines, clear our work backlog, and delete recurring calendar invites at the end of each quarter. Then, as a team, we decide what routines we want to keep and what meetings we need to support our goals.
  • Build an awesome product that meets customer needs. We are obsessed with understanding our customers. In reality, it translates to creating opportunities to interact with customers, capture insights from the interaction, and translate it into a product experiment.
Hamed Silatani is cofounder and CEO of Uptime Labs.