With Mike Knoop, Co-Founder & Head of AI at Zapier
Key Takeaways
LLMs are already transforming Zapier's business. They're seeing 100x productivity improvements on internal processes that drive user acquisition, deflecting a large volume of customer support requests by improving the handling of user-facing error messages. But most interesting is that Zapier views LLMs as the "escape hatch" UX paradigm that will enable it to grow from 10M users today, to 100M users over the next decade.
Topics Covered
- Mike's journey to taking the role of Head of Zapier AI
- In 10 years since co-founding Zapier, I've done every role, across engineering, design, management, etc.
- Throughout the whole journey, it's always been trying to focus on "what are the areas of the company that nobody else is focusing on right now?"
- First encounter with LLMs was with GPT-2, a few years ago
- Did an internal presentation on GPT-3 possibilities/use cases but we didn't think it was quite useful enough, or we didn't use our imagination well enough
- However, a year ago last summer, I read the chain of thought paper and it was the first thing that convinced me that these models have some pretty large capability overhangs that have been in place since GPT-3 and it's just been sitting there -- nobody knows about it
- Then tried in last 12 months to understand the potential, what else can they do that nobody knows they can do?
- First thing you encounter: how do you equip them with tools? This fits very well into Zapier paradigm -- we have 50k actions in the Zapier platform
- Effectively quit day jobs (Mike and co-founder Brian) and spent 6 months trying to prototype LLM-based product experiences; moved through world pretty quick
- After touring many ideas, ultimately settled on the intersection of models and tools; that was the most obvious way for us to participate in the ecosystem
- Fast internal LLM adoption at Zapier
- Had a big code red moment, where we said language models will change the game for what automation means; and we need to figure this out for ourselves
- We had a one-week hackathon where everyone had a free week to go figure out what these models can do for each person's jobs
- Investigated what came of that and saw this "S-curve" graph; really fast adoption then tailed off at 1/3 of Zapier; this is I think higher than any other company I've seen
- Two observations: 1) wow that's fast adoption 2) why only 1/3
- What were the specific use cases driving the growth?
- First place we saw the adoption curve was in the marketing partnership organization -- with Zap Templates
- Zap Templates are outlines of Zaps; they describe apps involved, what the triggers are; e.g. "when new lead fills form in Google Forms, create contact in HubSpot"
- That's a very functional description, but not very inspiring; you'd say "why is that useful for me?; you want to know what it helps you with
- Those descriptions around use cases / value we were writing manually; interviewing users and turning those into short desecriptions, which are very important for our SEO -- we have a few million on the website
- Problem is Zapier launches 2 integration per day -- you need a Zap Template between each app added and ALL the others in the library -- that number you want scales exponentially
- So we started testing this out; implemented by using Zapier's (not as well known) ability to integrate with internal APIs/databases to watch for new apps getting launched; app will now generate content ideas w/ associated JSON structure and humans can select which is best and publish
- We went humans building 10/day to 1000/day -- you just review and no longer need to build them
- If AI/bots are the ones trying to do tool discovery, will that change the role of SEO?
- We've thought about this a lot because we get a lot of our user acquisition through search
- It's been interesting to see that companies/websites who are heavily weighted in search engines typically get better representation in the language models because language models are learning from scrapes of the internet
- We did set up a system to measure how LLMs recommended products over time; e.g. "what CRM should I use" and track trends in that response; not sure yet what we will do with that, but it's an interesting idea
- Is it also important to vendors to optimize their apps to ensure Zapier is picking them to accomplish a given tool?
- I do think this is going to happen; it feels like SaaS is going to bifurcate into "do you own the experience" or "are you a capability provider"
- Today it seems there are many companies in the middle ground, and it feels like language models are going to put a lot of pressure onto this group
- Use case of handling customer interactions; what have you seen there?
- We see many use cases, but this area is where we see the "fat head of the tail"
- Gong call transcripts, Zendesk tickets, Typeform product surveys, etc.; lot of unstructured data to do something with
- Language models are really good at extracting information out of streams of data
- We have one use case in support where we're doing bug triage; you can start to get visibility into "what are the top five issues affecting customers this week"
- Then we have another language model doing categorization to actually route issues to product teams; that team then gets its own slice of data
- Summarization of customer info... will this be owned by Intercom/Zendesk, or should it exist as an independent capability?
- I would be shocked if these platforms don't add in these features
- What is Zapier's reason for existence? Users want a lot of choice
- Software makers need to make a lowest common denominator decision of where they spend engineering time, and they often have thousnads of users/buyers
- That experience is always going to have rough edges around it; it's going to integrate with every tool they care about
- How do you think about the vision for Zapier in this new world?
- The core way by which the product provides value we don't see changing
- The question is, we've taken Zapier to 10 million users in the last 10 years with the current paradigm of how Zapier wraps UX around writing code under the hood, but I think that paradigm is hitting diminishing returns
- Large language models offer an escape hatch / an answer to "how do we get to 100 million users"
- When LLMs came along 12 months ago, I thought: we are either going to accelerate and hit escape velocity or we're going to decay into irrelevance, and we're trying very hard for it to be the former
- Between the problems of "discovery of use cases" and "inability to use Zapier because it's overly complex", where do you see LLMs adding most value?
- Only 20% of our users today would self-identify as an engineer or developer, but you have to come in with a mindset of "how do I plug these systems together"; that is where Zapier is too hard today
- Breaking down the funnel, the majority of customers who come to Zapier have an understanding of a problem they want to solve; the question is how do I convert this business problem into software
- It's hard to get people to think that way, but it's something language models are really good at
- The ChatGPT plugin we launched actually does this... it allows you to take any action with Zapier's platform (send Slack message, create contact, etc.), but it also helps you build a workflow for a given problem
- That all being said, to get to 100 million users, we also need to help people discover ways to use automation, and I think language models can be helpful there as well
- Do you think you need to capture users earlier in the funnel?
- Higher intent users convert better; users coming to a search landing page is 2x likelier to convert
- Homepage/brand traffic converts worse; we need to do a better job of taking advantage of that curiosity and intent
- What problems have you encountered building these LLM-powered features, solved or unsolved?
- The place you want to go look for applications of the tech is the list of unsolved problems you've had from the last 10 years b/c it was too expensive, or the problem was too complex
- An example for us is error messages; we get a ton of random error messages, which are often very esoterically written and hard for users to interpret
- One of the biggest support challenges is explaining and triaging error messages
- We tried that before and it didn't work, but now we're coming back to it, with an added view of can you make recommendations on how to solve it (e.g. the common problem of missing fields)
- Once you have a problem, start with the most capable model; don't worry about anytihng else; if the most capable model can't solve your problem, no amount of fine tuning or novel architecture is likely to solve it either, at least not easily
- If you can narrowly solve the problem, then you can start understanding latency, cost, etc.
- Then you can apply general engineering principles e.g. evals, fine tuning, and we've started to do that
- The hallucination problem is the one we've least well structured around from an engineering perspective
- Way we solve it internally is if you want constrained generation, you need a human in the loop; for us that was zap templates
- The structural generation part of that problem is easier to solved because you can check if the generated JSON is good or not
- Otherwise you need a human in the loop to judge the creative components
- We're now starting to deploy fine-tunes and open source models, and you'll start to see a lot of other companies adopting fine tuning in order to drive down latency and cost per token
- Have you looked at applying the Code Interpreter style approach of iteratively trying to auto-fix problems?
- We do have a few spots where we are trying this; the places where you generally want to use that technique though is offline
- Users won't tolerate 60 seconds to get an output, but can you use those techniques offline... we have a pretty complex set of language models steps that cascade in e.g. the Zap Templates use case
- Any tools you've found valuable to handle fine tuning, quality management, etc.?
- Everyone's first language model product is "shipped on vibes"
- You then throw it out in the world, and immediately next, you get user feedback that "they tried to do X and it didn't give them the right response", then engineer goes back and tries to adjust the prompt, write deterministic code to catch the scenario and do something better
- Then, you realize if I make this change, how do I know all the other uses people have typed are still going to work?
- Then you immediately realize you need an eval / test suite, and you just can't get that with natural language because there are a lot of right answers
- The issue is the stuff in the middle, not what's obviously good or obviously bad; and you need a system to grade that, and then backtest it
- To backtest, you need feedback data in the first place; usually I first recommend the "thumbs up thumbs down"
- Then your engineers can understand when you deploy a change, is the % of right answers going up down or held the same -- there are a bunch of cool companies now working on this
- This extends into fine-tuning as well; you're not reaching for more accuracy with fine tuning; you're not going to take it from 50% to 80% right
- What fine-tuning will do though is increase the determinism of that 50%, and ultimately allow you to achieve better latency/performance/cost
- How do you actually evaluate the outputs / measure similarity?
- That's the biggest problem; you need a similarity score with acceptable ranges
- Most of our stuff and where we started is embeddings-based similarity; or you can use k-means clustering to say "we accept anything sufficiently close to these responses"
- This evals problem gets harder the longer the output; you want to break your problem into small components
- It's hard to get a sense for accuracy if you only get thumbs up down on a long output; ideal to break down feedback into smaller parts
- Have you looked into generating a grammar for every Zapier action?
- Saw the paper on this just a few weeks ago; we need to go try this
- It's still a latency-reducing thing, not a capability-reducing thing
- Maybe the one spot where that's not true is small/local models; 7B and 15B are not quite as good out of the box for structured/correct JSON 90% of the time
- Do AI agents necessitate lower latency?
- Yes; we realized reliability perception is more influenced by Zapier's front-end website than by the back-end runner
- We have a much higher standard on data loss and zap execution for the front-end website, but that's not how users think about it; user psychology is important
- One of the first things we realized is how important it is to get first visual impact down; you need to stream output back to the user, which gives the system some time to think
- Ultimately speed matters a lot; I keep going back to Code Interpreter because it's faster (vs the other startups doing this)