With Michelangelo D'Agostino, VP of Machine Learning at Tegus
Key Takeaways
Few companies are better positioned to benefit from LLMs than Tegus; the company has 60k transcripts of calls between investors and experts, each many pages long and rich with interesting information. Michelangelo outlines several interesting product ideas, many of which are replicable in other scenarios.
Topics Covered
- Background on Tegus
- Tegus's stated goal is to build the leading research platform for fundamental financial investors, starting by innovating on the expert research model
- Expert research consists of a network connecting you to a series of experts on a given company (e.g. often customers, former employees, or partners)
- The Tegus innovation was to turn this into a SaaS business, reducing the per-call fee in return for being able to publish these calls to the Tegus library (after a period of time) for other subscribers to reference
- Tegus really changes the usage patterns of these calls; you're able to use them in many other areas of your investment workflow because you're not being charged per-call 
- You can start to use Tegus as part of your initial research, not only when you're in the context of a deal
- Also after you've made an investment, you can now start to monitor the calls others are doing
- You can see that ongoing use case being valuable in many other areas e.g. tracking early companies, or seeing how sentiment changes over time on a company you're curious about
 
 
- The lifecycle of a Tegus transcript and where AI plays a role today 
- All calls are conducted on Twilio, and then go through both an instant transcription (which you get immediately as the person conducting the call) and a human-reviewed transcription (including a strict compliance review); only the later gets posted to the platform
- For calls going onto the platform, we use an AWS API for entity recognition and linkage, which we then use to link back to a company record in the database
- We also do text to speech if you prefer to listen to the call instead of reading the whole transcript 
- Custom-tuning entity recognition models is something we've thought about, but the AWS API works well for now
 
- Today the platform has 60k transcripts, and we're adding 2k per month, so it's a lot of data
 
- What are you most focused on in making better use of this large library? 
- As additional important context, Tegus made two acquisitions that are now part of the platform as well: BamSEC (a database of SEC filings) and Canalyst (a library of pre-built public company financial models)
- Canalyst has 4k+ company models with proprietary data behind them
- There are two broad buckets 
- The first bucket is focused on the transcripts themselves, which are very long, often 20-30 pages and you might have 100 of them on a company
- The second bucket are connections between different data sources; how do we build new experiences by linking together insights across transcripts and e.g. BamSEC's SEC filings?
 
- The simplest first question: "what kinds of things can we extract from a single transcript that would be useful?"; those became auto-generated summaries and tagging transcripts with topics  
- This topic modeling combined with summaries becomes useful when you want to find transcripts talking about a specific product (e.g. within Amazon, you may just want AWS) 
- We also want to build the idea of cross-company topic exploration; right now the experience is very company-specific
 
 
- This topic modeling combined with summaries becomes useful when you want to find transcripts talking about a specific product (e.g. within Amazon, you may just want AWS) 
- The summaries product we built users loved
- The problem with summaries is there's not one universal definition of a "good" summary; some people want them short, some want them long, some want bullets, some want a bull vs bear case 
- So you then think about, how do you let people customize?
 
- The other area we're actively working on is semantic search 
- You can issue a query like "show me transcrips related to vendor selection processes that involve DataDog"
 
- Once the data is embedded with an LLM, you can then use it for other use cases beyond search (e.g. comparing qualitative commentary from a set of relevant transcripts) 
- E.g. you can see that people are often talking about Datadog negatively as it relates to cost, and this other vendor most positively
- You can then imagine some really interesting ideas of fully automating market-level analysis of any given company
 
- Then there's a big incentive for people to upload transcripts outside of the platform as well, which really expands the value/role of a platform like Tegus
 
- Internal use cases
- Product features though are just one opportunity for ML at Tegus; there's also a lot of value in internally-facing applications
- One is our operations team, the people that source these experts
- Our compliance team is another
- On the operations side, the big opportunity is around search; how do we enable these operations folks to find the right expert across our internal systems?
- Another one is how we auto-generate biographies of experts from all the data we have on them; there's a lot of manual work involved there today
 
- Asking questions of transcripts; more value than people think
- As a final point on the transcripts bucklet, chat as a UI actually has a really interesting use here; being able to ask a question of a transcript and get back an unbiased answer, which you can then refine with follow-up questions is very valuable 
- Often when you do these calls, your hypothesis may bias you from the "correct" takeaway, often far more than you think
 
- This UI is also a great way to just be more efficient; Tegus customers have very valuable time and already consume a huge amount of information; delivering more learnings with less word-by-word reading is key 
- It also allows you to refine your question; sometimes your intent and what you asked for are misaligned; it's helpful to have that allowance for iteration
- And you can do this quickly; you don't to send the intern away for another day of work
 
 
- As a final point on the transcripts bucklet, chat as a UI actually has a really interesting use here; being able to ask a question of a transcript and get back an unbiased answer, which you can then refine with follow-up questions is very valuable 
- Long-term vision: creating connections between data sources 
- One avenue is a semantic search / chat experience that can answer with information across multiple data sources (e.g. what experts have said, what CEO has said in earnings call, etc.)
- If you ask a quantitative question, maybe we want to pull from the Canalyst API to get an authoritative data point 
- That's the holy grail
 
- What makes it challenging is you're mixing different kinds of content, so how do you deal with relative levels of "correctness" or "authoritativeness" (e.g. CEO comment vs a customer opinion) 
- The structured Canalyst data is a whole different beast; calling structured APIs with LLMs is still an emerging field
 
- The other interesting application in creating connections is an inline navigation experience, where you can link across to related commentary in other transcripts or SEC filings 
- E.g. you're looking at Q3 sales data, and you can link to where a customer is talking about problems related to churn; that would be super powerful
 
 
- You really just don't want to give someone a blank box 
- In the box, give someone great ideas about where to start
- Then once you have a sense for where the user is going, give them some next logical questions
- We are most focused right now on the beginning of the process 
- E.g. if we're looking across our corpus of transcripts on Snowflake, what are the most common or important questions investors have been asking?
- That should give you an idea for the 5 most important diligence questions you should wrap your mind around as you start to think about Snowflake
 
 
- Build vs buy? 
- On the foundation model, OpenAI's APIs work well and are easy to get started with, but long term that will change with e.g. open source, other LLM vendors catching up, etc.
- Separately, no technologists like having lock-in to some closed-source system; it would be irresponsible for me to only think about building on OpenAI 
- The number of different LLMs is leading to potentially a middle layer that intermediates you and the LLM vendor and lets you easily swap in and out different providers
 
- Then there's the open source ecosystem, which is a big deal for us 
- There's a lot that's attractive to deploying something yourself
- There are a couple reasons you'd want to do that: throughput (e.g. for us we need to process a lot of documents all at once), and consistency in outputs, now that people have seen OpenAI's outputs have drifted over time
 
- That gets to another interesting set of emerging tool vendors: experiment management, where you can test a set of prompts across various providers
- There's also observability and monitoring; how do you log against a model in production, how do you store inputs and outputs
- Vector DBs and other components of infrastructure have raised a lot of money, but unclear if that market remains as an independent product category
 
- What is the top-level unifying goal for you and your team? 
- Our founders talk about wanting to build an "over my dead body product"; to get there, we need to be making all our data across all our data sources as valuable to our users as possible