Skip to main content
HomePodcastsArtificial Intelligence (AI)

Data & AI Trends in 2024, with Tom Tunguz, General Partner at Theory Ventures

Richie and Tom explore trends in generative AI, the impact of AI on professional fields, cloud+local hybrid workflows, data security, the future of business intelligence and data analytics, the challenges and opportunities surrounding AI in the corporate sector and much more.
May 2024

Photo of Tom Tunguz
Guest
Tom Tunguz

Tomasz Tunguz is a General Partner at Theory Ventures, a $235m early-stage venture capital firm. He blogs sat tomtunguz.com & co-authored Winning with Data. He has worked or works with Looker, Kustomer, Monte Carlo, Dremio, Omni, Hex, Spot, Arbitrum, Sui & many others.

He was previously the product manager for Google's social media monetization team, including the Google-MySpace partnership, and managed the launches of AdSense into six new markets in Europe and Asia. Before Google, Tunguz developed systems for the Department of Homeland Security at Appian Corporation.


Photo of Richie Cotton
Host
Richie Cotton

Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Key Quotes

I think the productivity expectations of every white collar worker will now go through the roof. I think everyone will be expected to be 50 to, I don't know, 250 % more productive than they have been in the past because they have these tools at their disposal. And so being familiar with all these different systems and understanding when to use them will be absolutely essential.

I think today it's all about experimentation and really understanding where AI projects work and where they don't. The ecosystem is changing really fast. I think the leaders of the next wave will be conversant in AI, they will know how to speak to their leaders and educate them on where the company should be spending time, where the company should be investing in terms of software. Because the reality is like every board and every C -suite is saying we need AI, we need AI, we're starting to see the productivity improvements from our competitors.

They're looking to the leaders within each individual team and department to educate themselves and then ultimately develop a strategy on how to leverage this technology internally. And so being an expert in that domain, I think will lead to lots of promotions. Because if you can cut two thirds of a workforce or if you can increase revenue by a third, by making sales team that just that much more effective through automation, there's a lot of value to be created.

Key Takeaways

1

Data quality issues can derail AI projects, so prioritize data governance to improve the reliability and accuracy of AI outputs while implementing strict data security measures to prevent breaches and misuse.

2

A hybrid approach combining generative AI and traditional deterministic methods is essential to reduce errors and enhance application robustness, especially in complex data processing tasks.

3

As AI systems enter production, data teams need to adopt engineering practices for code quality and collaboration to ensure seamless integration into the software development lifecycle.

Links From The Show

Transcript

Richie Cotton: Welcome to DataFramed. This is Richie. Since the data and AI ecosystem of tools is changing rapidly, it can be tricky to keep up with what is available and what the most important ideas are. Fortunately, it's the job of venture capitalists to stay on top of such things, so I've invited one to share his thoughts.

I want to know what he considers to be the most important data and AI trends and how you can go about taking advantage of them. That guest is Tom Tunguz, a general partner at Theory Ventures. Tom was a longtime VC at Redpoint Ventures, rising to become a managing director before leaving to found his own fund, focusing on early stage data and machine learning companies.

He's also a prolific writer at tomtungus. com, and since he's not shy about prognosticating on the future of the data space, I'm very excited to hear his opinions.

Hi, Tom. Thank you for joining me.

Tom Tunguz: Pleasure to be here, Richie. Thanks for inviting me.

Richie Cotton: Excellent. So we're going to talk a bit about trends. We'll start with the big one. So generative AI has obviously been the big story of the last few years and it seems to be changing pretty fast. So, what trends are you seeing at the moment?

Tom Tunguz: I think you're right. I mean, it feels like crypto did maybe two or three years ago where every morning you'd wake up and there'd be a new paper and the world would have changed. And I definitely feel that way. I think we're all waiting with bated br... See more

eath for the next generation of the models, the ones that Facebook, Meta has promised, GPT 5, which seem like they'll be able to work on much longer.

duration tasks. So till we know what that looks like. But in the meantime, I think there's definitely been a shift. I would say last year to this year from preferring copilots, which complete your sentences to agents, which will fully automate work, or at least some fraction of the rope work.

Richie Cotton: Okay, no, that's very interesting because yeah, there's been a big refrain over the last year or two like, AI's not going to take your job, it's going to augment it, but agents are promised to just, like, automate things and completely take away that human from the loop. So, yeah, tell me more about that.

Tom Tunguz: Yeah, for sure. So, I mean, I think what we've seen, some of the productivity stats on copilots, which are about 50 to 75%. That's Microsoft in service now, the other way around, Microsoft is 75%. If the LLM agents are anything like mechanical robots, the promise is to increase productivity by about 250%. So one robot would take the place of two, two and a half humans, which is what's happened on automotive manufacturing lines.

We're still really early there. I would say we're starting to see the very first applications of this and security automation or customer support where Klarna cut two thirds of their customer support staff. And we can debate the relative effectiveness of the bots, but they did it and that they saved a lot of money doing it.

And so I think what we're seeing across the industry, people are really excited about this trend. And yeah, the nature of like entry level tasks will change. There was a really interesting Twitter thread about this where they were talking about okay. So robots have taken off in the last 10 years.

This robotic surgery, particularly like urology or areas in the body where. There's just a very difficult way, a very difficult space to navigate. And as a result, robots, they can move in ways that humans can't because their fingers, so to speak, are much narrower. There's now this training gap where the senior surgeons are no longer teaching the junior surgeons side by side.

The junior surgeons are now just watching the robot operate. And so as a result, there'll be this generation of surgeons or security analysts or customer support reps. Who are experts in the domain and then the younger ones don't receive the benefit of the training. And so that's an interesting dynamic that we're sort of paying attention to.

But in terms of like the technology itself, we're looking for tasks where there's a lot of Maybe some initial customization. Some of these agents are actually starting to write their own queries to hit external databases to enrich. and then produce a summary output, and then that goes to a classifier that says, is this worth a human's attention or not?

Richie Cotton: So, on the other side of AI, so what about non generative AI? Are you seeing any changes there?

Tom Tunguz: I think not moving as fast. It's not to say that it's not important. We're starting to see combinations of generative AI and we'll call them classical methods. One reason for this is, That generative AI mechanisms tend to be chaotic and non deterministic. If I ask, how do I reset my password period?

And I asked the question, how do I reset my password space period? That's a different question that will elicit a different answer from a generative AI model. And if you start to chain these inputs and outputs across generative AI steps, you might have error that. Explodes. And so one of the, and this is very early on, but we're starting to see companies take the output of a generative AI step and then apply a classical machine learning method to it to classify it for just to kind of constrain the error and then pass it on to the next step, pass it on to the next step.

we've met a company recently where there's three or four different kinds of machine learning that are being used all in one system. The AI is really good at two different things. It's really good at it. generating ideas. So four to five great titles for this blog post that I wrote right now.

And then deterministic methods as a classical methods are the ones that are really good at picking. And so I think you'll see this nice marriage between the two where most generative AI applications will not be entirely gen AI and will just be the wrappers. And as we get to the next step, you'll see, we call them constellations of models.

And so there's Just to kind of talk about it in one level more detail, input from the user comes in, there's a classifier that could be a classical classifier or a Gen AI classifier that says, what kind of query is this? Have I seen this before? And then, given that output, it passes it either to a small language Gen AI model that knows what it's talking about, or if it's like a completely new query, it goes to something like a chat GPT 4, is it going to handle whole universe, or if you go to classical.

Richie Cotton: Okay. So that hybrid approach is really interesting. so it's maybe like using the generative AI stuff as like the human interface, because humans like chatting and then you've got like something more deterministic underneath. So you've not got the, error propagation problems. Okay. I see one trend I've seen just from going to a lot of data conferences.

Last year, everyone was like, hey, you must build stuff with gerontive AI. It's gonna be amazing. And then just in the last few months, it's changed to all those gerontive AI prototypes you built last year failed because your data quality sucks. And now you need to think about like data governance and how you improve your data quality.

So are there any things interesting you're seeing in that space?

Tom Tunguz: Well, yeah, there's I mean, there's a lot around data security. So we've I think the big fear with generative AI, you have two different kinds of problems with generative AI security. The first is what happened with the The Canadian National Airline, when a passenger asked about a bereavement policy and the generative AI created a bereavement policy that the Canadian airline was forced to adhere to, even though it had nothing to do with the real policy.

So there's this like hallucination problem, but there are other challenges with it. Give you a friend of mine started running a company. One of his employees installed an LLM on top of the cloud data warehouse. And he asked, what is this employee's social security number? And I'll pop the social security number.

So you have data security issues. So we've identified like five different kinds of security that will be needed around large language models, whether it's Ensuring the developer's environment, that they're building a large language model that's secure. Ensuring that there's no data loss, like the social security issue.

There's the right access permissions to the database. Are you getting the right Snowflake connections? Do they have the right user account? There are data poisoning attempts. So if I'm downloading an open source model, am I downloading the right one? Has it been tampered with? That's sort of software supply chain.

And there are others. but this whole space is new. And the challenge is Historically, the CISO, the Chief Information Security Officer, has been predominantly responsible for securing data. The average tenure there is about 14 months out of breaches, so it's a very challenging job because the surface area continues to increase.

But now the data purse, the head of data is now basically responsible, maybe even the head of engineering are responsible for building the systems that use those data and generate AI. And so there's a sort of, both of these roles have to unite in terms of securing and making a hermetic seal around the data.

And those workflows have not yet been created. Our sense is that for many of the largest companies in the world who might have hundreds of thousands of LLM enabled applications on their roadmaps, Security is the single biggest blocking feature because nobody wants to be fired for the leak. But there's no, there's no playbook yet.

There's no sort of standard tools, and that's really slowing a lot of the enterprise adoption.

Richie Cotton: Okay, so it seems like, making LLMs or making generative AI applications more secure could be a big growth area in the next few months or years then. Okay.

Tom Tunguz: Yeah,

Richie Cotton: I say I liked your example of the Canadian Airlines. I believe the courts ruled that they had to uphold that bereavement policy that the ai told 'em about. Yeah, and I heard like another example of this where someone was chatting with like a, a car dealership and it was an AI sales rep and they persuaded the AI to sell a car for a dollar. And I have no idea whether that's been upheld or not, but like, that seems like a, a good whee if you can pull it off.

Okay, so, related to this, it seems like, a lot of the cloud data warehouses, your Snowflakes and Databricks and all this they're rapidly adding AI features. And so how do you see data warehouses changing at the moment with this rise of AI?

Tom Tunguz: they're at the core, right? A lot of the data that's being fed into these large language models comes from the cloud data warehouses themselves. And so typically in AI era, the map of the modern data stack was you'd have the source And ETL platform, the cloud data warehouse, and then you would have three different consumers, you'd have PI, exploratory analytics, and then machine learning, but machine learning was always post hoc.

It was never passive production. It was always customer segmentation, trend prediction, revenue maximization exercises. Now what's happening is. The cloud data warehouse is basically now in the path of production, so to speak, where extracts of that data are being fed into a machine learning pipeline. It's cutting up all the data, calculating the different vectors, potentially storing it in a vector database, and then at inference time, the query plus the data.

The documents and the vectors are all going into the vectors are leading to the documents that's going to the large language model. And so these cloud data warehouses suddenly have to rethink where they sit in the stack. many of the historical ones have not been architected to serve as production databases.

So, what's happening is that these new pipelines, right? So there's, there's spark that's being pushed into to calculate the different vectors, particularly large scale. And then you have this category of vector databases, which we see a lot of the calculations on similarity search. There's a strategic question about how much of the vector database should be a standalone database and how much of the vector database can actually exist within a cloud data warehouse.

Core functions there are relatively straightforward. It's. It's clustering with k means or another form of clustering and then cosine similarity, how similar are two vectors or three vectors or four vectors. And so I think, you know, we'll start to see it. Some of the bigger database companies have announced vector database initiatives.

So that's an important competitive advantage. dynamic that we'll see play out and that'll be really critical. I think the other dynamic within all of cloud data warehousing is the separation of compute storage, but not in the way that snowflake takes talks about it literally mean like the separation of the query engine from where the data is stored.

And you saw that in snowflakes, most recent earnings where a lot of the bigger customers are asking for the data to be stored in iceberg tables in s three. And so, it's very possible that that core centralized data. may actually be hit by a different query engine in order to power a generative AI pipeline.

Richie Cotton: Okay, so you might be skirting around the data warehouse entirely. So you go straight from your cloud storage in S3 or whatever, and that's going straight to the LLM and then or into your vector database and just like cutting out that middle part. Is that right?

Tom Tunguz: Yeah, it could be. You may need to, I mean, it depends on the sizes of the data. And again, like one of the challenges is formatting the data in the right way. A lot of the times, like if anybody has used Lanchain and you're trying to process documents, how you chunk the documents, how you cut them up, how you structure them in order to feed the model is really critical, or the sequencing is also really important.

And so you may need to pre process using a data warehouse or some other tool and then write to iceberg tables with Aura files underneath and then that will be consumed. But there's no, at least our perception is, there's no real standard design pattern quite yet for how everybody's building these flows in a very customized way today.

And two, three years from now, there'll be, I mean, just what we had in the modern data stack, right? It's source system, Fivetran, Snowflake, Looker. there's your stack. We don't have that yet.

Richie Cotton: Okay, that's very interesting. It seemed like there are a lot of new companies in this area and they're all doing overlapping parts of the puzzle. So there are probably lots of different ways to Get your complete flow at the moment, but you think we're gonna end up with like these companies becoming bigger and they'll sort of Overlap each other to converge on something or is it's gonna be they'll be like one winner somewhere that does everything

Tom Tunguz: I don't think there'll be one winner. I mean, the modern data stack had many different, approaches. And you know, I mean, you could look at like Databricks versus Snowflake. You can look at Biframe versus AirByte. Looker versus mean, any number of the BI products. So I think there will be many different approaches.

What's different within the world of AI is that unlike business intelligence, where everybody wanted a dashboard, the output is different. a company might want to build the Jedi pipeline summarize text, Another company might want to build a recommendation system that combines both.

Textual information about like a video, but also statistics about how long the user is watching those videos in order to show the right next video. So that's a different, a different kind of pipeline where you're computing multimodal vectors on the multimodal topic. You may have a third company that's trying to build.

TV production production grade video generation. That's a third completely different kind of pipeline. You might have like legal documents or let's say you were looking to analyze the financials of a company. Let's take a look at an income statement. you might want a classical method there to understand exactly what's going on inside of the P& L because you can't take any and you can't accept any error.

then there's another, the way that a venture capitalist goes to get a P& L is very different than the way that an auditor would look at a P& L. Anyway, so my point is there's so many different outputs that are necessary that you may actually see a much broader diversity of different data pipelines and data pipeline vendors to support all these different use cases.

Richie Cotton: Okay, so I guess like once we get A modern AI stack, it's going to be a lot more rich and complex compared to the modern data stack that was talked about a few years ago.

Tom Tunguz: I think so.

Richie Cotton: so going back to the idea of data quality, I know one of the buzziest terms at the moment is the idea of data contracts.

Can you just talk me through, like, What are these and when would you need one?

Tom Tunguz: Sure. Yeah, data quality. So you can imagine data is like a manufacturing line, right? we talked about it. You have raw ingredients that then go through a processing facility and then they're packaged up. And data quality, and Monte Carlo is the leader here, is really about understanding how effectively the data is coming through.

Are there changes in the upstream sources? And is the qual is the distribution of the data changing? Is the volume of the data changing? Is the shape of the data changing? And if there are, Everybody should be alerted because that's very likely a problem. And so that's what data, the data quality movement is about.

And so you have a company like Monte Carlo, which is doing using machine learning to understand exactly what's happening. And then you also have a test based approach where you assert different conditions. There should never be a zero. In this particular column, where it should always be back, for example.

once a data pipeline is working and you have an observability layer, kind of like a data doc through that, you're in a really good place. next sort of theme within the modern data stack world is this idea of a data contract. And it's part of this broader theme that parallels what's happened in software engineering.

So let's take a step back. So the way that we used to build software is, We used to build on one very large code base. All the engineers who work on one really large code base at the same time. And there's some advantages to that, but what we found is by cutting it up into small pieces and having small teams of engineers build on all those different microservices, we call them, it was much easier for everybody to collaborate.

The one requirement was that each team had declared to the rest of the world that these are the kinds of inputs my system expects, and these are the kinds of outputs that my system will produce, and these are the guarantees that I'll give you about how fast it will do that and how often we'll update our code.

And so this is exactly what's happening in the data world. 20 years ago, micro strategy and business objects and Cognos, they were all controlled by a centralized data team. And the access to data was, was really limited in order to make sure that it was highly controlled and the data was good, accurate.

And now what's happened is we've had a democratization of data where the marketing team has its own analyst and might have its own cloud data warehouse. The same for the sales team and finance might have the same thing. And so now it's been distributed just in the way of microservices. And what, data contracts promise is let's encode the inputs, the outputs, and then the SLA, the expectations around performance.

It's a software so that we can manage all of this effectively today for many of the largest companies. This doesn't exist. And so having a software platform or the change management associated with encoding that so that if the marketing team is consuming product data and the product team doesn't just change the format of the data without alerting the marketing team, breaking a whole bunch of downstream systems.

That's right.

Richie Cotton: Okay, so this sounds like, a really effective way for different teams who are working on either similar bits of data or one bit of data where it's downstream from another team, they can work together more effectively because they're essentially guaranteed like, what they're getting from the other team.

Tom Tunguz: this idea of data mesh, which is data distributed all throughout the organization. People are each consuming and. and producing data for each other as opposed to from a centralized node.

Richie Cotton: So, I'd also like to talk a bit about Cloud computing. So it seems like moving everything from working locally to moving to working in the cloud has been a big trend over the last, well, more than a decade. Is this something you see continuing or is the pendulum going to swing the other way?

Tom Tunguz: So it's absolutely true. Everyone was moving from on prem to the cloud. Now what we're starting to see is hybrid execution. And there's two different parts to hybrid execution. The first is the customers want to hold on to their own data. so we talked a little bit about Snowflake and Iceberg before.

many of the largest companies, Privacy centric companies, they want their own data stored on their S3 buckets or whatever buckets they have. And what they want to do is they want to bring the software and the compute to that data, as opposed to sending that data to Salesforce or sending it to Marketo and then having the output stay there and then somehow have to pay a third party vendor to just bring all the data here, we'll bring your software in, we'll compute, and then the software can leave if ever we decide to change it.

That is becoming much, much more important than it has been in the past. The other part of hybrid execution is actually, so this is why it's a little bit overloaded. there's a new wave of building applications where some of the processing is done in the cloud, maybe through that architecture, and some of the processing is done in the browser.

And so if we look at technologies like DUPDB, you can run a DUPDB instance inside of a Wasm container, WebAssembly container. And so let's say you have this really huge data set, like 100 gigabyte data set, pre process some of it to the cloud, take 2 gigabytes, put it locally, and then all of the analysis of the visualization that's done that 2 gigabyte data set is actually done on the user's computer.

And so as a result, it's much faster, it's much more capital efficient, you actually offload if you're a vendor. With this kind of architecture, you're able to operate with significantly better margins because you don't pay for as much computers as you used to in the past. That's

Richie Cotton: so certainly saving money sounds like a good idea. And it sounds like, so you've got raw data sets are generally biggest, they're going to be somewhere centralized. And then by the time you've got something processed, that's going to be much smaller. So. That may arguably makes more sense to be done locally.

Okay I suppose the trick is just to be able to move fluidly from local to cloud and back again. So you're not worrying too much about that.

Tom Tunguz: the hard part.

Richie Cotton: Okay. All right. And so. Another big trend is the business intelligence platforms have been conquering everything. Uh, Certainly the data analytics space over the last few years.

And you mentioned Locker and then there's something like Power BI and TAPO and all the rest. do you see these BI platforms changing at all or are they mature?

Tom Tunguz: I do think, I mean, these BI platforms will change. I think The way that we think about BI over the last 20 year, and we talked a bit about this, but in the early two thousands, BI was really centralized and controlled by a small number of people in order to ensure the accuracy. That was like in the year 2000, let's say, and then during Tableau was formed in 2004 and really hit its stride over the.

Five to 10 years. That was about enabling an analyst to take control and analyze their own data completely bottoms up strategy. No centralized control, at the beginning. And so you have a huge pendulum swing from the center to the edge. Then Loker came and said, Well, there's these next generation cloud databases, these cloud data warehouses like Snowflake and BigQuery.

Why don't we try to exact some more control on giving some flexibility and They deployed a called LookML, which allowed the data team to define a metric and everyone in the organization to use that definition of revenue. And now I think where we're going is continuing to sort of like go back to the pendulum.

So the next generation of companies like Omni what they're trying to do is allow the metrics definitions to be created at the edge and then promote it all the way through with the right kinds of workflow. The big question, you know, the one that you asked, is, well, where does AI fit into this? There's a challenge with AI.

I think, at least in the BI world, AI has a role with SQL query completion. Seems to be a great use case. I may not know exactly the right syntax to do a windowing function, or I may not know exactly the right syntax for beautiful union across two different tables, or CTE. And so, I can complete the code, complete the query using AI.

The big question is whether or not people will trust. these AI systems in order to answer questions like what is the company's revenue broken up by region and territory by region and product? Because if I ask that question in a slightly different way, if I flip the group, effectively the group eyes, I might get a different answer.

And so until we really solve that problem, the lack of trust, I think, will be a pretty significant barrier to entry to full automated, just ask a question and trust that the data is correct. And then talking to different data teams, there's another nuance here, which is the interpretation of the data, even if you have the right data, the interpretation of the data is still a really hard part.

Is there statistically significant difference between two averages and then how does that impact what the business ultimately makes? So there, I'm a little more cautious because of some of the, chaotic nature of the AI, let's say.

Richie Cotton: There's two parts then. So, you've got the helping you write the code in order to actually do the analysis and then the interpretation separately. And it seems like, you know, think the simple queries, it's certainly like the SQL generation is pretty good. And I agree with you that it's impossible to remember the syntax for window functions.

So, uh, yeah, better to get the AI to write that. But yeah for more complex queries, I can certainly see how to be some trust issues there. So, in that case, do you think we're going to need to have humans in the loop there, like for a long time in order to just sanity check the SQL?

Tom Tunguz: I think, our bet is that the number of people working in data will actually increase by a big, big number multiple because so many, many, many more functions and many more businesses will now become reliant on data because they need it. And so what will change is the kinds of tasks that analysts are doing, but we'll need many, many more of them.

Richie Cotton: And on the subject of like BI and AI obviously like the biggest generative AI application is, chat GPT. And do you think that chat interface is going to replace the BI point and click interface or are the two things complementary?

Tom Tunguz: I think, so there's BI and then there's exploratory analytics. Thanks. My sense is in the BI landscape, it probably won't have that much of an impact. I mean, maybe it will help you find a dashboard or might point you to a data point. I think in the exploratory use case, there's a much greater value there because let's imagine I'm Coca Cola, like which one of the bottlers has seen the greatest amount of volume growth in the last 12 months?

And then you might want to dig into that in a bunch of different dimensions. Like, let's say we wanted to tie that to contract structures. And so if we would jump from the world of structured data, which is what's going on, what's happening in the core and we want to tie it to, is there a particular provision within the contract terms associated with these kinds of bottlers?

There I could see generative AI being really helpful because it's ability to classify and it's ability to retrieve different kinds of information that might be written differently across different contracts. would be really useful. And so I think it might have, and it's still early, so I might be completely wrong, but I think it will probably have more of an impact on the exploratory analytics than it will on the core dashboarding and reporting of business metrics.

Richie Cotton: Yeah, so if you're trying to ask lots of questions quickly, like you do in EDA, then you probably want a chat interface. If you know what you want to build, then just do it. Build the dashboard using the traditional tools and then yeah, you're done. All right. So, This leads to, I mean you said that data scientists like, and they're trying to, the skills, Oh, their job is going to change, like what new skills do they need in order to be able to cope with this new world?

Tom Tunguz: Well, I think the productivity expectations of every white collar worker will now go through the roof. I think everyone will be expected to be 50 to I don't know, 250 percent more productive than they have been in the past because they have these tools at their disposal. And so being familiar with all these different systems and understanding when to use them will be absolutely essential.

I think we'll look at even, I there's this professor at Duke who requires, all his English students to use Chad TPT when they write. And it's a bit of a divisive perspective, but I'm very much in agreement with it. The reason is, when those students graduate, they'll be at a huge advantage if they understand how to use Chat GPT in a very sophisticated way to write.

The trade that he makes with the students is if there's a single grammatical error, you fail. And so just as much as the student can benefit from the technology, also, you know, need to assume some responsibility for it. So I think the role would become less data, like munging data, movement, data management, and much more understanding.

Is the data correct? What is the right interpretation? How does this apply to the business? Which I think we would all agree that that's a far more interesting part of the job than modifying a data frame from big Y to long, for example.

Richie Cotton: that's a definitely a task that I think no data analyst or data scientist appreciates. It's always a, it's always a pain the data being in the wrong shape. Okay, so we're talking about changes in skills at the level like whole roles or jobs. Have you seen any roles becoming more popular or less popular?

Like, is your title going to change? Ah, interesting. Have

Tom Tunguz: I don't think the job titles will change. We haven't seen the impact broadly yet. I think that the main difference here is the structure of the organizations where a lot of the core like machine learning data scientists functions are now being pushed fuse with the core engineering teams. And this is because the machine learning system, the AI systems are now being put into production.

And so that's, that's a very different place than where the data team used to live, which was downstream, analytics, post hoc not in the path of production. And so I, one of the broader themes that we're wondering is, does the data team actually start to live underneath engineering? It's starting to happen, particularly in the smaller companies. where AI is a core part of the feature and just, there's an output from a cloud data warehouse or some kind of aggregation that's then being funneled into production because all of a sudden, then if you think about the classic modern data stack, that whole Tool chain or the whole value chain to deliver that data needs to be production grade in the definition of a site reliability engineer working on a core website, it needs to have three or four nines of uptime reliability.

There needs to be alerting and monitoring around those core systems. And so there's this cultural fusion that I think will need to happen between the classic AI, data science, machine learning teams and the core engineering teams. that is definitely, we're seeing that broadly today, it will take time.

Richie Cotton: what sort of culture clashes are you expecting there between data and engineering teams?

Tom Tunguz: Well, the engineering teams, Like I said, they're accustomed to carry, I mean, many of them will carry pagers, so these systems break, they're accustomed to using libraries, they think about shipping the product really fast, they code in a very particular way, so if you look, if we were to compare the Python of a web application engineer and the Python in an iPython notebook, Of a data scientist, one has nothing to do with the other, right?

In fact, if the course software engineering team looks at a Python notebook, it's like, what is this? I can't, I can't take that code and put it into production. I need to actually re implement it or write it into a different language that goes into the CICD, which is the, sorry, integration, continuous integration, continuous development flow.

and the code path of production. It's completely different than the way that data science has been accustomed to building. And so just like, even the core workflows of like, how to commit the code, what does the code format look like? Is it packed? Is it a library? How do I make sure it has tests and checks?

And it performs in a similar way to the other kinds of Python code within the code base. That, that needs to come together.

Richie Cotton: Okay, so it sounds like, a lot of your data scientists, they need to, like, be able to write a great function, create great packages just look for style guides and, like, how to structure their code. Are there any other skills along those lines that you think are going to be important for data people?

Tom Tunguz: Now, I think that's it. I mean, there's a, another function or another, well actually another skill is This idea of data product, so I was a product manager at Google, and what we would do before deciding to build a product was we would create a product requirements document, PRD, which described exactly what it is that we were going to build.

We would socialize it within the rest of the company, we would understand the dependencies, and then once that was complete, it would pass to the engineering team for implementation, obviously there'd be back and forth. I think what we're starting to see in some of the more sophisticated data engineering organizations and AI organizations is they're starting to create data PRDs and they're starting to think about tables or APIs as data products, just the way that like a regular API or a production grade API would be.

So they're starting to look like their own sort of engineering teams with the data product manager, data tech lead, and then a bunch of data engineers who are building this and maintaining it. So I think that formal way of building products will come to data.

Richie Cotton: Okay, so I'm just going from that sort of scrappy, I'll just do my analysis notebook to okay, actually, you have to think about like, who else is reading the code, who else is consuming this, and make sure it's good for public consumption. so for companies who are thinking, okay, there are all these new tools available, what do you need to do to take advantage of them?

Absolutely,

Tom Tunguz: I think today it's all about experimentation and really understanding where they work and where they don't. The ecosystem is changing really fast. I think the leaders of the next wave will be conversant in AI, they will know how to speak to their leaders and educate them on where the company should be spending time, where the company should be investing in terms of software, because the reality is like every board and every C suite is saying we need AI, we need AI, we're starting to see the productivity improvements from our competitors.

They're looking to the leaders within each individual team and department to educate themselves and then ultimately develop a strategy on how to leverage this technology internally. And so being an expert in that domain I think will, will lead to lots of promotions because if you can cut two thirds of a workforce, or if you can increase revenue by a third by making sales team, that just that much more effective through automation.

Jen. There's a lot of value to be created.

Richie Cotton: so you take advantage of this, you could be like the real hero or heroine of your company then. Okay.

Tom Tunguz: Absolutely.

Richie Cotton: And are there any particular areas where you think data teams should be focusing their attention or companies should be focusing their attention to improve their data capabilities?

Tom Tunguz: I mean, I think the first is the data pipelines that are gonna be necessary to power sales optimization and customer support optimization. Those seem to be the two areas across companies where, for example, we're starting to see startups that are building fully automated sales development reps.

So instead of sending a human sending 10 emails per day, these systems are sending 1, 000 emails a week or 2, 000 emails a week. In order for those programs to be effective, the data pipelines. to inform those outbound campaigns will need to be there. Same on customer support. So those chatbots that are able to deflect two thirds or more of the inbound customer support queries, the richer the context is, the more data those robots have about the customer in particular or the FAQs or the new product features, the better and more effective they'll be.

And so I think that's where you'll see a lot of effort and energy because those are the two you. One is a revenue set in order to make a lot of money in software. Do you increase the revenue of your customer? You materially reduce the cost. So that's where you're going to increase revenue. And then the customer support team, typically that's where pretty significant cost comes from.

So I would expect focusing on those data pipelines, enabling that to happen will be great. The third order priorities is around marketing, because there's a technique in marketing. It's called account based marketing, building a website for a particular buyer, like a Coca Cola, Procter Gamble. And historically, it's been really difficult to scale, just as you can imagine, you need a lot of data to be able to do that.

Now we're starting to see the next generation account based marketing companies. do this for every single customer in the universe because they're just automating machines. And there again, the data pipelines, the context report.

Richie Cotton: Okay. That's really interesting. So start with stuff that is really going to have a direct impact on your revenue or your costs and then go towards like personalization and think about better marketing through personalization. Okay, those seem like pretty strong areas to target. just before we wrap up, are there any companies that you are particularly excited about right now?

Tom Tunguz: Within the data ecosystem, you know, MotherDuck we talked about, I think that they have a potential to really change the cloud data warehouse ecosystem with their unique hybrid architecture. Jordan, who's the founder and CEO there, he He was the tech lead at BigQuery, but he understands the domain. The other realization that we've had as a company is that more than 80 percent of the cloud data warehouse workloads are small enough to be able to process on a modern Mac.

And so having this hybrid architecture allows companies to do that. The other one is Omni. So this is the XLooker team, the key part of them, who got together with one of the Chris Merritt, who's one of the architects of DBT. They're building a modern BI system that Balance this centralized control and metrics definition with Enabling individual, an individual marketer to create a metric around cost of customer acquisition and building the workflow to have that move all the way up to centralize the centralized data team.

And they're having a lot of success. Those are the two I'd like to highlight today.

Richie Cotton: Okay, Mother Duck and Omni. Yeah, companies to watch out for then. And do you have any final advice for companies wanting to make better use of data?

Tom Tunguz: Keep going. I can be, it can be hard there's tends to be a lot of resistance associated with it. But what I've seen in my career time and time again is that the company has moved really fast and experiment with the next generation technologies. are often able to find a lot of alpha and develop competitive advantage through it.

Richie Cotton: All right. Super. Great advice there. Thank you so much, Tom.

Tom Tunguz: Oh, pleasure was mine. Thank you, Richie.

Topics
Related

blog

What is OpenAI's GPT-4o? Launch Date, How it Works, Use Cases & More

Discover OpenAI's GPT-4o and learn about its launch date, unique features, capabilities, cost, and practical use cases.

Richie Cotton

6 min

blog

AI Ethics: An Introduction

AI Ethics is the field that studies how to develop and use artificial intelligence in a way that is fair, accountable, transparent, and respects human values.
Vidhi Chugh's photo

Vidhi Chugh

9 min

podcast

The 2nd Wave of Generative AI with Sailesh Ramakrishnan & Madhu Iyer, Managing Partners at Rocketship.vc

Richie, Madhu and Sailesh explore the generative AI revolution, the impact of genAI across industries, investment philosophy and data-driven decision-making, the challenges and opportunities when investing in AI, future trends and predictions, and much more.
Richie Cotton's photo

Richie Cotton

51 min

tutorial

Databricks DBRX Tutorial: A Step-by-Step Guide

Learn how Databricks DBRX—an open-source LLM can handle complex tasks and generate intelligent results.
Laiba Siddiqui's photo

Laiba Siddiqui

10 min

tutorial

Phi-3 Tutorial: Hands-On With Microsoft’s Smallest AI Model

A complete guide to exploring Microsoft’s Phi-3 language model, its architecture, features, and application, along with the process of installation, setup, integration, optimization, and fine-tuning the model.
Zoumana Keita 's photo

Zoumana Keita

14 min

tutorial

How to Use the Stable Diffusion 3 API

Learn how to use the Stable Diffusion 3 API for image generation with practical steps and insights on new features and enhancements.
Kurtis Pykes 's photo

Kurtis Pykes

12 min

See MoreSee More