Life Is But A Stream

Ep 10 - Allium’s Blueprint for Scaling Blockchain Data with Data Streaming

Episode Summary

Allium’s CEO, Ethan Chan, reveals how Confluent’s data streaming platform helps them deliver clean, real-time blockchain data across nearly 100 chains, and why building your own infrastructure just doesn’t scale.

Episode Notes

Blockchain may be decentralized, but reliable access to its data is anything but simple. In this episode, Ethan Chan, Co-Founder & CEO of Allium, shares how his team transforms blockchain firehoses into clean, queryable, real-time data feeds.

From the pitfalls of hosting your own data streaming infrastructure to the business advantages of Confluent Cloud, Ethan reveals the strategic decisions that helped Allium scale from 3 to nearly 100 blockchains, without burning out their engineering team.

You’ll learn:

Why Allium opted for Confluent Cloud from day one
How data streaming supports mission-critical blockchain use cases
What makes stream sharing a force multiplier for customer success
How Confluent connectors accelerated their integrations with Snowflake, BigQuery, and Databricks

Whether you’re building real-time monitoring tools or reconciling petabytes of cross-chain data, this episode offers a blueprint for scaling with Apache Kafka® and Confluent.

About the Guest:

Ethan Chan is the CEO and Co-Founder of Allium. With over six years of experience in the AI and natural language processing space, Ethan has come to believe that high-quality data is the key constraint in building the best-performing models. This insight drives Allium’s mission: to provide the highest-quality blockchain data infrastructure for crypto analytics and engineering teams at institutions and funds. Allium focuses on delivering clean, reliable data in a simple, accessible format—data served on a silver platter.

Before founding Allium, Ethan was the Director of Engineering at Primer, where he founded and led the Primer Command product and team, and re-architected the core machine learning pipelines powering Primer Analyze. Earlier in his career, he lectured at Stanford University and conducted research at the Stanford Intelligent Systems Lab as a graduate student.

Guest Highlight:
“We definitely always wanted to go for a managed service from the get-go… If you break [streaming], every other engineering company is going to complain and you become infamous—for no reason.”

Episode Timestamps:
*(01:10) - Allium’s Data Streaming Strategy
*(05:20) - Data Streaming Goodness
*(26:50) - The Runbook: Tools & Tactics
*(29:55) - Data Streaming Street Cred: Improve Data Streaming Adoption
*(32:10) - Quick Bytes
*(36:40) - Joseph’s Top 3 Takeaways

Dive Deeper into Data Streaming:

Links & Resources:

Ep 8 - How Airy Powers Real-Time AI Agents with Data Streaming and Confluent
Connect with Joseph: @thedatagiant
Joseph’s LinkedIn: linkedin.com/in/thedatagiant
Ethan’s LinkedIn: linkedin.com/in/ethanyschan
Learn more at Confluent.io

Our Sponsor:
Your data shouldn’t be a problem to manage. It should be your superpower. The Confluent data streaming platform transforms organizations with trustworthy, real-time data that seamlessly spans your entire environment and powers innovation across every use case. Create smarter, deploy faster, and maximize efficiency with a true data streaming platform from the pioneers in data streaming. Learn more at confluent.io.

Episode Transcription

0:00:06.4 Joseph Morais: Welcome to Life Is But A Stream, the web show for tech leaders who need real-time insights. I'm Joseph Morais, technical champion and data streaming evangelist here at Confluent. My goal? Helping leaders like you, harness data streaming to drive instant analytics, enhance customer experiences, and lead innovation. Today I'm talking to Ethan Chan, co-founder and CEO of Allium. In this episode, we'll find out what it takes to build a business on real-time data. We'll break down how data streaming has become a core part of Allium's product strategy, why partnering with vendors turned out to be a smarter, more cost-effective solution, and how Ethan rallied both his engineers and customers around a real-time vision. But first, a quick word from our sponsor.

0:00:44.4 Announcer : Your data shouldn't be a problem to manage. It should be your superpower. The Confluent data streaming platform transforms organizations with trustworthy, real-time data, that seamlessly spans your entire environment and powers innovation across every use case. Create smarter, deploy faster, and maximize efficiency, with the true data streaming platform from the pioneers in data streaming.

0:01:12.7 Joseph Morais: Welcome back. Joining me now is Ethan Chan, co-founder and CEO of Allium. How are you today, Ethan?

0:01:17.8 Ethan Chan: Doing great. Thanks for having me on board. Yeah.

0:01:20.3 Joseph Morais: Oh, it's my pleasure. Let's jump right into it. What do you and your team do at Allium?

0:01:25.5 Ethan Chan: We are a blockchain data platform company. We take in all the blockchain data and ingest it, normalize it, standardize it, map it to different use cases, and serve it to our customers. If you think about what Bloomberg did for financial data, open financial data, if you think about what Google did for public web page data, we are doing the same thing for blockchain data. We are organizing the world's blockchain data, making it accessible for customers and people obviously in the blockchain crypto space who need to understand what activity is happening to financial institutions like Visa, who need to understand the stablecoin movements, and also the public sector. Like how do you put good policy? How do you regulate this digital asset industry? So all of them, have the same underlying element that they need really good, high-performing blockchain data.

0:02:11.2 Joseph Morais: And is it specific to cryptocurrency, or are there applications of blockchain that are outside of that?

0:02:15.9 Ethan Chan: Think of it as a Turing machine, as a computer. So whatever happens in a blockchain of a currency is a smart contract that someone defines that this has value, that a group of people define that has value. A smart contract is just a small subset of it, and we have a lot of customers of ours building their applications on top of this globally distributed computer, and we work very closely with them because the blockchains are mainly optimized for writing data to the blockchain and not reading data from the blockchain, and that's really where Allium plays it. Anytime your application needs to read the state of my wallet, my balance, they can read the data of Allium.

0:02:53.8 Joseph Morais: That's fantastic. And just to kind of tie it together, for me personally, all of this data, because it's on a blockchain, is out there publicly. But being able to consume all that data across many blockchains, across many different currencies, and turning into meaningful information, that's what Allium does?

0:03:10.6 Ethan Chan: Yes, that's correct. So I think there are a couple hundred blockchains out there already. We have nearly a hundred of them. There are probably more than a hundred million different tokens out there with different prices, and there's like petabytes and petabytes of data, it's only growing over time. Someone has to go in and organize all of that and make it manageable.

0:03:30.0 Joseph Morais: So who are your customers and who aren't your customers?

0:03:32.7 Ethan Chan: Software engineers who need to read really high-performance data to power their applications like wallets, like the trading apps, or even real-time monitoring systems. We also serve analysts. So who wants to understand like relative market share, is a token trending up, trending down, should I buy, should I not buy? And also thirdly, we also serve a lot of accounting and auditing use cases whereby people who are trading on the blockchain or understanding what's happening, they want to make sure that they don't go to jail. So they have all their data in one place so that they can reconcile their finances in one single place, in one single ledger.

0:04:08.4 Joseph Morais: Fantastic. So I know that you guys are all in on data streaming, and I know today's conversation we're really gonna get in depth, but at a high level, what is your company's product strategy around data streaming, or how is data streaming involved in your product strategy?

0:04:23.6 Ethan Chan: Internal data streaming allows us to pass all the data within our own data systems. In more on like the "growing their revenue, growing the customer base," it's very crucial because a lot of our customers demand real-time data, and they wanna control the logic they put on top of the data, on top of the blockchain data. How it relates to a business is that anytime, if you think about our customers who need data streaming, they wanna know, give me the latest balances, give me the latest transactions for these sets of wallet addresses I care about. That's where that real-time use case and being able to control their own rules on top of the if this then that on top of the data streams is very important for them and us as well.

0:05:03.9 Joseph Morais: Right. So internally it's about decoupling microservices, producers and consumers, but externally to your customers, it's about how do I provide that data to them?

0:05:12.3 Ethan Chan: Yes.

0:05:12.7 Joseph Morais: That's a really good strategy.

[music]

0:05:22.5 Joseph Morais: So we set the stage, so let's dive deeper into the heart of your data streaming journey in our first segment. So Ethan, tell me, what have you built or are currently building with data streaming?

0:05:31.8 Ethan Chan: When we first came into this industry three to four years ago, having spent a lot of time in the data infrastructure and the machine learning space, I think what we saw of how people needed to access real-time data in this industry, was more like polling. The limitations for continuous polling and webhooks is that they're just not to the level of guarantees, enterprise guarantees that this industry is expecting as it matures and it matures as it grows. The problem with webhooks is that whenever your webhook goes down, you don't know what messages you miss, right?

0:06:05.9 Joseph Morais: Sure.

0:06:06.5 Ethan Chan: And so, now let's say you're a bank, you're a custodian, custodying all these assets, you're building these monitoring, risk monitoring rules. If you missed a transaction, you don't even know that you missed it, that's not very good. So that's really our journey into this space, and that was when we were one of the companies or earlier companies in the space that said, "Hey, we're not gonna reinvent the wheel." That's a very well-known framework that does it very well and doing data streaming and doing data delivery in a real-time stream sharing. And that's how we ended up on the Kafka choice, yeah.

0:06:40.8 Joseph Morais: Gotcha. So you have mechanisms to go to those webhooks to grab that data, but then when you're presenting it, you're using data streaming as is that intermediate layer that has all those enterprise scaling and all those consistency features?

0:06:53.8 Ethan Chan: I think on the one side, we actually do, we do work very closely with the RPC nodes to get the data out as soon as possible. And then yes, in terms of fanning it out to different destinations, and sharing with our customers or even internal microservices that you mentioned, that's we use data streaming for that.

0:07:11.3 Joseph Morais: So what inspired you to originally use data streaming? Was there a specific tipping point?

0:07:16.1 Ethan Chan: Specifically on data streaming, like at the previous company I was at, wasn't a founder, but as one of the earlier engineers there. And we, I remember I had a teammate or a colleague that was managing all the data streaming services. It was very painful. And so, and that was when I always knew that at least specifically on that tipping point, like if we ever did any data streaming, we would try and work with a provider that we could trust and rely on. Yeah.

0:07:39.9 Joseph Morais: That's great, 'cause my follow-up is gonna be why Confluent? [laughter]

0:07:44.2 Ethan Chan: I would definitely say my own personal journey as a developer engineer, is that early days, I always say like, I can build that myself. Ah, this is an open source library, I can host it. Like, ah, don't pay that vendor or whatever. I used to be at that, like you can build, I know how to build it. And then when you realize that you have to do migrations, you have to maintain stuff 24/7, there are a lot of random edge cases you don't know until you deploy into production. Like, oh, what happens if your node goes down? Do you have redundancy? All this little things that honestly burn your weekends away, especially if you're breaking such a critical system, 'cause not only streaming happens almost at the extreme left.

0:08:25.1 Ethan Chan: So if you break that system, every other engineer at company is gonna complain and you'll just become infamous for no reason. And a lot of that normally happens when you try and host it yourself. And again, I am sure there are experts that could do it pretty easily, but I think most people aren't experts. And that was why we definitely always wanted to go for a managed service from the get-go.

0:08:49.4 Joseph Morais: Absolutely. Especially when you have, if it becomes your critical system of record. And some people aren't that far into data streaming, but when they are, you find that this becomes something that can't break. So why not pass that off to somebody who's made it as unbreakable as it possibly can? And then of course, having that level of support, if things do go awry, which I hope they never do. Now, I know Allium has become a huge standout in the Confluent for Startups program. The way I understand it, you guys discovered Confluent Cloud and just started with our free trial, and then eventually were issued $20,000 worth of credits as part of that program. And I know you guys have grown to a six-figure commitment with Confluent, which is unbelievably exciting. So tell me how you got started in the program, and what drove that level of investment and growth?

0:09:37.9 Ethan Chan: Confluent had a lot of connectors already built. And one of the critical connectors was with Snowflake. So the question is not whether or not, can you do it, it's whether should you do it. That's really the driving behind a lot of the decisions that we make now. It's like, yeah, we sure we could write our own connector, but do we wanna maintain it 24/7? Do we wanna do that? And I think Confluent, with all the various connectors that you guys already had, we connect to Snowflake, BigQuery, Databricks as well. So that's a lot of, I know with the brand and the long tail of integrations, I never have to worry about that. So that was a big plus. But what also drove more usage is that, two years ago, we probably only had, I want to say three blockchains on our platform. Ethereum, is it Polygon, maybe Solana to some extent? And now we have close to 100. I think we are close to 85 or 86 right now, and probably gonna hit 100 pretty soon. So just to give you a sense, so we have more than 10, 20X the number of... Actually 30X the number of blockchains since we first started on Confluent.

0:10:48.6 Ethan Chan: And then the third phase is that a lot of customers started to realize that like, look, webhooks are not gonna cut it. Like, I get it. Webhooks, I'm a developer, I know how to use a webhook really quickly, get stuff going, but if something happened on a blockchain, I need to at least know it happened. If my system goes down, I can replay the history so I can actually reconcile my data again. And so, those are the very valuable pieces of what Confluent and Kafka provides, and that's why that really drove a lot of the growth. So to recap, connectors, explosion of data, size, the number of chains, and also just the industry maturing and saying that, look, we cannot build mission-critical systems on top of webhooks anymore.

0:11:29.4 Joseph Morais: Before they were your customers, what were these businesses, what were the real challenges that you said, you know what, we need to build a company around this?

0:11:38.6 Ethan Chan: Imagine that everything that the blockchain spits out, it's like, it's structured, but it's also, like anyone can write any dumb smart contract into the blockchain, like anyone. Again, there are 85 different blockchains, of which there are maybe 10 different bigger ecosystems. Think of a different ecosystem as a different operating system. So Windows, Linux, even within the Linux ecosystem, there's different forks, you have Red Hat, different variations there. So imagine now, you have so many different operating systems, different types of blockchains out there, and there's also some sort of social agreement standard of everybody agrees that certain smart contracts means this is a transfer of real money, but then whenever there's a huge problem of fragmentation of understanding, like what happened across all the different ecosystems and normalizing and standardizing that. And so, the key insight that we saw is that people don't want to go in, and be an expert in every single operating system, as a blockchain system, and then parse the data, and then map it to the use case that people are, that they're eventually trying to solve. So let me talk about what these real-world use cases are.

0:12:46.2 Ethan Chan: For example, Visa is one of our customers. They just want to know, what are the stablecoin volume transfers across different blockchains and across different currencies, stablecoins out there. It's a currency backed by real money, real U.S Dollars somewhere or some sort of dollars by some bank somewhere, and they want to just know, for the 50 different blockchains out there, how is money getting transferred around? They don't want to know what the activity is, of course because Visa is one of the biggest payment networks in the world. So they wanna know what's happening, and a lot of customers like that, the thesis, the insight was that, I was pretty sure that there was no way that a lot of these folks would not come in, and try and hire 10, 20 engineers, parse everything out, normalize it, and then pay. You said six-figure, but the total spend is way more than seven figures across all infrastructure services. Just to be able to answer a very simple question of like, hey, how much volume was transferred in the past 24 hours?

0:13:43.8 Ethan Chan: Like that, just to get onto that very simple question, is so much data munching that goes behind it, and my thesis was that very simple, that if you build a very efficient company, you can execute well, and you can do this once, and again, sell it multiple times. And really, that was the core insight. And I just think that at the heart of it, the emotional heart of why we did this is because, I used to be a data scientist at some point as well, and the bane of their job is cleaning the data. No one wants to clean the data.

0:14:16.1 Joseph Morais: No, they want the good stuff.

0:14:17.4 Ethan Chan: They want the good stuff. They wanna just run their machine learning models, present the really nice metrics, and then show they grew revenue by 5% and then get a promotion. That's really what you care about, but then 95% of the day is actually figuring out where the data is and how to clean it.

0:14:35.2 Joseph Morais: Right. What does this field mean?

0:14:38.4 Ethan Chan: What does this field mean? And so, for the blockchain, something is very similar that like, again, maybe not just the data scientists, but like a financial analyst or a product manager, a growth, someone, a growth engineer. Whoever who wants to understand more about what's happening, they have to go through all of the same steps. And I think if we can do it better, faster and cheaper, why should someone not use us? So that was really the key insight. Yeah.

0:15:03.9 Joseph Morais: It makes a lot of sense. So you're like, I realized that in order for these other financial institutions to do this, they're gonna have to build these teams and it's gonna be this many people. And this is just to get to the dirty data, and then to clean it and to get insights. You're like, I bet if we build something like this, all these other institutions would use it, and they don't have to have their own teams of 10 people just to get the same thing that we're gonna try and build. I'm glad you had this insight because it sounds like you're making it a lot easier for these institutions to get the quality data that they need. And that's something we talk about here is this idea of data products. It's not just having that raw data that's hard to classify, doesn't have any metadata. It's about taking that, making sure it's properly classified, making sure that the events are correct, and then taking multiple streams of data, and then converting them or combining them or filtering them or modifying them so that they're usable.

0:15:54.3 Joseph Morais: And then you have that great downstream data product. So I know you've already talked about integration a bit, especially with Snowflake and Databricks, but what outcomes have you seen or aiming for specifically with stream processing, and then any other integrations that you may not have mentioned yet?

0:16:09.2 Ethan Chan: We do a lot of real-time processing. I think we use Apache Beam today for a lot of our real-time stream processing. So imagine like the moment data comes up fresh from the blockchain, we also do a lot of processing. We have a Lambda architecture where we use Snowflake of course, Databricks for the DBT hourly builds, but we also have the real-time system for more like mission-critical data schemas we have to parse out, like real-time balances, real-time transactions, real-time NFT trades, DEX swaps. So just to give you a sense of those are the types of schemas we have. Our customers, they just want, not the raw data. Because I like to joke that even if we just deliver raw data to our customers, it's the start of your problems. Like, if I gave you, even if I gave you the entire corpus of petabytes of raw data for free, or either even for free, you would still spend, I don't know, a couple hundred thousand just to store the data on your site. So you don't even wanna keep it yourself.

0:17:10.5 Ethan Chan: Like who wants to keep that much of data? I think right now, we want to do a double check. I think we're doing like 120 megabytes per second of data through Confluent today. So just per second. It also shows that that's kind of where the entire blockchain industry is. It's kind of where broadband was about 10 years ago maybe, give or take. But let's take it one step further, is that there is no human possible way for Allium to possibly fit every, the million use cases that the blockchain can spit out. Because anyone can publish any use case, any smart contract out there. So how do we allow our customers to build on top of already enriched data that we have, as such that we give them the flexibility whereby how do you bring your own transformations and shift left, so to speak. I know that's the hot word right now, shift left. And so, we are helping a lot of our customers also. Hey, I just want to filter on the subset of stuff, and then do some simple transformations and then get my answer quicker. And I only care about a sliver of stuff.

0:18:13.4 Ethan Chan: And then you extend it further. It's like, I want to build an alerting workflow, a monitoring workflow on top of the enriched data. And again, that's another downstream, you feed it another set of conditions. We want to become not just the data platform for this industry, but the data operating system for this industry. Yeah, so we want to be the operating system for be it accounting, be it analytics, building part of your applications, how can we become that operating system for your company? And it's very apt because the blockchain is an operating system. So we are the data operating system for that layer for this industry.

0:18:47.3 Joseph Morais: For the audience, shift left, in case you've never heard it, the idea is about bringing stream processing or processing of data closer to the source. So in this case with Allium, they're kind of grabbing all their data and putting it into data streaming. So, in a lot of ways, that's as close as they can to manipulating that data. They're not gonna manipulate it before that. So the idea is that, a lot of people nowadays are sending their operational data to the analytical systems and doing a lot of that data processing there, and building their data products and then maybe doing something like reverse ETL and back. But the idea of shifting left is to do more of that processing closer to the source. And there's a lot of advantages to that. One, the people close to the source, the people that are producing the data are usually the ones that know it best. So that's a good thing. You can reuse those data products in your operational estate. And then, it also reduces a lot of duplication of processing downstream. So I'm curious about another aspect of the DSP. That's something we talk about here a lot, the data streaming platform. How do you approach data governance?

0:19:47.1 Ethan Chan: We are the data governance team for a lot of this industry. Not a lot, but a lot of important teams in this industry.

0:19:53.6 Joseph Morais: Sure.

0:19:54.5 Ethan Chan: And a lot of it comes with data verification?

0:19:56.6 Joseph Morais: Yes.

0:19:57.1 Ethan Chan: So we made those investments very early on, very, very, very early on. And people also forget that verifying your data is almost as expensive as ingesting the data.

0:20:07.7 Joseph Morais: Yes.

0:20:08.5 Ethan Chan: People don't know that. It's like knowing where you messed up, is more expensive than, it's the same cost as doing it again. You know what I mean? It's not really costly. There are clever ways to optimize stuff here and there, but conceptually, the maintenance you have to do. Because right now, every few minutes we have an Airflow DAG orchestrator that we basically run all the checks. Do all the blocks exist? Do all the transactions exist? Are we missing anything? Does the current block point back to the previous block? Are we missing any transaction hashes? Do the COUNT* for a certain time window and then check it back. So check with other schemas and see whether they all join nicely together. And so, it's a lot of work and we just do it for every single one of our 85 blockchains out there. And that's where data governance is. That's the first piece of verifying that your primary copy of data is correct. Here's where it gets even more confusing, is that we have customers all over the world. We have customers in Asia, we have customers in U.S Central, U.S East, U.S West, on Databricks, on Snowflake, on BigQuery in Europe.

0:21:13.8 Ethan Chan: We have to verify that when we replicate our data from our main copy through across the world, we're also making sure that we're not losing any data. Because anytime, my thesis is that anytime we move data from point A to point B, you have to re-verify when you moved it to another point. You have to run a check again. That is also very expensive. And I like to joke that we are sort of like a CDN for the data layer, a lot of the data in this industry.

0:21:39.4 Joseph Morais: Yeah. Interesting.

0:21:40.4 Ethan Chan: Because we actually have to replicate the data across the entire, a lot of different data regions across the world, different providers. Because again, let me bring it back to the question of like, which customer wants to store one petabyte of data? So internally, I think we had this project that we call Data 360, not data... So that was a cooler name than data governance, but Data 360 is what we called it internally, to see like, hey, this data is complete in every single region. And also, we have tens of thousands of schemas. We don't wanna replicate everything. And so, keeping track of what we replicate is also another headache. And for us, the data governance takes that multifaceted global approach. And I do like to joke, maybe we centralize, decentralized data. And then, we decentralize it again by spreading it across the world. So anyway.

0:22:31.5 Joseph Morais: Yeah. Well, especially with all that replication and distribution of data, you wanna make sure you have it right from the beginning. That makes a lot of sense. So let's talk about data sharing. I know that is part of the way you deliver to your customers using stream sharing. Can you talk about that a little bit?

0:22:46.0 Ethan Chan: Yeah, we use data sharing and stream sharing. So FYI, we also do a lot of data sharing in the data warehouse a lot, every single major data lake, data warehouse provider has their own form of bulk data sharing already. So we actually do all of that as well. We'll talk about more in the real-time stream sharing piece. So yeah, we have I don't know, I wanna say like thousands, maybe close to a thousand different topics that we share across different organizations, across different blockchains, across different schemas today. So a lot of our customers depend on our real-time data streams, data feeds to power their own apps or build their own monitoring systems or even reconcile their own internal data systems. Sometimes they already have their own system running, they just want to use our data streams to just double check what our job, just like checking their homework. Whether or not they're correct. It's been very easy to use stream sharing through Confluent, very, very easy. In fact, you can ask my customers, I actually always demo Confluent all the time, because I said, hey, look at all these real-time data feeds streaming in front of you. One-click share. If you give me your email, and you're good to go. You can push into production like today if you wanted to, right after this call.

0:23:55.2 Joseph Morais: Yeah, I'm glad that feature is of so much use to you. Again, for the audience, stream sharing is something that is exclusive to Confluent. We make it extremely simple for anyone with a Confluent Cloud account to share a data stream, a topic, with anyone else with a Confluent Cloud account. And as Ethan mentioned, some of our partners in the analytics space have very similar features as well, but it's a pretty exciting feature and I'm so excited Allium gets to use it to it's fullest. And that's really the benefit of the Confluent data streaming platform, is it's not just the data streams or just the stream processing or just the support or the connectors. It's all these other value-add features, that you can really build your businesses around. And that's why I get so excited talking about the DSP. So tell me, I'm sure you never expected a question like this, but what's the future of data streaming and AI at Allium?

0:24:46.4 Ethan Chan: So I spent six years in NLP and AI and machine learning before starting Allium. So that's actually my background. Actually, for the first couple years at Allium, people always ask me, why don't you go back to AI or like, why? What is your AI strategy on top of the crypto data or in this space? And I always go and say, I'm gonna wait and see and wait for a lot of the infrastructure scaffolding to be built, and then I will save a lot of time buy not build, same thing we're working with. It's almost like I did enough of the machine learning stuff myself to know that, there's a lot of stuff that if I don't wanna do it, I don't wanna do it. Please let someone else do it. The good news is that, there's been a Cambrian explosion of 10,000 startups that you can work with and partner with, to almost build a lot of the AI infrastructure already. And so, how we're looking at it, is that we already have an AI assistant on top of our dataset because a lot of our customers, they don't want to understand what schemas we have, they just want the answer.

0:25:38.4 Ethan Chan: So the AI assistant has been very good at crafting, showing them which schemas to use and also crafting their right queries. And then we wanna take that one step further, because I mentioned we wanna be the data operating system. We want to get people to build your workflows on top of our data. And so, we're building a lot of these primitives for people to go like, let's say, want to reconcile all their balances, their audits in one single place. How do you get from that? And then, design the right primitive such that an AI agent can come in and just automate a lot of that. And we think it starts with the data because the data is the hot pot. And I think we have a very strong foundation and we're gonna start building layer and layer. And I'm very pragmatic, as much as investors wanna hear me say, we are AI first and everything. But the bottleneck to the best AI models, best AI outcomes, is the best datasets. And that is what I've been focusing on for since the beginning. Yeah.

0:26:27.7 Joseph Morais: I think it's a really good approach, Ethan honestly. I know I'm certainly biased, but I really feel like all emerging tech, all has a single crux and that's the data. If your data's not ready, presentable, easily accessible, it doesn't matter what newfangled thing you're gonna introduce, you're gonna be limited by the data.

0:26:44.5 Ethan Chan: Yeah.

[music]

0:26:52.0 Joseph Morais: Our next segment, is the runbook. Where we break down strategies to overcome common challenges and set your data in motion. So Ethan, tell me what, other than Kafka, what is the top tool Allium relies on for data streaming today?

0:27:05.1 Ethan Chan: We use Apache Beam today, I think we use it on Dataflow. Yeah, Dataflow. We use Dataflow for it right now for a lot of the streaming pieces we use, we run all these, write all these workers, just to basically extract all the right fields. Again, do all the custom smart contract parsing out, and then send it on a merry way. So I think that's the one of the bigger tools we use for the real time piece. Yeah.

0:27:28.4 Joseph Morais: Excellent Beam. So are there any tools or approaches that you actively avoid? And that could be specific to data streaming and a vendor of an architecture?

0:27:37.3 Ethan Chan: Okay, so because we have a lot of models, DBT models on Snowflake, running like Snowflake already, all these usual DBT transformations, you know how to use them, but they don't perform in near-real-time. That's why you have to use something like Beam. But ideally, why do I need to replicate my code somewhere else in a different language, different framework? And so, it's not really a tool to avoid, but I think there have been many companies like the Modern Data Stack, maybe four or five years ago that have been trying to tackle this problem. I really wish someone solves it overnight. I still haven't really found that. At least it was solved at our scale, just to be fair, you know what I mean? Because it will save us a lot of time. But I think we've tried a number of them and for us it's just like, we are still finding that holy grail, but I actually don't think it will exist. 'Cause I've been hearing this. So I used to work on Apache Spark more than again, when I was at Cloudera, I built this auto config tool for like to prevent stuff from going out of memory, managing the Java memory heap, so there's no OOMs, and also working in some of the Spark streaming tuning stuff. But Spark Streaming back then was like mini-batching really. So like for me, I'm yeah, I'm still waiting for that holy grail. I thought, I'm still waiting for that.

0:28:56.8 Joseph Morais: Right, so avoid completely translating all of your code and your data to another system, make it all interoperable or find things that are interoperable. I think that's a really good pattern. It's really a good advice for anyone.

0:29:08.3 Ethan Chan: Yeah. And also, just one nuance here at least for the blockchain. I'll space specifically because the skill set that is needed to understand the smart contracts in this industry, and then write high performance real-time code. There's not many people that have the intersection of skills. So it's actually a very labor intensive task and very pretty expensive. So 'cause a lot of the field, even if they know the crypto data, they only know SQL but they don't know how to build an engineering system. And that is actually two different skill sets. So if we can unify that, we actually open up, it actually changes the way we operate significantly. Instead of having two separate teams almost. Yeah.

0:29:46.1 Joseph Morais: Makes a lot of sense. You need like a multimodal tool that can welcome anybody.

0:29:51.1 Ethan Chan: Yeah.

[music]

0:29:58.5 Joseph Morais: For a lot of your customers to even adopt your services, they have to start with Confluent Cloud. So I'm curious, how do you get your customers to buy into data streaming? Now I know you actually take them through the UI, but have you ever had any pushback and were you able in those scenarios to convince them yes, this is the right way to consume our data streams?

0:30:17.7 Ethan Chan: Through any sales process, you always have to educate them and tell them why it's good. I think for most people, it's human nature. Even if you know something is better doesn't mean you'll do it. Like, I should not drink, I should run more, I should exercise more. Everybody knows that, I need to be more hardworking or something. But no one actually like yeah. I know I need to do it, but I'm not gonna do it.

0:30:37.4 Joseph Morais: Right.

0:30:38.6 Ethan Chan: So that's a lot of...

0:30:38.7 Joseph Morais: What we are battling what we should do versus what we wanna do? [laughter]

0:30:41.7 Ethan Chan: Yes, yes, yes. So I think on that regard, I think of course like signing up for Confluent obviously it's like oh, people are very scared about new vendors. Like oh no, that's another part of my bill. I thought I'm only paying Allium, why now I have to pay Allium plus someone else. So I think there's a lot of that friction involves, and a lot of involves educating the customer how to actually make their lives easier. But also, if someone really wanted webhooks we could do it. It's almost like we'll do it just to prove a point. And then they're like, okay, you know what? Let's go and do the... Let's use event bus instead. So I think that's really part of the friction. And of course, for some of the bigger institutions, they can't just sign up for a new organization ID overnight. If it's a startup, yeah, they can do it within one minute, but they need to get IT approval and that will take months. So obviously there's some frictions there. And so, a lot of it is like, how do we meet them where they are?

0:31:35.3 Ethan Chan: Maybe they're not using Confluent. How do we map it directly to. Maybe there are other event buses, we can fan it out to another event bus to fan it to them in their own local environment. So we actually do a lot of those workarounds. Obviously it takes more time and effort but you do what you have to do. So that's because at the end of the day, I strongly believe that not only are we the blockchain data experts, but we also want to make sure we're like FedEx. We live to deliver. We will deliver data wherever you are.

0:32:02.9 Joseph Morais: Yeah, wherever it is.

0:32:03.0 Ethan Chan: Wherever it is. Like we'll meet you where you are, whatever country you're in, whatever form factor you're in, whatever stack you wanna use, we have to meet you where you're at. Yeah.

0:32:11.9 Joseph Morais: I like that. You just get it done.

[music]

0:32:19.4 Joseph Morais: All right. So now let's shift gears and dive into the real hard-hitting content, the data streaming meme of the week. And this is a unique one because usually, when I do a data streaming meme of the week, the person I'm interviewing is not the creator of the meme. See, I love this meme. So tell me what inspired you to make this?

0:32:36.2 Ethan Chan: Okay. So for this one, it's I think David Beckham and Posh Spice. Tell me really the truth. So it starts with like it's free, be honest. No, it's open source, it's free. Really, you can run it for yourself for free. Be honest. And then it ends if like, no, you really need free full-time engineers, pay for infra, pay for data, pay for storage, pay for networking, certification, trainings, SRE team around it, and then it's really free. So I think, I may have actually borrowed this from someone who talked about Confluent, to be honest. I think so. I don't take credit. I think you can see actually that's like a URL. I didn't create this myself, I just reposted this. And so, but I reposted it because, I'm also in the business of selling it, in the infrastructure space. And I see this a lot of the time, because people always tell me that blockchain data is free, I can hit my own, I can get it myself. I have a computer science degree. I can figure this out myself really easily. Blockchain is easy. This is smart contracts, whatever. It's easy.

0:33:46.1 Ethan Chan: And goes back to always to the question of people never really take into account the total cost of ownership of maintaining and building a system. I think if you're doing a hackathon project, doesn't matter. But when you're an enterprise, you're a serious business, you wanna build for the long term. That's where, when you take in the total cost of ownership, that's when like stuff that is free, that's the reason why it's free, because it's not really free. So I think that is really where again, I mentioned at the start of our interview, that I think as I matured, I would definitely say, I was my first couple of years as one of those engineers that probably just said, "I can do it myself, you should be doing it ourselves. This doesn't make any sense. Like, why are we paying this person? Pay me more money instead and I'll do it." Type of thing. But I think as I matured, thankfully, I moved away from that, from this myself.

0:34:35.4 Joseph Morais: Well, a couple comments there. One, you're a startup, you guys are doing great. But I think other startups maybe, and you already figured this out, is that when you start working and your customers are gonna be enterprises, they're gonna have a certain level of expectation of uptime, of consistency. And that's where, using managed services can kind of really provide that to you, because maybe your team is still growing, and your sales is outgrowing your engineering, your SREs. That managed service really kind of helps take away that consternation around, well, is this system gonna be scalable? Do I need to worry about hiring 10 more people to scale at the growth we have? And the meme is really funny. And I realize you weren't the originator, but you're the spreader of it, which makes you just as important to the meme world. Someone once described this to me as free, like a puppy. Like I'm gonna give you a puppy, but guess what? You got to feed the puppy, you got to take it to the vet, you got to house the puppy. So like free, free could actually mean very expensive when you take everything into account.

0:35:28.0 Ethan Chan: Exactly. And you learn the term TCO. Total Cost of Ownership. People don't realize that. People don't wanna maybe come to terms with that sometimes, because also there's a lot of sunk cost fallacy as well. 'Cause I'm already kind of one foot in, one foot out. So I'm sure you face it at Confluent. And also, I think every single startup founder in the B2B enterprise, SaaS Space or Infra-space, we also kind of face it. It's that like how do you position it? Essentially it makes sense for people to like build stuff on top of your service. Yeah.

[music]

0:36:05.0 Joseph Morais: Before we let you go, we're gonna do a lightning round. Byte-sized questions, byte-sized answers. And that is B-Y-T-E. So it's like hot takes but schema backed and serialized. Are you ready?

0:36:14.8 Ethan Chan: Sure. Yeah, let's go.

0:36:16.6 Joseph Morais: All right, Ethan. What's something you hate about IT?

0:36:18.8 Ethan Chan: The what IT?

0:36:20.2 Joseph Morais: [laughter] Just the word, you're not the first person to give that answer, and I actually love that answer. What is the last piece of media you streamed?

0:36:28.7 Ethan Chan: The last piece of media I streamed? Probably the Air Traffic Controller, New York, 'cause I'm gonna fly soon, so I'm kind of concerned about that. Yeah.

0:36:36.2 Joseph Morais: Okay, awesome. Hopefully that comforted you. [laughter] What's a hobby you enjoy that helps you think differently about working with data across a large enterprise?

0:36:46.3 Ethan Chan: Funnily enough, I do think, not a hobby, but I do like reading a lot about fashion and enjoying fashion stuff. So how does that relate to data. So I care a lot about this thing called data UX. So if I design a data schema for you, for a customer, if you have to do one more left join than you need to, and get some nails there, as a data scientist or a data analyst, you will hate. You will just, a little bit part of you kind of dies inside of it. And so, I care a lot about that final presentation of the tails and schemas we deliver, and that's something that drives me. Because if you can design a schema that really has all the right information that's already, I wouldn't even say like proper normalized form, but in the very useful form to the use cases, people will love it. And at the end of the day, you reduce the cost of curiosity, you get people to explore more, and a lot of that comes from, it's more like a sense of aesthetics and design, and it just has to fit, the data just has to work, yeah.

0:37:48.2 Joseph Morais: Yeah, that's good, I like it. That eye to detail that you get in the fashion industry. Can you name a book or resource that's influenced your approach to building a venture of an architecture or implementing data streaming?

0:37:58.9 Ethan Chan: I think that's, this is one of my favorite books I like to recommend as well. I think the Philosophy of Software Design by Ousterhout.

0:38:05.0 Joseph Morais: Okay, great.

0:38:06.9 Ethan Chan: So I think one of the chapters he talked about shallow design and shallow API design versus like... Sorry, deep API design versus shallow API design. And in terms of influencing the data piece. Again, I think sometimes the customers don't want to know everything. They just wanna, you want to design an API or a schema that just shows just the right number of fields just to answer the question. Everything else, don't even tell them about it. If they really need it, you can open the hood for them. So I think that book and that concept I learned from that book many years ago, I really enjoy. That relates to data, yeah.

0:38:42.6 Joseph Morais: No, it's great. What's your advice for a first-time chief data officer or somebody else with an equivalent impressive title?

0:38:49.2 Ethan Chan: So number one, I think I should patent this name, but my title on LinkedIn is chief data plumber.

0:38:55.3 Joseph Morais: Oh, I like that. That's really good.

0:38:57.1 Ethan Chan: So that is my title. Please do not take it. I am the chief data plumber already. But I think that advice is really, I think at the end of the day, I think what a lot of data leaders would face is that, how do you align? You can build the best data system in the world, but at the end of the day, you exist to fit that business need, the BI data powering an application. So you really have to just make sure that you align on it. If you're not aligned on that, it doesn't matter if you had the best shiftless system that was optimized, whatever, it doesn't matter. At the end of the day, are you driving the business results? And that's really the core there. Yeah.

0:39:28.5 Joseph Morais: That's good. That's really good advice. Now, Ethan, any final thoughts or anything to plug?

0:39:33.3 Ethan Chan: I think two parts of it. We are always hiring for amazing data... Amazing engineers who are just interested in data infrastructure in general. We're always hiring for that. If there's anyone out there, also who ever needs blockchain data, just DM me, call me, I'm pretty responsive, but we're always hiring. I think what we like to say is that, we unfortunately, our infrastructure budget per engineer is higher than what we pay the engineer. So that's the scale we work with. But on the flip side, if you're joining a company, you have a lot of stuff to learn, every two, every platform, and obviously Confluent, and you'll be using a lot of it. Yeah.

0:40:11.9 Joseph Morais: Excellent. Well, thank you so much for joining me today, Ethan. It was absolutely a pleasure discussing this with you and having a chat. So for the audience, stick around because after this, I'm gonna give you my three top takeaways in two minutes.

[music]

0:40:31.9 Joseph Morais: Wow, what a fantastic conversation with Ethan. Let's talk about those takeaways. The first one that I just, I'm gonna be thinking about this quite a bit, is how Allium is using data streaming to read once and deliver many times. So as Ethan mentioned, the way that you query, the way you query blockchain, it's not super performant, it's prone to failure, and it's just not ideal for somebody building a real-time delivery system. Allium building something that they can retry and reliably read the data that they need, but then have it delivered many times through data streaming, a system that is specifically built for many things to read many times between whether it's transactions, consumer groups, et cetera, et cetera, a really interesting, not novel, but a really powerful way of using data streaming to serve their downstream customers. Ethan asked a question where I think it was really important.

0:41:23.5 Ethan Chan: You can build it, but should you? I think this is something that everyone should think about, whether you're a startup or you're an enterprise, if there are well-established providers of a technology that have great uptime, have great total cost of ownership when you do an analysis, and have all these extra bells and whistles that make a system enterprise ready like Confluent data streaming platform, you should really consider. Like yeah, I can lift really heavy rocks, but should I? Is this where I wanna spend my time? And as it pertains to an enterprise, is this where I want my engineer spending their time? Or should they be building things that are specific to my business, my business logic, not doing undifferentiated heavy lifting? So I absolutely love that. And another thing that Ethan said is, you could build the system yourself like Alliums, but all you get is the raw data. And he said, "Delivering raw data is just the start of your problem." And I couldn't agree more. So again, it comes back to that idea of shifting left, of getting closer as you can to your data source and take this uncleansed, this raw data and turning it into something that is a data product. And doing that as early as you can, so that those data products can be used internally at your business, or maybe externally to your customers.

0:42:37.6 Joseph Morais: So, wow. Just some really fantastic takeaways just in terms of how to use data streaming, and some things that you should be thinking about as you start your data streaming journey. That's it for this episode of Life is But a Stream. Thanks again to Ethan for joining us and thanks to you for tuning in. As always, we're brought to you by Confluent. The Confluent data stream platform is the data advantage every organization needs to innovate today, and win tomorrow. Your unified platform to stream, connect, process and govern your data starts @confluent.io. If you'd like to connect, find me on LinkedIn. Tell a friend or coworker about us, and subscribe to the show so you never miss an episode. We'll see you next time.