Life Is But A Stream

Ep 3 - The Connective Tissue: Shift Left to turn Data Chaos to Clarity

Episode Summary

We wrap our 3-part series on data streaming with a deep dive into data integration, exploring key topics like data governance, data quality, and system connectivity.

Episode Notes

In the final episode of our 3-part series on the basics of data streaming, we take a deep dive into data integration—covering everything from data governance to data quality.

Our guests, Mike Agnich, General Manager of Data Streaming Platform, and David Araujo, Director of Product Management at Confluent, explain why connectors are must-haves for integrating systems.

You’ll learn:

Why real-time ETL out performs the old-school approach
How shifting left with governance saves time and pain later
The overlooked role of schemas in data quality
And more…

About the Guests:

Mike Agnich is the General Manager and VP of Product for Confluent's Data Streaming Platform (DSP). Mike manages a product portfolio that includes stream processing, connectors and integrations, governance, partnerships, and developer tooling. Over the last six years at Confluent, Mike has held various product leadership roles spanning Apache Kafka®, Confluent Cloud, and Confluent Platform. Working closely with customers, partners, and R&D to drive adoption and execution of Confluent products. Prior to his work at Confluent, Mike was the founder and CEO of Terrain Data (acquired by Confluent in 2018).

David Araujo is a Director of Product Management at Confluent, focusing on data governance with products such as Schema Registry, Data Catalog, and Data Lineage. David previously held positions at companies like Amobee, Turn, WeDo Technologies Australia, and Saphety, where David worked on various aspects of data management, analytics, and infrastructure. With a background in Computer Science from the University of Évora, David has a strong foundation of technical expertise and leadership roles in the tech industry.

Guest Highlights:

"If a ton of raw data shows up on your doorstep, it's like shipping an unlabeled CSV into a finance organization and telling them to build their annual forecast. By shifting that cleaning and structure into streaming, we remove a massive amount of toil for our organizations… Instead of punting the problem down to our analytics friends, we can solve it because we're the ones that created the data." - Mike Agnich

"We've had data contracts in Kafka long before it became a buzzword—we called them schemas… But more recently, we've evolved this concept beyond just schemas. In streaming, a data contract is an agreement between producers and consumers on both the structure (schema) and the semantics of data in motion. It serves as a governance artifact, ensuring consistency, reliability, and quality while providing a single source of truth for understanding streaming data." - David Araujo

Links & Resources:

Connect with Joseph: @thedatagiant
Joseph’s LinkedIn: linkedin.com/in/thedatagiant
Mike’s LinkedIn: linkedin.com/in/magnich
David’s LinkedIn: linkedin.com/in/davidaraujo
What Is a Data Streaming Platform (DSP)
Learn more at Confluent.io

Episode Timestamps:

*(02:00) - Mike and David’s Journey in Data Streaming
*(13:55) - Data Streaming 101: Data Integration
*(40:06) - The Playbook: Tools & Tactics for Data Integration
*(53:25) - Voices from the World of Data Streaming
*(59:33) - Quick Bytes
*(1:05:20) - Joseph’s Top 3 Takeaways

Our Sponsor:

Your data shouldn’t be a problem to manage. It should be your superpower. The Confluent Data Streaming Platform transforms organizations with trustworthy, real-time data that seamlessly spans your entire environment and powers innovation across every use case. Create smarter, deploy faster, and maximize efficiency with a true Data Streaming Platform from the pioneers in data streaming. Learn more at confluent.io

Episode Transcription

0:00:08.1 Joseph Morais: Welcome to Life Is But A Stream, the web show for tech leaders who need real time insights. I'm Joseph Morais, technical champion and data streaming evangelist here at Confluent. My goal, helping leaders like you harness data streaming to drive instant analytics, enhance customer experiences and lead innovation. Today I'm talking to Mike Agnich and David Araujo. Mike is the GM and VP of Product Management at Confluent and David is one of our directors of product management. Our first three episodes of Life Is But a Stream explore the fundamentals of data streaming. If you haven't checked out episode one or two yet, we highly recommend you pause this episode and come back after you've listened. Trust me, you don't want to miss out.

0:00:45.1 Joseph Morais: So in episode one we talked about data streaming and then in episode two we talked about stream processing. In this episode we're talking about integration and governance. And I like to think about each of those as one of the cats from Voltron. All very powerful on their own, very intimidating, but when they come together, they form the very intimidating Voltron who kicks butt. And I'd like to think of the DSP or the data streaming platform as that, the Voltron of quality data. In this episode we'll explore how to integrate systems with data streaming, plus how connectors and direct integrations make it easy to connect with all kinds of data sources and syncs. We're also going to talk about some top tools and more. But first, a quick word from our sponsor.

0:01:25.6 Ad: Your data shouldn't be a problem to manage, it should be your superpower. The Confluent data streaming platform transforms organizations with trustworthy real time data that seamlessly spans your entire environment and powers innovation across every use case. Create smarter, deploy faster and maximize efficiency with a true data streaming platform from the pioneers in data streaming.

0:02:00.9 Joseph Morais: Welcome back. Joining me now is Mike and David from Confluent. Mike is the GM and VP of Product management and David is one of our directors of product management here at Confluent. How are you both today?

0:02:11.2 Speaker 2: Doing great. It's a nice day today.

0:02:13.5 Mike Agnich: Doing great.

0:02:13.9 Joseph Morais: Well, real quick, there's something I kind of do is I immediately kind of jump away from the questions when I start these episodes and I just want to give a little bit of praise. So I know you guys are both in product, which is a very hard thing to do, especially considering, you know, we have, you know, we have to take requirements from open source because we have a bunch of committers. We have all of our own, you know, field feedbacks. Just the amount of requirements you both have to go through that I see, it's just really impressive. So I just want to thank you both for all of your effort and building out a great cloud product and also a great on premise product. So my thanks to both of you. So let's introduce you both to the audience. Mike, we'll start with you. What do you oversee at Confluent?

0:03:00.6 Mike Agnich: My responsibilities at Confluent include ownership of a lot of the, what I would call the non core Kafka stuff. So that means things like connectors, things like governance, where I think David is really the resident expert. I work a lot with our partners. We've got a team that focuses on that. Some developer experience work. Yeah. So I would say a lot of the growth areas for Confluent that are kind of layered on top of the streaming foundation.

0:03:30.4 Joseph Morais: Right. So just a few things that you manage. That's just a very small bucket.

[laughter]

0:03:35.6 Mike Agnich: It's a good group. The cool stuff. The cool stuff as the kids say.

0:03:40.1 Joseph Morais: Yeah. How about you, David?

0:03:43.5 David Araujo: So my name is David Araujo and I'm also part of the product team here at Confluent and I oversee our data governance efforts for streaming data.

0:03:53.9 Joseph Morais: Excellent. We're going to dive all into exactly what that means in the following segment. So in your words, Mike, who are our customers and who aren't our customers?

0:04:04.4 Mike Agnich: Well, I think our customers, Confluent customers are customers that I think generally have mission critical applications and organizations that have real time or fast time data requirements and they generally have reached a level of complexity internally where there's just a ton of pressure on their businesses to do things quickly and to coordinate different app teams and teams that have, maybe there's some legacy data and some modern data. They need these things to work together. They could have dozens or hundreds or thousands of applications built by different teams and these things need to communicate with each other. So that would be at the company level. And then from the persona side, I say we really like working with architects who look at technology end to end in organizations from application into distributed systems for kind of storage and reliability, all the way out to kind of analytics and business intelligence. So we work a lot with architects. I would say they are kind of one of our core personas. The second group that's really important to us are developers. So all those folks, building developers, they love Kafka. They want access to data in real time.

0:05:23.2 Mike Agnich: They want to be able to build applications that scale gracefully in real time. So this kind of core developer is number two, and then number three is the operator, because once you get this stuff at scale, Kafka is a very difficult system to manage. So we built a lot of technology to make that easy, possible to observe what's going on, seamlessly upgrade or if you're using our cloud product, just don't worry about upgrades at all. So to remove the knobs and levers that don't need to be there so that operating a system at scale can just happen seamlessly. So it's the architect, it's the operator, it's the developer. And then there's this new set of personas that I think David is going to talk a lot more about. Folks like data stewards who it's not enough just to operate the data, you need to know what the data is, what data is sensitive, what data can be shared more freely, where do we want to apply things like encryption and masking to data? How do we create audits of all the stuff that's flowing through every application in an organization to know who created what and when and why? How has it changed over time? So it's a complicated set of folks, but I would say that the core of our business is kind of this operational estate of data where really I think all data generates, it's where it's all created out of these applications. Everybody who cares about that is ultimately a streaming customer.

0:06:47.0 Joseph Morais: Yeah, I really like that answer because I know we have customers in almost every vertical across the globe, and so it's really kind of hard to figure out what is the common theme. And that theme is, do you really care about your operational data? And I think it would be shocking to people, so many industries where that, at least data in real time isn't important or at least hasn't been important, but how they're now responding to their competitors who may be upstarts who started with event driven architecture and now what that means for them. And also, you know, focusing on the architect persona because, you know, what we do here is kind of fundamentally shifting how people build applications, especially if you've never built towards event driven applications before. So, you know, kind of...

[overlapping conversation]

0:07:30.9 Mike Agnich: If we're going to layer on that comment a little bit, I think there were like two really critical moments in the last... Well, I guess we've been at this for a while now, so probably the last 15 years maybe, but the first one when it came to real time data I think was actually kind of like the, call it the gig economy boom and specifically ride sharing and like when you think about like what happened when Uber showed up on the scene, the consumer expectation of an application went way up. You think about like what you can do when you're in an application like Uber, you have this expectation that you can not only see kind of where you are and when your ride is going to arrive, but you can actually see where all of the local cars are and you could see how they're moving and the apps are responding in real time to traffic pattern changes. So that kind of application, I don't think it was possible to build that off of a relational database. You needed a system like Kafka in order to be able to handle the volume of real time requirements, certainly at scale.

0:08:30.5 Mike Agnich: You know, if you have five people using an app, we could use whatever we want. But if you're doing it at scale, the world needed Kafka. So that was, I think, moment one. Moment two I think is happening now, which is actually, I would say this Gen AI movement because when you think about the needs of Gen AI, it's conversational in nature. It's like this conversation. I say something to you, you take it in. You need to react in real time. And in order to have a highly intelligent response to whatever I'm saying, you need to bring to bear, an agent is going to need to bring to bear all the data in the enterprise. If that's a flight booking system, it needs to be aware of inventory, it needs to be aware of delayed flights and canceled flights, and then that stuff, all that moves at real time. So to me those are these two things that have changed or are changing consumer expectations about how apps would respond. I think both are really big tailwinds for streaming.

0:09:27.9 Joseph Morais: Right. And all it takes is that one shift in perception or requirements and now everyone's like, well, why do other apps work like Uber, right? And now that real time expectation I think is implied for almost every service. So moving on, when were both of you introduced to data streaming and event driven architecture? And a bonus, have you or your past teams ever had the pleasure, like myself, of running open source Kafka?

0:09:54.0 Mike Agnich: Yeah, David and I will have slightly different answers here, I think. So the... I think, you know, for me, I was actually running a startup. This would have been like, I don't know, seven, eight years ago. And we were doing a lot of ETL and actually our team was a small team. There was 10 of us and we were figuring out the best way to do that. We had built a system to combine lots of data sources into kind of a, at the time we were looking at elastic indexes for things like search and recommendations and we were using kind of traditional batch ingest and we were kind of looking at what was coming on the scene. And we started looking at Kafka, playing with it and saw that it had a role to play. And some of the sources we were trying to join happened to have streams available to us. So that was the first time I started to use it and kind of saw the magic of how much easier ETL could be when you adopt a streaming paradigm. And then from there kind of came into Confluent, I guess about six and a half years ago and obviously got much deeper into it at that point. David?

0:10:55.0 David Araujo: Yeah, my experience is a little bit different. I was introduced to data streaming in my previous company that run an advertising platform at Internet scale. So we were running open source Kafka and it was being used in what we call the real time bidding system that basically delivers ads in real time to users on the Internet. And one of the main use cases was to use Kafka to replicate data, meaning users' data and ads data across regions using a tool called Bureau Maker that many are, well, that know about. So it basically replicates topics data across two Kafka clusters. And at the time the goal was to comply with what we call ad campaign strategy in terms of how many ads and how many times and at which time a user should see an ad on the Internet. And since users could move across different geo regions, we needed all these regions to have the most up to date counts on the ad serves to a particular user so that the ad server in the end could control when to show and not to show ads to users.

0:12:10.4 David Araujo: And to have an idea of the scale, at the time we were serving around 7 million queries per second on the ads platform. And Kafka was really critical for everything to work well. Yeah, and it's funny, at the time I remember we had a DevOps guy and he was the guy responsible for Kafka, open source Kafka. And this guy was always, I remember this guy was always stressed, he was always on the edge, right? So this guy was like 30 years old, and I used to hang out with him a lot because we had hobbies in common. This guy was like 30 years old. He seemed to be like 50, right? And since Kafka was so mission critical, right, because any hiccups there could result in overspending advertisers' budgets on impressions on ads. Right? And so it was very, very critical.

0:13:08.2 Joseph Morais: You know, I'm laughing because I was that DevOps guy. So my first exposure to Kafka was at an ad tech startup and our stack was basically Samza and Kafka. And I was 35 at the time and clearly it didn't stress me out as much, but it did keep me up at night because I was the front line. And I would have to keep the Kafka clusters running or the Kafka cluster we would bounce between blue, green, and I remember that pain. So, you know, being able to work with something like Confluent cloud and how easy it is, I would never go back. I'm obviously biased though. So that's funny, I didn't realize you started with an ad tech. That's a fascinating topic we could talk about for hours. But let's move on to our first segment today. So now that we've gotten to know you a bit better, Mike, let's jump into how to integrate systems with data streaming and talk about how connectors make it easy to connect with all kinds of data sources and syncs. So can you start by giving us a quick overview of what connectors are and why they're so important for data integration as it pertains to data streaming?

0:14:12.5 Mike Agnich: Great. Sure. So I think of connectors as really opinionated clients. And so they are built many times by the community. There's a massive community of open source connectors that are out there and then Confluent has built a big ecosystem as well. It's been a big investment for us. And the reason we've done this is, I talked a little bit about how I got into Kafka. It was really an integration. It was really integrating into other sources. And in that product we were building, there was a ton of this integration work. And the integration work is, it's hard for a bunch of reasons. It's difficult because you not only have to understand your system, but you've got to also understand external systems if you want to build really good connectors. And usually the developers being asked to do this integration, they usually don't have a deep background in integration. It's just kind of something they're trying to do to get their application built. And so it ends up actually being a tremendous amount of the work of getting an end to end thing working is doing the integration. And if you do it yourself, it slows you down.

0:15:16.5 Mike Agnich: And not only does it slow you down in kind of getting your product to market, but you're then also taking on actually a tremendous amount of maintenance, not only of the thing that you've coded, but actually if any changes happen to these external systems, you're also taking on the debt of having to change your code to adjust to what these external systems do. So it has been a big benefit to, I would say streaming users in general that Connect exists, and that there's this ecosystem that gets maintained for them and operated for them at scale. And so I think obviously Kafka is a very open protocol. There's nothing stopping any of us from going and building a connector on our own and running it ourselves if we want. Despite that, Connect has been a very, very popular product at Confluent. And it's because of this stuff. It's not the work that I would say data streaming engineers, it's not how they want to spend their time. They want to spend their time building more real time applications, not spending 30, 40, 50, 60% of their time building integrations and then maintaining those integrations over time.

0:16:27.0 Mike Agnich: I would say then, if I kind of finish the thought a little bit, there are really, and maybe this is a slightly different tangent, but I see there are like three or four different kinds of connectors and they're actually quite different. There's connectors to legacy systems, things like MQ and maybe old databases. And the reason why we need mainframes, anytime where our goal of our application is to basically free data that's very hard to access. Or maybe it's to get data from a legacy system into a modern system. Modern developers frequently don't have any interest in going and learning anything about mainframes. So if you're going to build a mainframe kit, you really don't want to build a mainframe. Trust me, you don't want to build a mainframe connector. They're there for a long time. It's really critical technology, it's really important.

0:17:15.9 Mike Agnich: So I would say that's pattern one. Pattern two, I would actually call it CDC specifically. CDC is used slightly differently and CDC connectors, I think it's like one of the really great ways to move into an event based architecture. CDC to me, which is like looking at change logs in databases, I think it's originating events. To me it's less about... Sometimes it's used for database replication. But I would say actually the more interesting use case with CDC is usually about what I would call event origination. You're looking at a table in let's say an Oracle database. And you're effectively looking at certain changes that are important business events and you want to make those available without anybody having any idea that this came from an Oracle thing. So CDC turns out to be a great way to do that.

0:18:05.6 Mike Agnich: And then I would say the next thing is, I would say SaaS connectivity, which is things into Salesforce and ServiceNow or whatever else. There's just critical business events that are coming out of these all the time. Unless you want to go and really learn those systems and code up a connector yourself, that's number three. And then four, I would say is everything on the sync side. So connectors can pull data in, those would be sources, they can push data out, those would be syncs. And so the syncs are very different in their nature. If you get really good at building sources, you're actually going to need a different skill set to go build really good syncs. The systems are different, they have different expectations. You want to architect them in different ways so that you have good failover characteristics. It's just very different. So those are kind of the four types, we talk about the connect ecosystem, but actually they're very different to engineer and design. And so I think the benefit of the Connect framework and the ecosystem is you don't have to do that stuff. You can lean on open source or ideally lean on Confluent. We have the fully managed stuff too, and then you don't have to worry about that stuff at all.

0:19:14.5 Joseph Morais: Well, two responses or two takeaways I think that are important is, one, it allows you to treat, you know, your systems, whether they're legacy or they might be modern, it might be some brand new ISV SaaS solution that came up in the last year that we happen to build a connector for, it allows you to treat all of them like producers and consumers, which is great, especially from the lens of somebody who's focused on, you know, data streaming, the data streaming engineers. But also one that I think is really important, when you think about like CDC and that could be any flavor of database, could be Oracle, could be MySQL, could be PostgreS, is it allows you to integrate with data streaming without changing what is already existing. Right? There are things that are already consuming from that Oracle database. They're already inserting records, they're doing queries, and you can leave that alone. Set up a CDC connector and now you have those events and you could build brand new things on top of that and then maybe you migrate, maybe you don't, maybe you have parallel systems, but it really kind of, it gives you a lot of freedom to innovate without breaking anything, which is really incredible in my opinion. So I do think you touched on this, but how do connectors fit into the broader data streaming ecosystem?

0:20:19.6 Mike Agnich: Yeah, so I think it really breaks down into those kind of different types of connectors. Right? So it's not... The connectors aren't just one thing. It's actually multiple different things for different purposes. You see them used, at least the use cases that excite me the most are event origination. So you've got a system that is probably speaking batch and connectors and CDC specifically end up being a really good way to kind of create events by looking at these change logs. But yeah, ultimately it just speeds up the integration into the overall ecosystem. Whether you are, you know, trying to connect something from a mainframe into a modern analytics system that might be, you know, Snowflake or Databricks or BigQuery or what have you, it's just we've got hundreds of these connectors and I would say the open source ecosystem, the number is in the thousands. So there's just this huge catalog of these connectors that just won't slow you down from achieving your task. So whatever you're trying to connect to, there's a solution there. We've also invested a lot in what I would call these generic connectors. So things like HTTP, JDBC, working on things like webhooks. And so the nice thing there is, you know, we're never going to have a connector for every system in the world. There's too many systems. They're being created all the time.

0:21:42.4 Joseph Morais: Right.

0:21:42.8 Mike Agnich: But with these generic connectors, which are some of our most popular ones, generally you can solve the problem you need to solve. Most things have rest endpoints, most things have HTTP connectivity. They've got something. And so it's like, we're invested in both, like, where we see demand pile up, we'll go and like build a connector that's very bespoke to that pattern and bespoke to that system. But we're also investing in these more generic connectors that you can use. If you've got something that only your organization is using, maybe it's even something you coded yourself and you want a connector to run against that, we can do that kind of stuff too.

0:22:18.4 Joseph Morais: Great. And as you were talking, there was something else I already thought of. Connectors in combination with some of our replication technologies, you know, specifically calling out something like cluster linking really helps our customers in their journey to modernization or their journey to the cloud. Because they ask themselves, well, how do I bridge data from the mainframe to Snowflake? And the answer is through a set of connectors and through replication technologies that becomes a, you know, I don't want to call it trivial, but a much easier pipeline than you would have to do, especially if you want...

0:22:47.0 Mike Agnich: I mean, you can imagine a very, what I would say a very complicated mission critical type of project might be, let's say to migrate data from some mission critical on premise system. Say you're running a bunch of MySQL databases on premise. They're their own things and you want to get those into the cloud on something. With connectors and cluster link, you do that without running a single line of code, right? You can use connectors to get data, let's say MySQL or PostgreS or whatever, into Kafka, into Confluent on premise. Cluster linking can then get the data from on premise into cloud and then you can connect out to whatever the modern system is in the cloud. That would be one pattern you could do. You wouldn't necessarily have to cluster link, but it's usually a very common pattern because you really want to make this move to cloud and it's a very popular pattern. And honestly with connectors and cluster linking, you do it without writing any code. It can be done very, very quickly.

0:23:45.1 Joseph Morais: Yeah, so this perfectly kind of segues into my next question because on this show we're all about use cases. So can you share some additional examples of how organizations are using connectors to streamline their data flows?

0:23:57.8 Mike Agnich: Yeah, so I've hit on a couple of them already, but to me I would say event origination. That's one. So you are an organization that has been all batch and you are trying to move into events. You want your entire business to become event centric. Connectors are a great way to do that. It's a great, like, first step or even a later step. It's a great use case. We also see migration use cases, meaning I want to take a, or you might even say like re-materialization. We definitely see this where I've got an on premise database and I want to modernize onto something in the cloud. Connectors can play a really critical role there. The third thing I would say is that event sourcing out of the really long tail of systems that exist in the world. Again, Salesforce, ServiceNow, Jira, go down the entire list. Anything you might want to... Twitter. Like anything you might want to pull data from and you don't want to go hand code bespoke integrations, the connectors are generally there for you. And so I would say those are three of my favorite. And then I would say there's also the, I would say the legacy mainframe stuff, like being able to free data from the legacy, I guess would be the fourth I'd mention.

0:25:15.2 Joseph Morais: That's great. And for the audience, I mean, the number of connectors is staggering. It's not just, you know, big data systems like databases. It could be something as simple as fetching a file from SFTP or even ingesting a file locally. Like there's connectors for almost everything. It really is impressive. But with that said, are connectors the only way to integrate systems with data streaming?

0:25:38.7 Mike Agnich: No. And I would say there's actually lots of... Like I said, this is the advantage of, Kafka is an open protocol. I think connectors is an opinionated way to do it. It's also a way where we provide a fully managed solution. Right? But there's, at the end of the day, because Kafka is an open protocol, there's a lot of, I would say custom clients built in the world, especially if your application is talking to applications, that's very, very common. So there's a lot of Java code that's doing that kind of connectivity. So there's some of that stuff. But then there's also the way where we're taking this platform, I would say, is deeper and deeper into the world of native integration. So we really like open standards. So we've talked about Kafka as an open standard. We see open standards emerging in the analytics world that have never existed before. Specifically things like Iceberg and Databricks is making Delta more and more open now too.

0:26:35.4 Mike Agnich: And so those worlds used to be pretty closed, but now you've got some open standards that we can really build into. So what you're seeing from Confluent now are more and more products that won't even need connectors. If what you're trying to do is get your data into a modern analytics system, we are going to integrate at a much deeper level, which actually is a great segue into some of the stuff that David manages, where we're managing not only the data movement but the metadata syncing so that you don't have to do any of the work to... Take today with connectors, what frequently happens is data will come out of streaming and it might land in the analytics estate, but then the structure, the schema and the metadata has to be rebuilt in the analytics estate before it becomes useful to the business.

0:27:22.6 Mike Agnich: Because we're unifying these worlds, I think what's coming from Confluent really imminently is the removal of the need to do that. Every single topic you have in Confluent will just be available with an interface that is what is natively expected from data engineers, data scientists and the tools they use, Snowflake, BigQuery, Databricks, et cetera. It's just going to work and you won't have to do any of the schema mapping or normalization or de-normalization with the combination of Flink and this product called Table Flow that we've talked about. And it is coming soon. You just don't have to do any of that. It just gets easier and easier and easier and you won't even need connectors. So I would say when I think about connectors will always have a really critical role because it addresses the really long tail of needs of systems that you need to connect to. But when we can find these really big patterns, and one of those patterns I would say is getting data into an open table format for analytics, we can go much deeper and we can actually do something a lot more elegant, scalable, low cost than what you're going to be able to achieve with the connector.

0:28:40.9 Joseph Morais: Right. You want to make your data easy, whether it's operational or analytics or analytical. And we're working to make that simple from the start, which is fantastic. And just again for the audience, an example of a native integration. So again, my background here at Confluent is for a very long time I managed our partnership with AWS. So one of the first native integrations I could think of is with AWS Lambda, where you can set up an event source mapping and then they spin up a consumer and you can start consuming right from your data streams and kick off Lambda functions, which is pretty cool. So in the last episode we talked to Anna and Abhishek about stream processing, the world of stream processing, kind of building up the DSP. So let's talk about data integration and stream processing together. Are there any particular patterns you can share that feature both technologies?

0:29:26.3 Mike Agnich: Well, I mean, we've already been talking about what I would call, you can call it modern ETL if you want, but it's, if you are... One of the very common things we see is getting data that is sourced from all these applications that are built, which I would say is like that streaming home turf, the operational domain of data. That data needs to get out to analytics systems and it needs to be done in a way where those analytic systems can operate on it instantly. So this work we're doing to map the schema and make it usable is critical, but you also want the data to land in a clean state. You don't want it to land dirty because then you're really punting a problem downstream to people who are well intended, but they don't have the tools to understand the data. If a ton of raw data shows up on your doorstep, it's kind of like shipping an unlabeled CSV into a finance organization and telling them to build their annual forecast. So they have to like dig through the data, figure out what's what, reconstruct it, usually get things wrong just because it's nearly impossible to do.

0:30:32.0 Mike Agnich: But by shifting that cleaning and that structure, and the cleaning and structure I would say is processing and governance into streaming, we remove a massive amount of toil from our organizations. And so then the data lands clean, it lands filtered, it lands normalized or denormalized, it lands joined, it lands with context because it's classified, it's got tags, it's discoverable. And instead of punting the problem down to our analytics friends, we can solve it because we're the ones that created the data. David talks a lot about the value of source aligning your governance. You want to do your governance, your schema, your cleansing as close to the source of data as possible. That's just by far the most efficient way to do it. And so I would say that's a very common pattern. And Flink is a very critical part and stream processing is a critical part of making this work. It's the engine to do the filtering, the joining, frequently the mapping. It can be used for classification, it can be used to build very complicated business logic or really simple SQL calls that are very high value. So it's a very powerful engine and a critical... That's why we kind of put this DSP stuff, this data streaming platform, it all fits together. It's processing, it's governance, it's connectors all in one thing. And that's how Confluent sees the world.

0:32:05.0 Joseph Morais: Fantastic. And, you know, I think one thing to really call out is, you know, you mentioned ETL, right? And that T is for transformation. And I think it's really important. If you're already doing the extraction, why not do the transformation before you do the loading? ELT never quite made sense to me. But the key is these streaming pipelines can give you real time ETL and kind of save you a lot of pain from handling the pain downstream. You know, processing closer to the source is definitely better. I agree with you.

0:32:32.8 Mike Agnich: I would say ELT became popular in the era... In a sense it became popular in the ZIRP era. Why did it become popular in the ZIRP era? I think a lot of customers really wanted to get onto data lakes and data warehouses quickly. And I love data lakes and data warehouses. But the fastest way to get live on a data lake or data warehouse is just to dump all of the raw data there and then worry about it later. And I think a lot of customers did that. And to me that would be the ELT pattern. Problem is like it is actually extremely expensive and as I mentioned before, it's not source aligned. So you actually lose the meaning of the data frequently. And the amount of duplication and 10xing of data work that has to happen because you punt, I would say uncleaned non understood data into that estate is I think pretty staggeringly high. A lot of customers are struggling with it. So I think we have a really great solution for shifting that processing, shifting that governance, making it more source aligned, shifting it left into streaming. And I would say this has been a very popular thing to do as customers are reconciling the investments they've made in the data lake and the data warehouse to get really efficient. And I think it's been something that we've been really successful with with our customers.

0:33:55.2 Joseph Morais: Yeah, because, you know, it's not just the cost of the bill of your data lake or warehouse provider, it's also the time to knowledge. Then there's a cost to that because if you have that clean data entering your analytics system, now you can start to do those queries right away instead of waiting for rounds of processing. You've already done that and you can also of course use those great tools.

0:34:14.4 Mike Agnich: It's the time to knowledge and it's the removal, I would say, of the break fix of pipelines. So there's, you've just got these incredibly complicated multi step batch pipelines and when any one piece in that chain breaks, the whole thing breaks and it creates a science project almost every night for organizations to go figure out where it broke, fix it, regenerate it and then also regenerate the dependent things all the way down the chain. So yeah, I mean, sure, saving money is great, but actually I think the primary benefits here to me, to your point, it's making the data faster, fresher and unified so that everybody's speaking the same language. That would be one, and two is like removing these tasks that just nobody wants to do. And it's pagers going off at 2:00 AM, 4:00 AM, it's dashboards not showing up on our CFO's doorstep at 8:00 AM when she expects it to be there. And so those are the big benefits. There happens to be an ancillary benefit of if you shift this stuff left, clean it early, it ends up saving a tremendous amount of money as well.

0:35:15.1 Joseph Morais: I'm smiling because I'm having flashbacks of having to reprocess hours of data because something was wrong in some build and running that manually and that is not something that anyone wants to do whether it's early in the morning or in the afternoon. So you already touched on it, Mike, but what does the future of data streaming integration look like? I heard it might be getting a little icy.

0:35:35.8 Mike Agnich: Yeah. So I think, I mean, Iceberg is a big investment that we've chosen to make, going back about two years. It was really actually exciting for those of us working in streaming because there wasn't an open format in analytics and that meant the best we could do is build really good connectors which we did and have been very popular. And users have been using those connectors for a long time to integrate into those systems and to shift left actually for years processing governance already. But with an open standard emerging, there's a lot more we can do. What we mean by that is not only move the data but actually integrate the metadata. And so we have a product we've already announced called Table Flow which is coming soon. And Table Flow, the way to think of it is, it's a very deep integration between Kafka, Flink and Iceberg as like, think of it as the three legged stool where you have one data artifact but you have all those interfaces. You've got the Kafka API for your developers, you've got the Flink API, SQL, Java, Python for those that want to work with data while it's moving.

0:36:45.2 Mike Agnich: All of those interfaces are now available and now with the Iceberg interface, what that means is you've got the preferred interface of the data engineer and the data scientist and you've even got a format which integrates so natively well into the downstream systems like Snowflake, Databricks, BigQuery, Fabric, good on the list because everybody's adopting Iceberg where those users won't even know they're using streaming data. It's just going to show up to them as a table and so they won't even know. They don't have to learn anything about Kafka. All of a sudden they're just going to get great real time structured data showing up in their interface of choice with the metadata integrated automatically for them. So the metadata means things like security controls, it means things like tags, it means things like schema so you don't have to do this additional mapping and all of this syncing between the metadata systems, which I think has been the primary pain point. And we've been able to move data for a long time. Moving metadata in a really good consistent way has been much, much harder and hasn't been there. And I think now that's really what we're delivering.

0:37:55.8 Joseph Morais: Yeah, a lot of the analytics folks, they don't want to learn data streaming and that's not their core competency. What they want is they want to access the data and if we can make that easier for them, you know, the more wonderful things that they'll be able to glean from that data.

0:38:08.6 Mike Agnich: Yeah, I'd say over the years we tried very hard to get, you know, data analysts to come and like learn streaming. I think now with this, I totally agree with you, we can kind of move a different direction, say, now just keep using the stuff you're using, don't worry about it.

0:38:21.8 Joseph Morais: Right. We'll meet you there.

0:38:23.7 Mike Agnich: The platform teams will keep driving the streaming stuff. It's just going to show up to you in the format that you're already used to using.

0:38:29.4 Joseph Morais: So one more question for you, Mike. Should data streaming be used exclusively for passing data between sources and destinations?

0:38:36.8 Mike Agnich: No. I would say the original use case of data streaming is, you know, apps talking to apps and sources of truth talking to apps. So I would say that that is, and I think that's misunderstood frequently. Like this integration piece has become a very highly visible use case for streaming. And it's great, but actually the home turf of streaming is actually about aligning all of our operational systems. Sources of truth and applications talking to one another is the reason why Kafka became popular. And that is still, I would say, the core seminal use case for streaming. And I would say that's a lot of Java code doing that, and it's stuff that we built, but connectors don't really play in the app to app communication pattern. That's a bunch of other technologies that we have helped build and customers are using today.

0:39:38.6 Joseph Morais: Yeah, I don't want to discourage any of our audience. If you're trying to use data streaming and connectors to integrate some source with another destination, and that's exclusively your first use case, that's fine. But really think about that data that you've exposed into your data streams, what it could be used for, how you could, you know, pass it between business units and build microservices, build operational systems. You know, if you're building a pipeline, that's great. But don't forget about the producer consumer model.

[music]

0:40:13.7 Joseph Morais: All right, thank you so much. So now in our next segment, the Playbook, where we have our guests dish out winning strategies for getting old, tired, unmoving data into motion. David, for this segment, we're going to focus on data governance. So in simple terms, what is data governance?

0:40:30.8 David Araujo: All right, great question. In simple terms, I'll try. All right, so governance is really this critical aspect of managing and using data effectively within the organization, right? It's establishing these processes, these policies, these guidelines to ensure kind of like data integrity, quality of data, usability, even security, and doing all of these while maximizing access to data across the earth, right? That is very important. And while data governance relies on these processes and the people, I've seen over and over again that successful data governance programs require very strong data governance tooling, right? The technology to assist the people in the processes. So such tools, they enable things like data discovery, data cataloging, data classification, managing data quality, data security, access controls, automating all these governance process. So in more simple terms, maybe I could say that it's making sure that data is well organized, trustworthy, and then it's following the legal and the ethical guidelines. Right? And another way that I like to put this is that governance is kind of like this balancing act. And let me explain what I mean by that. So for companies to get the most from their data, that data needs to be accessible, right? It needs to be discoverable, it needs to be trusted.

0:42:08.5 David Araujo: On the other hand, it also needs to be used in compliance with things like data regulations and protected against misuse. And talking about data regulations, things like GDPR, CCBA, and a lot of other regulations that are popping across the world, they really changed how companies have to treat their customers' data. And really, data governance became a very important topic for the majority of this companies. So great data governance systems have to combine all these aspects that I talked about by making it easy to discover and understand the data, use the metadata to demonstrate things like lineage and reliability, but also ensure that the policies and the best practices are being followed. So yeah, so that's basically the simple terms of data governance for me. So in conclusion, I think it's very, very important and I would say it's a must have these days for data companies. And these days every company is a data company, right? So by investing in good data governance practices, companies really benefit from this increasing productivity, data democratization, reduced costs, reduced risks and obviously increased regulatory compliance. That is very important.

0:43:31.2 Joseph Morais: Something you touched on, David, is lineage. And lineage allows us to kind of visualize data flows and at scale this becomes really important, right, because, you know, we have some customers that could have, excuse me, dozens or hundreds of topics, you know, many Flink statements, and how do you kind of trace all of that? This merging from this source, this transform and lineage allows us to visualize all that. It's really impressive.

0:43:56.1 David Araujo: Yeah.

0:43:57.2 Joseph Morais: So, you know, I think you touched on why data governance is important. But when you're working with our customers, who is it most important to? Is it to compliance people? Is it to architects, is it to directors of data strategy? Who is asking for governance?

0:44:15.4 David Araujo: I think it cuts across different personas, to be honest. I've talked with obviously many, many companies across the world about data governance in streaming and from architects that want to open the platform, right, want to deliver a self service platform to their developers. And there's obviously data stewards that have to be in control on how data is being used across the org and tech executives, right? Like these data regulations, they're no joke, right? So companies have to be on top of it. So I would say across the board, but think that I've been also surprised is that developers are also interested in asking about data governance. Okay? They want to make sure that the data products that they're building, right, they comply with the norms of the company and that they're doing the right thing. So it really cuts across, but I would say architects, that is towards developers, tech executives, that just goes everywhere.

0:45:17.6 Joseph Morais: Makes a lot of sense. So how does data governance integrate with our data streaming and stream processing services?

0:45:24.7 David Araujo: Yeah, so confluence data governance solution, we call this stream governance and it's really foundational for our data streaming platform. It's a layer that is basically cutting across all the products and services that we provide. So I'll give you an example. The Confluence Schema Registry, obviously a very popular tool in our governance portfolio. It's tightly integrated across everything. Right? So Connect that we've been talking about and Flink, just to name a few. Right? And this is an area that we at Confluent, we take it very seriously. We are making sure that it's natively integrated with everything that we build in the past and everything that we are building towards the future. Right? So again, very foundational, tightly integrated with everything that we do here. So in some sense, it's kind of like the backbone of this DSP, this data streaming platform, since it really enables customers to trust, scale and innovate with their streaming data. Without governance they get blocked. Right? So really the governance tooling that we've been working, it's been this idea of opening the platform to everyone across the company.

0:46:39.9 Joseph Morais: Yeah. And I think there's like a theme there, right? The fact that our data governance is directly and easily integrated with our data streams and the stream processing. Because doing that on your own is not trivial, right? Because you first have to stand up something like Schema Registry, you have to make sure that, you know, your Kafka clients and your producers, consumers are using it. And if you're using something like Flink and you're hosting that yourself, you have to make sure Flink is using it. And all of those pieces are just opportunities for failure. But if you have a product that has all of that and you can't not implement governance, you can't not use Schema Registry, it saves a lot of those pitfalls that I think many have gone through trying to build this themselves. So there's a term we use pretty often around here, data contracts. How do they relate to governance and what are they?

0:47:29.1 David Araujo: So data contracts, they're actually very central to our data governance vision here at Confluent. And the funny thing is that we had data contracts in Kafka for a long time, much before data contracts was a buzzword. We call them schemas. We call them schemas, right? And they're basically the implementation of a data contract for streaming. Now, more recently, we've been investing in this area and we've been evolving our concept of a data contract to be more than just the schema. So the way that I like to describe data contracts in streaming, in modern data streaming, is that it's this agreement between the data producer and the consumers on the structure. So the schema as well as the semantics of the data that is in motion. Right? So this data contract, basically it becomes this governance artifact that governs the way that the data is exchanged on the data streaming platform, what is allowed in, in what shape, with which policies, et cetera, et cetera. Right? So these data contracts are great since they really help with the consistency of the data, the reliability, the quality of the data, and they provide this single source of truth for understanding the data that is in motion.

0:48:52.6 Joseph Morais: Yeah. And just to kind of put it in a different way, a data contract ensures that the person or the thing producing the data also matches what the thing consuming the data expects. Because if those don't match, that could create all type of havoc and of course reduce your data quality. So, speaking of data quality, does governance help ensure only quality data ends up in my streams? And if so, how?

0:49:19.9 David Araujo: Yeah, so a key aspect of data governance is data quality, and the data contracts in particular, they really have this power to express the data quality expectations for the streams. Right? And then you combine that with enforcements on the producer, on the platform and on the consumer, and you're basically avoiding bad quality data to end up in your streams. So, for example, I'll give you an example. Data contracts could define that a social security number needs to be a string and it needs to match a certain regular expression for it to be valid. And additionally, you can even go further and say, look, the social security number is also PII, and so it should be encrypted. Right? So yes, data quality is very important. And data contracts, they become a very big part of our data quality story.

0:50:19.8 Joseph Morais: Yeah, social security numbers are a really good one. So for anyone who's not from the US, social security numbers are always nine digits. They're never alphanumeric, they're never less, they're never more. So that's a perfect one that you can say, hey, if I see anything in this field, not only tag it PII, if it's less than nine digits, more than nine digits, alphanumeric, I'm going to flag that because I know that cannot be right. And, you know, this kind of goes back to what Mike was talking about with, you know, ETL. If the idea is that all your data is flowing from your operational state, which is your data streams, why would you not want that to ensure before it ever ingresses to this, you know, into your analytics pipeline, that that data is not of the highest quality?

0:51:00.2 David Araujo: Yeah, yeah. And this is a pattern that we see a lot in data engineering. Right? And Mike talked about this extensively, but we see a lot this pattern of data consumers being tasked with handling data inconsistencies, incompatible changes, and expensive and complicated transformations in order to be able to process the data effectively. Right? And in particular for analytics. And this problem really has led to this effort to shift these responsibilities to the source of the data. Right? The data producer. This is normally referred as this shift left. Right? And this effort of shifting left has led to an emphasis on the data contract, because the data contract, it's owned by the data producer. Right? So it should be source aligned. So quality, contracts, governance, they're all very important and all tightly coupled together.

0:51:56.4 Joseph Morais: Yeah, I couldn't agree with you more. So one more question on this topic, David. What does the future of data governance look like?

0:52:06.2 David Araujo: Yeah, I have to bring AI if that's okay.

[laughter]

0:52:09.4 Joseph Morais: That's part of it for sure.

0:52:11.5 David Araujo: I think that the future of data governance is very tied to AI. Right? Today a lot of data governance requires and relies on these manual processes. And I think there's a huge opportunity here to really leverage AI to help humans with data governance activities. So, for example, use AI to make sure that sensitive customer data is being protected and in compliance with data regulations, for example. Or even one step further and you can think about this AI governance helping ensure that the broader AI is developed and deployed responsibly. It's balancing innovation with ethical considerations, for example. So yeah, I think the future of data governance really relies on AI.

0:52:58.9 Joseph Morais: Fantastic. So, man, that was really great information from Mike and David, but let's get into a little bit more. I thought that was fun, but let's get into some fun with our next segment. So next we're going to watch a quick clip from a real world user of data governance and I'd like to get both of your reactions. So in this clip we'll be hearing from booking.com and how they use data streaming to put travel decisions in the customer's hands.

[music]

[video playback]

0:53:33.1 Maxim Foursa: The mission of booking.com is to make it easy for people to experience the world, and Confluent is making it easy for us to deploy data streaming. Hello, I'm Maxim Foursa. I'm leading the application data services department at booking.com. This is where we have all operational databases, data streaming services and application services. Data streaming plays a very important role in booking.com. Booking.com is using data streaming in multiple applications. The key reason we use data streaming is to minimize time to insight. It enables us to collect relevant data and analyze it and get the insight in minimal time. We use it in the context of security to understand fraud cases. We use it to power our experimentation platform to collect all events and decide what functionality we want to deploy for our customers.

0:54:23.6 Maxim Foursa: We use it also in the order context to understand all the changes that customers do on our platform. In general, data streaming allows us to be more efficient in software engineering. It allows us to have less custom code and do more efficient products and develop also faster for our customers. Data streaming is an important enabler for event driven architecture. We are currently in a major transformation related to our infrastructure. We are looking to leverage a managed Flink for our real time analytics cases and we are looking to leverage the data governance functionality that we want to use to make our workflow more streamlined. We are happy to work with Confluent and use Confluent products. If it works for us, it can work for many more customers.

0:55:10.8 Joseph Morais: Great. So booking.com, you know, going back to your example, Mike, of ride shares, you know, what is more timely than, you know, booking some travel? You have to have all this different data coming in, whether it's hotels or flight data. And all that needs to be, you know, handled in real time because it doesn't really help me to know that, you know, as I'm looking at the screen that my hotel already booked up before I can hit and actually accept my reservation. So, you know, specifically they called out governance, but I just want to get your reaction to, you know, that video. And let's start with you, Mike.

0:55:43.7 Mike Agnich: Yeah, and I've been lucky enough to spend some time with the Booking folks in Amsterdam a few times and I've been really impressed with how they approach streaming. They ask really tough questions, but they've always been interested in doing more and more and streaming it. And so the scale of what they do is very large. And I would say the sophistication of what they do is very large. And to the point you made, this whole area of travel and hospitality ends up being a giant real time data challenge because of the nature of what I would think of as inventory in that entire industry. Things becoming bookings, bookings getting canceled, new inventory becoming available, new hotels opening, hotels closing. And if you get it wrong, it really can cause a terrible user experience. And I think in that area, the ability to move in real time has become a big differentiator. So the ability to like... And the owners of that inventory, Booking doesn't own that inventory. The inventory is owned all over the world. It's everything from, you know, it's the travel world, but it's hotels, it's motels, it's houses, it's everything.

0:57:00.0 Mike Agnich: And so the challenge is really big. And when you think about it, you start to realize why the entire travel and hospitality industry has invested so heavily in streaming. And not just streaming to move data, but streaming to do processing, to get intelligent with the data. When you need your experiences to operate in real time like they do, really the only option to do that at scale is streaming and stream processing. And if you don't have governance, none of that's going to work. If you don't have an understanding of the data, you understand that these bookings, where are they located? Are they up to date, accurate? And this can never go down. If streaming has any blips, if producers and consumers aren't aligned, booking stops functioning. And so that's my reaction, I guess, like having spent some time with them, they're really awesome to work with and I think have been very smart about investing heavily in more than just, I would say, the streaming, streaming 101. They are streaming 400, whatever academic analogy we want to make. They are really advanced and have been, I think, great partners for us. I've learned a lot from them.

0:58:21.6 Joseph Morais: That's really great context. And David, I know that in the video they called out governance. I believe it was for streamlining. It was streamlining adoption. So it's interesting. I'm curious about your perspective on that.

0:58:31.5 David Araujo: Yeah, yeah. I think one key point this customer calls out that I think it's super critical is efficiency. I wrote here efficiency for developers and building products faster for our customers. Right? This is basically setting booking apart from their competitors. Right? They move faster and they provide their customers with these experiences in real time. So there's a huge competitive advantage for Booking. And I was obviously very happy to see that they're using the full power of the data streaming platform. Like using Flink to process and react to streams in real time. And they're using governance to scale the use of platform safely and with control across Booking. So very interesting use case and obviously awesome customer.

0:59:19.4 Joseph Morais: I like the way you put that, David. See, governance allows you to move quickly, but also confidently because you know your data is good. Excellent. That was a great video. So we discussed what data streaming is. We talked about some strategies, we talked about governance and integration and we heard from one of our customers on how data streaming is transforming their organization. Now it's time for real hard hitting stuff, our data streaming meme of the week.

[music]

[video playback]

1:00:16.9 Joseph Morais: This actually really fits really nicely in our conversation because we were talking about how, you know, say an analytics person might be really excited to, you know, utilize data coming from data streaming, but they may not necessarily want to become data streaming engineers or learn Kafka. And I think this meme fits perfectly into that. What are your thoughts?

1:00:34.9 Mike Agnich: Yeah. Well, I think what Steve is hitting on here is a little bit of these are complicated systems to manage. And so it's like, as I sometimes said when I first started working with Apache Kafka, I would frame it as like, on a scale of 0 to 10, how difficult was it to operate and manage? At scale it was like either an 8 out of 10 or a 9 out of 10. Very complicated. Much harder than a relational database. Much harder than an analytics system. And then when you layer in trying to self manage Flink with Kafka, you're adding another distributed system to that and then your difficulty just went to 13 out of 10. Like it is extraordinarily difficult to run at scale. To me this meme is really speaking to the benefits of fully managed and cloud native auto scaling. Low touch. I would argue streaming is one of the best use cases for cloud because streaming is one of the hardest things to self manage.

1:01:31.9 Joseph Morais: Yeah, layers of distributed systems on top of distributed systems. Each one gets exponentially more difficult as you layer it on. David, what are your thoughts on that meme?

1:01:40.9 David Araujo: Well, this remind me that my DevOps guy in my previous company, my friend. Really.

1:01:51.1 Joseph Morais: I love it.

1:01:51.8 David Araujo: Yeah, yeah, you're going full circle. I really hope if he's listening, I really hope he's doing better these days and that the company decided to get some help from Confluent.

1:02:00.8 Joseph Morais: Oh, that's great.

[music]

1:02:08.7 Joseph Morais: All right, so before we let you go, we're going to do a lightning round. So think of these as like hot takes, right? Bite-sized questions, bite-sized answers. But these are schema backed and serialized. Are you ready?

1:02:20.0 Mike Agnich: Let's do it.

1:02:20.6 David Araujo: Yep.

1:02:21.1 Joseph Morais: All right. So starting with you Mike, what's something you hate about IT?

1:02:26.7 Mike Agnich: I don't like the letters IT. So I don't know, like I graduated with the technical degree in 2000 and it's like for us it was engineering. It's engineering. Like to me this is better. Like there's something... IT feels static, it feels frozen. Whereas like engineering... I just don't, I never use IT. I never use those letters. That's what I don't like.

1:02:55.8 Joseph Morais: That's a good one. I think a lot of people associate IT with just like kind of an on help desk guy and it's much bigger than that. And hey, they're engineers too, so let's not, I don't want to exclude them. That's certainly why I started my career. David, what's the last piece of media you streamed?

1:03:13.1 David Araujo: Yesterday I watched a soccer game from my club. I streamed that soccer game from my club and, yeah, and we won 3-1, so I'm happy.

[laughter]

1:03:21.8 Joseph Morais: Congratulations. Mike, what's a hobby that you enjoy that helps you think differently about working with data across a large enterprise?

1:03:32.6 Mike Agnich: You know, so I have three kids ranging from 17 years old to 4 years old.

1:03:37.3 Joseph Morais: Oh, wow.

1:03:37.6 Mike Agnich: So I would say my hobbies are working and kids, like there's no hobby. There's not that much time for the other stuff right now. But I will say, the same issues of like coordinating groups and if I push this to my kids and my kids' sports teams, it's just like every time I look at how different groups try to work together and so much gets lost in communication, I would say it's a ubiquitous problem. But I don't know, I don't know how else to answer that.

1:04:13.8 Joseph Morais: I love that answer. Honestly raising children is just as hard or harder than running layers of distributed systems. I think that's very relevant. David, can you name a book or resource that has influenced your approach to building event driven architecture or implementing data streaming?

1:04:32.4 David Araujo: Yeah, one article that I think influenced me heavily, my views on streaming and this whole data governance area was Zhamak's article on the mesh principles. So the whole decentralization and distribution of responsibilities to people who are the closest to the data really resonates with me. And in particular the aspects around self serve platform and data products are something that heavily influenced my views on governance and streaming in general.

1:05:04.6 Joseph Morais: Excellent. Mike, what's your advice for a first time chief data officer or somebody with a equivalent impressive title?

1:05:11.6 Mike Agnich: Well, so I would say sometimes I get asked by customers who are starting just getting into streaming, hey, what advice do you have, like what can I do now? And I say, all right, there's three really important things you need to do if you're getting into streaming. Three things you just can't skip. The first thing is schema. The second thing is schema. The third thing is schema.

[laughter]

1:05:33.0 Mike Agnich: Like, it goes back to the governance stuff David is talking about, but it is, you need to understand the structure of the data. Never lose it. I can't explain enough about how much waste and re-engineering is happening today because the meaning which is understood by our source aligned developers is lost in the middle and needs to be regenerated in order for lines of business to use it at the end of the day. So putting schema in streaming solves that problem better than anything else I've ever seen. So that tends to be the advice I give, is don't skip schema. Never lose the context and meaning of data as it's moving through your system end to end.

1:06:12.1 Joseph Morais: That's a really good answer. So this is for both of you. Any final thoughts or anything to plug?

1:06:18.7 David Araujo: From my side, use schemas. That's my final thoughts.

[laughter]

1:06:23.8 David Araujo: Just to reiterate Mike's advice.

1:06:27.4 Joseph Morais: That's quite important. [laughter] Clearly use schemas. Definitely we know that today. How about you, Mike?

1:06:37.2 Mike Agnich: I would just say, you know, I think that the technologies that we're unifying, I would say Kafka, Flink, Iceberg, if you are a developer interested in working on the cutting edge of what the world is doing, I think this is where this is happening. So I think that the usage of Flink is just skyrocketing around the community in general and the use cases are extremely interesting and you just got these open standards that are emerging everywhere. I think it's very exciting. So I would say like if you're going to read books on weekends and you want to do like technical stuff, I would really recommend getting to know more about Iceberg, more about Flink. I think these are really important open standards as we head in to this AI wave which is just getting started.

1:07:30.6 Joseph Morais: Well, excellent final thought. Mike and David, thank you so much for your time today. I appreciate all your efforts here at Confluent in helping build really great data streaming products and building up that DSP to make it easier for everybody to use their quality data and get it where they need to. So for the audience, please stick around after this because I'm going to give you my three top takeaways in two minutes.

[music]

1:08:00.4 Joseph Morais: Great conversation with Mike and David, here are my three top takeaways. Well, ETL versus ELT, right? Extract load transform or extract transform load? Clearly we are a bit biased here, but we really believe that, you know, real time ETL is kind of the way to move forward. Getting that transformation as close as you can get to the extraction is important because honestly, that's where... Closer to the source is where you have the ownership of the data. The people that have generated that data know what it is. And it's really interesting to think about how often people are sending data down to their, you know, data warehouses and data lakes without any knowledge of what that data is. The people that run those systems are not the generators of that data. So should they be the custodians of it? My argument is they shouldn't be. Which brings me to my second point. And that's the idea of shifting left, you know, between that combination of governance and connectors and of course stream processing, you can take away a lot of that processing that you're doing down in those data lakes and data warehousing and bring it up into your data streams, which is where the data comes from anyway.

1:09:10.9 Joseph Morais: So why not clean that data up and make it useful in your operational state? Because, you know, as Mike mentioned, you're going to save money and time, but you're also going to be able to take those data products and expose them operationally and use them in other interesting ways without having to do something called, you know, reverse ETL, which is kind of a weird thing because if the data already started here and you want it back here, why not make it good at the start, right? Instead of sending it all the way down for it to come back anyway. And that brings me into my last top takeaway. And that's schema, schema, schema. I thought this was excellent advice for anyone, especially a Chief Data Officer, to know that you gotta have that backing of your data.

1:09:55.5 Joseph Morais: You want to understand the schema, what that data should look like, how it should evolve, and you want to make sure that you're doing that from the very start. And that kind of lends into the theme of ensuring that you have quality data as it enters your system. Quality data from the start means you have quality data at the end, and you don't really want to be catching that just before you need to use it. So those are my three top takeaways. That's it for this episode of Life Is But a Stream. Thanks again to Mike and David for joining us and thanks for tuning in. As always, we're brought to you by Confluent. The Confluent data streaming platform is the data advantage every organization needs to innovate today and win tomorrow. Your unified solution to stream, connect, process and govern your data starts at Confluent at IO. If you'd like to connect, find me on LinkedIn. Tell a friend or coworker about us and subscribe to the show so you never miss an episode. We'll see you next time.