Mr. Debezium on Pinot, Flink, CDC & Decodable | Ep. 4: Gunnar Morling | Real-Time Analytics Podcast
- Gunnar Morling is pretty well known as the founder and maintainer of Debezium, a popular open-source change data capture tool. And he's recently joined Decodable, which brings him more broadly into the world of stream processing and integration and all these things. And I have him on the show today to talk about that transition and really the place of change data capture in the broader world of stream processing and real-time analytics. Gunnar's a great guy, talking to him is always fun. So listen in on today's episode of the "Real-Time Analytics" podcast. Hello and welcome to another episode of the "Real-time Analytics" podcast.
I am your host, Tim Berglund, and I am very excited to be joined today by my friend Gunnar Morling. Gunnar is a software engineer at Decodable, a change he's made recently. - Right. Gunnar, welcome to the show. - Tim, thank you so much for having me, really excited to be here and having, again, the chance to talk to you. I'm really looking forward to it. - Yeah, it's great to be back on a different podcast, but a podcast.
- Right? - So you are fairly well known. If there's anybody in the audience who doesn't know your name, I'll just give a little bit of preview. I mean, I think you're most famous as the Debezium guy. - That's what I say. (laughs) - Yes, yes. Hi, I'm Gunnar. I'm the Debezium guy.
(Gunnar laughs) So Debezium and change data capture, I'll let you give a little bit of an intro to that. And recently, you've made a move to Decodable. I would love - Exactly.
- To hear about that, but a brief couple minutes on your background. - Right. - You're the Debezium guy. - How do you get to be the Debezium guy? - Yeah, that's actually an interesting story on its own right. Yes, that's a good question.
So, yes, as you mentioned, I recently moved to Decodable, and we can talk about what we do there is managed stream processing. And before that I was exactly, up to the day exactly, for 10 years at Red Hat. And at Red Hat, the project I was doing for the last five years was Debezium, a tool for change data capture, which means it gets change events out of your database. Whenever something gets inserted or updated or deleted, then Debezium will get this event from the transaction log of a database and propagate it to all kinds of consumers, typically via Kafka.
And before that, I was working on Hibernate related projects at Red Hat. And now as you ask, how did I come to join Debezium? Yes, Randall Hauch. Hey Randall, if you hear this, he was the original founder of the Debezium project at Red Hat.
And then, at some point, he decided to leave, and he went to Confluent. And, well, at this point, I was looking for a new role, for something new and exciting. There was this opportunity, and this is how I came to be the Debezium guy, I guess. - There you go. There you go, capturing changes in all of the databases and, you said, propagating them to different kinds of clients. As you were saying that, I realized I've only ever thought of Kafka and like it could be Pulsar, or it could be Kinesis, it could be some other place for event.
- Right, exactly. - Is there any other kind of realistic sink that those go to? I mean, where else would you put them? - Right, so the thing is, I wouldn't even consider Kafka or Pulsar or Kinesis or any of those things as a sink, per se. They're more like the fabric of transporting your data, right? - There you go. - And I would say, most of the times, people use Debezium indeed with Apache Kafka. As you very well know, it's a set of Kafka Connect connectors, and this is how most of the Debezium users are using Debezium. But actually there's also other ways how you can use Debezium.
And one of them is what we call Debezium server. And Debezium server, essentially, in an entire architecture, it has the same role as Kafka Connect. So it's a runtime for Debezium in this case, but then it gives connectivity for all kinds of data streaming platforms other than Kafka. So you can use Debezium server to stream your change events into Pulsar or Kinesis, Google Cloud Pub/Sub, NATS, Reddit streams, whatever there is. So this opens up the capabilities of Debezium for all the users of those platforms, because we realize, okay, people like Debezium. And, I mean, many people also like Kafka, right? - Sure. - But then there's people
who are on other platforms, and we thought, also we want to give them the ability to use Debezium. - Right. - And this is how Debezium server came to be. - Nice. Okay, that's right. - Right.
- How did you come to Decodable? This takes us from raw change data capture kind of closer to the world of what computations we do, you know, the (indistinct) world. - Right. Absolutely, absolutely. And you touched exactly on the right thing. So yes, as I mentioned, I was working for roughly five years on the Debezium. And I would say it's a really a wildly successful project. It's kind of the defacto standard for change data capture, I would say. - 100%.
- And even other vendors come and they support now the Debezium change event format. So if you look like companies like Yugabyte or ScyllaDB or even Google Cloud Spanner, so they implemented a CDC connector based on the Debezium framework, and now they contribute it to Debezium. So it's really- - That's not your bias speaking. That is true. - It's totally, totally a neutral description of the situation, exactly. (Tim laughing) (Gunnar laughing) - From a completely neutral observer with nothing invested - Exactly.
- In any of this. But I'm over here going, nah, that's just what it is. - No, so but, now the but comes. So yes, it does that part well, I would say.
So it takes your data, your change events out of a database like Postgres, MySQL, all the Oracle SQL server and so on into something like Kafka or Pulsar and so on. But then, I mean, you don't wanna have your data there for its own sake, right? You wanna do something with this data. You wanna propagate it into something like Pinot maybe, or StarTree Cloud.
And now the question is, - Love it, love it. how do you do this? - Huh? - I said I love it. Pinot or StarTree Cloud. - Yeah, yeah. - Just brilliant, brilliant suggestion. - Just naturally,
I manage to mention that. - More of that. Yeah, you're just doing (Gunnar laughing) for example. - For example, right? I mean there's also like search indexes and cloud data warehouses, all this kind of stuff. So people want to take those change events out of Kafka, out of Pulsar into those external systems. And Debezium itself doesn't really concern itself with that aspect.
So that's the one thing. And then there's this entire notion of processing the data, right? So in some cases, yes, it's fine to take your change events and just take them as is and put them into another database, but then very often you wanna modify and process this data. You wanna filter it. Maybe you have like sensitive data. You wanna remove parts of your stream. Maybe it's like tenant-specific data, which you wanna route to different sink databases.
Maybe you wanna do more complex transformations like joining or grouping, aggregating your data, all this kind of stuff. And again, this isn't something which Debezium doesn't do, and it's not just the scope of the project. So that's perfectly fine. But I felt I wanna look at those data pipelines really from end to end and be able to help people with a system which does all of that. And this is exactly what we do at Decodable. So it's a managed stream processing platform based on Apache Flink as the stream processing layer, and then it also integrates with Debezium.
It integrates with connectors, again, from Pinot, Elasticsearch, all those kinds of systems, Snowflake. And it gives you all those things, all this data, end-to-end connectivity in a fully managed way. And that was something which I found and still find very exciting to work on. - Nice. Yeah, expanding the scope to include Compute. - Exactly. - And all the things that... - Exactly right.
Right, so there's this. Then, also, I mean, I've always worked at large companies before. I mean, when I left Red Hat, it was like 20K employees.
And I really felt, "Hey man, I'm also getting older, right? And I wanna have, at least once in my life, I wanna have this startup experience and see how this is. So do I like it or do I not like it?" So far, I love it. But I thought, "You know, I wanna give this a try." And then, Decodable, this opportunity came up, and I thought, "Okay, this is something I wanna try out and see whether it's fine for me." - Awesome.
Yeah, I was very happy to see the news when I did. It's nice to have you - Cool, thank you. - Over here on the startup end of things. - Exactly. And so far I like it. You know, it's more, as you are well aware of, right, it is agile and you make quick decisions.
You try out stuff quickly. Does it fly? Cool, we keep to it. Does it not fly? Okay, we drop it.
So you have very fast feedback cycles, and that is one thing I enjoy. - Absolutely. And it's fun to grow with a company, you know, if you've got a good growth arc. - Exactly.
- Companies go through stages, just like people do. Just like there are kind of phases of development as a growing person. - Exactly.
- There are phases of startups, and it's fun to navigate them. - Right. - I mean, it can be frustrating to navigate them too, right, 'cause you can suddenly be like, "Well that thing that I liked isn't here anymore.
And now it's, - Right. - Well, he was cuter when he was a toddler. And now he's a teenager. - Yes. (both laughing) - No, I totally see what you mean.
I mean, so far I really enjoy it. I really love the energy. Like, everybody is pulling in one direction. And if something comes up, like everybody comes together and helps to sort it out.
And I really love the energy. I love the vibe of this. So it's really cool. - Love it. - Now this is a fairly young podcast, the "Real-Time Analytics" podcast. I'm not sure of exactly what order things are gonna come. - Oh, I was about to ask you,
which episode is this even? (chuckles) - This might be number four, so yeah. - Okay, cool. - So this is early on in our history and, I mean, I certainly have got sort of a vision for where things are going and what we want to be about. But you know, in the minds and the ears of our listeners, and by the way, thanks for being here if you're listening, there isn't a lot of brand established yet. It's not a lot of expectations based on a long tradition.
So I need to say some things. For example, this isn't just a Pinot podcast. Now I work with Pinot, I work for StarTree. This podcast is a thing that StarTree does.
And so, you know, there'll probably be certain gravitational pull towards Pinot and StarTree things. - Right. Which makes sense. - StarTree customers and partners.
That's how that kind of thing works. But it is not called the Pinot podcast, it's called the "Real-Time Analytics" podcast for a reason. And so the interplay here of Flink, and I've got this bad habit of thinking of Decodable as the Flink company. Well, you know, that's a neat implementation detail. - Right.
- And early on in a product's history, implementation details usually matter, like that matter more. But that's not what it's about. It's this storage agnostic stream processing and integration layer.
- Exactly, right. Exactly. - Yeah, so that makes for some interesting questions to kind of bat around. Like, I'll take it, all that to say, now I'm gonna talk about Pinot just 'cause it is concrete. (Gunnar laughs) But from the standpoint of a real-time Pinot table where you're connected to, say, a Kafka topic, and you're ingesting messages from that Kafka topic and making them queryable. And that's like a core Pinot use case. It's not necessarily likely that the schema that you want in that table is lying around in some topic.
You know, what would be the odds, it's just there and everything's great. So that introduces the idea of stream processing like you were talking. - Totally. Yes, totally. So this would compliment each other in this situation then, right? - Absolutely. You need that somewhere.
So that's gonna be Flink. That's gonna be Kafka Streams. That's gonna be some other kind of thing that you do. - Right, right, right. - And, oh, the whole Flink/Kafka Streams, that'll be a later episode.
- Oh yeah, I'm curious about that one. (laughs) - Yeah, now we're gonna- - Tell me about this, and I will grab some popcorn. - Well, you will know. And I'll say right now, Anna McDonald, you're not scheduled yet, but that's just because I haven't texted you to ask. (Gunnar laughs) So you are the person that I would like to begin that conversation with.
Yeah, Anna. Anna, I hope we'll be on the show very, very soon. - Awesome. - But there's that, right? Like, okay, I want to get this stream of events. Maybe it started partially with some CDC happening, maybe it didn't.
I want to get it into my real-time database - Right. - Or my whatever database for query. - It could be streams, right? It could be multiple ones, and you wanna join them. - It could be multiple ones. - And put them - That you have - Into one database. - To join,
because you don't wanna do that. And like Pinot is rapidly growing a arbitrary join capability, but joins are still expensive, even if you're stupid fast at them. And you'd rather do them before you get them into the database and indexed. So like, yeah, it could be streams, could be transformations you have to do that are stateless, whatever. That's the one thing.
But how about, and I know you're the CDC guy and I'm asking you about Flink, but I just wanna kinda go there. - Right, let's see what I've learned over the last few months. - Exactly. - This is the test. (Gunnar laughing)
He's only been a few months, folks, at this, at the time of this recording. So when do you wanna not fuss with the database? Like, when do you want to just make things happen in Flink that are immediately queryable? Talk to us about how Flink supports that. - Right, I mean, so there is a few facets to that, right? I mean, the first thing which I would mention is there is all this connector and connectivity stories. So Flink has a rich ecosystem of connectors which allow you to ingest data from all kinds of systems, like there's the CDC-based connectors, there's REST-based input, there is all kinds of other sources, Kafka sources of course, Pulsar.
So you can ingest your data from a variety of systems, right? And you also can put your data into a variety of systems. So that's what I really like about this, is it's like this, let's say in mathematical terms, I guess, or modeling terms, it's like an m to n kind of relationship, right? So you have many things on the source side, you have many things on the sink side, and you can take those streams, process them, mesh them, and put them maybe to multiple systems. So maybe you have use case where, yes, you wanna have this large real-time query capability from Pinot, but then at the same time, I don't know, maybe you also wanna do full text search on the data. I don't know about Pinot. Maybe you also can do full text search there, - You can. - But maybe you wanna use, you can, okay.
- Well, within limitations. - Right, okay. - Maybe. - So maybe have another... - Maybe that text indexed
isn't right, and... - Exactly. So you wanna maybe use another system and put your data in Pinot and this external search service, right? - Yes. So this could be something and then, or maybe you just have, like, you also wanna put your data into S3. So you have like a long-term storage archive for data. - Call it a lake,
but for data. - Exactly right. - Yes, yes. - And this is, I believe, that's a very nice aspect of Flink and this category of systems that you can put your data into multiple sources, take from multiple sources, put them into multiple things, and have your data in all those places. - Yeah. And for the uninitiated, I mean, Flink is not providing storage, it's providing a compute layer, right? - Exactly right. So it does that.
And then, yeah, typically you will, I mean, you can do SQL queries against it, right? So yes, you can do compute, and there's like state store of course, so it's like Reliant and so on, but it's not long term storage by itself. - Right, you've got, let's say in the typical case, messages in Kafka topics. You'll consume them.
You'll do compute on them, including possibly SQL queries and yeah. - Exactly. - So your thoughts there though, and if I'm pressing into areas of Flink that just aren't you, that's a different separate podcast and we talk about other things, but. - Right. - I've got, and there's this blog post draft that Mark Needham, a colleague of mine and I are working on. - Oh, yeah, yeah. I love Mark.
Say hi to Mark. He's awesome. - Mark is amazing. He'll be a guest on this podcast occasionally, I hope. - Totally, yeah. - Yeah, his work is fantastic. But the idea is querying streams, and, Gunnar, this might have come from a Twitter thread with you, if I recall correctly.
There was a Twitter thread about how is it that you query a stream and how do you make the decision about which system to use? Like, suppose you've got Kafka Streams in your life, maybe you want a KTable. Maybe that's completely unsuitable, and you want something like Pinot. Maybe you want to use Flink SQL. So how do you think about that taxonomy? And another, before I let you go on that, the question comes up, you said these are complimentary technologies, and of course the opposite of that is competitive. And anybody, for example, who would say that Flink and Pinot are competitive technologies would be a person who didn't really understand very well what those two things do, right? That would be kind of a disqualifying statement. You would think, "Oh, okay, they're maybe not very technical."
- Right, yeah, exactly. - Need to kind of bone up on of these systems. - Exactly.
- And in the boundaries between things, you have decisions about, are you gonna do it this way? Are you gonna do it that way? - Right. - And you have these little interesting areas of overlapping functionality where you're never gonna say, well, we don't need Flink because we have Pinot. That makes no sense.
But here's a feature, here's a use case, here's a way I want to access my data, and it could be one or the other. And I'm interested in exploring those boundaries. Not because I care about which one wins in those cases, really. - Right.
- I think we just get a better understanding of what the things do. So can you riff on that? - Right. - I think the Twitter thread was you, but I don't know. - Uh, maybe, I don't know. It could be. I feel like I do too many Twitter stuff, and I don't remember it. (chuckles)
- Right. (chuckles) It does not get better. - It definitely is a topic I at least might have joined this conversation.
Yeah, I mean, yes, I would agree there is an overlap. And I feel there probably is a category of use cases where you would be fine to implement it either in Flink or in Pinot. Maybe it would be a, you know, it would both work. I mean, some of the things to consider, and I'm just thinking out loud, I'm curious what you think, but some of the things to consider are, I would say, is, again, this question of what source connectors are available. So for instance, what you can do with Flink is you can directly ingest data from Debezium CDC connectors without going through Kafka at first.
So they can be embedded within Flink, and then you have, let's say, a Flink native CDC source. But no, I'm not sure whether this is doable with Pinot. In general, you could build this. I don't know whether you have it. - You could. Pinot recently has grown the ability to ingest natively from Flink. - Right.
From Flink, okay. - But I don't think from Debezium. I should know. - Right, but maybe then you could even use the Flink CDC connector, which is based on Debezium to do this. Actually, that would be very interesting to explore too. - Ahh. - Right, so I think that's one axis of thinking you should have.
The other question there is around the query capabilities. Right now, that's a little bit above my pay grade, but what kinds of query, primitive query capabilities do you have? I guess maybe there's some stuff in Pinot, specific types of queries which you couldn't do in Flink or the other way around. So I feel that is an area which you probably should look at when you make this kind of decision. - Yeah, I like it. Another thought that's come up and kind of one of the criteria we're playing with is if there's some, you're making some queryable chunk of state based on this input stream of events, a topic. It's called a topic.
And maybe you're okay. You just have one dimension, one key that you need to query on. So like, from an indexing standpoint, you don't need anything fancy. But a thing that pushes you away, say from Pinot or another database proper, is do the results of this query need to go back into a stream somewhere? So Pinot is a terminus for your stream processing pipeline.
- Exactly, yes. - And then in return for that little bit of inflexibility, it gives you the ability to have concurrency and latency. - Right, yes. - Characteristics that are hard to have otherwise. - Right, yes. Exactly. - But you don't - Yeah, see, that's what I was - Change things there and throw them back.
That's just not, that's not how OLAP works. - I mean, you could, maybe you at some point you have a Debezium connector for Pinot, which takes data out of Pinot. (chuckles) I don't know, it could be an option. - Somebody's gonna do it, and I'm not gonna like it.
But you know what, I was not brought upon this earth to like it in all cases, so. - Yeah, so. But yes, you're right. So your data goes into Pinot and then you query it, but it doesn't move on, right, and that's what I was trying to get at when I said you can, with the Flink approach, you can put your data, your streaming query results into multiple places at once, right? So it could be Pinot and something else, and maybe in the simplest case, just another Kafka topic and you stream it - Exactly. - Somewhere else.
- Which in the case of your system, your whatever it is that you're building, the architecture, not terribly unlikely that you're gonna wanna go to several places. Like, you know, - Exactly. - Maybe it needs to go to Pinot because you've got features in your user interface that need 25 second, or 25 millisecond latency on. - 25 seconds? - 25 seconds, great.
We can do that now, yeah. - That's pretty impressive. - That's great, yeah. On the order of- - It's just a question of the size of the data set, so you can't impress me with that. (chuckles) - That's right, that's right. If it's all of the data in the known universe, then that's good.
No, on a realistic, large data set, If you need millisecond latencies, then you need to go to Pinot. - Yeah, that's awesome. - And that's for a real-time user facing thing, but it needs to go other places too. Like maybe it does still need to go to a data lake, because there's things happening in that data lake. I mean, that's the state of the world, so that makes a lot of sense. - Absolutely. I mean, people generally, I feel also like having optionality, right? They don't like being locked into something.
It's like that's how it is today, right? They don't wanna commit. You wanna keep the options on the table. And I feel that's probably also part of the consideration. - Yeah, yeah. So what are you working on, if you can talk about that? What's your- - Oh yeah, that's actually very interesting.
- You're here in the space now, so yeah. - Right. Actually, I just talked to Eric Sammer, our CO at Decodable today about this. So, yes, I guess I am having multiple hats right now, which is a bit owed to the fact that we are just a small team and there is not like the person who just does this. So it's a function of the team size that I do several things. And also it's a function of that I'm just interested in many things.
So sometimes I have a hard time to say, "Okay, this is what I'm doing and nothing else." I also feel like, "Hey, I would like to do this. I would like to do that." So to make it a bit more specific, part of my responsibility is to do DevRel work in the wider sense, but I'm not a full-time DevRel person.
- Okay. - I'm definitely a member of the engineering organization, but I like, as you know, I like going out, doing talks, doing a podcast like this. - You're good at it. - Just talking about the stuff I build. And I like doing that, and I like doing that as the engineer. Actually, I think that's good, because you are very hands-on, so you are very familiar with the substance matter.
So I like doing that. And that's definitely my part, part of my job. I am, and you are going to hear it here for the first time, I'm going to do a video series on our Decodable YouTube channel, which is going to be called the "Data Streaming Quick Tips." It's the first time ever I'm mentioning this, and there we'll do like short episodes, 10 minutes or so, 5 to 10 minutes.
How do you do joins with Flink? How do you put your data into Snowflake? This kind of stuff. So I'm doing all this DevRel stuff and some blogging and so on. - So good, so good. - You got a good background there. I like the camera set up, the microphone. You're ready to be a YouTuber. - Yes, exactly.
I enjoy spending way too much money on this kind of stuff. Exactly, so, and that's just one part. So I do that. I'm also then involved with engineering. And there, I mean, I work on specific projects.
So there's a key project, which I just cannot mention yet. It will be out soon, but so we have a huge thing there where I'm involved with, but then also I'm just trying to help with general guidance on how should we do certain things and help others to come to decisions, this kind of stuff. So I do this and then, well, I also just like being involved with the product, so I'm preparing a demo for next week for Data Council in Austin.
So I played with the product and I felt, damn it, this button should be like this or this feature should be like that. And so I also like to be involved with that. And I'm not a product manager, but I like providing input on the product and hopefully form the way how it should look like. So I'm a bit between all those kind of things, if that makes sense. - Love it. You're not a product manager, but you do have opinions.
- Exactly. (laughs) Exactly, I do have opinions. - And I love- - I like voicing them. (chuckles) - Yeah. And by the way, I'm excited that you'll be doing some more DevRel type stuff.
That sounds great. I think bringing your engineer status to that role, I think very, very true. I mean, my background and my training is as an engineer, but I haven't had an engineering role in years. So there's just a different kind of thing that you can bring. And that's why I always covet participation of engineering in our effort, because there are things you can do that a developer advocate usually can't, so. - Yeah, I mean, and also the other way around.
I mean, it's- - Very much. No, they're both specialties for a reason, but. - Exactly right.
- Yeah. - And I mean, also, to be honest, there was a time where I was considering to do this full-time, but then I figured, no, my heart really is for the engineering side of things. And I like doing the DevRel side stuff a little bit on the side, also, as much or as less as I want to. (soft jazz music) So it's not like my key responsibility.
And I feel that's the sweet spot for me. - It is, as they say, hard for you to kick against the goads. So maybe we'll see you on this side of the fence permanently at some point. - (chuckles) Who knows, yeah. Never say no, right? (chuckles) - My guest today has been Gunnar Morling.
Gunnar, thanks for being a part of the "Real-Time Analytics" podcast. - Tim, thank you so much. This was quite fun. Talk to you soon. - [Tim] And there you have it. If you feel compelled to help us spread the word and grow the "Real-Time Analytics" community, you can give us a rating on Spotify or Apple Podcasts or wherever fine podcasts are sold.
If you're watching us on YouTube, hey, subscribe and, of course, hit that notification bell. And you can always share your favorite episodes on LinkedIn or Twitter or wherever it is you do social media. Thanks, and I look forward to talking to you in the next episode.