What is Apache Kafka “In Simple English”

Yann Mulonda
Dev Genius
Published in
5 min readAug 8, 2023

--

Introduction to Event Stream-processing (ESP) & Kafka

Let’s start with a scenario to lay out the fundamental understanding of our topic. So something that most of us are familiar with these days, is what we call “loyalty or rewards programs.

The customer makes financial transactions using their credit/Debit card to buy groceries, a t-shirt, a book…or book a flight and hotel room for vacation…just any purchase using a dedicated payment method.

The companies then offer points, mileage, cash-back, or benefits to customers proportionally to the amount of money spent. The customers can redeem those points/miles/cashback/rewards for discounts, free products, or insider perks. Businesses do this to motivate repeat purchases and build trust with their customers.

What is Event Streaming-processing?

Now, how does this happen? how is my credit card company able to match every dollar I spent to the proper expense categories and award me millages that I can use to book a hotel room or flight tickets? This is where “Event Stream-processing (ESP)” comes in. ESP is a technology that can process a continuous flow of data (streams of events), as soon as an event or change happens. By processing single points of data rather than an entire batch, event streaming platforms provide an architecture that enables software to understand, react to, and operate as events occur.

Image Source: tibco.com

ESP Platform

Let’s think of this process from a data integration perspective, we have 1 event that starts from a “source system” that has the data about a new transaction which then connects to a “target system where that event change is loaded, analyzed, and transformed to the desired outcome. Simple software with lines of code can do this operation:

Image source: Learn Apache Kafka for Beginners by Stéphane Maarek

Data integration challenges increased with respect to the number source system and/or target system.

Image source: Learn Apache Kafka for Beginners by Stéphane Maarek

So as you can see, not so easy to integrate anymore. the higher the number of source and/or target systems, the more integration needs to be set up making the architecture extremely complex. Furthermore, each source system might get overloaded by the higher number of requests and connections from target systems. Each integration will also come with difficulties around the protocol, data format, data schema, and evolution.

This is where the Event Stream-processing platform comes in. As we discussed above, ESP Platforms provide an architecture that enables software to understand, react to, and operate as events occur.

What is Apache Kafka? — Kafka is a popular Event Stream-processing platform.

Kafka like many ESP Platform solves the data integration challenges by incorporating decoupling into the integration and process between source and target systems:

Image source: Learn Apache Kafka for Beginners by Stéphane Maarek

Now, keeping in mind our example of loyalty and reward programs, our source and target systems could be these:

Image source: Learn Apache Kafka for Beginners by Stéphane Maarek

Apache Kafka will collect, categorize and store all the data coming from your source systems such as websites, pricing data, financial transactions, user interactions, etc. These source systems are referred to as “Producers”, they produce the Kafka data stream.

When the target system needs to receive data, they’ll just tap into Kafka data to get it. Therefore, the target systems are referred to as “Consumers.” Kafka now seat in between receiving data from producers and sending data to consumers.

How does it work?

Kafka functions very similar to a message queue (e.g., RabbitMQ) but has a few enhanced capabilities. Kafka has a notion of producer and consumer as discussed. A producer pushes messages to Kafka, while a receiver fetches them. Many messages might be passing through Kafka, so to distinguish between them and allow you to isolate different processing contexts, Kafka groups messages into Topics.

Every producer that is trying to publish something has to provide “a topic name.” On the other hand, consumers subscribe to a set of topics (there might be many of them at the same time) and later on consume messages from those topics.

Images source: hevodata.com

To summarize, these are the key important things to retain about Kafka:

  • Producers publish messages to the queue and consumers fetch them for processing.
  • Consumers and producers work on a group of messages called topics. It allows you to isolate different kinds of messages.
  • Consumers are grouped into consumer groups that allow you to spread the workload into different instances of your consumers that are in the same group.
  • consumers are Java applications and can be scaled to provide more (or less) processing power.
  • Every topic is divided into partitions — separate chunks of messages with order guarantees within one partition. The number of partitions can be configured as desired.
  • Each message is uniquely identified by topic name, partition number, and offset.
  • An offset is a message number from the beginning of the existence of the topic and partition.
  • Committed offset is the offset stored in Kafka and is used to resume processing after a consumer crash or restart. Think of it as a checkpoint.
  • Consumer position is an offset used internally by consumers to track what messages to fetch on the next poll.

Why use Apache Kafka?

Kafka is an open-source project. It’s Distributed, resilient architecture, and fault-tolerant (you can patch it and maintain e it without taking the whole system down). Kafka comes with Horizontal scalability. The project aims to provide a unified, high-throughput, low-latency platform (less than 10 ms) for handling real-time data feeds.

Kafka is used by many organizations (such as Netflix, Uber, LinkedIn, etc.) and IT teams as a messaging system, activity tracking system, stream processing, micro-services publication/subscription, application logs gathering, metrics collector, decoupling of system dependencies, and integration with other Big data technologies.

  • Netflix uses Kafka to apply recommendations in real time while a user is watching TV shows on their application.
  • Uber uses Kafka to gather user and trip data in real-time to compute and forecast demand as well as the surge in pricing. That’s how prices on your Uber app change everything time even for the same trip.

Kafka is a pretty cool platform. We can easily set up a single-node Kafka cluster on your laptop using Docker. To learn more I’d recommend checking out the LinkedIn course: Learn Apache Kafka for Beginners by Stéphane Maarek and Kafka 101 by Lukasz Chrzaszz

If you enjoyed this article, you might also like What is Docker? ”In Simple English”

Cheers!!!

--

--

Co-Founder & CIO @ITOT | DevOps | Senior Site Reliability Engineer @ICF󠁧󠁢󠁳󠁣󠁴 | "Learning is experience; everything else is just information!”