What is Kafka?
The article dives into zookeeper, Kafka connect, producer, consumer and the basic architecture of Kafka.
What is Kafka?
Kafka is a streaming platform that can be used for storing, reading and processing data. It was founded by 3 engineers in LinkedIn since they needed something faster and more scalable than traditional MQ.
Further in the article we will learn why Kafka is faster then MQs.
Kafka topic
Kafka topic is the feed to which a producer pushes messages/records and a consumer subscribes to, to consume messages from it. A topic is something similar to a message queue, where data is pushed and popped from, the difference here is that messages are never popped from a Kafka topic. The topics can have a retention period configured, which would then automatically delete the messages which are older than the retention period.
Commit log
Commit log is an append-only data structure which is used by Kafka to store the records. This makes the messaging system durable, persistent, sequential and has better scalability and built-in fault tolerance since the log can be copied and stored even after consumer has consumed it.
Producer
Producer is an application which is a source of the data being pushed to one or more Kafka topics. For example, pushing the details of every transaction that occurs in your application to Kafka so that other systems that need transaction data can consume from this topics logs.
Consumer
Consumer is a client application that fetches data from Kafka topic. For example another application needs the transaction data that the application previously produced and let’s say wants to perform analysis on it, then it will use consumer code to fetch the data from Kafka cluster. Multiple consumers can consume from the same topic, unlike traditional MQs, and have their individual offsets. Multiple instances of a consumer application would be grouped together as a consumer group. This consumer group is assigned a consumer group id and the topic partition offset of the consumer, the offset upto which the application has consumed data, will be saved against this id.
Kafka Brokers
Brokers are the servers that together form a Kafka cluster by sharing information with each other. Brokers contain topics, partitions, and commit logs. Zookeeper is installed on Kafka brokers for maintaining configuration, processing and to managing data flow.
Zookeeper
Zookeeper is a software developed by Apache which is a centralized service, used in maintaining naming and configuration data and provide flexible and robust synchronization between distributed systems. It keeps track of status of Kafka cluster nodes and also Kafka topics, partitions etc.
Zookeeper is used by Kafka for:
- Electing a controller: In a Kafka cluster, one of the brokers serves as the controller, which is responsible for managing the states of partitions and replicas and for performing administrative tasks like reassigning partitions.
- Keeping track of which brokers are alive.
- Keep topics’ metadata i.e., which topics exist, how many partitions are present from each topic, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic.
- Data regarding quotas of how much data is each client allowed to read and write.
- Who is allowed to read and write which topic, which consumer groups exist, who are part of those groups and what is the latest offset for the consumer group from which partition
Zookeeper uses ZAB (Zookeeper Atomic Broadcast) which is the brain of the whole system. In a system of multiple processes, all correct processes receive a broadcast of the same set of messages. It is termed atomic because either it eventually completes at all participants correct or all participants abort without side effects. An atomic broadcast should guarantee:
- If one correct server broadcasts a message, then all correct servers will eventually receive it.
- If one correct server receives a message, then all correct servers will eventually receive that message.
- A message is received by each server at most once, and only if it was previously broadcast.
- The messages are totally ordered i.e. if one correct server receives message 1 first and then message 2, then every other server participant must receive message 1 before message 2.
- When nodes fail and restart, they should catch up with others.
Kafka Streams
Kafka streams are APIs built on top of producer and consumer APIs and, is more powerful than producer and consumer clients. Kafka streams help with more complex processing of record.
However, with Kafka Streams, along with the messaging service capability, Kafka also handles the streaming logic for you, so that you can concentrate on your business logic.
Kafka connect
Kafka connect is a tool for streaming data between Kafka and other data stores. It makes it easier to ingest large data like an entire database or collect metrics from an app server into and out of Kafka topics. There are 2 type of connector:
- Sink connector: Streams data from Kafka to other data stores like elastic search, hadoop
- Source connector: Deliver data from Kafka to data stores
Checkout list of available connector here.
Inner Workings of Kafka:
Since Confluent is the most widely used this article uses it’s setup to explain the architecture.
Let’s understand the basic working of Kafka. There are 3 main components, Kafka cluster, producer and consumer. We will explore them step by the roles of these components.
Producer’s role:
- A producer creates data, typically in a json format. The message, using producer API, is pushed to Kafka cluster. We can add a keys to messages which is used by zookeeper to assign a topic partitions to messages. For example GUIDs are used keys, Kafka will then use these different GUIDs of each message to assign partitions to the messages. Since keys are optional, when key are not present in the message the zookeeper randomly assigns a partition to the first message and then uses round robin from the second message onwards.
- Look at this exhaustive list of configurations to customise the producer according to the need to the application. All of them have a default value set. so if your application needs an extremely basic implementation then you need not worry about the configurations. But if you care about the persistence, sequence, timeouts etc. it would be recommended that you go through them.
Zookeeper’s role:
- A Kafka cluster can be configured with a certain number of partitions and replica partitions. The produced message reaches one of the brokers in the Kafka cluster. Using the key, a partition is then assigned to this message.
- The key to understanding essence of Kafka is to look at the importance of the zookeeper.
- Once partition is assigned, a message gets stored in the broker’s I/O buffer. Depending on whether the producer requested, an acknowledgment(ACK) is sent back it. Kafka relies on the operating system to flush messages to permanent storage.
- Since an ACK is sent before moving the data from I/O buffer to persistent storage, there is a chance of losing this data when lets say the broker crashes before saving the data but after ack is sent. This is where replica partitions come into picture. Once the leader partition receives message, it forwards it to the follower replica partitions. If the producer requests ack after data is saved to all replicas’ I/O buffers , the leader waits for an ack from the followers and once it receives ack from all active followers it send an ack back to the producer. It is highly unlikely that all the replicas go down before pushing the data to permanent storage. Replicas and acks are not mandatory but they are highly recommended.
- A leader epoch number is used to make sure that follower replicas only accept messages from leader. Leader epoch is a 32-bit number which increments every time a leader goes down and/or a new leader is selected. This number is attached to messages sent by leader to the followers so that replicas ignore outdated leader’s messages once it is back up again.
- Any new message is added to the end of the log. Once a message is copied to all the replicas an offset is assigned to it which is called high watermark offset. A consumer can read only up to high watermark offset.
- Partitions truncate data to high watermark offset. In case the leader goes down before sending data to the followers the high watermark wouldn’t be updated once the new leader is elected all the followers even the previous leader will truncate their logs upto the high watermark offset(an image here)
Consumer’s role:
- Consumer subscribes to the Kafka topic it needs to consume from. Consumer can also be configure using this exhaustive list of configurations
- Auth can be added to Kafka to make sure consumers can only read the topics they have access to. Enabling auth is not mandatory.
- Kafka uses consumer group id to store consumer related data. Kafka consumers belonging to the same consumer group share a group id, for example offset that consumer has consumed upto or which topics the consumer is authorized to access. The consumers in a group then divide the topic partitions amongst themselves so that each partition is only consumed by a single consumer from the group.
- A topic’s partition can have only one consumer instance per consumer group attached to it. Therefore, there should only be as many consumers under a consumer group id as there are topic partitions, otherwise the remaining consumers would remain idle. For example, there are 3 topic partitions and the consumer application is scaled to 5 instances. This would mean that there are now 5 consumers under the consumer group id trying to consume from the topic. Since only 3 will be allowed to consume from the topic. The remaining 2 instances would remain idle.
- Consumer should subscribe to the topics that it wants to consume.
- Once the topic is subscribed, whenever any message is pushed into that topic it will be consumed by the consumer.
Why is Kafka faster than MQs
- Sequential I/O : Kafka uses segmented, append-only log structure for reads and writes, which makes them sequential. The perception that “disks are slow” is incorrect. Performance of sequential I/O is 6000 times better than random I/O. A modern OS provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes.
- Record batching and compression : Kafka will batch the records required for reading or writing and then send it over the network to avoid network latency that will occur by sending the packets separately. This is improved by compressing the batches before sending them over the network
- Sticky partitioning : When a Kafka producer sends a record to a topic, it needs to decide which partition to send it to. If records are not sent to the same partition, they cannot form a batch together. But Partitioner class provided by Kafka uses keys to decide partitions, so what about the records whose keys are null. Records with null key would use round robin as mentioned earlier, which adds latency. To improve this from Apache Kafka version 2.4 sticky partitioning strategy is used. The sticky partitioner picks a random partition to sends all the records. Once the batch at that partition is filled or otherwise completed, the sticky partitioner randomly chooses and “sticks” to a new partition.
- Consuming : Unlike traditional MQs, Kafka doesn’t pop or delete the records as soon as they are read. Instead they are consumed by multiple consumer at the same time. But in the backend these records can be deleted by setting a retention period.
- Buffer writes: Kafka sends ack once message is written in I/O buffer and doesn’t wait for it to be flushed to the disk. This is why it is recommended to use Kafka with replicas.
- Zero copy: Kafka typically uses zero-copy optimization for transferring data from the file system cache to the network socket when sending messages from brokers to consumers