Apache Kafka

Apache Kafka® is run on Kafka servers as a cluster. A Kafka cluster stores record streams in categories called Topics. These streams of records - event streams - are a continuously updated data set of unknown or of unlimited size. These record streams are a reliably published stream of messages.

The following shows a completed pipeline in Xapix using a Kafka event stream and several other data sources. It contains the usual elements of a pipeline built for a REST endpoint and you use the same techniques to build it.

Kafka pipeline example

Topics

The basic unit of data within Kafka is a message (the data). In a Kafka cluster, streams of messages are categorized into topics which are analogous to a database table or system folder. The topic is the key abstraction of a Kafka event stream where each topic is a streams of "related" messages.

A topic is divided into ordered partitions where each partition is consumed by ONLY one Consumer.

Each topic contains metadata and a message body. The body contains a key/ value pair while the metadata includes information such as a timestamp.

Producers and Consumers

Kafka event streams contains a single topic which is streamed between two types of Kafka clients - Producers and Consumers. Producers create new messages while Consumers read messages. For a Consumer to read the published messages, it must be subscribed to the topic.

For example, a Producer writes data as a message and publishes the message to a specific topic. A Consumer that is subscribed to that topic will receive (or consume) all messages published to the topic.

Kafka event stream

Kafka event streams have an N to N relation to producers. That is, the producer relationship to topics is N to N. One or many Event Producers can be related to one or many Event Consumers.

Consumers automatically pull (fetch) messages from one to many topics. This includes fetching the next message from the topics. If the Consumer receives a null response, then no new message is available. This results in an infinite loop of continually requesting a new message to be fetched. A Consumer Offset indicates which topic is next to read.

Consumer Groups

Each Consumer can be a part of a Consumer Group. This provide scalability for consumption and load balancing. With Consumer Groups, multiple copies of Consumers are used to fetch multiple messages that read N number of messages at the same time. The Consumer Group receives messages from topics that each Consumer in the Group is subscribed to. This leads to high throughput and low latency.

Initial Offset (Position)

The initial offset defines the Consumer behavior when a Consumer Group is initialized. It is a configurable policy that sets the initial position for an assigned partition where a consumer begins to read and process messages. It is a required setting where consumption of messages starts at either the earliest or latest offset.

Kafka must keep track of what has been consumed. To keep track of where a Consumer is in a partition, this position is marked by a single integer. This integer is the offset of the next message that needs to be consumed.

Earliest

Earliest tells a Consumer that it must:

  1. Save its position in the log.

  2. Read messages.

  3. Process messages.

When set to Earliest, the Consumer process could crash after processing messages but before saving its position. After restart, it would then reprocess a few messages it had already processed. This corresponds to the “at-least-once” semantics in the case of Consumer failure. In many cases messages have a primary key and so the updates are idempotent (receiving the same message twice just overwrites a record with another copy of itself).

Latest

Latest tells a Consumer that it must:

  1. Read any messages.

  2. Save its position in the log.

  3. Process messages.

When set to Latest, the Consumer process could crash after saving its position in the log but before saving the output of its message processing. After restart, it would then begin at the saved position even though some messages might not have been processed. This corresponds to “at-most-once” semantics as in the case of a consumer failure where some messages may not be processed.

Sinks and pipelines

In a Kafka event stream, a Sink exports data to any external system. However, in Xapix, the Sink provides a termination of the pipeline.

Kafka event stream in Xapix

In a Xapix pipeline, an Event unit connects to a Kafka event stream. The Topic data is then orchestrated and transformed based on the design of the pipeline. The transformed data is then made available to external applications via webhooks.