kafka-0.10.0 Official Website Translation (1) Getting Started Guide

kafka-0.10.0 Official Website Translation (1) Getting Started Guide

1.1 Introduction Kafka is a distributed streaming platform. What exactly does that mean? Kafka is a distributed streaming platform, what does it mean?

We think of a streaming platform as having three key capabilities: It let's you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system . It allows you to push and subscribe to streaming records. In this respect, it is similar to a message queue or an enterprise-level message system. It let's you store streams of records in a fault-tolerant way. It let's you process streams of records as they occur.

Sina Weibo: intsmaze Liu Yangyang What is Kafka good for? What is Kafka good for? It gets used for two broad classes of application: Building real-time streaming data pipelines that reliably get data between systems or applications To the data between systems or applications Building real-time streaming applications that transform or react to the streams of data Build real-time streaming applications that transform or react to the streams of data.

  To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up. Concept: Kafka is run as a cluster on one or more servers. Kafka is run as a cluster on one or more servers. The Kafka cluster stores streams of records in categories called topics. The Kafka cluster stores streams of records in categories called topics. Each record consists of a key, a value, and a timestamp. Each record consists of a key, a value, and a timestamp. Kafka has four core APIs: Kafka has four core apis: The Producer API allows an application to publish a stream records to one or more Kafka topics. on. The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them. The consumer api allows an application to subscribe to one or more topics and record the process of their generation. The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams. A stream processor consumes an input stream from one or more topics, produces an output stream to one or more output stream topics, and effectively converts the input stream into an output stream. The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to Connector API allows building and running reusable producers or consumers To connect Kafka topics to existing applications or data systems. For example, a connector to a relational database may capture every change. effectively transforming the input streams to output streams. This Streams API allows an application to act as a stream processor, consume an input stream from one or more topics, produce an output stream to one or more output stream topics, and effectively transform The input stream is converted to an output stream. The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to Connector API allows building and running reusable producers or consumers To connect Kafka topics to existing applications or data systems. For example, a connector to a relational database may capture every change. effectively transforming the input streams to output streams. This Streams API allows an application to act as a stream processor, consume an input stream from one or more topics, produce an output stream to one or more output stream topics, and effectively transform The input stream is converted to an output stream. The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to Connector API allows building and running reusable producers or consumers To connect Kafka topics to existing applications or data systems. For example, a connector to a relational database may capture every change.

  In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages. The communication between Kafka's client and server is done with a simple, high-performance, language-independent TCP protocol. The version of this protocol can be maintained backward to be compatible with the old version. We provide a java version of the client for Kafka, in fact, this client has many language versions to choose from.

Topics and Logs Let's first dive into the core abstraction Kafka provides for a stream of records—the topic. We first dive into the core concept of Kafka. Kafka provides a series of records called topics. A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

  The subject is a category or naming which records will be pushed. Topics in Kafka always have multiple subscribers. Therefore, a topic can have zero, one or more consumers to subscribe to the data written in this topic. For each topic, the Kafka cluster maintains a partitioned log that looks like this: For each topic, the Kafka cluster maintains a partitioned log that looks like this:

  Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition. Each partition is one An orderly, unchanging sequence of records, it is constantly added-this kind of structured operation log. The records of the partition are assigned a continuous id number called the offset. The offset uniquely identifies each record in the partition. The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption , after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

  The Kafka cluster uses a configurable retention period to save all the records that have been pushed out, regardless of whether they have been consumed or not. For example, if the saving strategy is set to two days, then the record can be consumed two days after the record is pushed out, after which it will be discarded to make room. Kafka's performance is effectively constant to data size so storing data for a long time is not a problem.

  In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact , since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now" In fact, the only metadata that is kept on a per-consumer basis is that the offset is controlled by the consumer: usually when a consumer reads a record, his offset will increase linearly. However, in fact, since the displacement of the record is controlled by the consumer, the consumer can consume the record in any order. For example, a consumer can reset the offset to the previously used offset to reprocess the data or jump to the most recent record to start consumption. This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers. The combination of kafka means that it is very convenient for kafka consumers, and their ability to join or leave will not affect the cluster Or other consumers. For example, you can use our command line tool to track the content of any topic without changing consumption by any existing consumer. The partitions in the log serve several purposes. 1. they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. 2.they act as the unit of parallelism—more on that in a bit. Log partitioning has multiple purposes. First: They allow the log size to exceed the limit of their deployment on a single machine. It must fit on the server host of each partition. The combined nature of Kafka means that Kafka consumers are very convenient. They can join or leave without affecting the cluster or other consumers. For example, you can use our command line tool to track the content of any topic without changing consumption by any existing consumer. The partitions in the log serve several purposes. 1. they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. 2.they act as the unit of parallelism—more on that in a bit. Log partitioning has multiple purposes. First: They allow the log size to exceed the limit of their deployment on a single machine. It must fit on the server host of each partition. The combined nature of Kafka means that Kafka consumers are very convenient. They can join or leave without affecting the cluster or other consumers. For example, you can use our command line tool to track the content of any topic without changing consumption by any existing consumer. The partitions in the log serve several purposes. 1. they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. 2.they act as the unit of parallelism—more on that in a bit. Log partitioning has multiple purposes. First: They allow the size of the log to exceed their limit when deployed on a single machine. It must fit on the server host of each partition. 2.they act as the unit of parallelism—more on that in a bit. Log partitioning has multiple purposes. First: They allow the log size to exceed the limit of their deployment on a single machine. It must fit on the server host of each partition. 2.they act as the unit of parallelism—more on that in a bit. Log partitioning has multiple purposes. First: They allow the size of the log to exceed their limit when deployed on a single machine. It must fit on the server host of each partition.

Distribution The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance. On the servers in the Kafka cluster, each server processes data and requests a shared partition. Each partition is replicated on a configurable number of fault-tolerant servers. Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster. Each partition has one server acting as the "leader" and zero or more servers acting as "followers." When the leader processes all read and write requests to the partition, followers will passively replicate the leader's partition. If this leader fails, one of these followers will automatically become a new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster. Producers Producers Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second ! Producers push data to the theme of their choice. The producer is responsible for choosing which record is allocated to which partition of the specified topic. It is possible to simply balance the load and record to the partition through a round-robin method or determine which partition to record to according to some semantic partition function (divide according to the key of the record). Soon you will see more about the use of divisions. Producers push data to topics of their choice. The producer is responsible for choosing which record is allocated to which partition of the specified topic. It is possible to simply balance the load and record to the partition through a round-robin method or determine which partition to record to according to some semantic partition function (divide according to the key of the record). Soon you will see more about the use of divisions. Producers push data to topics of their choice. The producer is responsible for choosing which record is allocated to which partition of the specified topic. It is possible to simply balance the load and record to the partition through a round-robin method or determine which partition to record to according to some semantic partition function (divide according to the key of the record). Soon you will see more about the use of divisions. Producers push data to topics of their choice. The producer is responsible for choosing which record is allocated to which partition of the specified topic. It is possible to simply balance the load and record to the partition through a round-robin method or determine which partition to record to according to some semantic partition function (divide according to the key of the record). Soon you will see more about the use of divisions. Producers push data to topics of their choice. The producer is responsible for choosing which record is allocated to which partition of the specified topic. It is possible to simply balance the load and record to the partition through a round-robin method or determine which partition to record to according to some semantic partition function (divide according to the key of the record). Soon you will see more about the use of divisions. Producers Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second! Producers push data to the topic of their choice. The producer is responsible for choosing which record is allocated to which partition of the specified topic. It is possible to simply balance the load and record to the partition through a round-robin method or determine which partition to record to according to some semantic partition function (divide according to the key of the record). Soon you will see more about the use of divisions. Producers Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second! Producers push data to the topic of their choice. The producer is responsible for choosing which record is allocated to which partition of the specified topic. It is possible to simply balance the load and record to the partition through a round-robin method or determine which partition to record to according to some semantic partition function (divide according to the key of the record). Soon you will see more about the use of divisions. More on the use of partitioning in a second! Producers push data to the topic of their choice. The producer is responsible for choosing which record is allocated to which partition of the specified topic. It is possible to simply balance the load and record to the partition through a round-robin method or determine which partition to record to according to some semantic partition function (divide according to the key of the record). Soon you will see more about the use of divisions. More on the use of partitioning in a second! Producers push data to the topic of their choice. The producer is responsible for choosing which record is allocated to which partition of the specified topic. It is possible to simply balance the load and record to the partition through a round-robin method or determine which partition to record to according to some semantic partition function (divide according to the key of the record). Soon you will see more about the use of divisions.

Consumers

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines. Consumers label themselves through consumption Group name, each record pushed to the topic is only delivered to each consumer group that subscribes to the topic. The consumer can be in a separate instance process or on a different machine. If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances. If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

  A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four. Consumer group. Consumer group A has two consumer instances, and consumer group B has four consumer instances. More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

  More commonly, we find that the topic has a small number of consumer groups one for each "logical subscriber". The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances. Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group. Guarantees At a high-level Kafka gives the following guarantees: Kafka's advanced api The following guarantees can be given: Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log. The message is sent to a specific topic partition by the producer, and the message will be appended to this partition in the order in which it was sent. For example, if both M1 and M2 messages are sent by the same consumer and M1 is sent first, the offset of M1 will be smaller than that of M2 and appear earlier in the log. A consumer instance sees records in the order they are stored in the log. A consumer instance sees records in the order they are stored in the log. For a topic with replication factor N,