Kafka


Table of Contents

Basics

  • Apache Kafka is a distributed streaming platform
  • centralized system for publishing and subscribing to data
  • allows for reading and re-reading data
  • "native" implementation written in Scala (and by extension, Java)

Common Uses

  • log aggregation
  • can perform ML tasks (with help from third-parties)
  • durable storage allows it to serve as reliable data lake
  • storing events for event-source system
  • external distributed commit log
  • stream-processing framework
    • newest capability

Architecture Overview

Broker

  • "the heart of Kafka"
  • where data resides
  • several servers with Brokers coordinating together as single unit is a cluster

Producer

  • produced data sent as messages
    • byte arrays with no specific format
    • can optionally be paired together with keys
    • typeless by default
  • can attach schema to messages
    • can be verified against and typecast
    • schema can be backed by by JSON, XML, Avro, etc.
    • Avro tends to be the standard
  • Avro
    • serialization framework originally built for Hadoop
    • provides decent level of compression
    • is decouplable from data
    • meaning code is (usually) not required to be regenerated upon schema change
    • provides support for backwards and forwards compatibility

Topic

  • specific name to which a message is assigned
  • scalability achieved through splitting topic into partitions
    • each partition is basically a separate and single log
  • Kafka can scale up to maximum equal to topic's partition count
    • max cap necessary to guarantee that messages will be replayed in same order in which they were received
      • to guarantee that, only one reader within single consuming group (signified by ID) can be attached to any given partition
      • since there can only ever be one reader per partition at a time, total partition count becomes maximum parallelism capability
  • fault tolerance:
    • partitions handle parallelism factor
    • replication factor handles redundancy
    • cluster's metadata manager ZooKeeper handles leader elections if leader in cluster goes down

Data

  • data backing is basically commit log (an append-only data structure)
    • guarantees ordering
  • each message appended to log is given an offset
    • offset allows consumers of data to be able to replay back from any given point of the log
    • also is what allows different reader applications to consume same data at their own pace

Consumers

  • consume messages from a topic
  • typically scaled out via grouping of consumer apps, known as consumer groups
  • each consumer
    • is assigned subset of topic(s) via its partitions
    • moves at own pace, keeping track of last processed message's offset separately

Stream

  • generally, movement of data from end to end: producer > broker > consumer

Brokers

  • hub/heart for all data, managing:
    • how it's stored
    • in which partition it's stored
    • other metadata-based tasks (including dealing with offsets)
  • usually exists as group of Brokers as a cluster (for scalability)
  • number of Brokers should be based on:
    • How many Brokers?
    • storage needs and growth
    • replication requirements (since that effectively serves as a multiplier)
      • for example:
        • expectation = store 100TB on 2 servers of 50TB
        • actual = have replication factor of 2 means that you need 4 servers handling 200TB of data
    • amount of network traffic expected to be hitting Brokers
      • namely, how many expected Consumers and how often are you expecting them to be hitting the servers?
      • since replica acts as Consumer, that shold also be considered in this estimate

Zookeeper

  • high-level coordinator/controller of the cluster, responsible for assigning out topic partitions and monitoring for failures
  • typically run as a cluster, known as an ensemble
    • should be an odd number of servers
      • 3 is ok
      • 5 is optimal
      • more than 7 likely results in degradation
    • can opt to run multiple Kafka clusters on same ensemble, using chroot to delineate each cluster:
      # KafkaCluster1 is `chroot`
      zookeeper.connect = host:port/KafkaCluster1
    • not suggested to share ensemble with other non-Kafka apps
      • sensitive to latency and timeouts
      • connection hiccups can make brokers act unpredictably
  • role in Kafka
    1. when Broker starts up, registers itself in ZooKeeper by creating ephemeral node using its unique ID
    2. controller is simply the first Broker to create ephemeral node named controller
    3. subsequent Brokers will attempt to do the same, but will instead create a watch on current controller node

Producers

  • responsible for creatings messages and sending them to Broker
  • when possible, will wait to send messages in batches by Topic and Parition to reduce impact of network
    • NOTE: when no partition is assigned or calculated, it will be assigned using round-robin algorithm
  • managing speed of data production
    • can increase throughput by increasing number of threads used by Producer
    • if more is needed, can spin up more Producers

Consumers

  • responsible for flowing messages out from (i.e. reading Topic data out of) Broker
  • can track one or more Kafka Topics
  • must push offsets of data they've already processed back to Kafka cluster
    • in that sense, a Consumer is also a Producer
  • more than one is usually needed
  • Consumer groups consist of Consumers working together in coordinated system
    • Consumer Group Diagram
    • each Consumer shares Consumer group ID
    • coordinate which Consumers get which Paritions of data
    • since Partition cannot be split, max Consumers in group should not equal more than total available Partitions
  • when Consumer has group ID, sends join group message to coordinator
    • first consumer to join becomes group's leader
  • with leader set, coordinator sends Consumers it already knows or becomes aware of down to leader
    • leader then decides each Consumer's Partition assignments
    • once divvied up, sends assignments back to coordinator so that it can send each Consumer its assigned Partitions
    • this division of labor is rebalanced upon Consumers being added or removed from group
      • NOTE: message flows halted during this time
  • group coordinator determines Consumer availability via Consumers sending periodic pings (known as heartbeats)
    • if coordinator notices too many missed successive heartbeats, marks that Consumer as unavailable
  • NOTE: most of the above is handled for you

Configuration

TODO ๐Ÿ™ƒ


Customization & Production Readiness

TODO ๐Ÿ™ƒ


DevOps: Monitoring

TODO ๐Ÿ™ƒ


Resources

Made withย Gatsby G Logo