189 results

Kafka

Basics

Apache Kafka is a distributed streaming platform
centralized system for publishing and subscribing to data
allows for reading and re-reading data
"native" implementation written in Scala (and by extension, Java)

Common Uses

log aggregation
can perform ML tasks (with help from third-parties)
durable storage allows it to serve as reliable data lake
storing events for event-source system
external distributed commit log
stream-processing framework
- newest capability

Architecture Overview

Broker

"the heart of Kafka"
where data resides
several servers with Brokers coordinating together as single unit is a cluster

Producer

produced data sent as messages
- byte arrays with no specific format
- can optionally be paired together with keys
- typeless by default
can attach schema to messages
- can be verified against and typecast
- schema can be backed by by JSON, XML, Avro, etc.
- Avro tends to be the standard
Avro
- serialization framework originally built for Hadoop
- provides decent level of compression
- is decouplable from data
- meaning code is (usually) not required to be regenerated upon schema change
- provides support for backwards and forwards compatibility

Topic

specific name to which a message is assigned
scalability achieved through splitting topic into partitions
- each partition is basically a separate and single log
Kafka can scale up to maximum equal to topic's partition count
- max cap necessary to guarantee that messages will be replayed in same order in which they were received
  - to guarantee that, only one reader within single consuming group (signified by ID) can be attached to any given partition
  - since there can only ever be one reader per partition at a time, total partition count becomes maximum parallelism capability
fault tolerance:
- partitions handle parallelism factor
- replication factor handles redundancy
- cluster's metadata manager ZooKeeper handles leader elections if leader in cluster goes down

Data

data backing is basically commit log (an append-only data structure)
- guarantees ordering
each message appended to log is given an offset
- offset allows consumers of data to be able to replay back from any given point of the log
- also is what allows different reader applications to consume same data at their own pace

Consumers

consume messages from a topic
typically scaled out via grouping of consumer apps, known as consumer groups
each consumer
- is assigned subset of topic(s) via its partitions
- moves at own pace, keeping track of last processed message's offset separately

Stream

generally, movement of data from end to end: producer > broker > consumer

Brokers

hub/heart for all data, managing:
- how it's stored
- in which partition it's stored
- other metadata-based tasks (including dealing with offsets)
usually exists as group of Brokers as a cluster (for scalability)
number of Brokers should be based on:
- storage needs and growth
- replication requirements (since that effectively serves as a multiplier)
  - for example:
    - expectation = store 100TB on 2 servers of 50TB
    - actual = have replication factor of 2 means that you need 4 servers handling 200TB of data
- amount of network traffic expected to be hitting Brokers
  - namely, how many expected Consumers and how often are you expecting them to be hitting the servers?
  - since replica acts as Consumer, that shold also be considered in this estimate

Zookeeper

high-level coordinator/controller of the cluster, responsible for assigning out topic partitions and monitoring for failures
typically run as a cluster, known as an ensemble
- should be an odd number of servers
  - 3 is ok
  - 5 is optimal
  - more than 7 likely results in degradation
- can opt to run multiple Kafka clusters on same ensemble, using chroot to delineate each cluster:
```
# KafkaCluster1 is `chroot`
zookeeper.connect = host:port/KafkaCluster1
```
- not suggested to share ensemble with other non-Kafka apps
  - sensitive to latency and timeouts
  - connection hiccups can make brokers act unpredictably
role in Kafka
1. when Broker starts up, registers itself in ZooKeeper by creating ephemeral node using its unique ID
2. controller is simply the first Broker to create ephemeral node named controller
3. subsequent Brokers will attempt to do the same, but will instead create a watch on current controller node

Producers

responsible for creatings messages and sending them to Broker
when possible, will wait to send messages in batches by Topic and Parition to reduce impact of network
- NOTE: when no partition is assigned or calculated, it will be assigned using round-robin algorithm
managing speed of data production
- can increase throughput by increasing number of threads used by Producer
- if more is needed, can spin up more Producers

Consumers

responsible for flowing messages out from (i.e. reading Topic data out of) Broker
can track one or more Kafka Topics
must push offsets of data they've already processed back to Kafka cluster
- in that sense, a Consumer is also a Producer
more than one is usually needed
Consumer groups consist of Consumers working together in coordinated system
- each Consumer shares Consumer group ID
- coordinate which Consumers get which Paritions of data
- since Partition cannot be split, max Consumers in group should not equal more than total available Partitions
when Consumer has group ID, sends join group message to coordinator
- first consumer to join becomes group's leader
with leader set, coordinator sends Consumers it already knows or becomes aware of down to leader
- leader then decides each Consumer's Partition assignments
- once divvied up, sends assignments back to coordinator so that it can send each Consumer its assigned Partitions
- this division of labor is rebalanced upon Consumers being added or removed from group
  - NOTE: message flows halted during this time
group coordinator determines Consumer availability via Consumers sending periodic pings (known as heartbeats)
- if coordinator notices too many missed successive heartbeats, marks that Consumer as unavailable
NOTE: most of the above is handled for you

Configuration

TODO 🙃

Customization & Production Readiness

TODO 🙃

DevOps: Monitoring

TODO 🙃

Resources

Apache Avro™ 1.8.1 Getting Started (Python)

Made with

Google's Tech Dev Guide

Mastering Algorithms and Data Structures in JavaScript

NestJS

Security

File IO and Resource Management

Exceptions and Error Handling

Classes

Tree-based DSA I

DSA Programiz Course

CodeSignal JavaScript Courses

Data Structures and Algorithms

Algorithm Exercises

JavaScript

Web Accessibility

TypeScript

Python

AlgoCasts: Exercises and Solutions

Adding Enterprise Features to your NestJS App - Configuring a Middleware and adding Validation

NestJS Basics - Understanding NestJS

Amazon Web Services (AWS)

Kafka

Table of Contents

Basics

Common Uses

Architecture Overview

Brokers

Zookeeper

Producers

Consumers

Configuration

Customization & Production Readiness

DevOps: Monitoring

Resources