分布式消息系統Kafka初步 - goody9807－IT工程師數位筆記本

文章出處

終于可以寫kafka的文章了，Mina的相關文章我已經做了索引，在我的博客中置頂了，大家可以方便的找到。從這一篇開始分布式消息系統的入門。

在我們大量使用分布式數據庫、分布式計算集群的時候，是否會遇到這樣的一些問題：

l 我想分析一下用戶行為（pageviews），以便我能設計出更好的廣告位

l 我想對用戶的搜索關鍵詞進行統計，分析出當前的流行趨勢。這個很有意思，在經濟學上有個長裙理論，就是說，如果長裙的銷量高了，說明經濟不景氣了，因為姑娘們沒錢買各種絲襪了。

l 有些數據，我覺得存數據庫浪費，直接存硬盤又怕到時候操作效率低。

這個時候，我們就可以用到分布式消息系統了。雖然上面的描述更偏向于一個日志系統，但確實kafka在實際應用中被大量的用于日志系統。

首先我們要明白什么是消息系統，在kafka官網上對kafka的定義叫：A distributed publish-subscribe messaging system。publish-subscribe是發布和訂閱的意思，所以更準確的說kafka是一個消息訂閱和發布的系統。publish- subscribe這個概念很重要，因為kafka的設計理念就可以從這里說起。

我們將消息的發布（publish）暫時稱作producer，將消息的訂閱（subscribe）表述為consumer，將中間的存儲陣列稱作broker，這樣我們就可以大致描繪出這樣一個場面：

分布式消息系統Kafka初步

生產者（藍色，藍領么，總是辛苦點兒）將數據生產出來，丟給broker進行存儲，消費者需要消費數據了，就從broker中去拿出數據來，然后完成一系列對數據的處理。

乍一看這也太簡單了，不是說了它是分布式么，難道把producer、broker和consumer放在三臺不同的機器上就算是分布式了么。我們看kafka官方給出的圖：

分布式消息系統Kafka初步

多個broker協同合作，producer和consumer部署在各個業務邏輯中被頻繁的調用，三者通過zookeeper管理協調請求和轉發。這樣一個高性能的分布式消息發布與訂閱系統就完成了。圖上有個細節需要注意，producer到broker的過程是push，也就是有數據就推送到broker，而consumer到broker的過程是pull，是通過consumer主動去拉數據的，而不是broker把數據主動發送到consumer端的。

這樣一個系統到底在哪里體現出了它的高性能，我們看官網上的描述：

Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
Support for parallel data load into Hadoop.

至于為什么會有O(1)這樣的效率，為什么能有很高的吞吐量我們在后面的文章中都會講述，今天我們主要關注的還是kafka的設計理念。了解完了性能，我們來看下kafka到底能用來做什么，除了我開始的時候提到的之外，我們看看kafka已經實際在跑的，用在哪些方面：

LinkedIn - Apache Kafka is used at LinkedIn for activity stream data and operational metrics. This powers various products like LinkedIn Newsfeed, LinkedIn Today in addition to our offline analytics systems like Hadoop.

Tumblr - http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html

Mate1.com Inc. - Apache kafka is used at Mate1 as our main event bus that powers our news and activity feeds, automated review systems, and will soon power real time notifications and log distribution.

Tagged - Apache Kafka drives our new pub sub system which delivers real-time events for users in our latest game - Deckadence. It will soon be used in a host of new use cases including group chat and back end stats and log collection.

Boundary - Apache Kafka aggregates high-flow message streams into a unified distributed pubsub service, brokering the data for other internal systems as part of Boundary's real-time network analytics infrastructure.

DataSift - Apache Kafka is used at DataSift as a collector of monitoring events and to track user's consumption of data streams in real time. http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html

Wooga - We use Kafka to aggregate and process tracking data from all our facebook games (which are hosted at various providers) in a central location.

AddThis - Apache Kafka is used at AddThis to collect events generated by our data network and broker that data to our analytics clusters and real-time web analytics platform.

Urban Airship - At Urban Airship we use Kafka to buffer incoming data points from mobile devices for processing by our analytics infrastructure.

Metamarkets - We use Kafka to collect realtime event data from clients, as well as our own internal service metrics, that feed our interactive analytics dashboards.

SocialTwist - We use Kafka internally as part of our reliable email queueing system.

Countandra - We use a hierarchical distributed counting engine, uses Kafka as a primary speedy interface as well as routing events for cascading counting