開源跨平臺數據格式化框架概覽 - gaochundong－IT工程師數位筆記本

文章出處

說到數據格式化框架，就不得不提到 Google 的 Protocol Buffers，Facebook 的 Thrift，還有 Apache Hadoop 推出的 Avro。Microsoft 最近開源的 Bond 也是一種用于數據格式化的可擴展框架，其適用的應用場景包括服務間通信、大數據存儲和處理等。

為什么會有這么多關于數據格式處理的框架？它們都在解決什么問題呢？我們先來觀察一下典型的服務間通信的結構。

通常，在設計服務間通信時，我們所要面對的基本問題有：

如何傳輸數據？
使用什么協議通信？
數據以何種格式表達？
在服務端如何處理數據請求？
數據在服務端如何存儲？
請求消息如何路由或轉發？

隨著服務系統架構的不斷演進，我們會面對更多的問題：

適應架構演進的能力
適應集群擴展的能力
靈活性
延時
簡單

那么，以前我們都是在用什么技術來解決這些問題的呢？

C 語言的動態結構體二進制傳輸
DCOM, COM+
CORBA
SOAP
XML, JSON

都是聽起來很熟悉的名字。實際上，C 結構體仍然被廣泛地應用于網絡底層通信，DCOM, CORBA, SOAP 正逐步退出歷史舞臺。目前，最流行的就是基于 XML 或 JSON 的序列化機制。

但使用 XML 和 JSON 時也會面對一些問題：

通信協議需要額外描述
需要維護服務端和客戶端兩側契約代碼
需要為設計的協議編寫包裝類
需要為不同編程語言編寫實現
承擔解析 XML 和 JSON 較高的開銷
存儲空間占用相對較多

那么，對于這些數據處理和序列化框架，從軟件設計人員的角度來看，我們最需要的到底是什么呢？

多語言間的透明性
時間和空間效率
支持快速開發
能利用已有的類庫

所以，業界著名公司的開發人員分別推出了不同的框架，以期解決這些問題。包括 Google 的 Protocol Buffers，Facebook 的 Thrift，Apache Hadoop 的 Avro，和 Microsoft 的 Bond。

	Bond	Protocol Buffers	Thrift	Avro
框架起源	Microsoft	Google	Facebook	Apache
開源年份	2014	2008	2007	2009
開源協議	MIT License	BSD License	Apache License 2.0	Apache License 2.0
代碼位置	GitHub	GitHub	Apache	Apache
官方文檔	Documents	Documents	Documents	Documents

這些框架的一些共性：

使用 IDL 定義，IDL (Interface Description Language)
性能較高
支持版本演進
采用二進制格式

這些框架的典型使用過程：

編寫類似于結構體的消息格式定義，使用類似于 IDL 的語言定義。
使用代碼生成工具，生成目標語言代碼。
雖然生成了許多代碼，但代碼的可讀性比較高。
在程序中直接使用這些代碼。
生成的代碼不允許編輯。

也就是說，用戶首先需要定義數據結構，然后生成可以有效讀寫這些數據結構的代碼，再將代碼嵌入到服務端與客戶端的代碼中使用。

例如，下面使用 Protocol Buffers 的定義搜索請求消息 search.proto。

package serializers.protobuf.test;

message SearchRequest {
  required string query = 1;
  optional int32 page_number = 2;
  optional int32 result_per_page = 3 [default = 10];
  enum Corpus {
    UNIVERSAL = 0;
    WEB = 1;  }
  optional Corpus corpus = 4 [default = UNIVERSAL];
}

使用代碼生成工具生成 C# 代碼如下。

 1 namespace serializers.protobuf.test
 2 {
 3   [global::System.Serializable, global::ProtoBuf.ProtoContract(Name=@"SearchRequest")]
 4   public partial class SearchRequest : global::ProtoBuf.IExtensible
 5   {
 6     public SearchRequest() {}
 7     
 8     private string _query;
 9     [global::ProtoBuf.ProtoMember(1, IsRequired = true, Name=@"query", DataFormat = global::ProtoBuf.DataFormat.Default)]
10     public string query
11     {
12       get { return _query; }
13       set { _query = value; }
14     }
15     private int _page_number = default(int);
16     [global::ProtoBuf.ProtoMember(2, IsRequired = false, Name=@"page_number", DataFormat = global::ProtoBuf.DataFormat.TwosComplement)]
17     [global::System.ComponentModel.DefaultValue(default(int))]
18     public int page_number
19     {
20       get { return _page_number; }
21       set { _page_number = value; }
22     }
23     private int _result_per_page = (int)10;
24     [global::ProtoBuf.ProtoMember(3, IsRequired = false, Name=@"result_per_page", DataFormat = global::ProtoBuf.DataFormat.TwosComplement)]
25     [global::System.ComponentModel.DefaultValue((int)10)]
26     public int result_per_page
27     {
28       get { return _result_per_page; }
29       set { _result_per_page = value; }
30     }
31     private serializers.protobuf.test.SearchRequest.Corpus _corpus = serializers.protobuf.test.SearchRequest.Corpus.UNIVERSAL;
32     [global::ProtoBuf.ProtoMember(4, IsRequired = false, Name=@"corpus", DataFormat = global::ProtoBuf.DataFormat.TwosComplement)]
33     [global::System.ComponentModel.DefaultValue(serializers.protobuf.test.SearchRequest.Corpus.UNIVERSAL)]
34     public serializers.protobuf.test.SearchRequest.Corpus corpus
35     {
36       get { return _corpus; }
37       set { _corpus = value; }
38     }
39     [global::ProtoBuf.ProtoContract(Name=@"Corpus")]
40     public enum Corpus
41     {            
42       [global::ProtoBuf.ProtoEnum(Name=@"UNIVERSAL", Value=0)]
43       UNIVERSAL = 0,
44             
45       [global::ProtoBuf.ProtoEnum(Name=@"WEB", Value=1)]
46       WEB = 1,
47     }
48   
49     private global::ProtoBuf.IExtension extensionObject;
50     global::ProtoBuf.IExtension global::ProtoBuf.IExtensible.GetExtensionObject(bool createIfMissing)
51       { return global::ProtoBuf.Extensible.GetExtensionObject(ref extensionObject, createIfMissing); }
52   }
53 }

IDL 語法

使用 IDL 定義的語法通常包括：

每個字段（Field）必須包含一個唯一的正整數標識符，例如 "= 1", "= 2" 或 "1 : ", "2 : " 等。
字段可以被標記為 required 或 optional。
多個 structs 可以被定義在相同的文件中。
structs 可以包含其他 structs。
字段可以被指定默認值。

這里，為字段指定的標識符 "= 1", "= 2" 或 "1 : ", "2 : " 等稱為 "Tag"，這個操作稱為 "Tagging"。這些 Tag 用于從二進制的消息中識別字段，所以一旦定義并使用，則后續不能修改。

Tag 的值在 1-15 區間時使用 1 byte 存儲，在 16-2047 區間時使用 2 bytes 存儲。所以，為節省空間，要將 1-15 留給最常使用的消息元素，并且要為未來可能出現的頻繁使用元素留出空間。

下面是各框架在 IDL 定義層的比較：

	Bond	Protocol Buffers	Thrift	Avro
File Extension	.bond	.proto	.thrift	.avpr
Namespace	namespace	package	namespace	@namespace
Import	import "t.bond"	import "t.proto"	include "t.thrift"	import protocol
Compsite Type	struct t {}	message t {}	struct t {}	protocol t {}
Tagging	1: int32 t;	int32 t = 1;	1: i32 t	×
Field Rules	required optional	required optional repeated	required optional	-
Base Types	bool uint8 uint16 uint32 uint64 float double string int8 int16 int32 int64 wstring	double float int32 int64 uint32 uint64 sint32 sint64 fixed32 fixed64 sfixed32 sfixed64 bool string bytes	bool byte i15 i32 i64 double string	null boolean int long float double bytes string
Containers	list set map vector nullable blob	×	list set map	record array map union fixed
Polymorphism	√	×	×	×
Generics	√	×	×	×
Enumerations	enum t {}	enum t {}	enum t {}	enum t {}
Type Aliases	using t = int64	×	typedef i32 t	@aliases
Constants	×	×	const i32 t = 1	×
Exceptions	√	×	exception t {}	×
Services	√	service t {}	service t {}	protocol t {}
Attributes	√	-	-	-
Comments	C++ Style	C/C++ Style	C/Shell Style	Java Style

注："√" 代表支持，"×" 代表不支持，"-" 代表不涉及。

編程語言支持

各開源數據格式化框架默認會支持若干編程語言，一些沒有被默認支持的編程語言通常在社區中也會找到支持。下面是各框架默認支持的開發語言：

Bond

Protocol Buffers

Thrift

Avro

官方支持語言

C#, C++,

Python

C++, Java,

Python

C++, Java, Python,

PHP, Ruby, Erlang,

Perl, Haskell, C#,

Cocoa, JavaScript,

Node.js, Smalltalk,

OCaml, Delphi

C, C++, C#,

Java, JavaScript,

Python, Perl,

PHP, Ruby

開源語言實現

C#:

protobuf-net

protobuf-csharp-port

Node.js:

node-protobuf

性能比較

以下性能比較數據來自 GitHub eishay/jvm-serializers 。

Serializes only specific classes using code generation or other special knowledge about the class.

                                   create     ser   deser   total   size  +dfl
kryo-opt                               64     658     864    1522    209   129
wobly                                  43     886     536    1422    251   151
wobly-compact                          43     903     569    1471    225   139
protobuf                              130    1225     701    1926    239   149
protostuff                             82     488     678    1166    239   150
protobuf/protostuff                    83     598     692    1290    239   149
thrift                                126    1796     795    2591    349   197
thrift-compact                        126    1555     963    2518    240   148
avro                                   89    1616    1415    3031    221   133
json/json-lib-databind                 63   26330  103150  129479    485   263
json/jsonij-jpath                      63   38015   12325   50339    478   259

Total Time : Including creating an object, serializing and deserializing.

Serialization Time : Serializing with a new object each time (object creation time included).

Deserialization Time : Often the most expensive operation. To make a fair comparison, all fields of the deserialized instances are accessed - this forces lazy deserializers to really do their work.

Serialization Size : May vary a lot depending on number of repetitions in lists, usage of number compacting in protobuf, strings vs numerics, assumptions that can be made about the object graph, and more.