Google Megastoreのお勉強メモ #appengine - スティルハウスの書庫の書庫

BrettさんのSTMに関する記事の中でGoogle Megastoreについて言及されていて、そのリンク先がハミルトン先生の2008年7月の記事（内で紹介されていたPhil Bernsteinさんのメモ）でした。つまりどうやらMegastoreに関する公開情報でGooglerのお墨付きなものはこれしかなさそうです。そこで改めて要点を写経しつつ、App Engineレベルから見た疑問点等をまとめてみました。なお、青色部分は私の訳注および感想です。同じ記事について解説したkuenishiさんの記事もありますので、合わせてご参照ください。

ところでここでは「BigTable」表記です。BigtableのTが大文字か小文字かについてguidoは「う〜ん論文ではtだったから小文字じゃないかな〜」と言ってました。つまりGoogle社内でも統一されてません。

Google Megastore

What follows is a guest posting from Phil Bernstein on the Google Megastore presentation by Jonas Karlsson, Philip Zeyliger at SIGMOD 2008:
Megastore is a transactional indexed record manager built by Google on top of BigTable. It is rumored to be the store behind Google AppEngine but this was not confirmed (or denied) at the talk.

以下の内容は、SIGMOD 2008におけるJonas KarlssonとPhilip ZeyligerによるGoogle Megastoreのプレゼンテーションに関するPhil Bernsteinによるゲストポストである
Megastoreは、GoogleがBigTable上に構築した、トランザクショナルかつインデックスベースのレコードマネージャ
Google App Engineのデータストアの中身はこのMegastoreであると言われている（これは確認済み。DatastoreはMegastore上に実装されています）

·A transaction is allowed to read and write data in an entity group.
·The term “entity group” refers to a set of records, possibly in different BigTable instances. Therefore, different entities in an entity group might not be collocated on the same machine. The entities in an entity group share a common prefix of their primary key.  So in effect, an entity group is a hierarchically-linked set of entities.
·A per-entity-group transaction log is used. One of the rows that stores the entity group is the entity group’s root. The log is stored with the root, which is replicated like all rows in Big Table.

エンティティグループの中では、データの読み書き時にトランザクションを利用できる
「エンティティグループ」という用語は、複数のレコードの集まりを指す。各レコードは異なるBigTableインスタンス（タブレットサーバーのこと？）にあってもよい。よって1つのエンティティグループ内の各エンティティは、それぞれ異なるマシン上に配置される場合もある。あるエンティティグループ内の個々のエンティティは、プライマリキーに共通の接頭辞を持つ。そのためエンティティグループは「階層的にリンクされたエンティティの集まり」となる
- 「エンティティ」「エンティティグループ」という概念は、Datastore由来ではなくMegastore由来なんですね
- 「エンティティグループ」の「エンティティ」は複数のマシンに分散していてもよい。ということは、エンティティグループのトランザクションは分散トランザクション（＝グローバルトランザクション）。。ですよね？
エンティティグループ単位のトランザクションログを使用する。各行のうち、エンティティグループのルート（ルートエンティティ）にエンティティグループ（のログ？）が保存される。ログはルートに保存され、Big Tableのすべての行と同じくレプリケーションされる。

·To commit a transaction, its updates are stored in the log and replicated to the other copies of the log. Then they’re copied into the database copy of the entity group.
·They commit to replicas before acking the caller and use Paxos to deal with replica failures. So it’s an ACID transaction.
·Optimistic concurrency control is used. Few details were provided, but I assume it’s the same as what they describe for Google Apps.

トランザクションのコミット時には、更新内容がログに記録され、そのログがレプリケーションされる。つづいてエンティティグループのデータベースコピー（データ本体を表す行？）に更新内容が書き込まれる。
- Datastoreのドキュメントで説明されているcommitやapplyといったフェーズは、Megastore内部のフェーズなのかな？
レプリカ（コピー）へのコミット時には、コピー先からコピー元へのackを待たない。レプリカの障害時にはPaxosで対処する。よってこの処理はACIDトランザクションとなる。
- Paxosを使うことで（2PCのように）いちいちコピー先の返事を待たずにレプリケーションしつつ障害時の整合性も確保（ACID）される。。かな？　すべての書き込み時にPaxosを使うのか障害時にのみ使うのかよくわからない
楽観的排他制御を用いる。詳細は不明だが、Google Apps（App Engineの間違い）での説明内容と同じはず

·Schemas are supported.
·They offer vertical partitioning to cluster columns that are frequently accessed together.
·They don’t support joins except across hierarchical paths within entity groups. I.e., if you want to do arbitrary joins, then you write an application program and there’s no consistency guarantee between the data in different entity groups.
·Big Table does not support indexes. It simply sorts the rows by primary key. Megastore supports indexes on top. They were vague about the details. It sounds like the index is a binary table with a column that contains the compound key as a slash-separated string and a column containing the primary key of the entity group.

スキーマをサポートする（？）
頻繁にアクセスされるカラムをたばねた垂直パーティショニングをサポートする（これはDatastoreのAPIでは提供されてない）
joinはサポートしない。ただしエンティティグループの階層構造内でのjoinはサポートする（Ancestor Queryのこと？）。アプリケーション側で任意のjoinを実装することも可能だが、エンティティグループが異なるデータ間では整合性が保証されない
Big Tableはインデックスをサポートしない。プライマリーキーで行をソートするだけである。Megastoreでは、Big Table上でインデックス機能を提供する。詳細は不明。「/」で区切られた文字列からなる複合キーと、エンティティグループのプライマリーキーを含むカラムを備えたバイナリーテーブルがインデックスの実体のようだ
- このあたりはApp Engineのインデックステーブルのことを指すと思われる。しかし、もしインデックステーブルがMegastoreレベルで実装されているなら、App Engineクエリ機能の改良等はMegastore内部で実施されているということ。。？

·Referential integrity between the components of an entity group is not supported.
·Many-to-many relationships are not supported, though they said they can store the inverse of a functional relationship.  It sounded like a materialized view that is incrementally updated asynchronously.
·It has been in use by apps for about a year.

エンティティグループ内のエンティティ間の参照整合性は保証しない
N:N関係はサポートしないが、they can store the inverse of a functional relationshipとのこと（わかりません）。非同期かつインクリメンタルに更新されるマテリアライズドビューみたいだ
複数のアプリケーションでおよそ1年間使用されてきた

また混乱してきましたw

...エンティティグループも複数マシン上でのACIDを保証するので、つまりは分散トランザクション／グローバルトランザクション。。なんですよね？
しかしどうやって？ 2PCは使っていないです。ではPaxos？ DC間レプリにはPaxosを使っているとgooglerも言ってましたが、エンティティグループ内のコミット時に毎回Paxos使っているとは思えない。。ああよくわかりません。 Slim3やその他のいわゆる「アプリレベルの分散トランザクション」と、エンティティグループの分散トランザクション、それぞれのpros/consや実装の違いをだれかうまくまとめてくだされ。。

追記

ひがさんのつぶやき：

.@kazunori_279 エンティティグループのACIDはルートエンティティのログ(のみ)で実現されているので、単一マシンに閉じていて、分散トランザクションではないと思います #appengine

ふむ〜。データの保管場所が物理的に分散していても、それらを保護するタイムスタンプやロック自体が単一マシン上にあれば、ローカルトランザクションと見なせる。。という理解でいいかな？