Ceph中一些PG相关的状态说明和基本概念说明

最近公司有个Ceph集群出了点问题,于是也参与了修复的过程,过程中最让人头疼的就是一堆不明所以的状态了,所以看了看文档,也找了一些参考,
整理了一下Ceph PG的一些状态以及相关的概念说明,做了一个中英文的对照版本:

Placement Group States(PG状态)

当检查一个集群的状态时(执行ceph -w或者ceph -s),Ceph会汇报当前PG的状态,每个PG会有一个或多个状态,最优的PG状态是active + clean
下面是所有PG状态的具体解释:

creating

Ceph is still creating the placement group.
Ceph 仍在创建PG。

activating

The placement group is peered but not yet active.
PG已经互联,但是还没有active。

active

Ceph will process requests to the placement group.
Ceph 可处理到此PG的请求。

clean

Ceph replicated all objects in the placement group the correct
number of times.
PG内所有的对象都被正确的复制了对应的份数。

down

A replica with necessary data is down, so the placement group is
offline.
一个包含必备数据的副本离线,所以PG也离线了。

scrubbing

Ceph is checking the placement group metadata for inconsistencies.
Ceph 正在检查PG metadata的一致性。

deep

Ceph is checking the placement group data against stored checksums.
Ceph 正在检查PG数据和checksums的一致性。

degraded

Ceph has not replicated some objects in the placement group the
correct number of times yet.
PG中的一些对象还没有被复制到规定的份数。

inconsistent

Ceph detects inconsistencies in the one or more replicas of an
object in the placement group (e.g. objects are the wrong size,
objects are missing from one replica *after* recovery finished,
etc.).
Ceph检测到PG中对象的一份或多份数据不一致(比如对象大学不一直,或者恢复成功后对象依然没有等)

peering

The placement group is undergoing the peering process
PG正在互联过程中。

repair

Ceph is checking the placement group and repairing any
inconsistencies it finds (if possible).
Ceph正在检查PG并且修复所有发现的不一致情况(如果有的话)。

recovering

Ceph is migrating/synchronizing objects and their replicas.
Ceph正在迁移/同步对象和其副本。

forced_recovery

High recovery priority of that PG is enforced by user.
用户指定的PG高优先级恢复

recovery_wait

The placement group is waiting in line to start recover.
PG正在等待恢复被调度执行。

recovery_toofull

A recovery operation is waiting because the destination OSD is over
its full ratio.
恢复操作因为目标OSD容量超过指标而挂起。

recovery_unfound

Recovery stopped due to unfound objects.
恢复因为没有找到对应对象而停止。

backfilling

Ceph is scanning and synchronizing the entire contents of a
placement group instead of inferring what contents need to be
synchronized from the logs of recent operations. Backfill is a
special case of recovery.
Ceph正常扫描并同步整个PG的数据,而不是从最近的操作日志中推断需要同步的数据,Backfill(回填)是恢复的一个特殊状态。

forced_backfill

High backfill priority of that PG is enforced by user.
用户指定的高优先级backfill。

backfill_wait

The placement group is waiting in line to start backfill.
PG正在等待backfill被调度执行。

backfill_toofull

A backfill operation is waiting because the destination OSD is over
its full ratio.
backfill操作因为目标OSD容量超过指标而挂起。

backfill_unfound

Backfill stopped due to unfound objects.
Backfill因为没有找到对应对象而停止。

incomplete

Ceph detects that a placement group is missing information about
writes that may have occurred, or does not have any healthy copies.
If you see this state, try to start any failed OSDs that may contain
the needed information. In the case of an erasure coded pool
temporarily reducing min\_size may allow recovery.
Ceph 探测到某一PG可能丢失了写入信息,或者没有健康的副本。如果你看到了这个状态,尝试启动有可能包含所需信息的失败OSD,
如果是erasure coded pool的话,临时调整一下`min_size`也可能完成恢复。

stale

The placement group is in an unknown state - the monitors have not
received an update for it since the placement group mapping changed.
PG状态未知,从PG mapping更新后Monitor一直没有收到更新。

remapped

The placement group is temporarily mapped to a different set of OSDs
from what CRUSH specified.
PG被临时分配到了和CRUSH所指定的不同的OSD上。

undersized

The placement group has fewer copies than the configured pool
replication level.
该PG的副本数量小于存储池所配置的副本数量。

peered

The placement group has peered, but cannot serve client IO due to
not having enough copies to reach the pool\'s configured min\_size
parameter. Recovery may occur in this state, so the pg may heal up
to min\_size eventually.
PG已互联,但是不能向客户端提供服务,因为其副本数没达到本存储池的配置值( min_size 参数)。
在此状态下恢复会进行,所以此PG最终能达到 min_size 。

snaptrim

Trimming snaps.
正在对快照做Trim操作。

snaptrim_Wait

Queued to trim snaps.
Trim操作等待被调度执行

snaptrim_Error

Error stopped trimming snaps.
Trim操作因为错误而停止

Placement Group Concepts(PG相关概念)

When you execute commands like ceph -w, ceph osd dump, and other
commands related to placement groups, Ceph may return values using some
of the following terms:
当执行诸如ceph -wceph osd dump及其他和归置组相关的命令时, Ceph 会返回下列术语:

Peering (建立互联)

The process of bringing all of the OSDs that store a Placement Group
(PG) into agreement about the state of all of the objects (and their
metadata) in that PG. Note that agreeing on the state does not mean
that they all have the latest contents.
表示所有存储PG数据的OSD达成对PG中所有对象(和元数据)共识的过程。
需要注意的是达成共识并不代表他们都拥有最新的数据。

Acting Set (在任集合)

The ordered list of OSDs who are (or were as of some epoch)
responsible for a particular placement group.
一个OSD的有序集合,他们为一个PG(或者一些版本)负责。

Up Set (当选集合)

The ordered list of OSDs responsible for a particular placement
group for a particular epoch according to CRUSH. Normally this is
the same as the *Acting Set*, except when the *Acting Set* has been
explicitly overridden via `pg_temp` in the OSD Map.
一列有序OSD ,它们依据 CRUSH 算法为某一PG的特定元版本负责。
它通常和*Acting Set*相同,除非*Acting Set*被OSD map中的`pg_temp`显式地覆盖了。

Current Interval or Past Interval

A sequence of OSD map epochs during which the *Acting Set* and *Up
Set* for particular placement group do not change.
某一PG所在*Acting Set*和*Up Set*未更改时的一系列OSD map元版本。

Primary (主 OSD)

The member (and by convention first) of the *Acting Set*, that is
responsible for coordination peering, and is the only OSD that will
accept client-initiated writes to objects in a placement group.
*Acting Set*的成员(按惯例为第一个),它负责协调互联,并且是PG内惟一接受客户端初始写入的OSD。

Replica (副本 OSD)

A non-primary OSD in the *Acting Set* for a placement group (and who
has been recognized as such and *activated* by the primary).
PG的*Acting Set*内不是主OSD的其它OSD ,它们被同等对待、由主OSD激活。

Stray (彷徨 OSD)

An OSD that is not a member of the current *Acting Set*, but has not
yet been told that it can delete its copies of a particular
placement group.
不在PG的当前*Acting Set*中,但是还没有被告知要删除其副本的OSD。

Recovery (恢复)

Ensuring that copies of all of the objects in a placement group are
on all of the OSDs in the *Acting Set*. Once *Peering* has been
performed, the *Primary* can start accepting write operations, and
*Recovery* can proceed in the background.
确保*Acting Set*内、PG中的所有对象的副本都存在于所有OSD上。
一旦互联完成,主OSD就以接受写操作,且恢复进程可在后台进行。

PG Info (PG 信息)

Basic metadata about the placement group\'s creation epoch, the
version for the most recent write to the placement group, *last
epoch started*, *last epoch clean*, and the beginning of the
*current interval*. Any inter-OSD communication about placement
groups includes the *PG Info*, such that any OSD that knows a
placement group exists (or once existed) also has a lower bound on
*last epoch clean* or *last epoch started*.
基本元数据,关于PG创建元版本、PG的最新写版本、最近的开始元版本(last epoch started)、
最近的干净元版本(last epoch clean)、和当前间隔(current interval)的起点。 
OSD间关于PG的任何通讯都包含PG Info,这样任何知道PG存在(或曾经存在)的OSD也必定有last epoch clean或last epoch started的下限。
PG Log (PG 日志)
A list of recent updates made to objects in a placement group. Note
that these logs can be truncated after all OSDs in the *Acting Set*
have acknowledged up to a certain point.
PG内对象的一系列最近更新。需要注意的是这些日志在*Acting Set*内的所有OSD确认更新到某点后可以删除。

Missing Set (缺失集合)

Each OSD notes update log entries and if they imply updates to the
contents of an object, adds that object to a list of needed updates.
This list is called the *Missing Set* for that `<OSD,PG>`.
每个OSD都会记录更新日志,而且如果它们包含对象内容的更新,
会把那个对象加入一个待更新列表,这个列表叫做那个`<OSD,PG>`的*Missing Set*。

Authoritative History (权威历史)

A complete, and fully ordered set of operations that, if performed,
would bring an OSD\'s copy of a placement group up to date.
一个完整、完全有序的操作集合,如果再次执行,可把一个OSD上的PG副本还原到最新。

Epoch (元版本)

A (monotonically increasing) OSD map version number
一个(单调递增的)OSD map版本号。

Last Epoch Start (最新起始元版本)

The last epoch at which all nodes in the *Acting Set* for a
particular placement group agreed on an *Authoritative History*. At
this point, *Peering* is deemed to have been successful.
 一最新元版本,在这点上,PG所对应*Acting Set*内的所有节点都对权威历史达成了一致、
 并且互联被认为成功了。

up_thru (领导拍板)

Before a *Primary* can successfully complete the *Peering* process,
it must inform a monitor that is alive through the current OSD map
*Epoch* by having the monitor set its *up\_thru* in the osd map.
This helps *Peering* ignore previous *Acting Sets* for which
*Peering* never completed after certain sequences of failures, such
as the second interval below:

-   *acting set* = \[A,B\]
-   *acting set* = \[A\]
-   *acting set* = \[\] very shortly after (e.g., simultaneous
    failure, but staggered detection)
-   *acting set* = \[B\] (B restarts, A does not)
主OSD要想成功完成互联,它必须通过当前OSD map元版本通知一个Monitor,让此Monitor在OSD map中设置其up_thru。
这会使互联进程忽略之前的*Acting Set*,因为它经历特定顺序的失败后一直不能互联,比如像下面的第二周期:

acting set = [A,B]
acting set = [A]
acting set = [] 之后很短时间(例如同时失败、但探测是交叉的)
acting set = [B] ( B 重启了、但 A 没有)

Last Epoch Clean (最新干净元版本)

The last *Epoch* at which all nodes in the *Acting set* for a
particular placement group were completely up to date (both
placement group logs and object contents). At this point, *recovery*
is deemed to have been completed.
最近的Epoch,这时某一特定PG所在*Acting Set*内的所有节点都全部更新了(包括PG日志和对象内容)。
在这点上,恢复被认为已完成。

参考:

  1. https://github.com/ceph/ceph/blob/v14.0.0/doc/rados/operations/pg-states.rst
  2. http://docs.ceph.org.cn/rados/operations/pg-states/
  3. https://github.com/ceph/ceph/blob/v14.0.0/doc/rados/operations/pg-concepts.rst
  4. http://docs.ceph.org.cn/rados/operations/pg-concepts/