CEPH

Ceph is an open source, software defined and a distributed storage system. A Software-defined Storage (SDS) system means a form of storage virtualization to separate the storage hardware from the software that manages the storage infrastructure. Ceph is a true SDS solution and runs on any commodity hardware without any vendor lock in. An SDS selection provides the flexibility in hardware selection. Customers can select any commodity hardware of any manufacturer. Ceph is massively scalable(upto exabytes and beyond) and there is no single point of failure. Today,private and public cloud models are used massively in providing IT infrastructure to customers. Ceph is very popular in cloud storage solutions such as openstack. Cloud depends on commodity hardware and ceph makes full use of this commodity hardwares to provide a faultless,cost effective storage system. Ceph is a unified storage solution which provides access to files,blocks as well as objects from a single platform along with their storage. RAID technology has been the fundamental building block of storage systems for the past few years. RAID uses a lot of disk spaces and takes an efficient amount of time to repair a failed disk which has storage size in the order of TBs.Integration of RAID technology also increases the cost required for the storage. A ceph storage system address these problems and eliminates the need of RAID technology. Ceph support has been added to Linux kernel from Version 2.6.32

WHY OBJECT STORAGE ?

An object is a combination of data and metadata components. These are identified with a unique id and eliminates the possibility of another object with same id. Traditional storage solutions are not capable of providing an object storage. They provides only file and block based storage. An object based storage have many advantages when compared with traditional file and block based storage solutions. Selection of object storage provides platform and hardware independence and allows the freedom in selecting them. The basic building block or foundation of ceph is an object. Any form of data whether it is file,block gets stored in the form of objects in a ceph cluster and replicates this objects across the cluster and improves the reliability. In Ceph, objects are not tied to a physical path, making objects flexible and location-independent. This enables Ceph to scale linearly from the petabyte level to an exabyte level.

CEPH RELEASES

Hammer V0.94.3 is the latest release of ceph. Prior to that Giant version was also released.

add1

CEPH ARCHITECTURE

A ceph storage cluster is made up of several different software daemons where each daemon takes care of unique ceph functionalities. Each of these daemons is seperated from each other and this feature makes ceph cluster storage cost low as compared to other storage system.In the below figure,RADOS is the lower part that is totally internal to the Ceph cluster with no direct client interface, and the upper part that has all the client interfaces.

ceph

                                     Figure:Ceph Architecture

CEPH  DEPLOYMENT

Suppose we have three nodes with host names as ceph-node1,ceph-node2 and ceph-node3 respectively.

1.Installing Ceph-deploy on ceph-node1 by executing

# yum install ceph-deploy

2.Create a ceph cluster by using ceph-deploy tool,

# ceph-deploy new ceph-node1

The new subcommand of ceph-deploy deploys a new cluster with ceph as the cluster name, which is by default.It generates a cluster coniguration and keying files as ceph.conf and ceph.mon.keyring files in the current working directory.When ceph runs with authentication and autherization enabled,it will ask for a username and a keyring containing secret key of that user. By default,client.admin is the default user name.

3. To install Ceph software binaries on all the nodes using ceph-deploy,execute the following command from ceph-node1

# ceph-deploy install –release emperor ceph-node1 ceph-node2 ceph-node3

emperor is a version type of ceph

4. Create first monitor on ceph-node1

# ceph-deploy mon create-initial

5. Check the cluster status by

# ceph status

Initially the cluster won’t be healthy.

Creating Object Storage Device

Create an Object Storage Device(OSD) on ceph-node1 and add it to the ceph cluster by,

1. List the disks on nodes by,
# ceph-deploy disk list ceph-node1

From the output,identify the disks (other than OS-partition disks) on which we should create Ceph OSD.

2. The disk zap subcommand will destroy the existing partition table and content from the disk.

# ceph-deploy disk zap ceph-node1:sdb ceph-node1:sdc ceph-node1:sdd
3. The osd create subcommand will first prepare the disk, that is, erase the disk with a filesystem, which is xfs by default. Then, it will activate the disk’s first partition as data partition and second partition as journal:

# ceph-deploy osd create ceph-node1:sdb ceph-node1:sdc ceph-node1:sdd

4. Check the cluster status for new OSD entries:

# ceph status

At this stage, the cluster will not be healthy. We need to add a few more nodes to the Ceph cluster so that it can set up a distributed,replicated object storage, and hence become healthy.

RADOS

Reliable Autonomic Distributed Object Store(RADOS) or storage cluster is the heart of ceph storage system. RADOS provides features such as distributed object store, high availability, reliability, no single point of failure, self-healing,self-managing to ceph storage system. The data access methods of Ceph, such as rados block device(RBD), CephFS, rados gateway,and rados library are operate on top of the RADOS layer. RADOS stores data in the form of objects inside a pool. When there is a write request to a ceph cluster,the position to which the corresponding data write to be made is calculated based on algorithm called CRUSH. Based on that,RADOS distributes data to all the cluster nodes in the form of objects. RADOS also performs data replication. It takes copy of objects and distributes this copies to different zones. No two copies will reside on the same zone and ensures that every object is replicated at least once. RADOS also checks for object states to ensure every object is keeping a stable state. In the case of inconsistency,recoveries are performed with with the help of remaining object copies. This recovery operations are hided from the end user. RADOS consists of two major components,Object Storage Device(OSD) and Monitor.

1.RADOS Object Storage Device(OSD) : OSD stores data of clients in the form objects and on physical disk drives of each node in the cluster. A ceph cluster consists of many OSDs. For any read and write operations,the client requests for cluster maps from monitors and after examining the maps client directly interacts with OSDs for I/O operations. Each object in OSD has one primary copy and several secondary copies which are scattered across other OSDs. Each OSD plays the role of primary OSD for some objects and at the same time acts as a secondary OSD for other objects. When there is a disk failure, all OSDs performs recovery options. At this time secondary OSD holding replicated copies of failed objects will be promoted as primary OSD along with the creation of new secondary object copies.

2. Ceph Monitors : Ceph monitors does not store data of clients. It serves updated cluster maps to clients and other cluster nodes. Clients and other cluster nodes periodically check with monitors for the most recent copies of cluster nodes. Ceph monitors are responsible for the health of Ceph clusters by storing cluster informations,the states of nodes, and cluster configuration information. It also keeps a master copy of a cluster. A typical ceph cluster consists of more than one monitor. The monitor count in cluster should be an odd number and a multi monitored ceph architecture develops a quorum. The decision making is distributed among all the monitors. Odd number of monitors are recommended to avoid split brain scenarios. Out of all the ceph monitors,one operates as a leader. The other monitors will become leader if the current leader monitor is down. At least three monitors should be there in a production cluster . The cluster map includes the monitor, OSD,PG and CRUSH maps.

•Monitor map: This holds end-to-end information about a monitor node, which includes the Ceph cluster ID, monitor hostname, and IP address with port number. It also stores the current data for map creation and the last-changed information.

•OSD map: This stores fields such as the cluster ID,information for OSD map creation,last-changed information and information related to pools such as pool names, pool ID, type, replication level, and placement groups. It also stores OSD information such as count, state, weight and OSD host information. We can check cluster’s OSD maps by executing:

# ceph osd dump

•PG map: This holds the time stamp, last OSD map, full ratio, and near full ratio information. It also keeps track of each placement group ID, object count, state, state stamp, up and acting OSD sets. To check cluster PG map, execute:

# ceph pg dump

•CRUSH map: This holds information of cluster’s storage devices and the rules defined for the failure when storing data. To check cluster CRUSH map, execute the following command:

# ceph osd crush dump

librados

libraos is a C library that allows applications to work directly with RADOS,bypassing other interface layers to interact with ceph cluster. It offers API support so that applications can interact directly and parallelly with no HTTP overhead. Applications links with librados library and extends their protocol,there by gaining access to RADOS. This direct interaction with RADOS using librados improves the performance of applications. librados library serves as the base for other service interfaces that are built on top of librados interface,which includes Ceph File System,Ceph rados gateway and Ceph Block Device.

RADOS GATEWAY

Ceph object gateway is known as RADOS gateway. It provides API for different applications such as Amazon S3 API, Swift API(OpenStack Object Storage). It can be considered as a proxy which converts HTTP requests to RADOS requests and vice versa. Both S3 and swift API shares a common namespace inside a ceph cluster so that we can write data with one API and retrieve that data using another API. Apart from S3 and Swift API,application can be made bypass the RADOS gateway and get direct parallel access to librados,that is,to the ceph cluster. This method of removing additional layers will be an effective one for applications that require extreme performance from a storage point of view. Maintaining more than one gateway will result in reduced load on a storage cluster.

• S3 compatible: This provides an Amazon S3 RESTful API-compatible interface to Ceph storage clusters. RESTful(Representational State Transfer) API is a popular API building style for cloud based APIs.

• Swift compatible API: It provides an OpenStack Swift API-compatible interface to Ceph storage clusters.Ceph Object Gateway can be used as a replacement for Swift in an OpenStack cluster.

• Admin API: This is helpful for the administration of our Ceph cluster over HTTP RESTful API.

 blue cocktail dresses New Zealand

Figure:Different access methods using RADOS Gateway

RADOS BLOCK DEVICE(RBD)

In block storage,data is stored as volumes which are in the form of blocks and are attached to nodes. This provides large storage capacity required by applications. These blocks are mapped to operating system and are controlled by it’s file system. Ceph introduced a new protocol called RBD. RBD provides a reliable,distributed and high performance block storage disks to clients. RBD drivers has been integrated with linux kernel. RBD supports images up to 16 exabytes. Ceph block device provides full support to cloud platforms such as openstack and cloud stack etc. In openstack,ceph block device is used with cinder and glance components.

1.Creating a RBD with the name ‘testrdb’ with 20480 MB or 20 GB size

# rbd create testrdb –size 20480

2. Listing RBDs by,

# rbd ls

3. Retrieve information about the block device by,

# rbd –image testrbd info

4. Map the remote rbd image to RBD device,

echo “{ceph-monitor ip} name=admin,secret=Qwer12%$&*wqMN ceph-pool ceph-image” > /sys/bus/rbd/add

‘ceph-image’ is the name for rbd image and ‘ceph-pool’ is the name of pool.

5. Format the device,

# mkfs.xfs -L rbddevice /dev/rbd0

rbddevice is the label used to identify the RBD device in a multiple RBD environment.

6. Remove the rbd device by executing,

# echo “0” > /sys/bus/rbd/remove

CEPH File System

Ceph provides a file system on top of RADOS. It uses a metadata daemon which manages metadata and keeps it separated from the data. This separation helps to reduce complexity and improves reliability. CephFs offers a POSIX,distributed file system of any size. Ceph file system uses same ceph storage cluster system as ceph block devices and Ceph object storage. To use a ceph file system,We require at least one metadata server. Linux kernel version 2.6.34 and above supports CephFs. There are two approaches to use a CephFS, using a native kernel driver and other by using a Ceph FUSE.

Mounting CephFS with kernel driver

1. Check kernel version of client by using command ‘uname -r’ and create a mount point directory,

# mkdir /mnt/cephkernel

2. Mount cephfs by,

# mount -t ceph <monitr ip>:<port no of monitor>:/ /mnt/cephkernel -o name=admin,secret=<key>

eg: mount -t ceph 192.168.1.65:6789:/ /mnt/cephkernel -o name=admin,secret=Mwkwwk&%$75757HJF

Here key is the admin secret key located in /etc/ceph/ceph.client.admin.keyring

Mounting CephFS as FUSE 

FUSE stands for file system in userspace. It is a mechansism used that allows non-privileged users to create their own file systems without editing kernel code.

1.Install ceph-fuse module on client machine by,

# yum install ceph-fuse

2. Create a directory called ‘cephnew’ for mounting,

# mkdir /mnt/cephfs

3.Mount by,

# ceph-fuse -m <monitor ip>:<port number of monitor> <mount point name>

eg: ceph-fuse -m 192.168.1.34:6789 /mnt/cephfs

4. To mount permanently, open /etc/fstab and add,

<ceph-id> <mount point> <Type> <options>

id=admin /mnt/cephfs fuse.ceph defaults 0 0

PLACEMENT GROUP(PG)

A placement group is a logical collection of objects that are replicated on OSDs to provide reliability in storage system. We can consider PG as a logical container holding multiple objects and this container is mapped onto multiple OSDs.Placement Group is essential for the scalability and performance of a ceph storage system. Without PGs,It will be difficult to track and manage multiple replicated copies of an object that is spread over many OSDs. Every placement group requires resources like CPU,memory so that they can easily manage multiple objects. Increasing the number of PGs in a cluster reduces OSD load,but the count increment of PG should be done in a regulated way. 50 to 100 PGs per OSD is recommended.

CEPH POOLS

A ceph pool is a logical partition to store objects. Ceph provides easy storage management by means of this pools. Each pool in ceph holds a number of placement groups and this placement groups holds object that are mapped to OSDs. A ceph pool ensures data availability by creating a number of object copies. At the time of pool creation, we can define the replica size. The default replica size is 2(object + additional copy). When we first deploy a ceph cluster without creating a pool,ceph uses default pools to store data. A ceph pool supports snapshot features. A ceph pool allows to set ownerships and access to objects. In Ceph Storage Systems,Data management starts as soon as client writes data to a ceph pool. Once the client writes data to a ceph pool,data is then written to a primary OSD based on the pool replication size. The primary OSD then replicates the same data to secondary and tertiary OSDs. After finishing data writes, the secondary and tertiary OSDs will give an acknowledgement to primary OSD. Then only primary OSD will give an acknowledgement to client,confirming that data write operation has been completed.

Creating a Pool

Creating a Ceph pool requires a pool name,PG and PGP and a pool type which is replicated by default. PGP is the total number of Placement Groups for the Placement purpose of objects inside a pool.

1.Creating a pool named as ‘newpool’ with 128 PG and PGP numbers by,

# ceph osd pool create newpool 128 128

2. Listing of pools can be done in two ways,

# ceph osd lspools

# rados lspools

3. The default replication size for a Ceph pool created with ceph emperor or earlier releases is two. We can set replication size by,

# ceph osd pool set newpool size 4

4.Taking snapshot of a pool

# rados mksnap snapshot01 -p newpool

CRUSH

Normally traditional storage systems stores data and its metadata. The metadata, which is the data about data, stores information such as where the data is actually stored in memory. Each time new data is added to the storage system, its metadata is first updated with the physical location where the data will be stored, after which the actual data is stored. This is not usable when we need to deal with exabyte level data and it creates a single point of failure for storage system. if we lose our storage metadata, we lose all our data. So it is important to keep central metadata safe from disasters, either by keeping multiple copies on a single node or replicating the entire data and metadata. Such complex management of metadata is a bottleneck in a storage system’s scalability, high availability, and performance. Ceph uses the Controlled Replication Under Scalable Hashing (CRUSH) algorithm. Unlike traditional systems that rely on storing and managing a central metadata/index table, Ceph uses the CRUSH algorithm to compute where the data should be written to or read from. Instead of storing metadata, CRUSH computes metadata on demand, thus removing all the limitations encountered in storing metadata in a traditional way. The metadata computation process is known as CRUSH lookup and it is not system dependent. Ceph provides enough flexibility to clients to perform on demand metadata computation and allows data read or write . For a read-and-write operation to Ceph clusters, client first contact a Ceph monitor and retrieve a copy of the cluster map. The cluster map helps clients to know the state and configuration of the Ceph cluster. The data is converted to objects with object id and pool names/IDs. The object is then hashed with the number of placement groups to generate a final placement group within the required Ceph pool. The calculated placement group then goes through a CRUSH lookup(on demand metadata computation) to determine the primary OSD location for the storage or retrieval of data. After computing the OSD ID, the client contacts this OSD directly and stores the data. All these compute operations are performed by the clients, hence it does not impact cluster performance. Once the data is written to the primary OSD, the same node performs a CRUSH lookup operation and computes the location for secondary placement groups and OSDs so that the data is replicated across clusters for high availability.

Recovery and Rebalancing

In the event of failure of any component,Ceph waits for 300 seconds(default),before it marks OSD down and initiates recovery operation. This recovery option is done through ‘mon osd down out interval’ parameter under the ceph cluster configuration file. During this recovery operation,ceph starts to regenerate the affected data which is placed on the node that failed. CRUSH replicates data to many nodes and this replicated copies of data are used for the recovery. When a new disk or host is added to a ceph cluster,CRUSH starts a rebalancing operation during which it moves data from existing hosts or disks to the new host or disk. The Rebalancing operation is performed to keep all disks equally utilized. This will make cluster performance more efficient. All the existing OSDs will work in parallel to move the data and helps to complete the Rebalancing operation in a faster way.

CEPH and Openstack

Openstack is a set of software tools for building and managing cloud computing platforms for public and private clouds. Ceph provides a robust reliable storage for openstack. Ceph can be integrated with openstack components such as Cinder,Glance,Nova and Keystone. The main benefits of integrating Ceph with Openstack includes,

1.Ceph is a unified storage solution of block,file and mailnly object storage for Openstack,allowing different applications to use storage as they need.

2.Ceph supports rich APIs for both Swift and S3 object storage interfaces.

3.It provides snapshot feature to openstack volumes that can be used as a backup.

4.Ceph provides a feature rich storage backend at a very low cost which in turn limits the openstack deployment cost.

5.It provides advanced block storage capabilities such as cloning of VM for openstack clouds

CEPH Best Practices : 

1.The OSD journal

Ceph first writes the data from ceph clients to a journal. After completing this writing to journal,then data is written to the storage. Journal is a small sized partition on same disk as OSD or in another SSD(Solid State Drive) disk or may be as a file on a file system. 10 GB is the common size of the journal. Ceph uses journaling for speed and consistency. Ceph incorporates Btrfs and XFS as journaling file systems for OSD. A sync operation will run in every five seconds and it determines the life of a particular journal. Usage of SSD disk partitions for journaling purpose results in faster write of data to the journal. So it is recommended to use SSD disk partitions for journals. The back storage can be compromised of slower disks like SATA disks. In the case of a journal failure in a Btrfs based file system, there will be only minimal data loss or no data loss at all. The failure of journal disks that hosts OSDs running on XFS or ext4 file systems will result in data loss. So Btrfs is preferred. Btrfs is a copy on write file system,which means if the content of a block is changed then the changed block is written separately. This method preserves the old block and old data will be available even after a journal failure. We should not exceed OSD to journal ratio of four to five OSDs per journal disk when external SSDs are used for journal.

rsz_1selection_007

Figure:Ceph OSD journaling

In the above figure, (1) indicates the first data writting from client to journal. (2) indicates the data writing from journal to back storage,that is physical disks like SATA disks.

2.Number of Placement Groups

Setting correct number of placement groups is an essential step in building Ceph storage clusters. The formula to calculate the total number of placement groups for a Ceph cluster is:

Total PGs = (Total number of OSD * 100) / maximum replication count

Maximum replication count is the number of maximum replications set for an object. The result must be rounded up to nearest power of 2. For example, a result value of 1888.82 will be round to 2048.

Total number of PGs per pool in the Ceph cluster is calculated by,

Total PGs = ((Total number of OSD * 100) / maximum replication count)/pool count

This value also need to be rounded to the nearest power of two.

CONCLUSION

If we make a comparison between Ceph and other storage solutions available today,Ceph has more features. Ceph is an open source, software-defined storage solution on top of any commodity hardware, which makes it an economic storage solution. Ceph provides a variety of interfaces for the clients to connect to a Ceph cluster, thus increasing flexibility for clients. For data protection,Ceph does not rely on RAID technology. Rather, it uses replication ,which have been proved to be better solutions than RAID.Every component of Ceph is reliable and supports high availability. Ceph does not have any single point of failure, which is a major challenge for other storage solutions available today. One of the biggest advantages of Ceph is its unified nature, where it provides block, file, and object storage solutions, while other storage systems are still incapable of providing this. Ceph is a distributed storage system and clients can perform quick transactions using Ceph. It does not follow the traditional method of storing data by maintaining metadata,rather it introduces a new mechanism, which allows clients to dynamically calculate data location required by them. This provides an increase in performance for the client, as they no longer need to wait to get data locations and contents from the metadata server. In the event of failure, when other storage systems cannot provide reliability against multiple failures. Ceph detects and corrects failure in disk, node,network,data center etc. Other storage solutions can only provide reliability up to disk or at node failure. It provides a unified, distributed, highly scalable, and reliable object storage solution, which is much needed for today’s and the future’s unstructured data needs. The world’s storage need is increasing, so we need a storage system that is scalable to the exabyte level without affecting data reliability and performance. Ceph provides a solution to all these problems.

VN:F [1.9.6_1107]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.6_1107]
Rating: 0 (from 0 votes)
Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Live
  • StumbleUpon
  • Twitter
  • Yahoo! Buzz
  • Reddit
  • Technorati