centos 7 使用 ceph-deploy 快速部署 ceph 集群

一、安装 CEPH 部署工具

把 Ceph 仓库添加到 ceph-deploy 管理节点，然后安装 ceph-deploy 。

1、配置yum源

sudo yum install -y yum-utils && sudo yum-config-manager --add-repo https://dl.fedoraproject.org/pub/epel/7/x86_64/ && sudo yum install --nogpgcheck -y epel-release && sudo rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 && sudo rm /etc/yum.repos.d/dl.fedoraproject.org*

echo '[ceph-noarch]
name=Ceph noarch packages
baseurl=http://download.ceph.com/rpm-nautilus/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc' > /etc/yum.repos.d/ceph.repo

echo '[Ceph]
name=Ceph packages for $basearch
baseurl=http://mirrors.aliyun.com/ceph/rpm-infernalis/el7/$basearch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=http://mirrors.aliyun.com/ceph/keys/release.asc
priority=1

[Ceph-noarch]
name=Ceph noarch packages
baseurl=http://mirrors.aliyun.com/ceph/rpm-infernalis/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=http://mirrors.aliyun.com/ceph/keys/release.asc
priority=1

[ceph-source]
name=Ceph source packages
baseurl=http://mirrors.aliyun.com/ceph/rpm-infernalis/el7/SRPMS
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=http://mirrors.aliyun.com/ceph/keys/release.asc
priority=1' > /etc/yum.repos.d/ceph.repo

yum clean all
yum makecache

2、安装ceph-deploy

 yum install -y ceph-deploy
 ```
 
## 二、CEPH 节点环境预检及配置

你的管理节点必须能够通过 SSH 无密码地访问各 Ceph 节点。如果 ceph-deploy 以某个普通用户登录，那么这个用户必须有无密码使用 sudo 的权限。

#### 1、安装 NTP

建议在所有 Ceph 节点上安装 NTP 服务（特别是 Ceph Monitor 节点），以免因时钟漂移导致故障。

#### 2、安装 SSH 服务器

``` shell
yum install -y openssh-server

确保所有 Ceph 节点上的 SSH 服务器都在运行。

3、创建部署 CEPH 的用户

ceph-deploy 工具必须以普通用户登录 Ceph 节点，且此用户拥有无密码使用 sudo 的权限，因为它需要在安装软件及配置文件的过程中，不必输入密码。

在各 Ceph 节点创建新用户。确保各 Ceph 节点上新创建的用户都有 sudo 权限。

username=ceph-deploy
sudo useradd -d /home/${username} -m ${username}
sudo passwd ${username}
echo "${username} ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/${username}
chmod 0440 /etc/sudoers.d/${username}

允许无密码 SSH 登录
正因为 ceph-deploy 不支持输入密码，你必须在管理节点上生成 SSH 密钥并把其公钥分发到各 Ceph 节点。 ceph-deploy 会尝试给初始 monitors 生成 SSH 密钥对。

生成 SSH 密钥对，但不要用 sudo 或 root 用户。提示 “Enter passphrase” 时，直接回车，口令即为空：

ssh-keygen

Generating public/private key pair.
Enter file in which to save the key (/ceph-admin/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /ceph-admin/.ssh/id_rsa.
Your public key has been saved in /ceph-admin/.ssh/id_rsa.pub.

把公钥拷贝到各 Ceph 节点，把下列命令中的 {username} 替换成前面创建部署 Ceph 的用户里的用户名。

配置hosts, 配置之前修改每台机器的主机名，例如：

hostnamectl set-hostname node1

echo '172.66.1.12 node1
172.66.1.13 node2
172.66.1.14 node3
' >> /etc/hosts

username=ceph-deploy
ssh-copy-id {username}@node1
ssh-copy-id {username}@node2
ssh-copy-id {username}@node3

（推荐做法）修改 ceph-deploy 管理节点上的 ~/.ssh/config 文件，这样 ceph-deploy 就能用你所建的用户名登录 Ceph 节点了，而无需每次执行 ceph-deploy 都要指定 --username {username} 。这样做同时也简化了 ssh 和 scp 的用法。把 {username} 替换成你创建的用户名。

Host node1
   Hostname node1
   User ceph-deploy
Host node2
   Hostname node2
   User ceph-deploy
Host node3
   Hostname node3
   User ceph-deploy

同时，部署前，确保firewalld, selinux关闭。

4、优先级/首选项

确保你的包管理器安装了优先级/首选项包且已启用。在 CentOS 上你也许得安装 EPEL ，在 RHEL 上你也许得启用可选软件库。

sudo yum install -y yum-plugin-priorities

三、集群安装

http://docs.ceph.org.cn/start/quick-ceph-deploy/

1、创建集群

ceph-deploy new node1

2、安装ceph

ceph-deploy install node1

ceph-deploy mon create-initial

执行

ceph-deploy osd prepare node2:/dev/vdc
ceph-deploy osd prepare node2:/dev/vdc node3:/dev/vdc

报错

usage: ceph-deploy osd [-h] {list,create} ...
ceph-deploy osd: error: argument subcommand: invalid choice: 'prepare' (choose from 'list', 'create')

ceph-deploy 2.0.1竟然没有prepare子命令, 坑真多。

然后执行

ceph-deploy osd create node2 --data /dev/vdc

遇到报错

ceph_deploy][ERROR ] ExecutableNotFound: Could not locate executable 'ceph-volume' make sure it is installed and available on node2

这个是因为ceph-Deploy的版本高了，需要卸载高版本，安装低版本(admin节点)：

如下方法也能解决：

pip install ceph-deploy==1.5.39

最后执行下面命令成功

ceph-deploy osd prepare node2:/dev/vdc node3:/dev/vdc

这一步主要是为OSD做一些准备工作。

最后，激活 OSD：

ceph-deploy osd activate node2:/dev/vdc1 node3:/dev/vdc1

最后，类似的方法把node1也加进去，大功告成。

[root@node1 my-cluster]# ceph -s
    cluster 838f41b7-778e-43b1-b16d-99694af1df52
     health HEALTH_OK
     monmap e1: 1 mons at {node1=172.66.1.12:6789/0}
            election epoch 2, quorum 0 node1
     osdmap e13: 3 osds: 3 up, 3 in
            flags sortbitwise
      pgmap v23: 64 pgs, 1 pools, 0 bytes data, 0 objects
            100 MB used, 134 GB / 134 GB avail
                  64 active+clean

扩容操作

包括增加OSD和Monitor

增加Monitor

执行如下命令在node2上增加monitor

ceph-deploy mon add node2

出现如下错误

[node2][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
[node2][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

解决方法：

通过上面日志可知ceph.conf配置文件中缺少public network的配置，在admin节点中，my-cluster目录下的ceph.conf的[global]下添加对应的网址：

[global]
fsid = b8b4aa68-d825-43e9-a60a-781c92fec20e
mon_initial_members = node1
mon_host = 192.168.197.154
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network =192.168.197.0/24

之后通过执行下面的命令把修改的配置推送至每个节点，否则提示error：

ceph-deploy --overwrite-conf config push node1 node2 node3

之后可以正常执行monitor添加命令了。

OSD扩容

OSD扩容主要分为两步：

准备OSD, 即执行prepare命令

ceph-deploy osd prepare {ceph-node}:{device}

例如：

ceph-deploy osd prepare node1:/dev/vdc

最后，激活 OSD 。

激活OSD, 即执行activate命令

ceph-deploy osd activate {ceph-node}:{device}

例如：

ceph-deploy osd activate node1:/dev/vdc

一旦你新加了 OSD ， Ceph 集群就开始重均衡，把归置组迁移到新 OSD 。这时你可以看到归置组状态从 active + clean 变为 active，还有一些降级的对象；迁移完成后又会回到 active + clean 状态。

总结

手动成功部署一个ceph集群，完成了学习ceph的第一步。

2020/09/09 posted in ceph 存储

构建docker多架构镜像，混合架构都不是事

以前一直在想，为什么一个镜像只能在一种架构中使用，有什么办法可以解决多架构的kubernetes集群镜像的问题？经过一段时间的调研，还是找到了解决的方案。

什么是多架构集群呢？

在当前形势下，不少公司选择使用国产服务器，但国产服务器cpu基本上都是小众的架构，比如arm、mips等，而最广泛使用的x86架构的服务器。要引入国产服务器，因为各种现实原因，必然面临着一个kubernetes集群中使用多种架构的服务器的情况。

我一直觉得理想的情况是，kubernetes集群能在各种架构的机器上运行，能在编排层面对应用屏蔽掉底层机器的架构。可以一套编排方案，适用于各种架构的集群。会根据当前调度到的机器的架构，拉取对应架构的镜像。也就是说，我管你最终把我的应用pod运行在哪台机器上，这台机器是啥架构，我啥都不关心，把编排好的文件往kubernetes集群中一丢就完事了。

多架构镜像的构建就是实现这个目的的关键一环。

多架构镜像原理

docker宣称目标是"Build and Ship any Application Anywhere"，如果不解决多架构镜像问题，这个目标也无从谈起。

多架构的实现——manifests

manifests是什么呢? 咱们从docker的源码入手：

// A ManifestDescriptor references a platform-specific manifest.
type ManifestDescriptor struct {
    distribution.Descriptor

    // Platform specifies which platform the manifest pointed to by the
    // descriptor runs on.
    Platform PlatformSpec `json:"platform"`
}

// Descriptor describes targeted content. Used in conjunction with a blob
// store, a descriptor can be used to fetch, store and target any kind of
// blob. The struct also describes the wire protocol format. Fields should
// only be added but never changed.
type Descriptor struct {
    // MediaType describe the type of the content. All text based formats are
    // encoded as utf-8.
    MediaType string `json:"mediaType,omitempty"`

    // Size in bytes of content.
    Size int64 `json:"size,omitempty"`

    // Digest uniquely identifies the content. A byte stream can be verified
    // against against this digest.
    Digest digest.Digest `json:"digest,omitempty"`

    // URLs contains the source URLs of this content.
    URLs []string `json:"urls,omitempty"`

    // NOTE: Before adding a field here, please ensure that all
    // other options have been exhausted. Much of the type relationships
    // depend on the simplicity of this type.
}

// PlatformSpec specifies a platform where a particular image manifest is
// applicable.
type PlatformSpec struct {
    // Architecture field specifies the CPU architecture, for example
    // `amd64` or `ppc64`.
    Architecture string `json:"architecture"`

    // OS specifies the operating system, for example `linux` or `windows`.
    OS string `json:"os"`

    // OSVersion is an optional field specifying the operating system
    // version, for example `10.0.10586`.
    OSVersion string `json:"os.version,omitempty"`

    // OSFeatures is an optional field specifying an array of strings,
    // each listing a required OS feature (for example on Windows `win32k`).
    OSFeatures []string `json:"os.features,omitempty"`

    // Variant is an optional field specifying a variant of the CPU, for
    // example `ppc64le` to specify a little-endian version of a PowerPC CPU.
    Variant string `json:"variant,omitempty"`

    // Features is an optional field specifying an array of strings, each
    // listing a required CPU feature (for example `sse4` or `aes`).
    Features []string `json:"features,omitempty"`
}

可以看出manifest其实就是记录镜像的能够运行的cpu架构、操作系统类型、操作系统版本等等信息。multi-arch的镜像，可以看做是多个镜像，只是镜像仓库在存储镜像数据的同时，也会存储上面对应的manifest信息。

一个busybox镜像的manifests：

cxwen@cxw:~$ docker manifest inspect busybox
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 527,
         "digest": "sha256:2ca5e69e244d2da7368f7088ea3ad0653c3ce7aaccd0b8823d11b0d5de956002",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 527,
         "digest": "sha256:55dec6dbd4b329ef2bfc2e104ab6ee57ef1a91f15c8bd324650b34756f43ad61",
         "platform": {
            "architecture": "arm",
            "os": "linux",
            "variant": "v5"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 527,
         "digest": "sha256:8bec2de1c91b986218004f65a7ef40989ac9827e80ed02c7ac5cd18058213ba7",
         "platform": {
            "architecture": "arm",
            "os": "linux",
            "variant": "v6"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 527,
         "digest": "sha256:2e6dd9846f4022bd771bd7371114d60cf032e15bd160ab6172b776f9fc49812c",
         "platform": {
            "architecture": "arm",
            "os": "linux",
            "variant": "v7"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 527,
         "digest": "sha256:614f2e7b8fbab8a23bef168e7058739180a7a15d17c583bdfcbdb647d9798079",
         "platform": {
            "architecture": "arm64",
            "os": "linux",
            "variant": "v8"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 527,
         "digest": "sha256:13786684a2d2c684562fd93652fe803fad2ca9fc596ea793ca67b6bbeb2c4730",
         "platform": {
            "architecture": "386",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 527,
         "digest": "sha256:1d67a71f84422fa78f03573a44a419d023682b1c6aa1c380c9d84f69cc79e7f6",
         "platform": {
            "architecture": "mips64le",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 528,
         "digest": "sha256:49c781843a30b2af8fff421e4f2dde9365fb778d5ce11a88991d1ab6056d8f40",
         "platform": {
            "architecture": "ppc64le",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 528,
         "digest": "sha256:2ed8bd58e1966fc775f4ba7e97a56ceb1fdd96706779ae8bcb59bc8dbae21be8",
         "platform": {
            "architecture": "s390x",
            "os": "linux"
         }
      }
   ]
}

客户端拉取镜像的流程如下, 这里盗用别人画的一张流程图：

多架构镜像构建方法

docker已经提供了构建多架构镜像的命令行工具:

cxwen@cxw:~$ docker manifest --help

Usage:  docker manifest COMMAND

The **docker manifest** command has subcommands for managing image manifests and
manifest lists. A manifest list allows you to use one name to refer to the same image
built for multiple architectures.

To see help for a subcommand, use:

    docker manifest CMD --help

For full details on using docker manifest lists, see the registry v2 specification.

Commands:
  annotate    Add additional information to a local image manifest
  create      Create a local manifest list for annotating and pushing to a registry
  inspect     Display an image manifest, or manifest list
  push        Push a manifest list to a repository

Run 'docker manifest COMMAND --help' for more information on a command.

需要开启实验特性才能使用

$vim ~/.docker/config.json
{
    "experimental": "enabled"
}

步骤

1、分别为各架构够建相应的镜像，并推送到镜像仓库

2、创建manifest

docker manifest create {多架构镜像名} {amd64架构镜像名} {arm64架构镜像名} {mips64le架构镜像名} ......

例如：

docker manifest create xwcheng/kube-proxy:v1.15.9 xwcheng/kube-proxy-amd64:v1.15.9 xwcheng/kube-proxy-arm64:v1.15.9 xwcheng/kube-proxy-mips64le:v1.15.9

2、添加对应架构manifest添加annotation属性值

例如：

docker manifest annotate xwcheng/kube-proxy:v1.15.9 xwcheng/kube-proxy-amd64:v1.15.9 --os linux --arch amd64

docker manifest annotate xwcheng/kube-proxy:v1.15.9 xwcheng/kube-proxy-arm64:v1.15.9 --os linux --arch arm64 --variant unknown

docker manifest annotate xwcheng/kube-proxy:v1.15.9 xwcheng/kube-proxy-mips64le:v1.15.9 --os linux --arch mips64le

注意，arm64一定要添加 --variant unknown, 要不会无法拉取。

3、将manifest推送到镜像仓库

docker manifest push xwcheng/kube-proxy:v1.15.9

通过这三步，你就可以只使用 xwcheng/kube-proxy:v1.15.9 这一个镜像名在多种架构的机器上愉快的跑起来了。

2020/07/11 posted in docker

kubernetes CRD 开发入门指南

CRD是什么？

CRD全称为Custom Resource Definition，是kubernetes提供的开放的扩展api方式。

下面是我的理解：

通常我们说的Operator指的是CRD + Controller。CRD也是kubernetes提供的一种资源类型，我们通过CRD向kubernetes注册自定义的资源类型。空有定义好的CRD没有任何作用，要让自定义的资源类型像kubernetes中的资源一样工作，需要开发一个Controller来控制、调度、实现该资源中定义的状态。而我们真正使用的则是CR(Custom Resource)。

举个例子

现在很多项目都大量使用到CRD, 为了能有更清楚的理解,下面以 matrix项目为例：

[root@centos01 ~]# kubectl get crd | grep crd.cxwen.com
dns.crd.cxwen.com                                     2020-06-30T06:21:09Z
etcdclusters.crd.cxwen.com                            2020-06-30T06:21:09Z
masters.crd.cxwen.com                                 2020-06-30T06:21:09Z
matrices.crd.cxwen.com                                2020-06-30T06:21:09Z
networkplugins.crd.cxwen.com                          2020-06-30T06:21:09Z

可以看到matrix定义了许多的CRD。以masters.crd.cxwen.com为例，来看看CRD里面到底定义了哪些东西。

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: masters.crd.cxwen.com
spec:
  conversion:
    strategy: None
  group: crd.cxwen.com
  names:
    kind: Master
    listKind: MasterList
    plural: masters
    singular: master
  preserveUnknownFields: true
  scope: Namespaced
  versions:
  - additionalPrinterColumns:       # 定义kubectl get时打印出的字段
    - description: version
      jsonPath: .spec.version
      name: VERSION
      type: string
    - description: pod replicas
      jsonPath: .spec.replicas
      name: REPLICAS
      type: string
    - description: etcdcluster name
      jsonPath: .spec.etcdCluster
      name: ETCD
      type: string
    - description: expose type
      jsonPath: .spec.expose.method
      name: EXPOSETYPE
      type: string
    - description: expose node
      jsonPath: .spec.expose.node
      name: EXPOSENODE
      type: string
    - description: expose port
      jsonPath: .spec.expose.port
      name: EXPOSEPORT
      type: string
    - description: phase
      jsonPath: .status.phase
      name: PHASE
      type: string
    - jsonPath: .metadata.creationTimestamp
      name: AGE
      type: date
    name: v1
    schema:                       # 定义CRD的schema
      openAPIV3Schema:
        description: Master is the Schema for the masters API
        properties:
          apiVersion:
            type: string
          kind:
            type: string
          metadata:
            type: object
          spec:
            description: MasterSpec defines the desired state of Master
            properties:
              etcdCluster:
                type: string
              expose:
                properties:
                  method:
                    type: string
                  node:
                    items:
                      type: string
                    type: array
                  port:
                    type: string
                type: object
              imageRegistry:
                type: string
              imageRepo:
                properties:
                  apiserver:
                    type: string
                  controllerManager:
                    type: string
                  proxy:
                    type: string
                  scheduler:
                    type: string
                type: object
              replicas:
                type: integer
              version:
                description: Foo is an example field of Master. Edit Master_types.go
                  to remove/update
                type: string
            type: object
          status:
            description: MasterStatus defines the observed state of Master
            properties:
              adminKubeconfig:
                type: string
              exposeUrl:
                items:
                  type: string
                type: array
              phase:
                description: 'INSERT ADDITIONAL STATUS FIELD - define observed state
                  of cluster Important: Run "make" to regenerate code after modifying
                  this file'
                type: string
            type: object
        type: object
    served: true
    storage: true
    subresources:
      status: {}

CRD里面主要定义了两部分内容：

1、additionalPrinterColumns

顾名思意，additional Printer Columns就是额外的打印列的意思，即设置使用kubectl get命令去查看自定义资源时会打印哪些字段，例如：

[root@centos01 ~]# kubectl get master
NAME         VERSION    REPLICAS   ETCD         EXPOSETYPE   EXPOSENODE         EXPOSEPORT   PHASE   AGE
example-km   v1.15.12   1          example-ec   NodePort     [192.168.83.128]   31299        Ready   12m

这里就打印出了上面CRD中定义的字段。

2、schema

定义Custom Resource的模式或者说规范，这里面定义一个CR的各个属性的数据类型，CR一定要遵循CRD里面的定义才能创建成功。

有以下两种情况：

属性是一个基本数据类型

type直接指定属性的类型，如apiserver为string类型：

apiserver:
  type: string

属性是一个结构体类型

type值为object, properties中描述其它属性的类型，如imageRepo属性：

  imageRepo:
    type: object
    properties:
      apiserver:
        type: string
      controllerManager:
        type: string
      proxy:
        type: string
      scheduler:
        type: string

总结一下：

类型	说明
Operator	CRD + Controller
CRD (Custom Resource Definition)	定义自定义资源的各种属性并向kubernetes中注册
CR (Custom Resource)	通过CRD定义好的真正可以在k8s中使用的资源，类似于pod，deployment这样的k8s中定义好的资源
Controller	监听CRD的CRUD事件并添加自定义业务逻辑，负责确保其追踪的资源对象的当前状态接近期望状态

如何开发？

了解了CRD的是什么，那如何来开发一个CRD呢？

从上文看起来CRD的定义文件这么长，似乎很复杂。不要怕，这些都可以用工具生成，不需要咱们手动编写的。正所谓工欲善其事，必先利其器。下面就来了解下CRD的开发利器: kubebuilder。有时间的也可以研究下kubebuilder的文档, 里面有详细的介绍。

使用kubebuilder构建CRD基本代码框架

环境安装

可以直接在机器上进行安装，或者使用构建好kubebuilder镜像运行一个容器在容器执行操作：推荐使用容器方式，方便很多。

机器直接安装

安装go环境

首先机器上得安装go环境，不会安装的可以看这个教程：Go语言环境安装。如果因为墙的原因无法从go官网下载安装包，可以访问go语言中文网进行下载。

安装kubebuilder

接着需要在机器上安装kubebuilder, linux可以使用下面命令安装：

wget https://github.com/kubernetes-sigs/kubebuilder/releases/download/v2.3.1/kubebuilder_2.3.1_linux_amd64.tar.gz
tar -zxvf kubebuilder_2.3.1_linux_amd64.tar.gz
cp kubebuilder_2.3.1_linux_amd64/bin/kubebuilder /usr/local/bin/

其它版本和架构的机器安装方法一样，可以根据需要下载相应的安装包。

安装kustomize

kustomize是一个yaml渲染工具，kubebuilder依赖它进行yaml文件的渲染。

curl https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2Fv3.6.1/kustomize_v3.6.1_linux_amd64.tar.gz | tar -zxv -C /usr/local/bin/

容器中运行

拉取镜像 xwcheng/kubebuilder:2.3.1

docker pull xwcheng/kubebuilder:2.3.1

运行容器

docker run -d --name kubebuilder -v /go/src/github.com/cxwen/:/go/src/github.com/cxwen -e GOPATH=/go -e GOROOT=/usr/local/go xwcheng/kubebuilder:2.3.1 sh -c "while true; do sleep 1000000000; done"

可以将宿主机的目录挂载到容器中。

进入容器

docker exec -ti kubebuilder bash

构建代码框架

1、初始化代码框架

国内因为墙的关系，最好执行 export GOPROXY=https://goproxy.io。

mkdir -p crd/
cd crd/
kubebuilder init --domain cxwen.com --license apache2 --owner "cxwen"

Writing scaffold for you to edit...
Get controller runtime:
$ go get sigs.k8s.io/controller-runtime@v0.5.0
go: finding sigs.k8s.io/controller-runtime v0.5.0
......
Update go.mod:
$ go mod tidy
go: downloading github.com/go-logr/zapr v0.1.0
......
Running make:
$ make
go: creating new go.mod: module tmp
go: finding sigs.k8s.io v0.2.5
......
/go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
go build -o bin/manager main.go
Next: define a resource with:
$ kubebuilder create api

生成完之后，会出现如下这些文件机目录：

[root@centos01 ~]# tree
.
├── bin
│   └── manager
├── config
│   ├── certmanager
│   │   ├── certificate.yaml
│   │   ├── kustomization.yaml
│   │   └── kustomizeconfig.yaml
│   ├── default
│   │   ├── kustomization.yaml
│   │   ├── manager_auth_proxy_patch.yaml
│   │   ├── manager_webhook_patch.yaml
│   │   └── webhookcainjection_patch.yaml
│   ├── manager
│   │   ├── kustomization.yaml
│   │   └── manager.yaml
│   ├── prometheus
│   │   ├── kustomization.yaml
│   │   └── monitor.yaml
│   ├── rbac
│   │   ├── auth_proxy_client_clusterrole.yaml
│   │   ├── auth_proxy_role_binding.yaml
│   │   ├── auth_proxy_role.yaml
│   │   ├── auth_proxy_service.yaml
│   │   ├── kustomization.yaml
│   │   ├── leader_election_role_binding.yaml
│   │   ├── leader_election_role.yaml
│   │   └── role_binding.yaml
│   └── webhook
│       ├── kustomization.yaml
│       ├── kustomizeconfig.yaml
│       └── service.yaml
├── Dockerfile
├── go.mod
├── go.sum
├── hack
│   └── boilerplate.go.txt
├── main.go
├── Makefile
└── PROJECT

9 directories, 30 files

2、创建CRD

[root@centos01 ~]# kubebuilder create api --group crd --version v1 --kind TestCrd
Create Resource [y/n]
y
Create Controller [y/n]
y
Writing scaffold for you to edit...
api/v1/testcrd_types.go
controllers/testcrd_controller.go
Running make:
$ make
/go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
go build -o bin/manager main.go

执行完后可以看到目录下出现api和controllers这两个新的目录。

[root@centos01 ~]# tree api/
api/
└── v1
    ├── groupversion_info.go
    ├── testcrd_types.go
    └── zz_generated.deepcopy.go

1 directory, 3 files

[root@centos01 ~]# tree controllers/
controllers/
├── suite_test.go
└── testcrd_controller.go

0 directories, 2 files

代码解析

结构体: CRD的血肉

api目录中存放的是Custom Resource的结构体。如下所示，TestCrdSpec结构体中可以自定义yaml文件spec属性下需要的字段。TestCrdStatus结构体中可以自定义yaml文件status属性下需要的字段。

// TestCrdSpec defines the desired state of TestCrd
type TestCrdSpec struct {
    // INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
    // Important: Run "make" to regenerate code after modifying this file

    // Foo is an example field of TestCrd. Edit TestCrd_types.go to remove/update
    Foo string `json:"foo,omitempty"`
}

// TestCrdStatus defines the observed state of TestCrd
type TestCrdStatus struct {
    // INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
    // Important: Run "make" to regenerate code after modifying this file
}

// +kubebuilder:object:root=true

// TestCrd is the Schema for the testcrds API
type TestCrd struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   TestCrdSpec   `json:"spec,omitempty"`
    Status TestCrdStatus `json:"status,omitempty"`
}

spec和status有什么区别呢？

可以查看kubernetes的官方文档对象规约（Spec）与状态（Status）

每个 Kubernetes 对象包含两个嵌套的对象字段，它们负责管理对象的配置：对象 spec 和对象 status 。 spec 是必需的，它描述了对象的期望状态（Desired State） —— 希望对象所具有的特征。 status 描述了对象的实际状态（Actual State），它是由 Kubernetes 系统提供和更新的。在任何时刻，Kubernetes 控制面一直努力地管理着对象的实际状态以与期望状态相匹配。

对于我们开发的CRD来说，可以像这样理解。

spec是预先定义的期望状态，也就是控制器 (controller) 里面可以获取到的预置信息，并根据这些信息进行处理调度以达到这个期望状态，并且spec里面的信息不能被控制器里面的代码更改的。
status里面的字段里面的信息是标志当前实际状态。比如，一个Custom Resource生命周期有三个状态：initializing、ready、teminating, 可以在status中加一个phase字段来表示；当这个CR刚创建时，控制器将phase字段值更新为initializin；CR初始化完成，健康检查等通过，已经可以正常提供服务了，控制器就可以将phase字段置为ready；当这个CR使命已经完成，进入结束阶段，控制器就将phase字段置为teminating, 然后再执行资源清理操作。当然，这些状态的转换，都是需要在控制器代码里来实现的。

Reconcile：CRD的大脑

如果说结构体是CRD的血肉，那么controller里面的Reconcile方法就是CRD的大脑，因为CRD的所有行为，都是通过这个方法来控制的，这个方法也是我们代码实现的关键所在。每一个CRD都会在controllers目录中生成一个以{CRD名称}_controller.go格式命名的代码文件, Reconcile方法即在这个文件中。

func (r *TestCrdReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
    _ = context.Background()
    _ = r.Log.WithValues("testcrd", req.NamespacedName)

    // your logic here

    return ctrl.Result{}, nil
}

下面是Reconcile实现的一个模板：

func (r *TestCrdReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
    var err error
    ctx := context.Background() // 获取context
    log := r.Log.WithValues("master", req.NamespacedName) // 获取日志对象

    log.V(1).Info("TestCrd reconcile triggering")
    
    // 从kubernetes中获取crd对象
    testCrd := crdv1.TestCrd{}
    if err = r.Get(ctx, req.NamespacedName, &testCrd); err != nil { 
        if IgnoreNotFound(err) != nil {
            log.Error(err, "unable to fetch testCrd")
            return ctrl.Result{}, err
        }

        return ctrl.Result{}, nil
    }

    testCrdFinalizer := "crd.cxw.com"
    // 通过DeletionTimestamp字段来判断是删除还是创建更新操作
    if testCrd.ObjectMeta.DeletionTimestamp.IsZero() {
        // 判断是否是创建
        if ! ContainsString(master.ObjectMeta.Finalizers, testCrdFinalizer) {
            testCrd.ObjectMeta.Finalizers = append(testCrd.ObjectMeta.Finalizers, testCrdFinalizer)
            
            // 你的创建处理代码
            
        } else {
            // 你的更新处理代码
        }
    } else {
        if ContainsString(master.ObjectMeta.Finalizers, masterFinalizer) {
            
            // 你的删除处理代码

            testCrd.ObjectMeta.Finalizers = RemoveString(testCrd.ObjectMeta.Finalizers, testCrdFinalizer)
            if err = r.Update(ctx, &testCrd); err != nil {
                return ctrl.Result{}, err
            }
        }
    }

    return ctrl.Result{}, nil
}

注释也是代码

可以直接在代码里通过格式化的注释来实现授权、添加额外打印字段功能。具体使用可以参考matrix

1、给控制器授权

在Reconcile方法前面添加, 例如：

// +kubebuilder:rbac:groups=crd.cxwen.com,resources=testcrds,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=crd.cxwen.com,resources=testcrds/status,verbs=get;update;patch

func (r *TestCrdReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
    _ = context.Background()
    _ = r.Log.WithValues("testcrd", req.NamespacedName)

    // your logic here

    return ctrl.Result{}, nil
}

2、添加额外的打印字段

在CRD结构体前面添加，例如：

// +kubebuilder:printcolumn:name="VERSION",type="string",JSONPath=".spec.version",description="version"
// +kubebuilder:printcolumn:name="REPLICAS",type="string",JSONPath=".spec.replicas",description="pod replicas"
// +kubebuilder:printcolumn:name="ETCD",type="string",JSONPath=".spec.etcdCluster",description="etcdcluster name"
// +kubebuilder:printcolumn:name="EXPOSETYPE",type="string",JSONPath=".spec.expose.method",description="expose type"
// +kubebuilder:printcolumn:name="EXPOSENODE",type="string",JSONPath=".spec.expose.node",description="expose node"
// +kubebuilder:printcolumn:name="EXPOSEPORT",type="string",JSONPath=".spec.expose.port",description="expose port"
// +kubebuilder:printcolumn:name="PHASE",type="string",JSONPath=".status.phase",description="phase"
// +kubebuilder:printcolumn:name="AGE",type="date",JSONPath=".metadata.creationTimestamp"

// Master is the Schema for the masters API
type Master struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   MasterSpec   `json:"spec,omitempty"`
    Status MasterStatus `json:"status,omitempty"`
}

Finalizer

Finalizer的作用是基于kubernetes的处理机制。

当api接收到一个资源对象删除操作，Finalizer为空时，kubernetes会直接将这个资源对象删掉，删掉之后，etcd中就不存在改资源对象了，通过api也无法查询到了。

当Finalizer不为空时，kubernetes不会直接删除改资源对象，而是在对象ObjectMeta中添加DeletionTimestamp这么个字段，一直要等到Finalizer为空时才把资源对象删除。

所以，当我们的CRD在删除时有需要进行资源清理操作时，就可以在创建时添加上Finalizer，当检测到DeletionTimestamp不为空时，就知道该资源对象处于删除状态了，然后执行资源清理操作，最后移除Finalizer即可。

Finalizer的添加很简单，它的值可以是任何字符串。当然，为了起到一些标志作用，可以使用有意义的字符串。

OwnerReference

kubernetes GC在删除一个对象时，任何 ownerReference 是该对象的对象都会被清除。

下面是kubernetes官方文档中的描述。

某些 Kubernetes 对象是其它一些对象的所有者。例如，一个 ReplicaSet 是一组 Pod 的所有者。具有所有者的对象被称为是所有者的附属。每个附属对象具有一个指向其所属对象的 metadata.ownerReferences 字段。
有时，Kubernetes 会自动设置 ownerReference 的值。例如，当创建一个 ReplicaSet 时，Kubernetes 自动设置 ReplicaSet 中每个 Pod 的 ownerReference 字段值。在 Kubernetes 1.8 版本，Kubernetes 会自动为某些对象设置 ownerReference 的值，这些对象是由 ReplicationController、ReplicaSet、StatefulSet、DaemonSet、Deployment、Job 和 CronJob 所创建或管理。也可以通过手动设置 ownerReference 的值，来指定所有者和附属之间的关系。

添加OwnerReference, 以deployment为例,在ObjectMeta下面添加OwnerReferences即可：

Deployment{
    TypeMeta: metav1.TypeMeta{
        APIVersion: "apps/v1",
        Kind:       "Deployment",
    },
    ObjectMeta: metav1.ObjectMeta{
        Name:      test,
        Namespace: test,
        OwnerReferences: []metav1.OwnerReference{
            *metav1.NewControllerRef(app, schema.GroupVersionKind{
                Group: v1.SchemeGroupVersion.Group,
                Version: v1.SchemeGroupVersion.Version,
                Kind: "TestCrd",
            }),
        },
    },
    ......
}

小结

进行CRD开发时

使用注释的方法来为controller配置权限，以及添加额外打印字段
使用Finalizer来做资源的清理
使用OwnerReference进行对象之间依赖关系的管理

其它代码开发技巧可以研究下kubebuilder的文档。

源码小窥

kubebuilder生成的源码架构，主要是基于controller-runtime这个go代码库。这个代码库对controller操作做了很好的封装，基于它开发CRD非常方便。

main.go

var (
    scheme   = runtime.NewScheme()
    setupLog = ctrl.Log.WithName("setup")
)

// 注册scheme
func init() {
    _ = clientgoscheme.AddToScheme(scheme)

    _ = crdv1.AddToScheme(scheme)
    // +kubebuilder:scaffold:scheme
}

func main() {
    ......
    
    // 初始化manager
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        Scheme:             scheme,
        MetricsBindAddress: metricsAddr,
        Port:               9443,
        LeaderElection:     enableLeaderElection,
        LeaderElectionID:   "6ed5364d.cxwen.com",
    })
    if err != nil {
        setupLog.Error(err, "unable to start manager")
        os.Exit(1)
    }

    // 初始化controller
    if err = (&controllers.TestCrdReconciler{
        Client: mgr.GetClient(),
        Log:    ctrl.Log.WithName("controllers").WithName("TestCrd"),
        Scheme: mgr.GetScheme(),
    }).SetupWithManager(mgr); err != nil {
        setupLog.Error(err, "unable to create controller", "controller", "TestCrd")
        os.Exit(1)
    }
    // +kubebuilder:scaffold:builder

    // 启动manager
    setupLog.Info("starting manager")
    if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
        setupLog.Error(err, "problem running manager")
        os.Exit(1)
    }
}

main.go中主要做了下面两件事：

1、注册scheme

在controller中，如果需要使用manager提供的client操作某种类型的资源，需要将资源类型注册到scheme中。从代码中的init函数可以看到，crdv1中的类型被注册到了scheme中。

2、创建并启动manager

这个manager就是管理controller的manager, 一个manager中可以管理多个controller。

创建manager过程中，创建了两个很重要的对象：cache和client。这两个对象是manager中所有controller共享的。

下面是创建Manager的函数：

// New returns a new Manager for creating Controllers.
func New(config *rest.Config, options Options) (Manager, error) {
    // Initialize a rest.config if none was specified
    if config == nil {
        return nil, fmt.Errorf("must specify Config")
    }

    // Set default values for options fields
    options = setOptionsDefaults(options)

    // Create the mapper provider
    mapper, err := options.MapperProvider(config)
    if err != nil {
        log.Error(err, "Failed to get API Group-Resources")
        return nil, err
    }

    // Create the cache for the cached read client and registering informers
    cache, err := options.NewCache(config, cache.Options{Scheme: options.Scheme, Mapper: mapper, Resync: options.SyncPeriod, Namespace: options.Namespace})
    if err != nil {
        return nil, err
    }

    apiReader, err := client.New(config, client.Options{Scheme: options.Scheme, Mapper: mapper})
    if err != nil {
        return nil, err
    }

    writeObj, err := options.NewClient(cache, config, client.Options{Scheme: options.Scheme, Mapper: mapper})
    if err != nil {
        return nil, err
    }
    // Create the recorder provider to inject event recorders for the components.
    // TODO(directxman12): the log for the event provider should have a context (name, tags, etc) specific
    // to the particular controller that it's being injected into, rather than a generic one like is here.
    recorderProvider, err := options.newRecorderProvider(config, options.Scheme, log.WithName("events"), options.EventBroadcaster)
    if err != nil {
        return nil, err
    }

    // Create the resource lock to enable leader election)
    resourceLock, err := options.newResourceLock(config, recorderProvider, leaderelection.Options{
        LeaderElection:          options.LeaderElection,
        LeaderElectionID:        options.LeaderElectionID,
        LeaderElectionNamespace: options.LeaderElectionNamespace,
    })
    if err != nil {
        return nil, err
    }

    // Create the metrics listener. This will throw an error if the metrics bind
    // address is invalid or already in use.
    metricsListener, err := options.newMetricsListener(options.MetricsBindAddress)
    if err != nil {
        return nil, err
    }

    // Create health probes listener. This will throw an error if the bind
    // address is invalid or already in use.
    healthProbeListener, err := options.newHealthProbeListener(options.HealthProbeBindAddress)
    if err != nil {
        return nil, err
    }

    stop := make(chan struct{})

    return &controllerManager{
        config:                config,
        scheme:                options.Scheme,
        cache:                 cache,
        fieldIndexes:          cache,
        client:                writeObj,
        apiReader:             apiReader,
        recorderProvider:      recorderProvider,
        resourceLock:          resourceLock,
        mapper:                mapper,
        metricsListener:       metricsListener,
        internalStop:          stop,
        internalStopper:       stop,
        port:                  options.Port,
        host:                  options.Host,
        certDir:               options.CertDir,
        leaseDuration:         *options.LeaseDuration,
        renewDeadline:         *options.RenewDeadline,
        retryPeriod:           *options.RetryPeriod,
        healthProbeListener:   healthProbeListener,
        readinessEndpointName: options.ReadinessEndpointName,
        livenessEndpointName:  options.LivenessEndpointName,
    }, nil
}

cache

cache用到了kubernetes中一个重要的工具包：informer。

cache主要就是创建了InformersMap，scheme里面的每个GVK (GroupVersionKind结构体，包含Group、Version、Kind三个字段，可以唯一确定一个资源) 都创建了对应的 Informer，通过 informersByGVK这个map来存放GVK和Informer的映射关系，每个 Informer会根据ListWatch 函数对对应的GVK进行List和Watch。我们为controller开发的Reconcile方法最终都会注册到informer的handler中，这样利用informer就可以达到监控资源的事件并触发Reconcile目的。

// newSpecificInformersMap returns a new specificInformersMap (like
// the generical InformersMap, except that it doesn't implement WaitForCacheSync).
func newSpecificInformersMap(config *rest.Config,
    scheme *runtime.Scheme,
    mapper meta.RESTMapper,
    resync time.Duration,
    namespace string,
    createListWatcher createListWatcherFunc) *specificInformersMap {
    ip := &specificInformersMap{
        config:            config,
        Scheme:            scheme,
        mapper:            mapper,
        informersByGVK:    make(map[schema.GroupVersionKind]*MapEntry), // schema GVK和informer映射
        codecs:            serializer.NewCodecFactory(scheme),
        paramCodec:        runtime.NewParameterCodec(scheme),
        resync:            resync,
        startWait:         make(chan struct{}),
        createListWatcher: createListWatcher,
        namespace:         namespace,
    }
    return ip
}

MapEntry中包含最终创建的informer对象。

// MapEntry contains the cached data for an Informer
type MapEntry struct {
    // Informer is the cached informer
    Informer cache.SharedIndexInformer

    // CacheReader wraps Informer and implements the CacheReader interface for a single type
    Reader CacheReader
}

client

从下面的代码可以看出，读操作使用上面创建的 cache，写操作使用client直连kubernetes。

// defaultNewClient creates the default caching client
func defaultNewClient(cache cache.Cache, config *rest.Config, options client.Options) (client.Client, error) {
    // Create the Client for Write operations.
    c, err := client.New(config, options)
    if err != nil {
        return nil, err
    }

    return &client.DelegatingClient{
        Reader: &client.DelegatingReader{
            CacheReader:  cache,
            ClientReader: c,
        },
        Writer:       c,
        StatusClient: c,
    }, nil
}

CRD Reconcile执行

在main.go中, 通过下面的代码把Manager中的client、scheme以及日志对象传递给相应的CRD对象。

if err = (&controllers.TestCrdReconciler{
    Client: mgr.GetClient(),
    Log:    ctrl.Log.WithName("controllers").WithName("TestCrd"),
    Scheme: mgr.GetScheme(),
}).SetupWithManager(mgr); err != nil {
    setupLog.Error(err, "unable to create controller", "controller", "TestCrd")
    os.Exit(1)
}

// TestCrdReconciler reconciles a TestCrd object
type TestCrdReconciler struct {
    client.Client
    Log    logr.Logger
    Scheme *runtime.Scheme
}

然后在Reconcile方法中就可以直接使用client来进行CURD操作。

总结

kubernetes的强大之处之一就是支持CRD对API进行扩展，当今很多项目都大量使用到CRD，像calico、istio以及kubevirt等等。如果有在kubernetes上层进行二次开发需求，可以优先考虑CRD,这是一种非常优雅的扩展方式，也是kubernetes生态的发展趋势。除此之外，有kubebuilder这个CRD开发利器，也能让我们的开发工作事半功倍。

资源链接

matrix github：https://github.com/cxwen/matrix

kubebuilder github: https://github.com/kubernetes-sigs/kubebuilder
kubebuilder的文档: https://book.kubebuilder.io/

Go语言环境安装: https://www.runoob.com/go/go-environment.html

go语言中文网go安装包下载: https://studygolang.com/dl

kustomize github: https://github.com/kubernetes-sigs/kustomize

kubernetes的官方文档对象规约（Spec）与状态（Status）: https://kubernetes.io/zh/docs/concepts/overview/working-with-objects/kubernetes-objects/#%E5%AF%B9%E8%B1%A1%E8%A7%84%E7%BA%A6-spec-%E4%B8%8E%E7%8A%B6%E6%80%81-status

controller-runtime github: https://github.com/kubernetes-sigs/controller-runtime

2020/07/05 posted in kubernetes

kubernetes v1.10.0 高可用集群部署

kubernetes官方并为明确给出高可用生产集群的部署方案，经过调研，使用keepalived和haproxy可以实现高可用集群的部署。该方案也已经过实践检验，运行的还是比较稳定的。

机器

IP	用途	备注
172.28.79.11	master、etcd	主节点
172.28.79.12	master、etcd、keepalived、haproxy	主节点，同时部署keepalived、haproxy，保证master高可用
172.28.79.13	master、etcd、keepalived、haproxy	主节点，同时部署keepalived、haproxy，保证master高可用
172.28.79.14	node、etcd	业务节点
172.28.79.15	node、etcd	业务节点
172.28.79.16	node	业务节点
172.28.79.17	node	业务节点
172.28.79.18	node	业务节点
172.28.79.19	node	业务节点
172.28.79.20	node	业务节点
172.28.79.21	node	业务节点
172.28.79.22	node	业务节点
172.28.79.23	node	业务节点
172.28.79.24	node、harbor	业务节点
172.28.79.25	node	业务节点

机器基础配置信息

版本信息

项目	版本
系统版本	CentOS Linux release 7.4.1708 (Core)
内核版本	4.14.49

ntpd时间同步配置

/etc/ntp.conf

#perfer 表示『优先使用』的服务器
server ntp.aliyun.com prefer
#下面没有prefer参数，做为备用NTP时钟上层服务器地址，我这里设置的是公网，语音云则可以设置其他两地NTP IP。
server cn.ntp.org.cn
#我们每一个system clock的频率都有小小的误差,这个就是为什么机器运行一段时间后会不精确. NTP会自动来监测我们时钟的误差值并予以调整.但问题是这是一个冗长的过程,所以它会把记录下来的误差先写入driftfile.这样即使你重新开机以后之前的计算结果也就不会丢了
driftfile /var/lib/ntp/ntp.drift
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
# By default, exchange time with everybody, but don't allow configuration.
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery
# Local users may interrogate the ntp server more closely.
restrict 127.0.0.1
restrict ::1

kubernetes组件配置信息

组件版本

组件名	版本
docker	Docker version 1.12.6, build 78d1802
kubernetes	v1.10.0
etcd	3.1.12
calico	v3.0.4
harbor	v1.2.0
keepalived	v1.3.5
haproxy	1.7

配置

组件配置

docker

配置文件：/usr/lib/systemd/system/docker.service

[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H 0.0.0.0:2375 -H unix:///var/run/docker.sock --registry-mirror https://registry.docker-cn.com --insecure-registry 172.16.59.153 --insecure-registry hub.cxwen.cn --insecure-registry k8s.gcr.io --insecure-registry quay.io --default-ulimit core=0:0 --live-restore
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
#TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process

[Install]
WantedBy=multi-user.target

--registry-mirror：指定 docker pull 时使用的注册服务器镜像地址,指定为https://registry.docker-cn.com可以加快docker hub中的镜像拉取速度
--insecure-registry：配置非安全的docker镜像注册服务器
--default-ulimit：配置容器默认的ulimit选项
--live-restore：开启此选项，当dockerd服务出现问题时，容器照样运行，服务恢复后，容器也可以再被服务抓到并可管理

kubernetes

etcd

以172.28.79.11节点为例，其它节点类似：

apiVersion: v1
kind: Pod
metadata:
  labels:
    component: etcd
    tier: control-plane
  name: etcd-172.28.79.11
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --name=infra0
    - --initial-advertise-peer-urls=http://172.28.79.11:2380
    - --listen-peer-urls=http://172.28.79.11:2380
    - --listen-client-urls=http://172.28.79.11:2379,http://127.0.0.1:2379
    - --advertise-client-urls=http://172.28.79.11:2379
    - --data-dir=/var/lib/etcd
    - --initial-cluster-token=etcd-cluster-1
    - --initial-cluster=infra0=http://172.28.79.11:2380,infra1=http://172.28.79.12:2380,infra2=http://172.28.79.13:2380,infra3=http://172.28.79.14:2380,infra4=http://172.28.79.15:2380
    - --initial-cluster-state=new
    image: k8s.gcr.io/etcd-amd64:3.1.12
    livenessProbe:
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2379
        scheme: HTTP
      failureThreshold: 8
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: etcd
    volumeMounts:
    - name: etcd-data
      mountPath: /var/lib/etcd
  hostNetwork: true
  volumes:
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data

kubernetes系统组件

kubeadm init 启动k8s集群config.yaml配置

apiVersion: kubeadm.k8s.io/v1alpha1
kind: MasterConfiguration
networking:
  podSubnet: 192.168.0.0/16
api:
  advertiseAddress: 172.28.79.11
etcd:
  endpoints:
  - http://172.28.79.11:2379 
  - http://172.28.79.12:2379
  - http://172.28.79.13:2379
  - http://172.28.79.14:2379
  - http://172.28.79.15:2379

apiServerCertSANs:
  - 172.28.79.11
  - master01.bja.paas
  - 172.28.79.12
  - master02.bja.paas
  - 172.28.79.13
  - master03.bja.paas
  - 172.28.79.10
  
  - 127.0.0.1
token:
kubernetesVersion: v1.10.0
apiServerExtraArgs:
  endpoint-reconciler-type: lease
  bind-address: 172.28.79.11
  runtime-config: storage.k8s.io/v1alpha1=true
  admission-control: NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,NodeRestriction,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota
featureGates:
  CoreDNS: true

kubelet配置

/etc/systemd/system/kubelet.service.d/10-kubeadm.conf

[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki --eviction-hard=memory.available<5%,nodefs.available<5%,imagefs.available<5%"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS

keepalived

keepalived采取直接在物理机部署，使用yum命令进行安装，并设置开机自启。

yum install -y keepalived

systemctl enable keeplalived

配置keepalived配置文件

启动配置文件："/etc/keepalived/keepalived.conf"。keepalived的MASTER和BACKUP配置有部分差异

MASTER

! Configuration File for keepalived

global_defs {
    notification_email {
      root@localhost
    }
    router_id master02
}

vrrp_script chk_haproxy {
    script "/etc/keepalived/haproxy_check.sh"
    interval 3
    weight -20
}

vrrp_instance VI_1 {
    state MASTER    # BACKUP节点改成BACKUP
    interface bond1
    virtual_router_id 151
    priority 110    # BACKUP节点改成100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
       172.28.79.10 # k8s使用的VIP
       172.28.79.9  # 数据库组件使用的VIP
    }
    track_script {
       chk_haproxy
    }
}

haproxy检查脚本：/etc/keepalived/haproxy_check.sh

#!/bin/bash

if [ `ps -C haproxy --no-header |wc -l` -eq 0 ] ; then
    docker restart k8s-haproxy
    sleep 2
    if [ `ps -C haproxy --no-header |wc -l` -eq 0 ] ; then
        service keepalived stop
    fi
fi

haproxy

haproxy以容器的形式启动，启动命令如下：

docker run -d --net host --restart always --name k8s-haproxy -v /etc/haproxy:/usr/local/etc/haproxy:ro hub.xfyun.cn/k8s/haproxy:1.7

haproxy配置文件：/etc/haproxy/haproxy.cfg

global
  daemon
  log 127.0.0.1 local0
  log 127.0.0.1 local1 notice
  maxconn 4096

defaults
  log               global
  retries           3
  maxconn           2000
  timeout connect   5s
  timeout client    50s
  timeout server    50s

frontend k8s
  bind *:6444
  mode tcp
  default_backend k8s-backend

backend k8s-backend
  balance roundrobin
  mode tcp
  server k8s-1 172.28.79.11:6443 check
  server k8s-2 172.28.79.12:6443 check
  server k8s-3 172.28.79.13:6443 check

部署完成后操作

修改kube-proxy configmap

kubectl edit configmap kube-proxy -n kube-system

.....
kubeconfig.conf: |-
  apiVersion: v1
  kind: Config
  clusters:
  - cluster:
      certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      server: https://172.28.79.10:6444  # 更改此行ip为vip,改成172.28.79.10
    name: default
  contexts:
  - context:
      cluster: default
      namespace: default
      user: default
    name: default
  current-context: default
  users:
  - name: default
    user:
      tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
......

执行如下命令让kube-proxy组件重新启动

kubectl get pod -n kube-system | grep kube-proxy | awk '{print $1}' | xargs kubectl delete pod -n kube-system

修改所有node节点kubelet.conf

/etc/kubernetes/kubelet.conf

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN5RENDQWJDZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRFNE1EVXhPREF4TXpNME1Gb1hEVEk0TURVeE5UQXhNek0wTUZvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTGJoCmw1TDRaNHFiWTJ3MmY5TFlEb0ZqVlhhcHRhYklkQmZmTS9zMTJaWFd1NU5LYWlPR09ub3RxK1gwM0VJb3Z4VEkKUGh5NzBqY294VGlLUTk5ZkFsUS82a2Vhc0x5MDNGZXJvYkhmaldUenBkZE5mWVNEZStMazlMV0hIZ0phOXVUQQpDU3kyay9sZGo3VWQ0Sk9pMi9lcGhVTUNNMUNlbmdPeWZDNUl0SUpFZzJmMk95cTE5U0JBeW1zYzFTalg5Q0F6CnNyMlhiTm9hK1lVS2Flek1QSldvYlNxdEg0czQ1TkluYytMREJFTkk4VGVITktybENsamdIeUorUjU1V2pCTW8KeSs3Y1BxL2cwTkxmSU4xRjJVbkFFa3RTSmVYUFBSaGlQUUhJcGRBU0xySXhVcE9HNlN3Yk51bmRGdGsxaUJiUgpUSW9md2UyT0VhZkhySmV5OHJrQ0F3RUFBYU1qTUNFd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFLME1mOFM5VjUyaG9VZ3JYcGQyK09rOHF2Ny8KR3hpWnRFelFISW9RdWRLLzJ2ZGJHbXdnem10V3hFNHRYRWQyUnlXTTZZR1VQSmNpMmszY1Z6QkpSaGcvWFB2UQppRVBpUDk5ZkdiM0kxd0QyanlURWVaZVd1ekdSRDk5ait3bStmcE9wQzB2ZU1LN3hzM1VURjRFOFlhWGcwNmdDCjBXTkFNdTRxQmZaSUlKSEVDVDhLUlB5TEN5Zlgvbm84Q25WTndaM3pCbGZaQmFONGZaOWw0UUdGMVd4dlc0OHkKYmpvRDhqUVJnL1kwYUVUMWMrSEhpWTNmNDF0dG9kMWJoSWR3c1NDNUhhRjJQSVAvZ2dCSnZ2Uzh2V1cwcVRDegpDV2EzcVJ0bVB0MHdtcEZic2RPWmdsWkl6aWduYTdaaDFWMDJVM0VFZ2kwYjNGZWR5OW5MRUZaMGJZbz0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
    server: https://172.28.79.10:6444   # 此处改为VIP加haproxy监听端口6444
  name: default-cluster
contexts:
- context:
    cluster: default-cluster
    namespace: default
    user: default-auth
  name: default-context
current-context: default-context
kind: Config
preferences: {}
users:
- name: default-auth
  user:
    client-certificate: /var/lib/kubelet/pki/kubelet-client.crt
    client-key: /var/lib/kubelet/pki/kubelet-client.key

部署前注意事项

1. 确保所有节点时间同步

使用ntpdate命令进行时间同步，若无私网时间服务器，可以使用阿里云时间服务器。

ntpdate ntp1.aliyun.com

2. 确保所有节点ip转发功能打开

net.ipv4.ip_forward = 1