Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error check to pdsh function and other works to verify the deploying process #1

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

xuechendi
Copy link
Contributor

Four commit here:

  1. add error check in pdsh function, so pdsh will exit when stderr is not none.
  2. add osd device field in yaml, user can specify osd devices there, if none, will follow the original way using /dev/disk/by-partlabel/osd-device-%s-data
  3. skip error check when calling ceph-osd, ceph-mon command
  4. change pdcp ceph.conf to all nodes to scp

1. change the pdsh function to with ".communicate()", and add error check there, with error, do sys.exit()
2. remove all ".communicate()" suffix of pdsh caller in benchmark/*.py cluster/ceph.py monitoring.py
3. add a parameter in pdsh function to skip error check if the cmd dosn't require 0 error like "pkill collectl"

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
add a check in setup_fs, if user specify the osd devices in yaml,
will deploy osd on those device instead of /dev/disk/by-partlabel/osd-device-%s-data

Example:
=== runtest.xfs.yaml ===
cluster:
  osds: [cceph01, cceph02]
  cceph01: [/dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1, /dev/sde1, /dev/sdf1, /dev/sdg1,/dev/sdh1]
  cceph02: [/dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1, /dev/sde1, /dev/sdf1, /dev/sdg1,/dev/sdh1]

=== deploy log ===
['/dev/sda1', '/dev/sdb1', '/dev/sdc1', '/dev/sdd1', '/dev/sde1', '/dev/sdf1', '/dev/sdg1', '/dev/sdh1']
pdsh: ['pdsh', '-R', 'ssh', '-w', 'root@cceph01', 'sudo umount /dev/sda1']
pdsh: ['pdsh', '-R', 'ssh', '-w', 'root@cceph01', 'sudo rm -rf /tmp/cbt/mnt/osd-device-0-data']
pdsh: ['pdsh', '-R', 'ssh', '-w', 'root@cceph01', 'sudo mkdir -p -m0755 -- /tmp/cbt/mnt/osd-device-0-data']
pdsh: ['pdsh', '-R', 'ssh', '-w', 'root@cceph01', 'sudo mkfs.xfs -f -i size=2048 -n size=64k /dev/sda1']
pdsh: ['pdsh', '-R', 'ssh', '-w', 'root@cceph01', 'sudo mount -o inode64,noatime,logbsize=256k -t xfs /dev/sda1 /tmp/cbt/mnt/osd-device-0-data']
...

If not specify the osd device, will follow the original way using device /dev/disk/by-partlabel/osd-device-%s-data

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
After adding error check, ceph-osd and ceph-mon cmd will interrupt during deploying

example:
pdsh: ['pdsh', '-R', 'ssh', '-w', 'root@cceph01', 'sudo ceph -c /tmp/cbt/None/ceph.conf -i \
/tmp/cbt/mnt/osd-device-0-data/keyring auth add osd.0 osd "allow *" mon "allow profile osd"']
[ERROR]:cceph01: added key for osd.0
pdsh will get stderr msg "added key for osd.0", which should be stdout msg

before fixing this bug in ceph-osd, will skip error check of these command firstly

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Currently using pdcp function to send ceph.conf to tmp_dir(/tmp/cbt/None),
It seems it's better using scp here, to scp ceph.conf from the head node
to all osd/mon/mds nodes.

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Current, when deploy osd to the second node, it will mount osd device from 0 again
here, change to use the osd_id as the mount dir number

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
stdout, stderr = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True).communicate()
if force:
return [stdout, stderr]
if stderr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having something in stderr doesn't mean, that command fail

@bengland2
Copy link
Contributor

that's why I checked Popen.returncode in my pull request.

@bengland2
Copy link
Contributor

I do like idea of specifying block devices in yaml instead of with partition name convention though! Can we separate out that part and get that merged?

@bengland2
Copy link
Contributor

you have a point about using scp in at least one case, rpdcp did not pull results back to test driver for me, but I think it can push files out from test driver to other systems just fine. We could do scps in parallel to copy per-host results to different subdirectories on test driver host.

@bengland2
Copy link
Contributor

I think the error checking part of this pull request is now fixed, or at least mostly fixed, more error checking wlil be added over time, see merged PRs #110 and #107 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants