Migrating QCG services to a new machine

During course of years it may happen that the services has to be migrated to another machine (for example because of the Operating System upgrade). This guide shortly summarises what have to be done to make the migration process as smoothly as possible:

  • Use "mount" command to see what network partitions are mounted on the old machine, mount the same partitions on the new machine.
  • Make sure that you are able to submit batch jobs from the new machine.
  • Install QCG-Computing QCG-Computing ,  QCG-Notification, and QCG-Accounting on the new machine. You do not need to fully configure QCG-Computing at the moment.
  • Copy/merge the following configuration files:
    • /etc/qcg/*
    • /opt/qcg/dependencies/etc/pbs_drmaa.conf
    • /etc/grid-security/grid-mapfile.local
    • /etc/xinetd.d/gsiftp
    • /etc/sysconfig/qcg-compd
    • /etc/lcmaps/lcmaps-qcg.db (only if configured with LCMAPS)
  • Copy QCG-Accounting state files:
    # ls -l /var/run/qcg/qcg-acc/
    total 12
    -rw-r--r--. 1 root root 5 Oct 27 17:01 apel.last.id
    -rw-r--r--. 1 root root 5 Oct 27 17:01 bat.last.id
    -rw-r--r--. 1 root root 5 Oct 27 17:01 gridsafe.last.id
    
  • Announce downtime in GOCDB
  • Stop services on the old machine
  • Migrate qcg-comp database (you do not have to migrate qcg-ntf database as it stores more ephemeral information)
    old-host# pg_dump -a -h  127.0.0.1 -U qcg-comp qcg-comp > qcg-comp.dump
    old-host#scp qcg-comp.dump newhost:~/
    
    new-host# cat qcg-comp.dump | psql -h 127.0.0.1 -U qcg-comp qcg-comp
    
    
  • Start and test services on the new machine
  • Change IP Address/Hostname of the new machine
  • Run over the checklist:
    • Batch system commands:
      #e.g in Torque/Maui you can run the following commands:
      pbsnodes
      qstat
      echo date | qsub
      showres
      
    • Computing:
      qcg-comp -G | xmllint --format -
      qcg-comp -c -J /usr/share/qcg-comp/doc/examples/jsdl/date.xml
      
    • Notification:
      ps -Af | grep qcg-ntf
      
    • GridFTP:
      telnet localhost 2811
      
    • Security:
      # CA certifcates
      ls -la /etc/grid-security/certificates/ | wc -l
      1297
      # fetch CRL
      /etc/init.d/fetch-crl-cron status
      Periodic fetch-crl is enabled.
      
    • Partitions:
        #home partition
        ls -l /home/users/ | wc -l
        439
        # lustre scratch (if applicable)
        ls -l /mnt/lustre/qcg/ | wc -l
      
    • Modules:
      module avail
      
    • Applications:
      #cat /etc/qcg/qcg-comp/application_mapfile | grep bash
      bash    * /opt/exp_soft/plgrid/qcg-app-scripts/app-scripts/bash.qcg
      #ls -l /opt/exp_soft/plgrid/qcg-app-scripts/app-scripts/bash.qcg
      
    • Firewall
    • Accounting:
      tail -n 1000 /var/log/qcg/qcg-acc/qcg-accounting.log
      
    • Nagios - Look into test status