The QCG-Accounting Agent
Architecture
Installation
You can install the package using the QosCosGrid yum repository:
yum install qcg-accounting qcg-accounting-logrotate
Configuration
The whole configuration of QCG-Accounting agent is stored in single properties files (/etc/qcg/qcg-acc/config.properties). List of configuration properties:
Common
- qcg.site.name - your GOCDB site name,
- qcg.batch.server - hostname where the batch server is running,
- qcg.parser.plugin - the name of the log parser plugin (e.g. pbs). Delete this property if the agent has no access to LRMS logs,
- qcg.publishers.plugins - the coma separated list of publisher plugins (e.g. bat,apel)
- qcg.debug - if set to true produce more verbose messages,
- qcg.state.dir - local state directory (default: /var/run/qcg/qcg-acc/),
- qcg.max.delay - maximum random delay before reporting starts, the random delay was introduced in order to avoid sending reports by all sites in the same time,
- qcg.subprocess.timeout - the general timeout after which child processes started by QCG-Accounting (e.g. log parsers) will be killed; the timeout value less than 0 means that the timeout will be disabled (default -1),
- qcg.default.vo - the default VO name sent in case no FQAN was available (default: "vo.plgrid.pl"), e.g. job was submitted using non-VOMS proxy,
- qcg.db.pass - password of the QCG-Computing database (see <Database> section of the qcg-compd.xml file),
If your database setup is not standard you may need to configure also the following properties:
- qcg.db.host - QCG-Computing database host,
- qcg.db.port - QCG-Computing database port,
- qcg.db.name - QCG-Computing database name,
- qcg.db.user - QCG-Computing database name.
- qcg.db.max.days - Limit processed job records to those that has started N days ago (Default is to look 90 days back)
- qcg.db.max.records - The maximum number of jobs processed in single run (default 5000)
Also if you want to report job as different (by default the QCG-Accounting agent tries to guess local hostname automatically) submit host than you may want to set the following property:
- qcg.submit.host=host.second.alias
Parser plugins
SLURM plugin - slurm
No configuration needed. The plugin assumes that the sacct command is usable on the qcg machine.
PBS Pro and Torque log parser - pbs
- qcg.pbs.home - the root of the Torque/PBS Pro spool directory (e.g. /var/torque).
- qcg.pbs.max.days - max number of days to look back into the past (default: 7 days).
Publishers plugins
BAT publisher (PL-Grid only) - bat
At first you must ask the BAT administrator to provide you all credentials (username/password and X.509 certificate) needed to connect to the BAT. Copy the received keystore into the file /etc/qcg/qcg-acc/truststore.ts (make sure that this file is only readable by root).
- qcg.bat.user and qcg.bat.pass - put here values provided by the BAT administrator
- qcg.bat.keystore.pass - keystore pass (provided with key by the BAT administrator)
- qcg.bat.test - enables test mode (i.e. do not send records to BAT broker) - default: false.
- qcg.bat.grid.only - set this to true if you do not want to report LRMS specific job information.
APEL SSM publisher - apel
- Install APEL SSM2 from the UMD-3 repository:
yum install apel-ssm
- Make sure that /var/spool/apel/outgoing/ exists:
mkdir /var/spool/apel/outgoing/ chmod 0700 /var/spool/apel/outgoing/
- In the GOCDB add new endpoint for QCG host of gLite-APEL service type providing its Host DN (need to be able to publish records to APEL).
- In the GOCDB add new endpoint for QCG host of APEL service type providing its Host DN (neeed to be monitored by Nagios).
- Update the configuration of APEL SSM2 in the /etc/apel/sender.cfg file:
- Configure APEL SSM2 to publish CPU Accounting data instruction,
- Provide appropriate authentication / authorization data,
- Adjust logging. IMPORTANT: If you want to pipe output of the ssmsend command to the QCG-Accounting log file, ensure that the console output is enabled (console: true).
Then configure the plugin itself:
- qcg.ssm.msg.dir - directory for outgoing usage record messages (default: /var/spool/apel/outgoing/),
- qcg.ssm.benchmark.type - benchmark name: either Si2k or HEPSPEC,
- qcg.ssm.benchmark.value - benchmark value (if cluster is composed of machines various types provide here weighted mean),
- qcg.ssm.site.name - site name as reported to APEL (optional). Default: qcg.site.name,
- qcg.ssm.max.records - the maximum number of records sent int single message (default: max 1000 records per file),
- qcg.ssm.safe.mode - do not run APEL publisher if there are some old unsent records: (default: true),
- qcg.ssm.send.timeout - the timeout after which the ssmsend program will be killed; if the value is less than 0, the timeout will be disabled; otherwise the value will overwrite the value specified in qcg.subprocess.timeout (default -1).
Known Issues - The QCG-Accounting must be installed on different machine than the glite-APEL and UNICORE SSM publishers otherwise reports may get overridden.
Grid-SAFE publisher - gridsafe
The Gird-SAFE publisher plugin was developed within the MAPPER project to simplify gathering accounting data from many infrastructures (EGI, PRACE and campus resources). Steps needed to configure the GRID-SAFE plugin:
- you can use the host cert-key pair to authenticate in the Grid-SAFE RUPI service, but first you need to convert it into the PKCS12 format. You must report your host DN to the Grid-SAFE administrator
openssl pkcs12 -export -descert -inkey /etc/grid-security/hostkey.pem -in /etc/grid-security/hostcert.pem -out /etc/qcg/qcg-acc/hostcred.p12 -name "HOST Certificate" chmod 0400 /etc/qcg/qcg-acc/hostcred.p12
- you can use the example configuration:
qcg.gridsafe.url=https://gridsafe-mapper.drg.lrz.de:8443/axis2/services/RUPIService qcg.gridsafe.keystore=/etc/qcg/qcg-acc/hostcred.p12 qcg.gridsafe.keystore.pass=gridsafepass qcg.gridsafe.truststore=/etc/qcg/qcg-acc/gridsafe-truststore.jks qcg.gridsafe.truststore.pass=storepass qcg.gridsafe.truststore.type=jks #send usage report only about the following users qcg.gridsafe.filter.userdn.file=http://gridsafe-mapper.drg.lrz.de/gridsafe/mapper.users
- or configure it manually:
- qcg.gridsafe.url - URL of the Grid-SAFE RUPI WebService (e.g. https://gridsafe-mapper.drg.lrz.de:8443/axis2/services/RUPIService),
- qcg.gridsafe.keystore - path to the keystore file for the RUPI plugin,
- qcg.gridsafe.keystore.pass - password to access the keystore,
- qcg.gridsafe.keystore.type - type of the keystore: pkcs12 or jks (default is pkcs12),
- qcg.gridsafe.truststore - path to the truststore file for the RUPI plugin,
- qcg.gridsafe.truststore.pass - password to access the truststore,
- qcg.gridsafe.truststore.type - type of the truststore: pkcs12 or jks (default is pkcs12).
Filters
- qcg.PUBLISHER.filter.userdn - send usage records only for jobs with the given X.509 DN
- qcg.PUBLISHER.filter.userdn.file - send usage record only for jobs with the X.509 DN's listed in the given file (the file location can be an URL stream, e.g. http://gridsafe-mapper.drg.lrz.de/gridsafe/mapper.users)
- qcg.PUBLISHER.filter.project - send usage record only for jobs with the given project Id (grant)
- qcg.PUBLISHER.filter.project.file - send usage records only for jobs with project id (grant) listed in the given file
Troubleshooting
The QCG-Accounting Agent stores all diagnostic information in the following log file: /var/log/qcg/qcg-acc/qcg-accounting.log. You may also try to set the qcg.debug configuration property to true in order to get more verbosity of log messages.
Migration from version 2.X to 3.0
- stop cron temporary and make sure that no QCG-Accounting process is running
/etc/init.d/crond stop ps -AF | grep QCGAcc
- backup /opt/plgrid/var/run/qcg-acc/:
cp -r /opt/plgrid/var/run/qcg-acc/ /opt/plgrid/var/run/qcg-acc.copy
- update qcg-accounting:
yum update qcg-accounting qcg-accounting-logrotate
- copy configuration files and keystores to the new conf dir:
cp /opt/plgrid/qcg/etc/qcg-acc/config.properties.rpmsave /etc/qcg/qcg-acc/config.properties cp /opt/plgrid/qcg/etc/qcg-acc/keystore.ks.rpmsave /etc/qcg/qcg-acc/keystore.ks
- update any paths in config.properties (i.e. /opt/plgrid/qcg/etc/qcg-acc/ to /etc/qcg/qcg-acc/)
- IMPORTANT copy last reported job ids:
cp /opt/plgrid/var/run/qcg-acc.copy/* /var/run/qcg/qcg-acc/
- try to run once and check for any errors (you may want to set temporary qcg.max.delay to 0):
/usr/bin/qcg-accounting ... [ INFO] - Tue May 21 00:37:29 CEST 2013: new lastReportedID: 26957. Processing took: 0 seconds.
- start again cron
/etc/init.d/crond start
License
QCG-Accounting is released under the GPL license. For QosCosGrid licensing details see: QosCosGrid license
FAQ
- Q: I want to republish records for all jobs that started N days ago. What should I do?
- A: You can do this by deleting /var/run/qcg/qcg-acc/PUBLISHER-NAME.last.id and setting qcg.db.max.days to the number of days back that you want to publish records. Also remember to adjust the qcg.pbs.max.days so it is not smaller than qcg.db.max.days.
- Q: I am getting "Plugin apel throwed an exception: /var/spool/apel/outgoing51dd5ec6 message dir not empty. Please rerun ssmsend manually java.lang.IllegalStateException: /var/spool/apel/outgoing51dd5ec6 message dir not empty. Please rerun ssmsend manually" but I had run ssmsend already. What is wrong?
- A: Some messages may be locked. Delete the lock file and run ssmsend again:
rm /var/spool/apel/outgoing/*/*.lck
Release Notes
Attachments
-
QCG-Accounting.png
(362.2 KB) - added by mmamonski 11 years ago.
QCG-Accounting internal architecture