In most parallel toolkits used within single cluster environments the master process spawns the worker processes either using SSH or LRMS native interfaces. This make the task of exchanging contact information (e.g. listening host and port) between master and workers relatively easy as the master process is always initialized before the slave processes. With a co-allocated parallel application this is an issue as master and workers are started independently. In the QosCosGrid stack we solved this problem with a help of external entity: the QCG-Coordinator service. The service implements two general operations: PutProcessEntry and GetProcessEntry. The master process provides contact information using the PutProcessEntry method, while the slave processes acquire this information using the GetProcessEntry method which blocks until the information is available. This relaxes the requirement that the co-allocated parts of the applications must be started in some particular order.

  • PutProcessEntry(in: key, in: data) - puts contact information data for a given session key,
  • GetProcessEntry(in: key, out: data)- gets contact information data for a given session key.

The GetProcessEntry operation is blocking, i.e. it waits until the process data for a given key is available. This relaxes the requirement that the kernels must be started in some particular order. The unique session key is generated by QCG-Broker and distributed to all applications processes. The whole process of exchanging contact information is shown in the below figure.

Example usage of the QCG-Coordinator service

Attachments