Fixed Bugs for IBM Platform LSF Version 9.1.3

Anuncio
Fixed Bugs for IBM Platform LSF Version 9.1.3
Release Date: July 31 2014
The following bugs have been fixed in LSF Version 9.1.3 between 8 October 2013 and 21 July
2014:
223287 Date
Description
2013-12-06
The preemption calculation was refined for shared resources to improve the
preemption performance and throughput of the whole cluster.
Component mbschd, schmod_preemption.so
Platform
Impact
223587 Date
All
Throughput of the cluster is diminished when it takes a long time to get small jobs
preempted.
2013-10-16
The parameter MAX_EVENT_STREAM_SIZE cannot limit the size of the lsb.status
file.
Description
After this fix, either the oldest lsb.stream.timestamp file or the oldest
lsb.status.timestamp file will be deleted and then a new file will be written when
the number for MAX_EVENT_STREAM_FILE_NUMBER is reached.
Component mbatchd, liblsbstream.so
Platform
All
The lsb.status file grows to a very large size and causes a storage space issue.
Impact
When the storage system does not work well, the LSF cluster may become
unavailable.
223589 Date
Description
2013-10-09
Duplicate data is logged in the lsb.status file which causes a PK violation in PA.
Component mbatchd
Platform
All
Impact
223671 Date
Description
Cluster Admin sees many PK violations on Platform Analytics side.
2013-11-01
When running a large number of short jobs (for example, sleep 3) on a Windows host,
some jobs show as exited even though they ran successfully.
Component res.exe
Platform
Windows
Impact
All jobs run successfully, but LSF reports that some small jobs have exited.
223799 Date
2014-05-04
To allow for enabling and disabling the sourcing of LSB_SUB_PARAM_FILE without
causing an error, the internal parameter
LSB_SUB_ADDITIONAL_REMOVAL_FROM_PARAMFILE (in lsf.conf) is used. When
Description
LSB_SUB_ADDITIONAL_REMOVAL_FROM_PARAMFILE is set as Y or y,
LSB_SUB_ADDITIONAL should be removed from LSB_SUB_PARAM_FILE and
exported as an environment variable. By default, LSB_SUB_ADDITIONAL is not
removed and the setting of LSB_SUB_ADDITIONAL_REMOVAL_FROM_PARAMFILE is
incorrectly regarded as N or n.
Component bsub, mesub
Platform
All
Impact
A script using LSB_SUB_ADDITIONAL no longer works in LSF 9.1.1.
224016 Date
2013-10-20
POE will core dump if a user specifies LSB_PJL_TASK_GEOMETRY in the submission
Description
script for a job and the number of task groups in the task geometry is not equal to the
number of execution hosts allocated by LSF.
Component permapi.so
Platform
Linux
Impact
224323 Date
POE jobs run inside an LSF core dump.
2013-10-31
When using bsub -M to submit an exclusive job with a memory limit so large that it
Description
exceeds the execution host’s maximum physical memory, an out of memory (OOM)
condition occurs.
Component sbatchd
Platform
Linux\UNIX
Impact
A host becomes unavailable when the job exceeds the host’s memory.
224677 Date
2013-10-14
The XDR buffer used by the Master LIM to send resource information to the remote
cluster is set to a fixed size. If the number of resources on a cluster is increased so
Description
that the resource information size is larger than the buffer, connections between the
member clusters will break.
In this fix, the buffer size is calculated dynamically based on resource configuration in
a cluster.
Component lim
Platform
All
Impact
Platform MultiCluster fails if too many resources are defined in lsf.shared.
224701 Date
2013-10-29
Jobs occasionally get stuck with the pending reason New job is waiting for
Description
scheduling. This may occur when trying to brun a job when the job is already
scheduled.
Component mbschd
Platform
Impact
All
Some jobs are stuck in a pending state until noticed by administrators or users
complain to administrators.
224928 Date
2013-10-31
When cgroup accounting features are enabled (set by LSB_MEMLIMIT_ENFORCE=y,
LSF_PROCESS_TRACKING=Y, or LSF_LINUX_CGROUP_ACCT=Y in lsf.conf), the
Description
job may be terminated by a memory limit (set by bsub -M or MEMLIMIT in
lsb.queues).
The memory usage accounting incorrectly shows that the job exceeds the memory
limit, so the job is terminated.
Component sbatchd
Platform
Impact
225236 Date
Linux
The job is terminated by MEMLIMIT unexpectedly when cgroup memory accounting is
enabled.
2013-10-28
When no decay value is defined for the queue, a job submitted with “decay=0" will get
Description
rejected, even though this is a valid value. For example:
bsub –R "rusage[mem=300:decay=0]"
Component mbatchd
Platform
All
Impact
A job with a valid rusage string is rejected.
225401 Date
2013-12-20
mbatchd restarts, replays the events file, and core dumps if you use bsub to submit a
Description
new job and the job's resource requirement string (such as “order[ ]”) is longer
than 511 characters.
Component mbatchd
Platform
All
Impact
mbatchd core dumps and the LSF batch system is unavailable.
226154 Date
2013-11-20
If an interactive job is submitted that tries to pass the job's input data to bsub using a
Description
pipe ( | ), bsub will not forward the input data to the job correctly.
For example, if the command used is echo "exit" | bsub -Ip tcsh -s, the
"exit" string will not be passed to tcsh -s and the job will not exit.
Component bsub
Platform
Linux/Solaris
Impact
Some interactive jobs do not run.
226391 Date
Description
2013-11-15
The LSF/Clearcase integration utility daemons.wrap does not show the actual
Clearcase view name (ccview) in the error log.
Component daemons.wrap
Platform
All
Impact
Error messages in the log are not sufficient to diagnose problems.
226849 Date
2013-12-24
mbatchd hangs when there are duplicate job events with the same jobID in the
Description
lsb.events file. The duplicate events are caused by the failover, for example, when
the network is operating erratically.
Component mbatchd
Platform
All
Impact
mbatchd does not response and the LSF batch system is unavailable.
227047 Date
2013-12-20
When the nofile limit is set to unlimited, LSF daemons take a long time to start and the
cluster behaves abnormally.
Description
With this fix applied, setting the nofile limit higher than 65535, an INFO level message
will be logged for lim, res, mbatchd, and sbatchd stating: The nofile limit is
in excess of 65535. This may cause performance issues with your
cluster.
Component lim, res, mbatchd, sbatchd
Platform
All
Impact
Cluster becomes unusable.
227222 Date
2013-12-22
After a socket error occurs on Windows hosts, subsequent jobs cannot run and the
following error message appears in the child sbatchd log:
sbdChild: starting mode=-s handles=664:600
Description
rcvJobFile: chanRead_() failed. A socket operation has failed:
Socket operation on non-socket.
execJob: Job <4949999> <494999>failed in rcvjobfile_(), Unknown
error.
Component sbatchd.exe
Platform
Impact
227274 Date
Windows
All subsequent jobs remain pending on the host and sbatchd must be restarted to
process the jobs.
2013-12-13
For jobs submitted using bsub –n, the bstop command does not work as designed.
Description
After using bstop, the bjobs output shows the job in SUSPEND status. However, the
job is still running if checked by a system command.
Component sbatchd
Platform
Impact
All
A job is not stopped with bstop, even though bjobs shows the job is in SUSPEND
status.
227413 Date
Description
2013-12-03
An application level checkpoint array element in a job array remains pending after a
brestart command.
Component mbatchd
Platform
All
Impact
Restarting a checkpoint array job fails and the job remains pending.
227571 Date
2013-12-05
If JOB_DEP_LAST_SUB (in lsb.params) is set to 0, there are warning messages
Description
indicating that the parameter will be ignored. JOB_DEP_LAST_SUB is set to 1 by
default.
Component mbatchd
Platform
All
Impact
JOB_DEP_LAST_SUB in lsb.params cannot be disabled.
228315 Date
Description
2013-12-27
When QJOB_LIMIT is defined in lsb.queues, jobs submitted to the queue
occasionally cannot be dispatched even if there are available slots.
Component mbschd, schmod_limit.so
Platform
All
Impact
Some jobs in a queue cannot be scheduled even if there are slots available.
228346 Date
Description
2013-12-30
When LSB_MIXED_PATH_ENABLE (in lsf.conf) is set to Y and bsub is used with a
long command name, the job may exit with the incorrect status.
Component sbatchd
Platform
All
Impact
Some jobs using long command names may exit with the incorrect status.
228349 Date
2013-12-20
During job cleanup, the effective user ID for sbatchd is changed to root. When the
Description
user’s home directory is mounted from the NFS server and configured without root
privilege, the job post script will fail.
Component sbatchd
Platform
Linux
Impact
.lsbatch/* files cannot be cleaned up.
228399 Date
Description
2014-01-28
PREEMPT_JOBTYPE=BACKFILL does not work.
When configuring a low priority queue with a job to preempt a backfill job, LSF does
not allow the job (which may not have a run limit) to preempt the backfill job because
it may delay the start of the job in the high priority queue.
Component mbatchd
Platform
All
Impact
An administrator cannot set up a queue that would preempt a backfill queue under
certain conditions.
228501 Date
Description
2014-01-03
Windows hosts cannot execute esub.bat.
Component bsub.exe
Platform
Windows
Impact
Cannot use esub.bat on Windows hosts.
228596 Date
Description
2014-01-27
Messages for lsb.acct file rotation and deletion are not logged.
Component mbatchd
Platform
All
Impact
No lsb.acct file rotation info message for users.
228863 Date
Description
2014-01-27
In MultiCluster environments, bjobs -pac -l jobid does not show the
PREDICTEDREMAINTIME column.
Component Mbatchd, bjobs
Platform
Impact
228865 Date
Description
All
The PAC Multi-cluster feature is missing the PREDICTEDREMAINTIME value in the
GUI.
2014-01-22
When frequently using bhosts to check the host status, it causes the child mbatchd
to perform a core dump.
Component mbatchd
Platform
All
Impact
Child mbatchd core dumps and does not respond to b* query commands.
228919 Date
Description
2014-03-27
Remote jobID information is unavailable in bjobs output. Therefore, Platform
Application Center is unable to show the remote jobID of a job in a remote cluster.
Component bjobs
Platform
All
Impact
229142 Date
Description
Platform Application Center cannot show the remote jobID from bjobs output.
2014-01-22
Due to a script error, elim.gpfshost cannot be started. Therefore, local host GPFS
information (gtotalin and gtotalout) cannot be collected and reported.
Component elim.gpfshost
Platform
Linux
Impact
elim.gpfshost exits when loading host GPFS information.
229370 Date
2014-03-03
When starting dynamic hosts on a particular subnet they should join the cluster
Description
automatically and start taking jobs. LSF adds them to the correct host group but they
never leave closed_LIM status in bhosts, even if all LSF daemons on them are
restarted.
Component lim
Platform
Impact
229403 Date
Description
All
Dynamic batch execution hosts cannot be added to a cluster if the package size of the
slave lim configuration that needs to send to the master lim is bigger than the MTU.
2014-03-18
No error is recorded in the lim log when lim does not have permission to open
lsf.conf or lsf.cluster.
Component lim
Platform
All
Impact
lim error message is confusing or absent, making it difficult to debug a problem.
229567 Date
2014-02-24
A job that is not configured with application level rerun, but is in a rerun queue (that is,
Description
a "rerunnable" job) will no longer be rerunnable after an Administrator runs badmin
reconfig.
Component mbatchd
Platform
Impact
229638 Date
Description
All
Rerunnable jobs are no longer rerunnable after an Administrator runs badmin
reconfig.
2014-03-05
When issuing the commands lsadmin resrestart/shutdown or lsadmin
resdebug -o, esub is called unnecessarily.
Component lsadmin
Platform
Impact
229721 Date
Description
All
Running lsadmin resrestart/shutdown or lsadmin resdebug -o takes a
longer time than necessary.
2014-02-14
LSF does not remove the file hostAffinityFile after the job is finished.
Component sbatchd
Platform
Linux
Impact
The hostAffinityFile temp file must be deleted manually.
229819 Date
Description
2014-03-10
When a Platform MPI job is running across nodes, the run limit is not handled
correctly with signal SIGUSR2.
Component blaunch
Platform
Linux
Impact
229891 Date
Description
Platform MPI jobs are killed prematurely instead of a graceful shutdown.
2014-02-20
In a MultiCluster lease mode environment, submitting a job from an LSF 8.0.1 host to
an LSF9.1.2 host causes an mbatchd core dump on LSF 9.1.2.
Component mbatchd
Platform
all
Impact
Jobs cannot be submitted from LSF8.0.1 to LSF9.1.2 in MultiCluster lease mode.
229909 Date
Description
2014-03-04
Occasionally, the mbschd log contains the error Cannot connect to the
mbatchd:, even when all processes are functioning correctly.
Component mbatchd
Platform
All
Impact
Error message gives the impression that the cluster is not functioning well.
229920 Date
Description
2014-02-27
Value of the PER_PROJECT parameter (in lsb.resources) is limited to 59
characters.
Component mbatchd
Platform
All
Impact
Value of PER_PROJECT is limited to 59 characters.
229971 Date
Description
2014-03-13
mbatchd generates some duplicate records in the lsb.status file.
Component mbatchd
Platform
Impact
230012 Date
Description
All
The lsb.status file grows very quickly and impacts the PA reports due to a PK
violation.
2014-02-20
When a job is run with both bkill and brequeue at the same time, a memory error
occurs causing an mbatchd core dump.
Component mbatchd
Platform
All
Impact
mbatchd core dumps and LSF batch system is unavailable.
230060 Date
Description
2014-03-10
LSF randomly calculates an incorrect memory usage for MPI jobs.
Component sbatchd
Platform
Linux
Impact
The correct memory usage for MPI jobs is not available.
230096 Date
Description
2014-02-26
The value of PREDICTEDSTARTTIME in the output of bjobs -l -pac jobid
contains a redundant <>.
Component bjobs
Platform
Linux
Impact
Pending jobs cannot be viewed in PAC.
230113 Date
2014-03-11
Description
Scheduler performance suffers after using bmod to modify a job group with a limit
when there are many jobs in the group.
Component mbschd, schmod_limit.so
Platform
All
Impact
Jobs cannot be dispatched
230305 Date
Description
2014-03-10
Performance of scheduler with resource reservation suffers when there are many
pending jobs.
Component mbschd, schmod_reserve.so, schmod_parallel.so
Platform
All
Impact
Scheduler performance degradation
230389 Date
Description
2014-03-14
If there are multiple *.swtag files under $LSF_TOP/properties/version, the
multiple FUSERGRP and FXUSR entries cause patchinstall to fail.
Component patchinstall
Platform
Linux\UNIX
Impact
Patch cannot be installed by patchinstall.
230428 Date
Description
2014-03-03
Large memory leak with pim on Mac OS X 10.7.
Component pim
Platform
Mac OS X
Impact
Jobs will pend since memory is not available
230570 Date
Description
2014-3-17
When using the bswitch command to switch a running job to another queue, the
JOBS limit defined in lsb.resources is bypassed.
Component mbatchd
Platform
Impact
230578 Date
Description
Linux2.6-glibc2.3-x86_64
A running job in a low priority queue can be switched to a high priority queue and
exceed the JOBS limit defined on the high priority queue.
2014-03-13
lsadmin ckconfig or reconfig does not show an error message when Begin
ClusterAdmins is missing in the lsf.cluster file.
Component lim, lsadmin
Platform
Impact
230580 Date
All
The cluster is unavailable after a restart and an error message indicating the problem
is not available.
2014-04-11
Problems when using lsmake to build customized Android 4.3 code:
(1) Building native Android 4.3 using lsmake is very slow.
(2) Cannot build HTC customized Android code on two hosts using lsmake.
(3) Slots are occupied by lsmake, leading to low slot usage.
The following options have been added to the command lsmake to accommodate
Description
this feature:
--max-cross-host-level <number>
If lsmake enters <number> level, do not distribute the task to other hosts. The
default value is a large integer.
--no-block-shell-mode
Perform "shell" tasks without blocking mode. Without this parameter, blocking
mode is used.
Component lsmake, lsmakerm
Platform
All
Impact
230780 Date
lsmake cannot be used and performs worse than gmake when building Android 4.3
code.
2014-03-19
When using the badmin reconfig command, the fatal error: Master host
<host_name> is not defined in the Host section of the lsb.hosts
Description
file is reported if a short host name is defined in LSF_MASTER_LIST (in
lsf.conf) and a long host name is defined in /etc/hosts and
/etc/sysconfig/network.
Component mbatchd
Platform
All
Impact
badmin reconfig quits with a fatal message.
230963 Date
Description
2014-05-17
When the DNS is responding slowly, mbatchd responds to b* commands slowly as
well, because it attempts to resolve host group names.
Component mbatchd
Platform
All
Impact
mbatchd responds to b* commands slowly.
230979 Date
Description
2014-05-24
The startup script hostsetup does not work on hosts with Mac OS X 10.8 and higher
and LSF daemons do not start up after reboot.
Component hostsetup
Platform
Impact
230985 Date
Mac OS X
LSF has to be restarted manually after every restart to configure the LSF startup script
even though hostsetup has been run.
2014-03-11
Description
Job information should be exported as environment variables to eexec, so that
eexec does not have to issue bjobs or bhist to get job information.
Component sbatchd
Platform
All
Impact
Poor job submission and execution performance with GOLD integration.
231078 Date
Description
2014-03-13
mbatchd generates some duplicate records in the lsb.status file.
Component mbatchd
Platform
Impact
231135 Date
Description
All
lsb.status file grows very quickly and impact the PA report feature due to a PK
violation.
2014-3-19
Using the bsub -cwd option or setting DEFAULT_JOB_CWD in lsb.params can
unintentionally overwrite the value of the LS_SUBCWD environment variable.
Component sbatchd
Platform
All
Impact
The original submission directory is lost in the execution environment.
231143 Date
2014-03-21
A child mbatchd core dump occurs when ENABLE_EVENT_STREAM is enabled,
CONDENSE_PENDING_REASONS is set to ALL, and the system fails to write to the
Description
pendingreasons.<cluster_name> file for any reason (for example, there is no
space left for the directory set by PENDING_REASONS_TMP_DIR or the file has been
deleted).
Component mbatchd
Platform
All
Impact
231286 Date
Child mbatchd core dump occurs but there is no observable impact to the LSF cluster.
2014-03-25
The mbatchd log includes the message RB_rusageUpdate() Scheduler
Description
doesn't scheduled job xx@xxxx on host xxx, but SBD reports job
usage info from that host when resized jobs are run and hosts are released
on completion.
Component mbatchd
Platform
All
Impact
Confusing warning message.
231357 Date
2014-03-31
A job is put into a pending state with the pending reason Job has a specified
Description
start time if both the -b and -R requirements have not been met. However, the
job's pending reason will be changed if there is another job pending due to the same
-R option.
Component mbschd
Platform
All
Impact
The job pending reason is incorrect.
231589 Date
Description
2014-04-20
An mbatchd core dump occurs when upgrading LSF from version 9.1.1 to 9.1.2, if
sbatchd is restarted first, then mbatchd is restarted.
Component mbatchd, sbatchd
Platform
All
Impact
mbatchd core dump occurs and LSF batch system is unavailable.
231611 Date
Description
2014-4-2
The operator || does not work for shared resources in rusage with parallel jobs even
if there are free resources available for a sibling resource requirement.
Component mbschd, schmod_parallel.so
Platform
Impact
232162 Date
Description
All
Parallel jobs with a sibling resource requirement cannot be dispatched even if there
are enough resources in the cluster.
2014-04-04
A job submitted with a file limit is killed if the information is sent to the res log but it
has already reached the file limit.
Component res
Platform
All
Impact
Jobs submitted to LSF cluster are killed incorrectly.
232240 Date
2014-04-04
When LSB_KRB_TGT_FWD=Y and LSB_AFS_JOB_SUPPORT=Y are configured in
Description
lsf.conf, res causes an FD leak which eventually means there are no FDs
available.
Component res
Platform
Linux\UNIX
Impact
LSF jobs fail to renew AFS tokens.
232315 Date
Description
2014-04-14
The error System error 109 has occurred is returned when using the net
stop command on Windows to stop res or sbatchd.
Component res, lim
Platform
Windows
Impact
Scripts cannot be used to manage daemons.
232334 Date
Description
2014-04-13
An sbatchd core dump occurs when AIX is upgraded to the TL and service pack 710003-01-1341 and jobs are submitted.
Component sbatchd
Platform
AIX
Impact
sbatchd core dumps and LSF reports the job is exited.
232736 Date
Description
2014-04-30
Using the LSF API to read a streaming file causes a memory leak.
Component liblsbstream.so
Platform
ALL
Impact
A memory leak causes the application’s memory usage to grow exponentially.
232824 Date
Description
2014-04-22
A resource requirement is not changed by bmod after a checkpoint job is restarted.
Component sbatchd
Platform
All
Impact
Checkpoint jobs restart with a wrong RES_REQ.
233074 Date
2014-04-20
Description
In a mixed MultiCluster, forward environment with MultiCluster lease mode, mbatchd
core dumps when it restarts.
Component mbatchd
Platform
All
Impact
mbatchd core dumps and LSF batch system is not available.
233234 Date
2014-04-30
When DJOB_ENV_SCRIPT (in lsb.queues) is configured and the
Description
openmpi_rankfile.sh script is used, the file created by the script cannot be
accessed. The openmpi_rankfile.sh script is missing an environment.
Component res openmpi_rankfile.sh
Platform
Linux\UNIX
Impact
The blaunch/openMPI integration is not complete for the CPU binding.
233424 Date
2014-04-22
Before a job is finished, bpeek fails to display the job output and generates an
Description
unclear error message: ls_rstat: File operation failed: No such file
or directory.
Component bpeek
Platform
All
Impact
An unclear error message is generated by bpeek.
233455 Date
Description
2014-04-29
elim.mic contains a hard coded library path to libmicmgmt.so. If
libmicmgmt.so is not installed in the default directory, elim.mic exits.
Component elim.mic
Platform
Linux
Impact
233534 Date
Description
elim.mic exits and no GUP information is collected.
2014-07-03
If sbatchd is not responding or is unavailable when mbatchd attempts to send
modification to it, sbatchd will never receive the modification information.
Component mbschd, sbatchd
Platform
Impact
233565 Date
All
Changing a running job’s run time limit with bmod does not take effect when sbatchd
is unavailable.
2014-04-24
Loading the Kerberos library fails and causes sbatchd to core dump when running
Description
jobs if LSB_KRB_TGT_FWD=Y is set but LSB_KRB_TGT_DIR is not configured in
lsf.conf.
Component sbatchd
Platform
Linux\UNIX
Impact
LSF jobs fail to renew AFS tokens.
233610 Date
Description
2014-04-25
After migration to a new host, a job array element remains in a pending state.
Component mbatchd
Platform
All
Impact
A long-running checkpoint job fails to start in LSF after migration to a new host.
234252 Date
Description
2014-05-08
An mbatchd core dump occurs, caused by a memory overflow, when using bsub -f
with %J and the file path is larger than 256 characters.
Component mbatchd
Platform
All
Impact
mbatchd core dumps and LSF batch system is not available.
234588 Date
Description
2014-05-11
When running patchinstall outside the installation directory, patchinstall fails
with the error Error: Unable to access jar file lap/LAPApp.jar.
Component patchinstall
Platform
Impact
234858 Date
Linux/Unix
patchinstall fails to install patches when it is issued outside the installation
directory.
2014-05-23
LSF does not recognize a user group and gives a warning message when running
Description
badmin reconfig if a user group cannot be searched by LDAP due to an LDAP
size limitation.
Component mbatchd
Platform
Linux\UNIX
Impact
User group is not recognized by LSF.
235115 Date
Description
2014-06-18
Problem with guarantee and preemption features. High priority, guaranteed consumer
jobs remain pending even when resource requirements are met.
Component mbschd, schmod_default.so
Platform
All
Impact
Pending jobs do not run even if resource requirements are ment.
235149 Date
Description
2014-6-10
In a MultiCluster environment, the error message MCB_encodeMcbMsg:
xdr_func() failed is frequently found incorrectly placed in the mbatchd log.
Component mbatchd
Platform
All
Impact
The execution cluster cannot send job usage information to the submission cluster.
235662 Date
2014-05-29
If Intel MPI is installed in a directory other than the default /opt directory, Intel MPI
Description
jobs using the PAM/TASK starter framework cannot run unless the MPI_TOPDIR is
manually changed in the Intel MPI wrapper.
Component intelmpi_wrapper
Platform
Impact
235681 Date
Linux
The intelmpi_wrapper must be modified manually if the Intel MPI location is not
the default directory.
2014-05-27
An XDR error message MCB_channelJobDecsnToRemote:
Description
MCB_sendDispDecsnToCluster() failed occurs in mbatchd, and the job
cannot be forwarded to a remote cluster.
Component mbatchd
Platform
All
Impact
Jobs are not forwarding to the remote cluster in MultiCluster lease mode.
235696 Date
Description
2014-6-16
When checkpoint jobs are submitted by a script, the command brestart -W does
not work and restarted jobs cannot be terminated by RUNLIMIT.
Component sbatchd, erestart
Platform
Impact
235889 Date
Description
Linux
After running brestart, the job cannot exit and keeps being "Checkpoint initiated"
and "Checkpoint succeeded" iteratively.
2014-06-04
When executing a job containing multiple tasks, task RES calculates an incorrect XDR
size and causes an XDR encoding error.
Component res
Platform
All
Impact
Jobs that are launched by blaunch fail to execute.
235891 Date
Description
2014-05-30
In MultiCluster lease mode, an uninitialized variable sometimes causes an mbschd
core dump.
Component mbschd schmod_default.so
Platform
All
Impact
Job dispatch fails.
235920 Date
Description
2014-05-30
The RES log contains an incorrect spelling for "acquire".
Component res, mbschd
Platform
All
Impact
Typo in an error message of the RES log file.
235925 Date
2014-06-17
Description
The command combination bkill -r -J <job_name> does not work as expected.
Component bkill
Platform
All
Impact
bkill -r -J <job_name> does not kill jobs as expected.
236105 Date
Description
2014-06-10
In MultiCluster lease mode, a buffer overflow occurs when the lease.state.file
is too large, causing an mbatchd core dump.
Component mbatchd
Platform
All
Impact
mbatchd core dumps and the LSF batch system is not available.
236187 Date
Description
2014-7-2
When the first line of a bsub job script is larger than 16361 characters a core dump
occurs.
Component bsub
Platform
Linux2.6-glibc2.3-x86_64
Impact
bsub core dumps and the job is not submitted.
236477 Date
Description
2014-06-17
When mbschd encounters an error on a job, other jobs do not get scheduled and
remain stuck.
Component mbschd
Platform
All
Impact
Jobs are pending until badmin reconfig is issued.
236606 Date
Description
2014-06-11
In MultiCluster forward mode, a parallel job submitted with the same RES_REQ is
blocked.
Component mbschd, schmod_mc.so
Platform
All
Impact
Jobs are stuck in pending status on submission cluster.
236614 Date
2014-06-11
When the lsb_submit() API is used to submit jobs for both a parent and child
Description
process, jobs submitted for the child process do not process
LSB_SUB_MODIFY_FILE and LSB_SUB_MODIFY_ENVFILE properly.
Component liblsf.a, liblsb.so, libbat.a, libbat.so, lsbatch.h, lsf.h
Platform
All
Impact
esub does not work with the lsb_submit() API.
236705 Date
2014-07-02
When a shared resource is configured for the cluster, vemkd reports a warning
Description
message:
lsfinit: resource <resource_name> is being used by multiple
hosts. It cannot be used in a resource requirement expression.
Component vemkd
Platform
All
Impact
Error messages in the vemkd log file with even with correct configuration.
236752 Date
2014-07-01
When preemption and guaranteed SLA are both enabled, mbschd will take a long
Description
time to finish one scheduling cycle, especially when there are tens of thousands of
pending jobs.
Component mbschd
Platform
All
Impact
mbschd performance issue causes low job throughput of the LSF cluster.
237058 Date
Description
2014-06-24
A core dump occurs when using bsub to submit a job, and the specified command
and its arguments contain multiple quotations.
Component bsub
Platform
All
Impact
Some jobs cannot be submitted.
237460 Date
2014-07-14
After enabling LSB_QUERY_ENH (in lsf.conf), if there are many bhosts requests
Description
and there is an affinity host in the cluster or affinity is enabled in the cluster, the query
child mbatchd will core dump repeatedly. The core dump is caused by a "thread
unsafe function"
Component mbatchd
Platform
All
Impact
Child mbatchd core dumps when b* query commands do not work.
237644 Date
Description
2014-07-03
When using the brestart command with GOLD Integration jobs, the command fails
due to some missing job information such as a project name or jobID.
Component brestart
Platform
Linux
Impact
GOLD integration does not work well with brestart.
237853 Date
2014-07-21
When a long pre-execution job is run, but the job is killed before the pre-execution
Description
portion is finished, the environment variable LSF_JOB_EXECUSER cannot be retrieved
with eexec.
Component sbatchd
Platform
Linux
Impact
Many GOLD reservations will not be released, if gcharge fails and a job is killed.
237867 Date
Description
2014-07-02
Occasionally esub does not work when LSF_TMPDIR is set to a shared file system
because the temp file for one job is overriden by another job.
Component bsub
Platform
All
Impact
Incorrect or missing job submission options are set in esub.
237955 Date
Description
2014-06-27
mbatchd exits when the license error Unable to contact LIM occurs.
Component mbatchd
Platform
All
Impact
Error message is confusing.
238419 Date
2014-07-03
Description
A job that is running on another host occasionally fails when the bpeek command is
used in a mixed cluster environment.
Component bpeek
Platform
All
Impact
bpeek occasionally does not work.
238881 Date
Description
2014-07-21
For exclusive jobs, the reserved job cannot be backfilled.
Component mbschd, schmod_reserve.so
Platform
All
Impact
Short jobs cannot use backfill slots.
Copyright and trademark information
© Copyright IBM Corporation 2014
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP
Schedule Contract with IBM Corp.
IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp.,
registered in many jurisdictions worldwide. Other product and service names might be trademarks
of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright
and trademark information" at www.ibm.com/legal/copytrade.shtml.
Descargar