slurmd unable to communicate with slurmctldlet slurmctld “think” that nodes are idle~ like after “SuspendProgram”, but in fact they are down when it startsPython h5py: “Unable to create file”, seemingly at randomSlurm and Openmpi: An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirunslurmctld: fatal: CLUSTER NAME MISMATCHunable to confine a job to use a single gpu on a 2 gpu node using slurmHow to communicate between nodes of a cluster?How login node communicates with compute node in a slurm cluster?Unable to setup slurmdbd plugin: Connection refusedRunning mpirun with srun on multiple nodes gives a different communicatorFatal Python error: initfsencoding: Unable to get the locale encoding File “/cm/shared/apps/anaconda2/4.5.12/lib/python2.7/encodings/__init__.py”
In what language did Túrin converse with Mím?
Can UV radiation be safe for the skin?
Find the logic in first 2 statements to give the answer for the third statement
Which is the correct version of Mussorgsky's Pictures at an Exhibition?
Was it illegal to blaspheme God in Antioch in 360.-410.?
What was Captain Marvel supposed to do once she reached her destination?
How to understand payment due date for credit card?
Welche normative Autorität hat der Duden? / What's the normative authority of the Duden?
GPL Licensed Woocommerce paid plugins
Get contents before a colon
Eshet Chayil in the Tunisian service
Idiomatic way to create an immutable and efficient class in C++?
What's the origin of the concept of alternate dimensions/realities?
German equivalent to "going down the rabbit hole"
Do universities maintain secret textbooks?
Terminology of atomic spectroscopy: Difference Among Term, States and Level
What is the following VRP?
Resources to learn about firearms?
I was reported to HR as being a satan worshiper
How can a trade secret thief avoid being caught?
Was a six-engine 747 ever seriously considered by Boeing?
What am I looking at here at Google Sky?
What is the practical impact of using System.Random which is not cryptographically random?
Why do motor drives have multiple bus capacitors of small value capacitance instead of a single bus capacitor of large value?
slurmd unable to communicate with slurmctld
let slurmctld “think” that nodes are idle~ like after “SuspendProgram”, but in fact they are down when it startsPython h5py: “Unable to create file”, seemingly at randomSlurm and Openmpi: An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirunslurmctld: fatal: CLUSTER NAME MISMATCHunable to confine a job to use a single gpu on a 2 gpu node using slurmHow to communicate between nodes of a cluster?How login node communicates with compute node in a slurm cluster?Unable to setup slurmdbd plugin: Connection refusedRunning mpirun with srun on multiple nodes gives a different communicatorFatal Python error: initfsencoding: Unable to get the locale encoding File “/cm/shared/apps/anaconda2/4.5.12/lib/python2.7/encodings/__init__.py”
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.
When running scontrol show slurmd
, I get:
Active Steps = NONE
Actual CPUs = 1
Actual Boards = 1
Actual sockets = 1
Actual cores = 1
Actual threads per core = 1
Actual real memory = 984 MB
Actual temp disk space = 492 MB
Boot time = 2019-03-27T17:53:56
Hostname = fedora2
Last slurmctld msg time = NONE
Slurmd PID = 1549
Slurmd Debug = 4
Slurmd Logfile = /var/log/slurmd.log
Version = 17.11.13-2
I don't know why slurmd
on fedora2
can't communicate with the controller on fedora1
. slurmctld
daemon is running fine on fedora1
.
The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=fedora1
#
ControlMachine=fedora1
ControlAddr=192.168.1.4
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=fedora
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=verbose
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=verbose
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP
The output of tail /var/log/slurmd.log on fedora2, on multiple lines:
error: Unable to register: Unable to contact slurm controller (connect failure)
slurm
add a comment |
I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.
When running scontrol show slurmd
, I get:
Active Steps = NONE
Actual CPUs = 1
Actual Boards = 1
Actual sockets = 1
Actual cores = 1
Actual threads per core = 1
Actual real memory = 984 MB
Actual temp disk space = 492 MB
Boot time = 2019-03-27T17:53:56
Hostname = fedora2
Last slurmctld msg time = NONE
Slurmd PID = 1549
Slurmd Debug = 4
Slurmd Logfile = /var/log/slurmd.log
Version = 17.11.13-2
I don't know why slurmd
on fedora2
can't communicate with the controller on fedora1
. slurmctld
daemon is running fine on fedora1
.
The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=fedora1
#
ControlMachine=fedora1
ControlAddr=192.168.1.4
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=fedora
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=verbose
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=verbose
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP
The output of tail /var/log/slurmd.log on fedora2, on multiple lines:
error: Unable to register: Unable to contact slurm controller (connect failure)
slurm
add a comment |
I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.
When running scontrol show slurmd
, I get:
Active Steps = NONE
Actual CPUs = 1
Actual Boards = 1
Actual sockets = 1
Actual cores = 1
Actual threads per core = 1
Actual real memory = 984 MB
Actual temp disk space = 492 MB
Boot time = 2019-03-27T17:53:56
Hostname = fedora2
Last slurmctld msg time = NONE
Slurmd PID = 1549
Slurmd Debug = 4
Slurmd Logfile = /var/log/slurmd.log
Version = 17.11.13-2
I don't know why slurmd
on fedora2
can't communicate with the controller on fedora1
. slurmctld
daemon is running fine on fedora1
.
The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=fedora1
#
ControlMachine=fedora1
ControlAddr=192.168.1.4
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=fedora
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=verbose
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=verbose
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP
The output of tail /var/log/slurmd.log on fedora2, on multiple lines:
error: Unable to register: Unable to contact slurm controller (connect failure)
slurm
I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.
When running scontrol show slurmd
, I get:
Active Steps = NONE
Actual CPUs = 1
Actual Boards = 1
Actual sockets = 1
Actual cores = 1
Actual threads per core = 1
Actual real memory = 984 MB
Actual temp disk space = 492 MB
Boot time = 2019-03-27T17:53:56
Hostname = fedora2
Last slurmctld msg time = NONE
Slurmd PID = 1549
Slurmd Debug = 4
Slurmd Logfile = /var/log/slurmd.log
Version = 17.11.13-2
I don't know why slurmd
on fedora2
can't communicate with the controller on fedora1
. slurmctld
daemon is running fine on fedora1
.
The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=fedora1
#
ControlMachine=fedora1
ControlAddr=192.168.1.4
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=fedora
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=verbose
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=verbose
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP
The output of tail /var/log/slurmd.log on fedora2, on multiple lines:
error: Unable to register: Unable to contact slurm controller (connect failure)
slurm
slurm
edited Mar 28 at 2:19
user3273814
asked Mar 27 at 23:02
user3273814user3273814
1276 bronze badges
1276 bronze badges
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Make sure that:
no firewall prevents the
slurmd
daemon from talking to the controllermunge
is running on each serverthe dates are in sync
the Slurm versions are identical
the name
fedora1
can be resolved to the correct IP
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55387790%2fslurmd-unable-to-communicate-with-slurmctld%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Make sure that:
no firewall prevents the
slurmd
daemon from talking to the controllermunge
is running on each serverthe dates are in sync
the Slurm versions are identical
the name
fedora1
can be resolved to the correct IP
add a comment |
Make sure that:
no firewall prevents the
slurmd
daemon from talking to the controllermunge
is running on each serverthe dates are in sync
the Slurm versions are identical
the name
fedora1
can be resolved to the correct IP
add a comment |
Make sure that:
no firewall prevents the
slurmd
daemon from talking to the controllermunge
is running on each serverthe dates are in sync
the Slurm versions are identical
the name
fedora1
can be resolved to the correct IP
Make sure that:
no firewall prevents the
slurmd
daemon from talking to the controllermunge
is running on each serverthe dates are in sync
the Slurm versions are identical
the name
fedora1
can be resolved to the correct IP
answered Mar 29 at 14:33


damienfrancoisdamienfrancois
28.7k5 gold badges53 silver badges67 bronze badges
28.7k5 gold badges53 silver badges67 bronze badges
add a comment |
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55387790%2fslurmd-unable-to-communicate-with-slurmctld%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown