slurmd unable to communicate with slurmctldlet slurmctld “think” that nodes are idle~ like after “SuspendProgram”, but in fact they are down when it startsPython h5py: “Unable to create file”, seemingly at randomSlurm and Openmpi: An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirunslurmctld: fatal: CLUSTER NAME MISMATCHunable to confine a job to use a single gpu on a 2 gpu node using slurmHow to communicate between nodes of a cluster?How login node communicates with compute node in a slurm cluster?Unable to setup slurmdbd plugin: Connection refusedRunning mpirun with srun on multiple nodes gives a different communicatorFatal Python error: initfsencoding: Unable to get the locale encoding File “/cm/shared/apps/anaconda2/4.5.12/lib/python2.7/encodings/__init__.py”

In what language did Túrin converse with Mím?

Can UV radiation be safe for the skin?

Find the logic in first 2 statements to give the answer for the third statement

Which is the correct version of Mussorgsky's Pictures at an Exhibition?

Was it illegal to blaspheme God in Antioch in 360.-410.?

What was Captain Marvel supposed to do once she reached her destination?

How to understand payment due date for credit card?

Welche normative Autorität hat der Duden? / What's the normative authority of the Duden?

GPL Licensed Woocommerce paid plugins

Get contents before a colon

Eshet Chayil in the Tunisian service

Idiomatic way to create an immutable and efficient class in C++?

What's the origin of the concept of alternate dimensions/realities?

German equivalent to "going down the rabbit hole"

Do universities maintain secret textbooks?

Terminology of atomic spectroscopy: Difference Among Term, States and Level

What is the following VRP?

Resources to learn about firearms?

I was reported to HR as being a satan worshiper

How can a trade secret thief avoid being caught?

Was a six-engine 747 ever seriously considered by Boeing?

What am I looking at here at Google Sky?

What is the practical impact of using System.Random which is not cryptographically random?

Why do motor drives have multiple bus capacitors of small value capacitance instead of a single bus capacitor of large value?



slurmd unable to communicate with slurmctld


let slurmctld “think” that nodes are idle~ like after “SuspendProgram”, but in fact they are down when it startsPython h5py: “Unable to create file”, seemingly at randomSlurm and Openmpi: An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirunslurmctld: fatal: CLUSTER NAME MISMATCHunable to confine a job to use a single gpu on a 2 gpu node using slurmHow to communicate between nodes of a cluster?How login node communicates with compute node in a slurm cluster?Unable to setup slurmdbd plugin: Connection refusedRunning mpirun with srun on multiple nodes gives a different communicatorFatal Python error: initfsencoding: Unable to get the locale encoding File “/cm/shared/apps/anaconda2/4.5.12/lib/python2.7/encodings/__init__.py”






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.



When running scontrol show slurmd, I get:



Active Steps = NONE
Actual CPUs = 1
Actual Boards = 1
Actual sockets = 1
Actual cores = 1
Actual threads per core = 1
Actual real memory = 984 MB
Actual temp disk space = 492 MB
Boot time = 2019-03-27T17:53:56
Hostname = fedora2
Last slurmctld msg time = NONE
Slurmd PID = 1549
Slurmd Debug = 4
Slurmd Logfile = /var/log/slurmd.log
Version = 17.11.13-2


I don't know why slurmd on fedora2 can't communicate with the controller on fedora1. slurmctld daemon is running fine on fedora1.



The slurm.conf is as follows:



# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=fedora1
#
ControlMachine=fedora1
ControlAddr=192.168.1.4
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=fedora
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=verbose
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=verbose
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP


The output of tail /var/log/slurmd.log on fedora2, on multiple lines:



error: Unable to register: Unable to contact slurm controller (connect failure)









share|improve this question
































    0















    I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.



    When running scontrol show slurmd, I get:



    Active Steps = NONE
    Actual CPUs = 1
    Actual Boards = 1
    Actual sockets = 1
    Actual cores = 1
    Actual threads per core = 1
    Actual real memory = 984 MB
    Actual temp disk space = 492 MB
    Boot time = 2019-03-27T17:53:56
    Hostname = fedora2
    Last slurmctld msg time = NONE
    Slurmd PID = 1549
    Slurmd Debug = 4
    Slurmd Logfile = /var/log/slurmd.log
    Version = 17.11.13-2


    I don't know why slurmd on fedora2 can't communicate with the controller on fedora1. slurmctld daemon is running fine on fedora1.



    The slurm.conf is as follows:



    # slurm.conf file generated by configurator easy.html.
    # Put this file on all nodes of your cluster.
    # See the slurm.conf man page for more information.
    #
    #SlurmctldHost=fedora1
    #
    ControlMachine=fedora1
    ControlAddr=192.168.1.4
    MailProg=/bin/mail
    MpiDefault=none
    #MpiParams=ports=#-#
    ProctrackType=proctrack/cgroup
    ReturnToService=1
    SlurmctldPidFile=/var/run/slurm/slurmctld.pid
    #SlurmctldPort=6817
    SlurmdPidFile=/var/run/slurm/slurmd.pid
    #SlurmdPort=6818
    SlurmdSpoolDir=/var/spool/slurmd
    SlurmUser=slurm
    SlurmdUser=root
    StateSaveLocation=/var/spool/slurmctld
    SwitchType=switch/none
    TaskPlugin=task/affinity
    #
    #
    # TIMERS
    #KillWait=30
    #MinJobAge=300
    #SlurmctldTimeout=120
    #SlurmdTimeout=300
    #
    #
    # SCHEDULING
    FastSchedule=1
    SchedulerType=sched/backfill
    SelectType=select/cons_res
    SelectTypeParameters=CR_Core
    #
    #
    # LOGGING AND ACCOUNTING
    AccountingStorageType=accounting_storage/none
    ClusterName=fedora
    #JobAcctGatherFrequency=30
    JobAcctGatherType=jobacct_gather/none
    SlurmctldDebug=verbose
    SlurmctldLogFile=/var/log/slurmctld.log
    SlurmdDebug=verbose
    SlurmdLogFile=/var/log/slurmd.log
    #
    #
    # COMPUTE NODES
    NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
    NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
    PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP


    The output of tail /var/log/slurmd.log on fedora2, on multiple lines:



    error: Unable to register: Unable to contact slurm controller (connect failure)









    share|improve this question




























      0












      0








      0








      I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.



      When running scontrol show slurmd, I get:



      Active Steps = NONE
      Actual CPUs = 1
      Actual Boards = 1
      Actual sockets = 1
      Actual cores = 1
      Actual threads per core = 1
      Actual real memory = 984 MB
      Actual temp disk space = 492 MB
      Boot time = 2019-03-27T17:53:56
      Hostname = fedora2
      Last slurmctld msg time = NONE
      Slurmd PID = 1549
      Slurmd Debug = 4
      Slurmd Logfile = /var/log/slurmd.log
      Version = 17.11.13-2


      I don't know why slurmd on fedora2 can't communicate with the controller on fedora1. slurmctld daemon is running fine on fedora1.



      The slurm.conf is as follows:



      # slurm.conf file generated by configurator easy.html.
      # Put this file on all nodes of your cluster.
      # See the slurm.conf man page for more information.
      #
      #SlurmctldHost=fedora1
      #
      ControlMachine=fedora1
      ControlAddr=192.168.1.4
      MailProg=/bin/mail
      MpiDefault=none
      #MpiParams=ports=#-#
      ProctrackType=proctrack/cgroup
      ReturnToService=1
      SlurmctldPidFile=/var/run/slurm/slurmctld.pid
      #SlurmctldPort=6817
      SlurmdPidFile=/var/run/slurm/slurmd.pid
      #SlurmdPort=6818
      SlurmdSpoolDir=/var/spool/slurmd
      SlurmUser=slurm
      SlurmdUser=root
      StateSaveLocation=/var/spool/slurmctld
      SwitchType=switch/none
      TaskPlugin=task/affinity
      #
      #
      # TIMERS
      #KillWait=30
      #MinJobAge=300
      #SlurmctldTimeout=120
      #SlurmdTimeout=300
      #
      #
      # SCHEDULING
      FastSchedule=1
      SchedulerType=sched/backfill
      SelectType=select/cons_res
      SelectTypeParameters=CR_Core
      #
      #
      # LOGGING AND ACCOUNTING
      AccountingStorageType=accounting_storage/none
      ClusterName=fedora
      #JobAcctGatherFrequency=30
      JobAcctGatherType=jobacct_gather/none
      SlurmctldDebug=verbose
      SlurmctldLogFile=/var/log/slurmctld.log
      SlurmdDebug=verbose
      SlurmdLogFile=/var/log/slurmd.log
      #
      #
      # COMPUTE NODES
      NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
      NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
      PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP


      The output of tail /var/log/slurmd.log on fedora2, on multiple lines:



      error: Unable to register: Unable to contact slurm controller (connect failure)









      share|improve this question
















      I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.



      When running scontrol show slurmd, I get:



      Active Steps = NONE
      Actual CPUs = 1
      Actual Boards = 1
      Actual sockets = 1
      Actual cores = 1
      Actual threads per core = 1
      Actual real memory = 984 MB
      Actual temp disk space = 492 MB
      Boot time = 2019-03-27T17:53:56
      Hostname = fedora2
      Last slurmctld msg time = NONE
      Slurmd PID = 1549
      Slurmd Debug = 4
      Slurmd Logfile = /var/log/slurmd.log
      Version = 17.11.13-2


      I don't know why slurmd on fedora2 can't communicate with the controller on fedora1. slurmctld daemon is running fine on fedora1.



      The slurm.conf is as follows:



      # slurm.conf file generated by configurator easy.html.
      # Put this file on all nodes of your cluster.
      # See the slurm.conf man page for more information.
      #
      #SlurmctldHost=fedora1
      #
      ControlMachine=fedora1
      ControlAddr=192.168.1.4
      MailProg=/bin/mail
      MpiDefault=none
      #MpiParams=ports=#-#
      ProctrackType=proctrack/cgroup
      ReturnToService=1
      SlurmctldPidFile=/var/run/slurm/slurmctld.pid
      #SlurmctldPort=6817
      SlurmdPidFile=/var/run/slurm/slurmd.pid
      #SlurmdPort=6818
      SlurmdSpoolDir=/var/spool/slurmd
      SlurmUser=slurm
      SlurmdUser=root
      StateSaveLocation=/var/spool/slurmctld
      SwitchType=switch/none
      TaskPlugin=task/affinity
      #
      #
      # TIMERS
      #KillWait=30
      #MinJobAge=300
      #SlurmctldTimeout=120
      #SlurmdTimeout=300
      #
      #
      # SCHEDULING
      FastSchedule=1
      SchedulerType=sched/backfill
      SelectType=select/cons_res
      SelectTypeParameters=CR_Core
      #
      #
      # LOGGING AND ACCOUNTING
      AccountingStorageType=accounting_storage/none
      ClusterName=fedora
      #JobAcctGatherFrequency=30
      JobAcctGatherType=jobacct_gather/none
      SlurmctldDebug=verbose
      SlurmctldLogFile=/var/log/slurmctld.log
      SlurmdDebug=verbose
      SlurmdLogFile=/var/log/slurmd.log
      #
      #
      # COMPUTE NODES
      NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
      NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
      PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP


      The output of tail /var/log/slurmd.log on fedora2, on multiple lines:



      error: Unable to register: Unable to contact slurm controller (connect failure)






      slurm






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 28 at 2:19







      user3273814

















      asked Mar 27 at 23:02









      user3273814user3273814

      1276 bronze badges




      1276 bronze badges

























          1 Answer
          1






          active

          oldest

          votes


















          0















          Make sure that:



          1. no firewall prevents the slurmd daemon from talking to the controller


          2. munge is running on each server


          3. the dates are in sync


          4. the Slurm versions are identical


          5. the name fedora1 can be resolved to the correct IP






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55387790%2fslurmd-unable-to-communicate-with-slurmctld%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0















            Make sure that:



            1. no firewall prevents the slurmd daemon from talking to the controller


            2. munge is running on each server


            3. the dates are in sync


            4. the Slurm versions are identical


            5. the name fedora1 can be resolved to the correct IP






            share|improve this answer





























              0















              Make sure that:



              1. no firewall prevents the slurmd daemon from talking to the controller


              2. munge is running on each server


              3. the dates are in sync


              4. the Slurm versions are identical


              5. the name fedora1 can be resolved to the correct IP






              share|improve this answer



























                0














                0










                0









                Make sure that:



                1. no firewall prevents the slurmd daemon from talking to the controller


                2. munge is running on each server


                3. the dates are in sync


                4. the Slurm versions are identical


                5. the name fedora1 can be resolved to the correct IP






                share|improve this answer













                Make sure that:



                1. no firewall prevents the slurmd daemon from talking to the controller


                2. munge is running on each server


                3. the dates are in sync


                4. the Slurm versions are identical


                5. the name fedora1 can be resolved to the correct IP







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 29 at 14:33









                damienfrancoisdamienfrancois

                28.7k5 gold badges53 silver badges67 bronze badges




                28.7k5 gold badges53 silver badges67 bronze badges





















                    Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







                    Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55387790%2fslurmd-unable-to-communicate-with-slurmctld%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

                    Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

                    Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript