Note: This discussion is about an older version of the COMSOL Multiphysics® software. The information provided may be out of date.

Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.

Cluster computing got stuck :external progress scheduling

Please login with a confirmed email address before reporting spam

Hi,Ihave recently setup a cluster and i had comsol 4.3 on the head node,and i had the installation forder shared,
model working directory shared.
However,when I am trying to submit a batch job,using the cluster model in the user guide libarary,it happened to be stuck during "external progress :scheduling"(I found it in the .mph.log file).
And situation is,if I select only one node,the head node,comsolclusterbatch.exe progresses perfect and responded correctly.I select the second node,the compute node,the comsolclusterbatch.exe is effective again and returned the result correctly.
And if I set the node=2 both into working,a job requiring two nodes submitted to the HPC job manager,on both two computer nodes the comsolclusterbatch.exe appeared,comsuming some of memory,ca 60M,but neither of them is comsuming any CPU,and the progress is 0.
I found the log stuck at "external progress 1 :scheduling"
It's very strange that comsol failed to work parallely on two nodes.
I am running windows HPC 2008 R2 on the head node and the compute node.
The headnode have an i7 CPU,and the compute node have a AMD B28 cpu. Both the memory is 16G.

looking forword to your reply

13 Replies Last Post 5 nov. 2012, 08:11 UTC−5
Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 17 oct. 2012, 18:08 UTC−4
I do all my work with COMSOL on linux workstations, servers, and clusters. I have never used a Windows-based cluster. I would also be interested in other experiences with this type of system. I noticed that Microsoft stopped coming to the COMSOL Conference. Are there many Microsoft clusters out there using COMSOL ?
I do all my work with COMSOL on linux workstations, servers, and clusters. I have never used a Windows-based cluster. I would also be interested in other experiences with this type of system. I noticed that Microsoft stopped coming to the COMSOL Conference. Are there many Microsoft clusters out there using COMSOL ?

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 17 oct. 2012, 22:28 UTC−4
HPC clusters may be easier with me,I am not farmiliar with the Linux commands.
But sincerely,the guy beside my office runs a cluster running Linux,he has six DELL computers connected as a cluster,running ubuntu.Maybe Linux is more popular a platform.
If I can not get this run properly then maybe I have to turn to a Linux platform
HPC clusters may be easier with me,I am not farmiliar with the Linux commands. But sincerely,the guy beside my office runs a cluster running Linux,he has six DELL computers connected as a cluster,running ubuntu.Maybe Linux is more popular a platform. If I can not get this run properly then maybe I have to turn to a Linux platform

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 18 oct. 2012, 01:28 UTC−4
Hi James,

we do have a Windows HPC Server 2008 R2 based cluster system. It works quite well with COMSOL since maybe 4.2 or so (it got better with each release).

If you have any detailed questions, please feel free to send a PM.

Regards
Matthias
Hi James, we do have a Windows HPC Server 2008 R2 based cluster system. It works quite well with COMSOL since maybe 4.2 or so (it got better with each release). If you have any detailed questions, please feel free to send a PM. Regards Matthias

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 18 oct. 2012, 01:30 UTC−4
Hello He Shuhong,

it could be anything from a problem in the model to a setup problem of your cluster. If you post a simple model, I can try to run it on our Windows HPC cluster to see if it works here.

Regards
Matthias

Update: We of course also run COMSOL on that cluster system...
Hello He Shuhong, it could be anything from a problem in the model to a setup problem of your cluster. If you post a simple model, I can try to run it on our Windows HPC cluster to see if it works here. Regards Matthias Update: We of course also run COMSOL on that cluster system...

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 18 oct. 2012, 09:36 UTC−4
I am sure that COMSOL will run fine on your Microsoft HPC cluster since it is listed as being supported by COMSOL.
If you continue to be stuck, before trying on your neighbor's LInux cluster, you might try asking COMSOL tech support for some help in case it is an installation problem. do you have other test cases that come with the Microsoft HPC cluster to make sure your cluster is working correctly ?
I am sure that COMSOL will run fine on your Microsoft HPC cluster since it is listed as being supported by COMSOL. If you continue to be stuck, before trying on your neighbor's LInux cluster, you might try asking COMSOL tech support for some help in case it is an installation problem. do you have other test cases that come with the Microsoft HPC cluster to make sure your cluster is working correctly ?

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 21 oct. 2012, 09:44 UTC−4
Hi,it took me two days to verify if I had got anything wrong on my cluster.And it turned out a little disappointing,we could run Microsoft Visual Studio in paralell,but there is still something wrong with comsol.
The situation is,I installed comsol on the head node,with a license containing clusternode serial.
I shared the installation directory on the headnode to make it accessible to all computing nodes,and I setup a NEW directory,sharing every permission with everyone to store the *.mph file and the *.mph.log file.
External COMSOL installation forder is directed to the shared COMSOL working directory on my headnode
the External working directory path:\\headnode\COMSOL43\
the External file storing path:\\headnode\test1\
I'm using the model from model libarary according to the user guide
"cluster_install_win_43.pdf"
And I don't really understand what is a Floating Network Licence,I just want my model simuteanously studied by All my nodes,but it got stuck at scheduling external progressssssssssssssssss.
I tried to submit a job with comsol commands manually in the HPC Cluster manager,assigning two nodes including the head node,using 8 cores,and 8 comsolclusterbatch.exe appeared in the taskmanager occupying lot of memory and CPU with out returning anything.
When I submit a job through comsol using the configuration above,assinged pathes,on both the headnode and comupte node the progress comsolclusterbatch.exe appeared and progressed several seconds,then stopped working,the progress is not terminated automaticly,and meanwhile returning nothing,in the COMSOL GUI the progress stuck at external progress1:scheduling
Hi,it took me two days to verify if I had got anything wrong on my cluster.And it turned out a little disappointing,we could run Microsoft Visual Studio in paralell,but there is still something wrong with comsol. The situation is,I installed comsol on the head node,with a license containing clusternode serial. I shared the installation directory on the headnode to make it accessible to all computing nodes,and I setup a NEW directory,sharing every permission with everyone to store the *.mph file and the *.mph.log file. External COMSOL installation forder is directed to the shared COMSOL working directory on my headnode the External working directory path:\\headnode\COMSOL43\ the External file storing path:\\headnode\test1\ I'm using the model from model libarary according to the user guide "cluster_install_win_43.pdf" And I don't really understand what is a Floating Network Licence,I just want my model simuteanously studied by All my nodes,but it got stuck at scheduling external progressssssssssssssssss. I tried to submit a job with comsol commands manually in the HPC Cluster manager,assigning two nodes including the head node,using 8 cores,and 8 comsolclusterbatch.exe appeared in the taskmanager occupying lot of memory and CPU with out returning anything. When I submit a job through comsol using the configuration above,assinged pathes,on both the headnode and comupte node the progress comsolclusterbatch.exe appeared and progressed several seconds,then stopped working,the progress is not terminated automaticly,and meanwhile returning nothing,in the COMSOL GUI the progress stuck at external progress1:scheduling

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 21 oct. 2012, 09:47 UTC−4
I am just trying the testing case of COMSOL and it failed as my description in the last POST.
I think it may be a installation problem,as my cluster passed all the MPI diagnosis.
I am just trying the testing case of COMSOL and it failed as my description in the last POST. I think it may be a installation problem,as my cluster passed all the MPI diagnosis.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 22 oct. 2012, 07:14 UTC−4
Hi,

there should be a .status and a .log file in the directory where you put your .mph file for cluster calculation. Could you have a look at those and post them?

Regards
Matthias
Hi, there should be a .status and a .log file in the directory where you put your .mph file for cluster calculation. Could you have a look at those and post them? Regards Matthias

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 22 oct. 2012, 07:50 UTC−4
I feel very grateful with you that my problem got a little progressive.
Now I can run comsol in paralell,and the batch progress terminated normally,and the PROBLEM is,the batch returns nothing.I assigned two nodes to run this batch job.The result is saved,but the progress did not update with comsol desktop.
This is the log file,I have check just now.

*******************************************
***COMSOL 4.3.0.151 progress output file***
*******************************************
Mon Oct 22 19:33:49 CST 2012
---------- Current Progress: 100 %
Memory: 428/450 563/582
Stationary Solver 1 in Solver 1 started at 22-十月-2012 19:34:42.
Current Progress: 0 %
Memory: 445/469 565/591
Nonlinear solver
Number of degrees of freedom solved for: 15040.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.u: 0.0037
mod1.p: 0.01
Iter ErrEst Damping Stepsize #Res #Jac #Sol
1 31 0.0100000 32 2 1 2
- Current Progress: 10 %
Memory: 476/503 607/633
2 5.4 0.1000000 6 3 2 4
-- Current Progress: 20 %
Memory: 462/504 592/646
3 0.031 1.0000000 0.61 4 3 6
-------- Current Progress: 88 %
Memory: 483/504 616/646
4 0.0011 1.0000000 0.029 5 4 8
---------- Current Progress: 100 %
Memory: 469/504 599/646
5 4e-005 1.0000000 0.0024 6 5 10
Node 1:
Nonlinear solver
Number of degrees of freedom solved for: 15040.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.u: 0.0037
mod1.p: 0.01
Iter ErrEst Damping Stepsize #Res #Jac #Sol
1 31 0.0100000 32 2 1 2
2 5.4 0.1000000 6 3 2 4
3 0.031 1.0000000 0.61 4 3 6
4 0.0011 1.0000000 0.029 5 4 8
5 4e-005 1.0000000 0.0024 6 5 10
Stationary Solver 1 in Solver 1: Solution time: 21 s.
Current Progress: 0 %
Memory: 505/506 660/661
---------- Current Progress: 100 %
Memory: 498/561 658/721
Stationary Solver 2 in Solver 1 started at 22-十月-2012 19:35:15.
Current Progress: 0 %
Memory: 549/572 664/721
Nonlinear solver
Number of degrees of freedom solved for: 67521.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.c: 27
Iter ErrEst Damping Stepsize #Res #Jac #Sol
1 0.64 0.0100000 0.64 2 1 2
- Current Progress: 10 %
Memory: 604/733 723/872
2 0.6 0.0722125 0.65 3 2 4
-- Current Progress: 20 %
Memory: 733/738 867/880
3 0.48 0.7221251 1.6 4 3 6
---- Current Progress: 44 %
Memory: 733/740 867/885
4 0.12 1.0000000 7.7 5 4 8
------- Current Progress: 73 %
Memory: 607/740 725/885
5 0.066 1.0000000 0.17 6 5 10
------- Current Progress: 70 %
Memory: 606/740 724/885
6 0.027 1.0000000 0.055 7 6 12
-------- Current Progress: 80 %
Memory: 607/740 724/885
7 0.0084 1.0000000 0.024 8 7 14
-------- Current Progress: 88 %
Memory: 737/740 871/885
8 0.0025 1.0000000 0.0079 9 8 16
--------- Current Progress: 94 %
Memory: 608/740 726/885
9 0.00074 1.0000000 0.0018 10 9 18
---------- Current Progress: 100 %
Memory: 584/740 700/885
Node 1:
Nonlinear solver
Number of degrees of freedom solved for: 67521.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.c: 27
Iter ErrEst Damping Stepsize #Res #Jac #Sol
1 0.64 0.0100000 0.64 2 1 2
2 0.6 0.0722125 0.65 3 2 4
3 0.48 0.7221251 1.6 4 3 6
4 0.12 1.0000000 7.7 5 4 8
5 0.066 1.0000000 0.17 6 5 10
6 0.027 1.0000000 0.055 7 6 12
7 0.0084 1.0000000 0.024 8 7 14
8 0.0025 1.0000000 0.0079 9 8 16
9 0.00074 1.0000000 0.0018 10 9 18
Stationary Solver 2 in Solver 1: Solution time: 104 s. (1 minute, 44 seconds)
Run time: 159 s.
Saving: \\headnode\samples\123.mph
Save time: 11 s.
Total time: 201 s.

and the status file,thankfully done.
1350905830192
Done
I feel very grateful with you that my problem got a little progressive. Now I can run comsol in paralell,and the batch progress terminated normally,and the PROBLEM is,the batch returns nothing.I assigned two nodes to run this batch job.The result is saved,but the progress did not update with comsol desktop. This is the log file,I have check just now. ******************************************* ***COMSOL 4.3.0.151 progress output file*** ******************************************* Mon Oct 22 19:33:49 CST 2012 ---------- Current Progress: 100 % Memory: 428/450 563/582 Stationary Solver 1 in Solver 1 started at 22-十月-2012 19:34:42. Current Progress: 0 % Memory: 445/469 565/591 Nonlinear solver Number of degrees of freedom solved for: 15040. Nonsymmetric matrix found. Scales for dependent variables: mod1.u: 0.0037 mod1.p: 0.01 Iter ErrEst Damping Stepsize #Res #Jac #Sol 1 31 0.0100000 32 2 1 2 - Current Progress: 10 % Memory: 476/503 607/633 2 5.4 0.1000000 6 3 2 4 -- Current Progress: 20 % Memory: 462/504 592/646 3 0.031 1.0000000 0.61 4 3 6 -------- Current Progress: 88 % Memory: 483/504 616/646 4 0.0011 1.0000000 0.029 5 4 8 ---------- Current Progress: 100 % Memory: 469/504 599/646 5 4e-005 1.0000000 0.0024 6 5 10 Node 1: Nonlinear solver Number of degrees of freedom solved for: 15040. Nonsymmetric matrix found. Scales for dependent variables: mod1.u: 0.0037 mod1.p: 0.01 Iter ErrEst Damping Stepsize #Res #Jac #Sol 1 31 0.0100000 32 2 1 2 2 5.4 0.1000000 6 3 2 4 3 0.031 1.0000000 0.61 4 3 6 4 0.0011 1.0000000 0.029 5 4 8 5 4e-005 1.0000000 0.0024 6 5 10 Stationary Solver 1 in Solver 1: Solution time: 21 s. Current Progress: 0 % Memory: 505/506 660/661 ---------- Current Progress: 100 % Memory: 498/561 658/721 Stationary Solver 2 in Solver 1 started at 22-十月-2012 19:35:15. Current Progress: 0 % Memory: 549/572 664/721 Nonlinear solver Number of degrees of freedom solved for: 67521. Nonsymmetric matrix found. Scales for dependent variables: mod1.c: 27 Iter ErrEst Damping Stepsize #Res #Jac #Sol 1 0.64 0.0100000 0.64 2 1 2 - Current Progress: 10 % Memory: 604/733 723/872 2 0.6 0.0722125 0.65 3 2 4 -- Current Progress: 20 % Memory: 733/738 867/880 3 0.48 0.7221251 1.6 4 3 6 ---- Current Progress: 44 % Memory: 733/740 867/885 4 0.12 1.0000000 7.7 5 4 8 ------- Current Progress: 73 % Memory: 607/740 725/885 5 0.066 1.0000000 0.17 6 5 10 ------- Current Progress: 70 % Memory: 606/740 724/885 6 0.027 1.0000000 0.055 7 6 12 -------- Current Progress: 80 % Memory: 607/740 724/885 7 0.0084 1.0000000 0.024 8 7 14 -------- Current Progress: 88 % Memory: 737/740 871/885 8 0.0025 1.0000000 0.0079 9 8 16 --------- Current Progress: 94 % Memory: 608/740 726/885 9 0.00074 1.0000000 0.0018 10 9 18 ---------- Current Progress: 100 % Memory: 584/740 700/885 Node 1: Nonlinear solver Number of degrees of freedom solved for: 67521. Nonsymmetric matrix found. Scales for dependent variables: mod1.c: 27 Iter ErrEst Damping Stepsize #Res #Jac #Sol 1 0.64 0.0100000 0.64 2 1 2 2 0.6 0.0722125 0.65 3 2 4 3 0.48 0.7221251 1.6 4 3 6 4 0.12 1.0000000 7.7 5 4 8 5 0.066 1.0000000 0.17 6 5 10 6 0.027 1.0000000 0.055 7 6 12 7 0.0084 1.0000000 0.024 8 7 14 8 0.0025 1.0000000 0.0079 9 8 16 9 0.00074 1.0000000 0.0018 10 9 18 Stationary Solver 2 in Solver 1: Solution time: 104 s. (1 minute, 44 seconds) Run time: 159 s. Saving: \\headnode\samples\123.mph Save time: 11 s. Total time: 201 s. and the status file,thankfully done. 1350905830192 Done

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 22 oct. 2012, 08:28 UTC−4
Hi,

the log file looks absolutely ok. I believe that the status file should say "0", but cannot verify this at the moment.

Have you opened the saved .mph file, to see if there are results inside? From the log file, they should be there.

I see the same behavior from time to time as well, jobs running fine, but the GUI is not coming back. I have no clue about the reason yet. Nevertheless, I follow jobs always from the HPC Cluster Manager (or Job Manager) as well, so I see what's going on.

So maybe you do some more tests!

Regards
Matthias
Hi, the log file looks absolutely ok. I believe that the status file should say "0", but cannot verify this at the moment. Have you opened the saved .mph file, to see if there are results inside? From the log file, they should be there. I see the same behavior from time to time as well, jobs running fine, but the GUI is not coming back. I have no clue about the reason yet. Nevertheless, I follow jobs always from the HPC Cluster Manager (or Job Manager) as well, so I see what's going on. So maybe you do some more tests! Regards Matthias

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 22 oct. 2012, 22:33 UTC−4
Hi, thank you for concerning my problem,and now the problem seems to be solved to some extent.
Now the progress bar sometimes goes normally,and thus return an expected result according to the samples in the model library.And some times the result is saved but no notification to open the result.
Now I am working on my model,a 2D grating ca 93750+56250=150000 sq um,mesh grid at 0.2 as maximum for free triangular,establishing ca 50million triangles,comsuming ca 9GB memory
It's a little upseting that COMSOL is comsuming so much memory.On one of my computing node returned a MPI error,which terminated my caclulation.
I have 16G on my headnode and 18G on two compute nodes,Should I add more compute nodes or upgrade my headnode?
Hi, thank you for concerning my problem,and now the problem seems to be solved to some extent. Now the progress bar sometimes goes normally,and thus return an expected result according to the samples in the model library.And some times the result is saved but no notification to open the result. Now I am working on my model,a 2D grating ca 93750+56250=150000 sq um,mesh grid at 0.2 as maximum for free triangular,establishing ca 50million triangles,comsuming ca 9GB memory It's a little upseting that COMSOL is comsuming so much memory.On one of my computing node returned a MPI error,which terminated my caclulation. I have 16G on my headnode and 18G on two compute nodes,Should I add more compute nodes or upgrade my headnode?

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 23 oct. 2012, 01:35 UTC−4
Hi,

do you use the head node as a compute node as well?

Upgrade: What is the type of network in your cluster? Gigabit Ethernet, 10G Ethernet, Infiniband? What type of models would you like to run? Large models with lots of nodes, or parametric sweeps of smaller models?

We chose to have big compute nodes (96 GB RAM, 12 cores each) and rather slow network (1G Ethernet) because we are usually using the cluster for parametric studies, and the models easily fit onto one node. However, if you plan to run really huge models, you should go for large memory and fastest network at the same time.

Regards
Matthias

Hi, do you use the head node as a compute node as well? Upgrade: What is the type of network in your cluster? Gigabit Ethernet, 10G Ethernet, Infiniband? What type of models would you like to run? Large models with lots of nodes, or parametric sweeps of smaller models? We chose to have big compute nodes (96 GB RAM, 12 cores each) and rather slow network (1G Ethernet) because we are usually using the cluster for parametric studies, and the models easily fit onto one node. However, if you plan to run really huge models, you should go for large memory and fastest network at the same time. Regards Matthias

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 5 nov. 2012, 08:11 UTC−5
Hi ,finally we got our cluster running.Thank you again for helping!
There is still a little problem about our model.We are studing a certern structure of the photonic crystal,
And we are focusing on the Energy gaps of the crystal in the K wave-vector space.
We found a model in the version 3.5 a,the bandgap of photonic Crystal in the RF module,
But I do not have a version of the 3.5a comsol,I found the PDF describing the model,
And I made some progress in studying the egeinfrequency.
But what is strange that the PDF declares the egeinfrequency being around 4.22e14,and we reached a value of 4.3e14 acording to the direct solver studying the egeinfrequency,I set up the same variables and constants as the PDF explained,and the intergration for the whole domain I wrote A intop1(1) for A @ m^2,nEz intop1(Ez*conj(Ez)/A) @(V/m)^2,is there anything wrong with this?I found this model no longer exists in the version 4.3
And another Problem is that there is a Harmonic Propagation selection in the solver parameters of version 3.5a,which I couldn't find in the version 4.3
I am trying to rebuild this model in 4.3,And I really need some help
Hi ,finally we got our cluster running.Thank you again for helping! There is still a little problem about our model.We are studing a certern structure of the photonic crystal, And we are focusing on the Energy gaps of the crystal in the K wave-vector space. We found a model in the version 3.5 a,the bandgap of photonic Crystal in the RF module, But I do not have a version of the 3.5a comsol,I found the PDF describing the model, And I made some progress in studying the egeinfrequency. But what is strange that the PDF declares the egeinfrequency being around 4.22e14,and we reached a value of 4.3e14 acording to the direct solver studying the egeinfrequency,I set up the same variables and constants as the PDF explained,and the intergration for the whole domain I wrote A intop1(1) for A @ m^2,nEz intop1(Ez*conj(Ez)/A) @(V/m)^2,is there anything wrong with this?I found this model no longer exists in the version 4.3 And another Problem is that there is a Harmonic Propagation selection in the solver parameters of version 3.5a,which I couldn't find in the version 4.3 I am trying to rebuild this model in 4.3,And I really need some help

Note that while COMSOL employees may participate in the discussion forum, COMSOL® software users who are on-subscription should submit their questions via the Support Center for a more comprehensive response from the Technical Support team.