/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/torch_geometric/typing.py:18: UserWarning: An issue occurred while importing 'pyg-lib'. Disabling its usage. Stacktrace: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/libpyg.so) warnings.warn(f"An issue occurred while importing 'pyg-lib'. " /scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/torch_geometric/typing.py:42: UserWarning: An issue occurred while importing 'torch-sparse'. Disabling its usage. Stacktrace: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/libpyg.so) warnings.warn(f"An issue occurred while importing 'torch-sparse'. " Seed set to 42 /scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3 train_model.py --model hi_lam --graph hierarchical ... GPU available: False, used: False TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id 81mdiqse. wandb: Tracking run with wandb version 0.17.5 wandb: W&B syncing is set to `offline` in this directory. wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. /scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3 train_model.py --model hi_lam --graph hierarchical ... [rank: 0] Seed set to 42 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 ---------------------------------------------------------------------------------------------------- distributed_backend=gloo All distributed processes registered. Starting with 1 processes ---------------------------------------------------------------------------------------------------- | Name | Type | Params | Mode ----------------------------------------------------------------- 0 | m2m_edge_index | BufferList | 0 | train 1 | mesh_up_edge_index | BufferList | 0 | train 2 | mesh_down_edge_index | BufferList | 0 | train 3 | m2m_features | BufferList | 0 | train 4 | mesh_up_features | BufferList | 0 | train 5 | mesh_down_features | BufferList | 0 | train 6 | mesh_static_features | BufferList | 0 | train 7 | grid_embedder | Sequential | 6.4 K | train 8 | g2m_embedder | Sequential | 4.5 K | train 9 | m2g_embedder | Sequential | 4.5 K | train 10 | g2m_gnn | InteractionNet | 29.2 K | train 11 | encoding_grid_mlp | Sequential | 8.4 K | train 12 | m2g_gnn | InteractionNet | 29.2 K | train 13 | output_map | Sequential | 4.5 K | train 14 | mesh_embedders | ModuleList | 22.4 K | train 15 | mesh_same_embedders | ModuleList | 22.7 K | train 16 | mesh_up_embedders | ModuleList | 18.2 K | train 17 | mesh_down_embedders | ModuleList | 18.2 K | train 18 | mesh_init_gnns | ModuleList | 116 K | train 19 | mesh_read_gnns | ModuleList | 116 K | train 20 | mesh_down_gnns | ModuleList | 466 K | train 21 | mesh_down_same_gnns | ModuleList | 583 K | train 22 | mesh_up_gnns | ModuleList | 466 K | train 23 | mesh_up_same_gnns | ModuleList | 583 K | train ----------------------------------------------------------------- 2.5 M Trainable params 0 Non-trainable params 2.5 M Total params 10.012 Total estimated model params size (MB) Loaded graph with 444629 nodes (378200 grid, 66429 mesh) Loaded hierarchical graph with structure: level 0 - 59049 nodes, 469480 same-level edges 0<->1 - 59049 up edges, 59049 down edges level 1 - 6561 nodes, 51520 same-level edges 1<->2 - 6561 up edges, 6561 down edges level 2 - 729 nodes, 5512 same-level edges 2<->3 - 729 up edges, 729 down edges level 3 - 81 nodes, 544 same-level edges 3<->4 - 81 up edges, 81 down edges level 4 - 9 nodes, 40 same-level edges Sanity Checking: | | 0/? [00:00 main() File "/scratch2/NCEPDEV/stmp3/Jianping.Huang/Aidan/neural-lam/train_model.py", line 330, in main trainer.fit( File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit call._call_and_handle_interrupt( File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return function(*args, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run results = self._run_stage() File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1028, in _run_stage self._run_sanity_check() File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1057, in _run_sanity_check val_loop.run() File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, *args, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, *step_args) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 311, in _call_strategy_hook output = fn(*args, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 410, in validation_step return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 640, in __call__ wrapper_output = wrapper_module(*args, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1113, in _run_ddp_forward return module_to_run(*inputs, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 633, in wrapped_forward out = method(*_args, **_kwargs) File "/scratch2/NCEPDEV/stmp3/Jianping.Huang/Aidan/neural-lam/neural_lam/models/ar_model.py", line 233, in validation_step prediction, target, pred_std = self.common_step(batch) File "/scratch2/NCEPDEV/stmp3/Jianping.Huang/Aidan/neural-lam/neural_lam/models/ar_model.py", line 189, in common_step prediction, pred_std = self.unroll_prediction( File "/scratch2/NCEPDEV/stmp3/Jianping.Huang/Aidan/neural-lam/neural_lam/models/ar_model.py", line 142, in unroll_prediction pred_state, pred_std = self.predict_step( File "/scratch2/NCEPDEV/stmp3/Jianping.Huang/Aidan/neural-lam/neural_lam/models/base_graph_model.py", line 134, in predict_step mesh_rep = self.g2m_gnn( File "/scratch2/NCEPDEV/naqfc/Jianping.Huang/miniconda3/envs/neural-lam/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/scratch2/NCEPDEV/stmp3/Jianping.Huang/Aidan/neural-lam/neural_lam/interaction_net.py", line 112, in forward rec_diff = self.aggr_mlp(torch.cat((rec_rep, edge_rep_aggr), dim=-1)) RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 59049 but got size 59040 for tensor number 1 in the list. wandb: You can sync this run to the cloud by running: wandb: wandb sync ./wandb/offline-run-20240828_172207-81mdiqse wandb: Find logs at: ./wandb/offline-run-20240828_172207-81mdiqse/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.