1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
| INFO:tensorflow:Init TPU system INFO:tensorflow:Initialized TPU in 2 seconds INFO:tensorflow:Starting infeed thread controller. INFO:tensorflow:Starting outfeed thread controller. INFO:tensorflow:Enqueue next (1251) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (1251) batch(es) of data from outfeed. INFO:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to `Session::Close()`. INFO:tensorflow:Error recorded from training_loop: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.
Largest program allocations in vmem:
XLA label: register allocator spill slots Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
TPU compilation failed [[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
INFO:tensorflow:training_loop marked as finished WARNING:tensorflow:Reraising captured error Traceback (most recent call last): File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.
Largest program allocations in vmem:
XLA label: register allocator spill slots Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
TPU compilation failed [[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "./resnet_main.py", line 585, in <module> tf.app.run() File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "./resnet_main.py", line 572, in main hooks=hooks) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2409, in train rendezvous.raise_errors() File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors six.reraise(typ, value, traceback) File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise raise value File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train saving_listeners=saving_listeners File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default saving_listeners) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run run_metadata=run_metadata) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run raise six.reraise(*original_exc_info) File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise raise value File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run return self._sess.run(*args, **kwargs) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run run_metadata=run_metadata) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run return self._sess.run(*args, **kwargs) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.
Largest program allocations in vmem:
XLA label: register allocator spill slots Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} } Allocation type: scoped
TPU compilation failed [[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
|