闲话TPU #3 模型编写

人类有99%的决定,包括关于配偶、事业和住处的重要抉择,都是由各种进化而成的算法来处理,我们把这些算法称为感觉、情感和欲望。

——《未来简史:从智人到神人》以色列/尤瓦尔·赫拉利

目录

闲话TPU #1 背景/价格/TFRC计划及羊毛
闲话TPU #2 配置GCP环境/创建TPU实例
闲话TPU #3 模型编写
闲话TPU #4 Coral Edge TPU赋能移动端
闲话TPU #5 那些使用TPU训练的巨型模型(时间和算力需求)

叁/编写适用于TPU的模型

Three/Modeling and Coding

从入了DL的坑之后, up也算使用过不少的framework. 从最初在自己的机器上跑caffe到使用Keras, 从臃肿的TensorFlow转向当时正火爆的动态图多GPU数据并行PyTorch, 后来经历PyTorch和Caffe2合并从0.4.0望穿秋水的等这1.0的跨越的时候, 被借助TPU/FPGA等超级加速硬件怼paper到哭, 打不过就加入下定决心也要借助计算红利快速迭代创造一波价值.

[后说后话, 因为马上要拥有DGX-2了嘿嘿, 现在又在逐步回归Pytorch-1.0.1]渐渐也就理解了个事er, idea跟人走, framework跟硬件走, 不敢说走着走着路更宽了, 逐渐进步还是有的. ^ ^

回归正题, 如上所述TPU严格要求所对应的TF版本. 现可用版本如下表所示:

TensorFlow version Cloud TPU support start Cloud TPU support end
1.13 March 11, 2019 (End date not yet set)
1.12 November 8, 2018 (End date not yet set)
1.11 September 27, 2018 (End date not yet set)
1.9 July 2, 2018 March 11, 2019
1.8 April 20, 2018 March 11, 2019

建议的模型书写模式(up用的比较熟练的一个模式):

tf.data用来进行data ingestion and transformation via parallel input-pipeline.

tf.estimator用来build your models.

Eager execution模式训练我们的模型

为了最大获取TPU的加速比, Shapes应在模型运行时就是明确的, 因此要掌握dynamic的尺度.

The XLA compiler compiles a TensorFlow graph just in time for the first batch. If any subsequent batches have different shapes, the model doesn’t work. (Re-compiling the graph every time the shape changes is too slow.) Therefore, any model that has tensors with dynamic shapes that change at runtime isn’t well suited to TPUs.

TensorFlow Serving进行灵活的高性能服务部署.

RESTful API: https://www.tensorflow.org/tfx/serving/api_rest

具体的模型编写框架可以参照tensorflow/tpu在GitHub上的repo, 可以大致如下形容:

preprocessing.py

1
2
3
4
5
6
def	preprocess_ops(xxx):
pass
def preprocess_for_train(xxx):
pass
def preprocess_for_eval(xxx):
pass

model.py

1
2
3
4
5
6
def model(xxx):
if model is tf.keras based:
# start with a Input layer and end with layer-like function
return model
else:
return model.output

params.py

1
2
3
defaults_params = dict(
...
)

inputpipeline.py

1
2
3
4
5
import preprocessing

class InputPipeline(xxx):
do some preprocessing
return tf.dataset(ooo)

main.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import inputpipeline
import params
import model

# some FLAGS used when run shell

def model_fn(features, labels, mode, params):
# def network from model.py
return TPUEstimatorSpec(when train|eval|predict)

def main():
# tf.contrib.cluster_resolver.TPUClusterResolver
tpu_cluster_resolver(tpu_gRPC_name, zone, project)
# tf.contrib.tpu.RunConfig
config(cluster, model_dir, session_config, tpu_config)
# tf.contrib.tpu.TPUEstimator
classifier = TPUEstimator(use_tpu, model_fn, config, params, batch_size...)
train_input, eval_input = inputpipeline(is_training, data_dir, num_paralell_calls, use_bfloat16)

if FLAGS.mode == EVAL:
pass
elif FlAGS.mode == TRAIN|TRAIN_AND_EVAL:
pass

if __name__ == '__main__':
app.run(main)

Coding这方面其实没有多大的变化, 将estimator用TPUestimator封装, TensorBoard inside, 参考着offical repo 和我们常规使用没有太大差别.

然后请务必多次运行并尝试!Try/Try/Try!

肆/错误排除

Four/Troubleshooting

Wrong Message:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
WARNING:tensorflow:Estimator's model_fn (<function resnet_model_fn at 0x7f44d31fd730>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_save_checkpoints_steps': 1251, '_evaluation_master': 'grpc://10.2.3.2:8470', '_session_config': graph_options {
rewrite_options {
disable_meta_optimizer: true
}
}
cluster_def {
job {
name: "worker"
tasks {
value: "10.2.3.2:8470"
}
}
}
, '_log_step_count_steps': None, '_keep_checkpoint_max': 8, '_task_id': 0, '_global_id_in_cluster': 0, '_eval_distribute': None, '_protocol': None, '_master': 'grpc://10.2.3.2:8470', '_experimental_distribute': None, '_tpu_config': TPUConfig(iterations_per_loop=1251, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': None, '_model_dir': 'gs://my_results/resnet-tpu-framework/weighted-resnet-4', '_save_summary_steps': 100, '_train_distribute': None, '_task_type': 'worker', '_num_worker_replicas': 1, '_service': None, '_device_fn': None, '_num_ps_replicas': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f44d2fd3d68>, '_tf_random_seed': 16, '_cluster': <tensorflow.contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver object at 0x7f44d3202748>}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:tensorflow:Precision: bfloat16
INFO:tensorflow:Using dataset: gs://my_datasets/data
INFO:tensorflow:Waiting for new checkpoint at gs://my_results/resnet-tpu-framework/weighted-resnet-4
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

因为VM同时后台运行多个任务(nohup)导致内存不足, 增大VM内存即可

Multiple missions run in background(within nohup mode) causing the lack of memory of VM, just increase the memory in VM specification.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
INFO:tensorflow:Error recorded from outfeed: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 28 meaning 'Timeout was reached', error details: SSL connection timeout
when initiating an upload to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Failed to sync 715 events to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Could not flush events file.
[[node current_epoch (defined at ./resnet_main.py:393) = WriteScalarSummary[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"](SummaryWriter, strided_slice, current_epoch/tag, current_epoch/Identity)]]

Caused by op 'current_epoch', defined at:
File "./resnet_main.py", line 577, in <module>
tf.app.run()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./resnet_main.py", line 564, in main
hooks=hooks)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train
saving_listeners=saving_listeners
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2195, in _call_model_fn
features, labels, mode, config)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2503, in _model_fn
host_ops = host_call.create_tpu_hostcall()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1736, in create_tpu_hostcall
ret[name] = self._host_fns[name](*dequeue_ops)
File "./resnet_main.py", line 393, in host_call_fn
summary.scalar('current_epoch', ce[0], step=gs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 440, in scalar
return summary_writer_function(name, tensor, function, family=family)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 384, in summary_writer_function
should_record_summaries(), record, _nothing, name="")
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
return true_fn()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 377, in record
with ops.control_dependencies([function(tag, scope)]):
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 438, in function
name=scope)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 633, in write_scalar_summary
name=name)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()

AbortedError (see above for traceback): All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 28 meaning 'Timeout was reached', error details: SSL connection timeout
when initiating an upload to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Failed to sync 715 events to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Could not flush events file.
[[node current_epoch (defined at ./resnet_main.py:393) = WriteScalarSummary[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"](SummaryWriter, strided_slice, current_epoch/tag, current_epoch/Identity)]]

疑似是VM和GCS之间网络连接的问题

Bad internet access. Change and try again will solve the problem.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
FailedPreconditionError (see above for traceback): Unable to enqueue when not opened, queue: [0000:00:05.0 PE0 C1 MC2 TN0 Queue TENSOR_CORE_INFEED]. State is: FAILED
[[node input_pipeline_task0/while/InfeedQueue/enqueue/2 (defined at /home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/ops/gen_tpu_ops.py:1055) = InfeedEnqueueTuple[_class=["loc:@input_pipeline_task0/while/IteratorGetNext_2"], device_ordinal=2, dtypes=[DT_BFLOAT16, DT_INT32], shapes=[[19267584], [128]], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext_2, input_pipeline_task0/while/IteratorGetNext_2:1)]]

INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://my_results/resnet-tpu-framework/weighted-resnet-9/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into gs://my_results/resnet-tpu-framework/weighted-resnet-9/model.ckpt.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Installing graceful shutdown hook.
2019-01-10 15:02:51.275101: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
INFO:tensorflow:Creating heartbeat manager for ['/job:tpu_worker/replica:0/task:0/device:CPU:0']
INFO:tensorflow:Configuring worker heartbeat: shutdown_mode: WAIT_FOR_COORDINATOR

INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 6 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (1251) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1251) batch(es) of data from outfeed.

生成TPU的时候状态不正确? 等待了一会之后出现了WAIT_FOR_COORDINATOR之后成功初始化TPU.

TPU status is wrong(not sure why :( ). Just waiting for another minute, a massage ‘WAIT_FOR_COORDINATOR’ come out and the TPU is successfully initialized.


1
2
3
4
5
6
7
8
9
10
11
12
INFO:tensorflow:Querying Tensorflow master (grpc://10.3.2.2:8470) for TPU system metadata.
2019-01-13 03:15:46.904944: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.3.2.2:8470).
WARNING:tensorflow:Retrying (10/120).
INFO:tensorflow:Querying Tensorflow master (grpc://10.3.2.2:8470) for TPU system metadata.
2019-01-13 03:16:46.908810: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.3.2.2:8470).
WARNING:tensorflow:Retrying (11/120).
INFO:tensorflow:Querying Tensorflow master (grpc://10.3.2.2:8470) for TPU system metadata.
2019-01-13 03:17:46.913479: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.3.2.2:8470).
WARNING:tensorflow:Retrying (12/120).

gRPC连接问题.

Connection problem with gRPC. Reset the TPU, or just init another TPU. All of those kinds of problems can be solved through restarting and recreating in high probability.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 2 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (1251) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1251) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to `Session::Close()`.
INFO:tensorflow:Error recorded from training_loop: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.

Largest program allocations in vmem:

XLA label: register allocator spill slots
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.

Largest program allocations in vmem:

XLA label: register allocator spill slots
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./resnet_main.py", line 585, in <module>
tf.app.run()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./resnet_main.py", line 572, in main
hooks=hooks)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2409, in train
rendezvous.raise_errors()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
raise value
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train
saving_listeners=saving_listeners
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
raise value
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.

Largest program allocations in vmem:

XLA label: register allocator spill slots
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

OOM issue??

Oh, this version of XLA complier is something wrong with this special situation, it costs more HBM than usual.

Reducing the batchsize to 512 or even 256 will solve this problem. According to R(Russell), in TFv1.13 the problem will be solved or relief.