从童年起,我便独自一人,照顾着历代的星辰。

——《孤独》中/白鹤林

但遇见你以后,念念落地生根,未来欢愉在等。

——佚名补

目录

闲话TPU #1 背景/价格/TFRC计划及羊毛
闲话TPU #2 配置GCP环境/创建TPU实例
闲话TPU #3 模型编写
闲话TPU #4 Coral Edge TPU赋能移动端
闲话TPU #5 那些使用TPU训练的巨型模型(时间和算力需求)

那些使用TPU训练的巨型语言模型

Six/Those Huge Neural Network Models, Training Time and Computing Resources

BERT

原论文中描述,大型 BERT 模型在 16 个 Cloud TPU 上需要训练 4 天:

“Training of BERT_BASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERT_LARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.”

现在我们来算一下成本,16 个 Cloud TPU v3 总训练成本为 16×8×24×4=12288 美元。有研究者在 Reddit 中回复作者,他们可以使用更便宜的抢占式(Preemptible)TPU 训练模型,那样成本约为 16×2.4×24×4=3686.4 美元。不过一般的 TPU 优先于抢占式 TPU,如果它们需要计算资源,可以暂停抢占式对资源的调用。

BERT 的作者在 Reddit 上也表示预训练的计算量非常大,Jacob 说:「OpenAI 的 Transformer 有 12 层、768 个隐藏单元,他们使用 8 块 P100 在 8 亿词量的数据集上训练 40 个 Epoch 需要一个月,而 BERT-Large 模型有 24 层、2014 个隐藏单元,它们在有 33 亿词量的数据集上需要训练 40 个 Epoch,因此在 8 块 P100 上可能需要 1 年?16 个 Cloud TPU 已经是非常大的计算力了。」

为了做对比,这里统一用一般的 TPU 价格计算成本,因此 BERT 训练一次大概需要 1.23 万美元。

GPT-2

今年另一个非常受关注的语言模型就是 GPT-2 了,它充分展示了什么才算大模型。我们可以理解为,GPT-2就是在 GPT 的基础上放大十多倍,它需要的算力应该比 BERT 还大。堆了这么多算力与数据,GPT-2 的效果确实惊人,它根据一个前提就能从容地把故事编下去。

但是在 GPT-2 原论文中,我们没找到关于算力的描述,只找到了疑似论文作者的描述。他表明 GPT-2 用了 64 个 Cloud TPU v3,训练了一周多一点。

如果按这个数据,那么训练成本为 32×8×24×7=43008 美元,这个成本已经是训练 BERT 的 3 到 4 倍了。

XLNet

2018 年,谷歌发布大规模预训练语言模型 BERT ,为 NLP 领域带来了极大的惊喜。但最近,Quoc V. Le 等研究者提出的 XLNet 在 20 个任务上超过了 BERT 的表现,并在 18 个任务上取得了当前最佳效果。既然效果这么好,那么它的成本会不会也超过 BERT?

在原论文中,作者表示 XLNet 大模型在 128 个 Cloud TPU v3 下需要训练 2 天半:

“We train XLNet-Large on 512 TPU v3 chips for 500K steps with an Adam optimizer, linear learning rate decay and a batch size of 2048, which takes about 2.5 days. ”

这样算起来,128×8×24×2.5=61440 美元,没想到 XLNet 训练一次的费用比 GPT-2 还高,达到了 BERT 的 5 倍。既然成本这么高,以后可以考虑用预训练的 XLNet 代替 BERT 了。

在看了 XLNet 的算力成本之后,有开发者感叹:「谢天谢地我不在 NLP 领域工作,要是让我去说服老板训练一个模型花 6 万多美元,而且还不能保证这个模型一定好用,我觉得我会哭……」

BigGAN

视觉模型中,常见高成本任务就是训练高分辨率的 GAN 了。在去年,研究者表示他们训练 512×512 像素的图像需要 64 个 Cloud TPU v3,训练 24 到 48 个小时:

“We train on a Google TPU v3 Pod, with the number of cores proportional to the resolution: 128 for 128×128, 256 for 256×256, and 512 for 512×512. Training takes between 24 and 48 hours for most models.”

如果我们用最大训练时间 48 小时为基准,那么训练成本为 64×8×48=24576 美元。是的,BigGAN 的训练成本也比 BERT 高,大约是它的两倍左右。

StyleGAN

最后,我们统计一下 StyleGAN 的训练成本,因为这篇论文是英伟达提出来的,所以用的是 Tesla V100。该论文使用的 FFHQ 数据集由 1024×1024 的人脸图像组成,模型使用 8 张 Tesla V100 需要训练一星期:

“Our training time is approximately one week on an NVIDIA DGX-1 with 8 Tesla V100 GPUs.”

这里我们按照谷歌云的价格计算总成本,从而更好地做对比。总体而言,训练成本为 8×2.48×24×7=3333.12 美元。可能因为数据集仅限于人脸,StyleGAN 的成本要比 BigGAN 低很多。

Reference

  1. https://www.reddit.com/r/MachineLearning/comments/c2pfgb/d_how_can_you_do_great_ai_research_when_you_dont/
  2. https://www.reddit.com/r/MachineLearning/comments/c59ikz/r_it_costs_245000_to_train_the_xlnet_model512_tpu/
  3. https://www.reddit.com/r/MachineLearning/comments/9nfqxz/r_bert_pretraining_of_deep_bidirectional/

白发空垂三千丈,一笑人间万事。问何物、能令公喜?
我见青山多妩媚,料青山见我应如是。

——《贺新郎》南宋/辛弃疾

目录

闲话TPU #1 背景/价格/TFRC计划及羊毛
闲话TPU #2 配置GCP环境/创建TPU实例
闲话TPU #3 模型编写
闲话TPU #4 Coral Edge TPU赋能移动端
闲话TPU #5 那些使用TPU训练的巨型模型(时间和算力需求)

伍/边缘计算场景中TPU硬件

Five/Edge TPU and Other AI Accelerators

1. 为什么Edge TPU值得期待

AIY Edge TPU开发板在树莓派的基础上完成了AI进化,并且性能也要高于树莓派,可谓是青出于蓝而胜于蓝,而AIY Edge TPU加速器可以说是直接正面争锋Movidius NCS.

Edge TPU/Edge TPU Dev Board/Edge TPU Accelerator将共同促进边缘AI的蓬勃发展.

2. Edge TPU

树莓派(Raspberry Pi)出现的时候我还在修习本科学位,最后一个学期尝试的是IoT方向.当时国内用来做IoT模拟实验的还是类似仿真电路模拟的那一个臃肿箱子,当时好友@宋恺睿 songkairui@hackret.com第一次拿出树莓派的时候,我:哇塞!!!没有想到那么小一个东西的如此的富有功能性.

Raspberry Pi: 我们的使命是将计算和数字化的力量交付给全世界人民。 我们这样做是为了让更多人能够利用计算和数字技术的力量开展工作,解决对他们而言至关重要的问题,并激发他们的创造性。

就像我没有想到在同样的大小(量级)上能赋予了边缘计算设备AI推理的能力的设备来的如此快.不同的是这次并没有那么惊讶罢了.虽然周边的朋友没有一些在用,在见识过Intel家的神经元计算棒Movidius Neural Compute Stick后一直期待着有那么一款能真正普及的IoT-AI辅助设备.

Movidius Neural Compute Stick: 英特尔®Movidius™神经计算棒(NCS)是一款微型无风扇的深度学习硬件USB驱动器,用于学习AI编程。 NCS由相同的低功耗高性能Movidius™视觉处理单元(VPU)供电。 VPU可以在数百万智能安防摄像机,手势控制无人机,工业机器视觉设备等中找到。

Edge TPU: Google’s purpose-built ASIC designed to run inference at the edge.

Edge TPU可以在边缘部署高质量的ML推理。它增强了Google的云TPU和云物联网,以提供端到端(云端到边缘,硬件+软件)基础架构,以促进基于AI的解决方案的部署。除了开源TensorFlow Lite编程环境之外,Edge TPU最初将部署多个Google AI模型,结合了Google在AI和硬件方面的专业知识。

Edge TPU补充了CPU,GPU,FPGA和其他ASIC解决方案,以便在边缘运行AI,这将由Cloud IoT Edge提供支持。

边缘**
**(设备/节点/网关/服务器)
谷歌云
可进行任务 机器学习推理 机器学习训练和推理
软件, 服务 Cloud IoT Edge, Linux OS Cloud ML Engine, Kubernetes Engine,**
**Compute Engine, Cloud IoT Core
ML框架 TensorFlow Lite, NN API TensorFlow, scikit-learn,XGBoost, Keras
硬件加速器 Edge TPU, GPU, CPU Cloud TPU, GPU, and CPU

Edge TPU的尺寸约为1美分硬币的1/8大小,它可以在较小的物理尺寸以及功耗范围内提供不错的性能(目前具体性能指标不清楚,官方称可以在高清分辨率的视频上以每秒30帧的速度,在每帧上同时执行多个最先进的AI模型),支持PCIe以及USB接口。

Edge TPU优势在于可以加速设备上的机器学习推理,或者也可以与Google Cloud配对以创建完整的云端到边缘机器学习堆栈。在任一配置中,Edge TPU通过直接在设备本地处理数据,这样不仅保护隐私,而且消除对持久网络连接的需要,减少延迟,允许使用更少的功率和性能。

3. AIY Edge TPU开发板

AIY Edge TPU开发板是一款搭载了Edge TPU的单板计算机,功能非常丰富。开发板分为底板跟核心板,底板包括一些常用的外设接口,而核心板是基于Google Edge TPU的模块化系统子板(核心板与底板可以分离),也就是下图中带屏蔽罩的那个SOM(system-on-module )。

边缘TPU模块(SOM)规格

中央处理器 恩智浦i.MX 8M SOC(quad Cortex-A53,Cortex-M4F)
GPU 集成的GC7000 Lite图形处理器
ML加速器 Google Edge TPU协处理器
内存 1 GB LPDDR4
闪存 8 GB eMMC
无线 Wi-Fi 2x2 MIMO(802.11b / g / n / ac 2.4 / 5GHz)
蓝牙4.1
外形尺寸 40毫米x 48毫米

底板规格

闪存 MicroSD插槽
USB Type-C OTG
Type-C电源
Type-A 3.0主机
Micro-B串口控制台
LAN 千兆以太网端口
音频 3.5mm音频插孔(符合CTIA标准)
数字PDM麦克风(x2)
2.54mm 4针端子,用于立体声扬声器
视频 HDMI 2.0a(全尺寸)
用于MIPI-DSI显示器的39针FFC连接器(4通道)
用于MIPI-CSI2摄像机的24针FFC连接器(4通道)
GPIO 40针扩展头
功率 5V DC(USB Type-C)
外形尺寸 85毫米x 56毫米

支持的操作系统

Debian Linux

支持的框架

TensorFlow Lite

4. Edge TPU协处理器

AIY Edge TPU加速器是一个基于Google Edge TPU的USB设备型的神经网络加速设备,通过USB Type-C可以连接到任何基于Linux系统的PC机、单板计算机如树莓派等的设备上去执行机器学习推理。

产品规格

ML加速器 Google Edge TPU协处理器
连接器 USB Type-C *(数据/电源)
外形尺寸 65毫米x 30毫米

*仅与USB 2.0速度兼容Raspberry Pi主板。

支持的操作系统

Debian Linux

支持的框架

TensorFlow Lite

5. NVIDIA Jetson Nano($99!)

AI 新维度

Jetson Nano 模组仅有 70 x 45 毫米,是体积非常小巧的 Jetson 设备。 为多个行业(从智慧城市到机器人)的边缘设备部署 AI 时,此生产就绪型模组系统 (SOM) 可以提供强大支持。

较高计算性能

Jetson Nano 提供 472 GFLOP,用于快速运行现代 AI 算法。 它可以并行运行多个神经网络,同时处理多个高分辨率传感器,非常适合入门级网络硬盘录像机 (NVR)、家用机器人以及具备全面分析功能的智能网关等应用。

低功率需求

Jetson Nano 为您节约时间和精力,助力您实现边缘创新。 体验功能强大且高效的 AI、计算机视觉和高性能计算,功耗仅为 5 至 10 瓦。

技术规格

GPU NVIDIA Maxwell™ 架构,配备 128 个 NVIDIA CUDA® 核心
CPU 四核 ARM® Cortex®-A57 MPCore 处理器
内存 4 GB 64 位 LPDDR4
存储空间 16 GB eMMC 5.1 闪存
视频编码 4K @ 30 (H.264/H.265)
视频解码 4K @ 60 (H.264/H.265)
摄像头 12 通道(3x4 或 4x2)MIPI CSI-2 DPHY 1.1 (1.5 Gbps)
连接 千兆以太网
显示器 HDMI 2.0 或 DP1.2 | eDP 1.4 | DSI (1 x2) 2 同步
UPHY 1 x1/2/4 PCIE、1x USB 3.0、3x USB 2.0
I/O 1x SDIO/2x SPI/6x I2C/2x I2S/GPIO
尺寸 69.6 mm x 45 mm
规格尺寸 260 针边缘连接器

6.References

  1. https://www.raspberrypi.org/
  2. https://software.intel.com/en-us/movidius-ncs
  3. https://cloud.google.com/edge-tpu/
  4. https://aiyprojects.withgoogle.com/edge-tpu
  5. https://www.nvidia.cn/autonomous-machines/embedded-systems/jetson-nano/

More

  • 更新信息
  • 购买Jetson Nano
  • 购买Edge TPU
  • Edge TPU/Jetson Nano测试&&报告

人类有99%的决定,包括关于配偶、事业和住处的重要抉择,都是由各种进化而成的算法来处理,我们把这些算法称为感觉、情感和欲望。

——《未来简史:从智人到神人》以色列/尤瓦尔·赫拉利

目录

闲话TPU #1 背景/价格/TFRC计划及羊毛
闲话TPU #2 配置GCP环境/创建TPU实例
闲话TPU #3 模型编写
闲话TPU #4 Coral Edge TPU赋能移动端
闲话TPU #5 那些使用TPU训练的巨型模型(时间和算力需求)

叁/编写适用于TPU的模型

Three/Modeling and Coding

从入了DL的坑之后, up也算使用过不少的framework. 从最初在自己的机器上跑caffe到使用Keras, 从臃肿的TensorFlow转向当时正火爆的动态图多GPU数据并行PyTorch, 后来经历PyTorch和Caffe2合并从0.4.0望穿秋水的等这1.0的跨越的时候, 被借助TPU/FPGA等超级加速硬件怼paper到哭, 打不过就加入下定决心也要借助计算红利快速迭代创造一波价值.

[后说后话, 因为马上要拥有DGX-2了嘿嘿, 现在又在逐步回归Pytorch-1.0.1]渐渐也就理解了个事er, idea跟人走, framework跟硬件走, 不敢说走着走着路更宽了, 逐渐进步还是有的. ^ ^

回归正题, 如上所述TPU严格要求所对应的TF版本. 现可用版本如下表所示:

TensorFlow version Cloud TPU support start Cloud TPU support end
1.13 March 11, 2019 (End date not yet set)
1.12 November 8, 2018 (End date not yet set)
1.11 September 27, 2018 (End date not yet set)
1.9 July 2, 2018 March 11, 2019
1.8 April 20, 2018 March 11, 2019

建议的模型书写模式(up用的比较熟练的一个模式):

tf.data用来进行data ingestion and transformation via parallel input-pipeline.

tf.estimator用来build your models.

Eager execution模式训练我们的模型

为了最大获取TPU的加速比, Shapes应在模型运行时就是明确的, 因此要掌握dynamic的尺度.

The XLA compiler compiles a TensorFlow graph just in time for the first batch. If any subsequent batches have different shapes, the model doesn’t work. (Re-compiling the graph every time the shape changes is too slow.) Therefore, any model that has tensors with dynamic shapes that change at runtime isn’t well suited to TPUs.

TensorFlow Serving进行灵活的高性能服务部署.

RESTful API: https://www.tensorflow.org/tfx/serving/api_rest

具体的模型编写框架可以参照tensorflow/tpu在GitHub上的repo, 可以大致如下形容:

preprocessing.py

1
2
3
4
5
6
def	preprocess_ops(xxx):
pass
def preprocess_for_train(xxx):
pass
def preprocess_for_eval(xxx):
pass

model.py

1
2
3
4
5
6
def model(xxx):
if model is tf.keras based:
# start with a Input layer and end with layer-like function
return model
else:
return model.output

params.py

1
2
3
defaults_params = dict(
...
)

inputpipeline.py

1
2
3
4
5
import preprocessing

class InputPipeline(xxx):
do some preprocessing
return tf.dataset(ooo)

main.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import inputpipeline
import params
import model

# some FLAGS used when run shell

def model_fn(features, labels, mode, params):
# def network from model.py
return TPUEstimatorSpec(when train|eval|predict)

def main():
# tf.contrib.cluster_resolver.TPUClusterResolver
tpu_cluster_resolver(tpu_gRPC_name, zone, project)
# tf.contrib.tpu.RunConfig
config(cluster, model_dir, session_config, tpu_config)
# tf.contrib.tpu.TPUEstimator
classifier = TPUEstimator(use_tpu, model_fn, config, params, batch_size...)
train_input, eval_input = inputpipeline(is_training, data_dir, num_paralell_calls, use_bfloat16)

if FLAGS.mode == EVAL:
pass
elif FlAGS.mode == TRAIN|TRAIN_AND_EVAL:
pass

if __name__ == '__main__':
app.run(main)

Coding这方面其实没有多大的变化, 将estimator用TPUestimator封装, TensorBoard inside, 参考着offical repo 和我们常规使用没有太大差别.

然后请务必多次运行并尝试!Try/Try/Try!

肆/错误排除

Four/Troubleshooting

Wrong Message:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
WARNING:tensorflow:Estimator's model_fn (<function resnet_model_fn at 0x7f44d31fd730>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_save_checkpoints_steps': 1251, '_evaluation_master': 'grpc://10.2.3.2:8470', '_session_config': graph_options {
rewrite_options {
disable_meta_optimizer: true
}
}
cluster_def {
job {
name: "worker"
tasks {
value: "10.2.3.2:8470"
}
}
}
, '_log_step_count_steps': None, '_keep_checkpoint_max': 8, '_task_id': 0, '_global_id_in_cluster': 0, '_eval_distribute': None, '_protocol': None, '_master': 'grpc://10.2.3.2:8470', '_experimental_distribute': None, '_tpu_config': TPUConfig(iterations_per_loop=1251, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': None, '_model_dir': 'gs://my_results/resnet-tpu-framework/weighted-resnet-4', '_save_summary_steps': 100, '_train_distribute': None, '_task_type': 'worker', '_num_worker_replicas': 1, '_service': None, '_device_fn': None, '_num_ps_replicas': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f44d2fd3d68>, '_tf_random_seed': 16, '_cluster': <tensorflow.contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver object at 0x7f44d3202748>}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:tensorflow:Precision: bfloat16
INFO:tensorflow:Using dataset: gs://my_datasets/data
INFO:tensorflow:Waiting for new checkpoint at gs://my_results/resnet-tpu-framework/weighted-resnet-4
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

因为VM同时后台运行多个任务(nohup)导致内存不足, 增大VM内存即可

Multiple missions run in background(within nohup mode) causing the lack of memory of VM, just increase the memory in VM specification.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
INFO:tensorflow:Error recorded from outfeed: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 28 meaning 'Timeout was reached', error details: SSL connection timeout
when initiating an upload to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Failed to sync 715 events to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Could not flush events file.
[[node current_epoch (defined at ./resnet_main.py:393) = WriteScalarSummary[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"](SummaryWriter, strided_slice, current_epoch/tag, current_epoch/Identity)]]

Caused by op 'current_epoch', defined at:
File "./resnet_main.py", line 577, in <module>
tf.app.run()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./resnet_main.py", line 564, in main
hooks=hooks)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train
saving_listeners=saving_listeners
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2195, in _call_model_fn
features, labels, mode, config)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2503, in _model_fn
host_ops = host_call.create_tpu_hostcall()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1736, in create_tpu_hostcall
ret[name] = self._host_fns[name](*dequeue_ops)
File "./resnet_main.py", line 393, in host_call_fn
summary.scalar('current_epoch', ce[0], step=gs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 440, in scalar
return summary_writer_function(name, tensor, function, family=family)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 384, in summary_writer_function
should_record_summaries(), record, _nothing, name="")
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
return true_fn()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 377, in record
with ops.control_dependencies([function(tag, scope)]):
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 438, in function
name=scope)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 633, in write_scalar_summary
name=name)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()

AbortedError (see above for traceback): All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 28 meaning 'Timeout was reached', error details: SSL connection timeout
when initiating an upload to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Failed to sync 715 events to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Could not flush events file.
[[node current_epoch (defined at ./resnet_main.py:393) = WriteScalarSummary[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"](SummaryWriter, strided_slice, current_epoch/tag, current_epoch/Identity)]]

疑似是VM和GCS之间网络连接的问题

Bad internet access. Change and try again will solve the problem.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
FailedPreconditionError (see above for traceback): Unable to enqueue when not opened, queue: [0000:00:05.0 PE0 C1 MC2 TN0 Queue TENSOR_CORE_INFEED]. State is: FAILED
[[node input_pipeline_task0/while/InfeedQueue/enqueue/2 (defined at /home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/ops/gen_tpu_ops.py:1055) = InfeedEnqueueTuple[_class=["loc:@input_pipeline_task0/while/IteratorGetNext_2"], device_ordinal=2, dtypes=[DT_BFLOAT16, DT_INT32], shapes=[[19267584], [128]], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext_2, input_pipeline_task0/while/IteratorGetNext_2:1)]]

INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://my_results/resnet-tpu-framework/weighted-resnet-9/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into gs://my_results/resnet-tpu-framework/weighted-resnet-9/model.ckpt.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Installing graceful shutdown hook.
2019-01-10 15:02:51.275101: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
INFO:tensorflow:Creating heartbeat manager for ['/job:tpu_worker/replica:0/task:0/device:CPU:0']
INFO:tensorflow:Configuring worker heartbeat: shutdown_mode: WAIT_FOR_COORDINATOR

INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 6 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (1251) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1251) batch(es) of data from outfeed.

生成TPU的时候状态不正确? 等待了一会之后出现了WAIT_FOR_COORDINATOR之后成功初始化TPU.

TPU status is wrong(not sure why :( ). Just waiting for another minute, a massage ‘WAIT_FOR_COORDINATOR’ come out and the TPU is successfully initialized.


1
2
3
4
5
6
7
8
9
10
11
12
13
INFO:tensorflow:Querying Tensorflow master (grpc://10.3.2.2:8470) for TPU system metadata.
2019-01-13 03:15:46.904944: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.3.2.2:8470).
WARNING:tensorflow:Retrying (10/120).
INFO:tensorflow:Querying Tensorflow master (grpc://10.3.2.2:8470) for TPU system metadata.
2019-01-13 03:16:46.908810: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.3.2.2:8470).
WARNING:tensorflow:Retrying (11/120).
INFO:tensorflow:Querying Tensorflow master (grpc://10.3.2.2:8470) for TPU system metadata.
2019-01-13 03:17:46.913479: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.3.2.2:8470).
WARNING:tensorflow:Retrying (12/120).

gRPC连接问题.

Connection problem with gRPC. Reset the TPU, or just init another TPU. All of those kinds of problems can be solved through restarting and recreating in high probability.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 2 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (1251) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1251) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to `Session::Close()`.
INFO:tensorflow:Error recorded from training_loop: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.

Largest program allocations in vmem:

XLA label: register allocator spill slots
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.

Largest program allocations in vmem:

XLA label: register allocator spill slots
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./resnet_main.py", line 585, in <module>
tf.app.run()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./resnet_main.py", line 572, in main
hooks=hooks)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2409, in train
rendezvous.raise_errors()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
raise value
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train
saving_listeners=saving_listeners
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
raise value
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.

Largest program allocations in vmem:

XLA label: register allocator spill slots
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

OOM issue??

Oh, this version of XLA complier is something wrong with this special situation, it costs more HBM than usual.

Reducing the batchsize to 512 or even 256 will solve this problem. According to R(Russell), in TFv1.13 the problem will be solved or relief.

勇者愤怒,抽刃向更强者。怯者愤怒,却抽刃向更弱者

——《华盖集》中/鲁迅

目录

闲话TPU #1 背景/价格/TFRC计划及羊毛
闲话TPU #2 配置GCP环境/创建TPU实例
闲话TPU #3 模型编写
闲话TPU #4 Coral Edge TPU赋能移动端
闲话TPU #5 那些使用TPU训练的巨型模型(时间和算力需求)

贰/配置GCP和创建TPU实例

Two/Configure Google Cloud Platform and Create a TPU Instance

首先推荐一下几个官方网址, 在这之上我们可以找到(近乎)bug free的教程(有些坑我们之后再聊):

Cloud TPU 主页: https://cloud.google.com/tpu/?hl=zh-cn

Quickstart: https://cloud.google.com/tpu/docs/quickstart?hl=zh-cn

TensorFlow: https://www.tensorflow.org/

PyTorch: 大G一些官方和非官方人员共同合作为PyTorch-1.0后的版本开发了可以将PyTorch语言编译为TPU硬件可识别的XLA编译器, 说实话, 应用效果一般且坑还是很多的, 对于支持的标准模型/用法, 计算效率到时没差(某B的非会员降速会员原速呵), 且现在TPU-XLA的支持还没有merge到PyTorch的官方repo中, 持观望态度.

TPU Available TensorFlow Ops: https://cloud.google.com/tpu/docs/tensorflow-ops

枚举了所有TPU所支持的Ops, 当然实现一个操作的不同Ops之间还是存在实际速度差异的, 太详细的可以自己去tpu-profile(TensorBoard inside)中查看“操作时间”的分页.

TPU Tools: https://cloud.google.com/tpu/docs/cloud-tpu-tools

主要是cloud_tpu_profiler这个pip包, cloud_tpu_profiler将TPU profiler可视化的集成到TensorBoard中, 对于我们进行bottleneck分析十分有帮助.

1. 创建一个GCP账号

创建一个GCP账号(推荐用gmail账号直接创建), 创建一个GCP项目.

2. 开启GCP账号的结算功能

修改项目的结算设置

如果您只是一个结算帐号的结算管理员,则您创建的新项目会自动关联到您现有的结算帐号。如果您创建多个结算帐号且有权访问这些帐号,则可以更改项目的结算帐号。本文介绍了如何更改项目的结算帐号,以及如何启用和停用项目的结算功能。

如果您希望通过电子邮件接收帐单或对帐单,或者您想要更改接收帐单或对帐单的人员,请参阅更改结算联系人和通知

更改项目的结算帐号

要更改现有项目的结算帐号,您必须是该项目的所有者,并且是目标结算帐号的结算管理员。如需了解有关结算管理员和结算权限的信息,请参阅访问权限控制概览

要更改结算帐号,请执行以下操作:

  1. 转到 Google Cloud Platform Console
  2. 打开 Console 左侧菜单,然后选择结算
  3. 如果您有多个结算帐号,系统会提示您选择转至关联的结算帐号以管理当前项目的结算。
  4. 与此结算帐号相关联的项目下,找到要更改结算帐号的项目的名称,然后点击该名称旁边的菜单。
  5. 选择更改结算帐号,然后选择所需的目标结算帐号。
  6. 点击设置帐号

尚未在交易记录中记录的已经产生的费用将被计入原来的结算帐号。这可包括项目移动之前最多 2 天内的费用。

为项目启用结算功能

启用结算功能的方式取决于您是创建新项目还是为现有项目重新启用结算功能。

为新项目启用结算功能

当您创建新项目时,系统会提示您选择要将哪个结算帐号关联到项目。如果您只有一个结算帐号,该帐号会自动关联到您的项目。

如果您没有结算帐号,则必须先创建一个并为项目启用结算功能,然后才能使用各项 Google Cloud Platform 功能。要创建新的结算帐号并为项目启用结算功能,请按照创建新结算帐号中的说明操作。

为现有项目启用结算功能

如果您有暂时停用了结算功能的项目,则可以按照以下步骤重新启用结算功能:

  1. 转到 Google Cloud Platform Console
  2. 从项目列表中,选择要为其重新启用结算功能的项目。
  3. 打开 Console 左侧菜单,然后选择结算
  4. 点击关联结算帐号
  5. 选择结算帐号,然后点击设置帐号

停用项目的结算功能

要停止为某个项目自动付款,您需要停用该项目的结算功能。请注意,即使您停用了结算功能,也仍需负责结算帐号中的所有未结费用,我们会通过您列出的付款方式扣除相应费用。

要停用项目的结算功能,请执行以下操作:

  1. 转到 Google Cloud Platform Console
  2. 打开左侧菜单,然后选择结算
  3. 如果您有多个结算帐号,请选择转至关联的结算帐号以管理当前项目的结算。要查找不同的结算帐号,请选择管理结算帐号
  4. 与此结算帐号相关联的项目下,找到要停用其结算功能的项目的名称,然后从旁边的菜单中选择停用结算功能。系统会提示您确认是否要停用此帐号的结算功能。
  5. 点击停用结算功能

3. 开启TPU API

如果是新创建的GCP账号, 或者是第一次使用TPU, 需要开启TPU service的API, 在Google Cloud Platform界面下左侧边栏中选择Compute Engine菜单, 进而选择其中的TPU子菜单, 现在我们可以在页面中央区域开间“Enable TPU API”的button, click it and wait for minutes.

(很多第一次尝试TPU的朋友反应这个API的开启速度实在是慢的没边er了, 没错就是这么慢, 您且等会er)

4. 设置资源(Google Cloud Storage)

GCS(Google Cloud Storage)是默认也是唯一TPU使用过程中I/O的容器, 我们必须创建一个Cloud Storage Bucket用来盛放我们的data以及results(checkpoints).

  1. 转到GCP控制台上的“云存储/存储/Storage”页面。

    转到云存储页面

  2. 创建一个新存储桶,指定以下选项:

  • 唯一名称。

  • 默认存储类: Regional

  • 存储类别中第一类是在多个分区都可以使用的, 但是如非必要不应选择, 原因是我们进行科研/业务本就是单一区域的任务, 选择第一类存储是加价不加量的行为….三类和四类适合静态存储, 尤其是四类, 如果仅仅是想贮藏/掩埋的数据, 丢这里就对了(而且很廉价的说).

  • 位置:如果要使用Cloud TPU设备,请接受默认设置。如果要使用Cloud TPU Pod片,则必须指定Cloud TPU Pod 可用的区域

其中几个土亢:

  1. 我们使用的GCE(VM虚拟机)/Cloud TPU/GCS三者的位置必须一致, 其中GCE和Cloud TPU的位置是精确到zone级别的, GCS使用的时候只需要选择相应的area就可以. e.g. TPU@us-central1-f, GCE@us-central1-f, GCS@us-central1.

  2. 新用户创建了当前地区的VM之后经常会发现仍然不能读取/写入到相应的bucket之中, 这是因为配置VM的时候没有授予VM GCS I/O的权限. 正确配置方式如下:

新建虚拟机实例中拖到最下方身份和 API 访问权限选择允许所有Cloud API的全面访问权限.

5. 创建VM

TPU unit实际上是一个整体的instance, 其中包括了XLA CPU XLA GPU 8*Tensor Core , 所以我们创建的VM在某种意义上单纯的是一个switch(控制开关), 用来给TPU unit传递“何时干活, 干什么活, 怎么干这个活”的控制类消息.

[省钱小窍门]我们创建的VM大小一般来说1~2CPUn1-standard-1 n1-standard-2即可, [极限省钱的情况下]甚至是抢占式的共享VM也可以, 但是有一个前提条件是, 我们的模型必然在VM instance上init, 所以VM必须要满足model所需的内存占用量.

科研过程中自己写的那种比较特殊的模型, 尤其是没有或者优化失败的模型, 请务必使用n1-highmen-1等高内存系列, 不过话说回来, 没有完整优化的模型, 即是在VM上成功init了也并不意味着这个模型会成功的在TPU上运行成功, 毕竟TPU也是有自身的HBM(on-chip memory)的限制的.

8 GB of on-chip memory (HBM) is associated with each Tensor Core for Cloud TPU v2; 16 GB for Cloud TPU v3.

即: 2代TPU有64GB可用显存(姑且这么说吧, 顺嘴), 3代TPU拥有128GB可用显存.

如果对TPU的软件架构和硬件架构有深入了解的兴趣, 请移步参考 https://cloud.google.com/tpu/docs/system-architecture . 其中会介绍XLA编译, 还有为什么TPU会仅接受GCS作为I/O对象. [还不是为了把用户群体绑定在GCP平台上…]

6. 创建TPU

Compute Engine中选择TPU-创建TPU节点, 进入创建Cloud TPU界面.

  • 名称: 随你喜欢,都听你的

  • 地区: 要注意选择你所需要的地区, 保持和GCE/GCS地区相同. 参加TFRC计划的记得选中us-central1-f才是免费的池子. zone-f的quote会在TFRC计划时间内生效, 其余时间zone-f不会出现配额, in another word, 不用担心会因为TFRC计划超时而产生额外的费用, 安心用~

  • TPU类型: 选择V2 or V3, 单个instance or Pod(看您钱包鼓不鼓咯, 爷请上座~)

  • TensorFlow版本: 这个是比较重要的选项, 每个被创建的TPU都是被严格规定了版本的, 其实是XLA编译的选择, 一些本地版本不匹配的情况可能会提示warning, TPU可能会直接撂挑子不干的哦~ 所以务必使用和本地编译bug-free的相同版本.

  • 网络: default

  • IP地址范围: 按照CIDR书写IP地址保证区域内可用IP地址数量≥8即可(8 cores for each TPU unit)

  • 隐藏部分可用选择添加Tag(???), 还未曾见过instance多到需要tag区分, 抢占式TPU节点在这里选择.

现在我们有了resource(GCS), 有了switch(GCE), 有了hardware(TPU)下一步就是如何通过software-level控制/告诉TPU如何运行了.

More

  • 介绍一下CTPU工具的使用
  • 介绍一下gcloud命令行工具
  • 介绍一下Colab上TPU的使用

时天下承平日久,自王侯以下,莫不逾侈。

——《张衡传》南宋/范晔

目录

闲话TPU #1 背景/价格/TFRC计划及羊毛
闲话TPU #2 配置GCP环境/创建TPU实例
闲话TPU #3 模型编写
闲话TPU #4 Coral Edge TPU赋能移动端
闲话TPU #5 那些使用TPU训练的巨型模型(时间和算力需求)

零/为什么会出现这篇博客?

Zero/Why is there this blog?

[补充说明1: 这个并不是篇纯粹的Tech Blog, 更多的是一些经验而谈和碎碎念, 如何省钱搞大事儿, 如何更加便捷的使用工具, 有哪些可以不掉进去的坑, 期待pure tech的敬请移步版官方英文Docs]

[补充说明2: Blog将持续更新相关信息/新鲜的坑/碎碎念, 也欢迎客官老爷随时提出意见建议和批评指正, Email: cy.z.feng@gmail.com, CyFeng16@GitHub]

up是一名大学研究狗, 彼时所在实验室的GPU资源总是不够用(相信分家产的事情大家都或多或少的遇见过啦), 包括但不限于互相kill the task或遇见嫌疑人X独占整台机器的情况.

另一个很大的原因是, idea迭代的速度确确实实的限制了我们无意间天马行空的想象力的落地, 像是看到一篇新鲜出炉的paper的motivation or idea和你我曾经迸发的点子似曾相识, 这个时候的挫败感相比碌碌无为来的更为严重.

本着[打不过就加入的原则] 以更快速地工具验证自己idea的原则, 在中国好师兄李卓桓助力下, 时至今日update这篇Blog时, 使用TPU时长已经高达6 months ^ ^.

嗯, Blog就到这里了, 完结撒花**:tada:** :tada: :tada: [就会被满怀期待看Blog的人寄刀片的啊吧]

壹/简单介绍下TPU

One/Brief introduction of TPU

TPU, 全称Tensor Processing Unit, 从名字看就是专门用来加速Tensor计算的, 何谓Tensor计算? 矩阵计算是也. 最近有幸受邀去PnP中一家某清的创业孵化器公司交流TPU的使用, 了解到NVIDIA家最新的2080Ti在HPC和Deep Learning中Performance up了大约10X的样子, 我想应该是受益于20系列GPU计算卡中特有的Tensor core计算单元, 然而…然而要知道N家的卡中Tensor core只有2个96×96(16-bit)的阵列呀[以后升级了不负责任的说], TPU中的矩阵计算单元却有128X128, 这也说明了为什么TPU在合理优化的前提下, 模型运行速度会飞起来~~(内存加载的优化也至关重要, XLA编译下Fusion大大优化了内存访存.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Python PEP20
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

节省时间即是珍惜生命, 不是么!!! [顺便防脱发(误)]

希望更多的国人researcher加入使用TPU的大家庭中, 也希望大G在硬件方面更加open, V4什么的快点呀!!! (定价更平民一些就更好了, 毕竟都是要恰饭的嘛! :-D)

Cloud TPU version Support started Support ends
v3-8 October 10, 2018 (End date not yet set)
v2-8 February 12, 2018 (End date not yet set)
v2-32 (alpha) November 7, 2018 (End date not yet set)
v2-128 (alpha) November 7, 2018 (End date not yet set)
v2-256 (alpha) November 7, 2018 (End date not yet set)
v2-512 (alpha) November 7, 2018 (End date not yet set)

当前(2019年03月31日11:41:06)面向公众开放的TPU分为两代, Cloud TPU v2 and Cloud TPU v3, 从实际运行的经验上比较, V3的速度大约是V2的1.8X左右(数据来源于each step的比较, 上下行和checkpoint存储为CPU的工作, TPU不背这个锅), 目前2代TPU拥有Pod的使用模式支持(可以想象成GPU的单机多卡并行), 最高支持256 个 TPU 芯片(16x16 切片)共512 cores, 性能美如画, 比肩**$399,000DGX-2**, 每小时的price也是美如画 - -. 当前Google Cloud Platform上的TPU定价如下表(北美区域):

Cloud TPU v2 每小时每个 TPU $4.50。
抢占式 TPU v2 每小时每个 TPU $1.35。
Cloud TPU v3 每小时每个 TPU $8.00。
抢占式 TPU v3 每小时每个 TPU $2.40。
v2-32 Cloud TPU v2 Pod(Alpha 版) 每小时每个 Pod 切片 $24.00。
v2-128 Cloud TPU v2 Pod(Alpha 版) 每小时每个 Pod 切片 $96.00。
v2-256 Cloud TPU v2 Pod(Alpha 版) 每小时每个 Pod 切片 $192.00。
v2-512 Cloud TPU v2 Pod(Alpha 版) 每小时每个 Pod 切片 $384.00。
  • 抢占式 TPU 是指 Cloud TPU 在需要将资源分配给另一项任务时,可以随时终止(抢占)的 TPU。__抢占式 TPU 的费用要比普通 TPU 低廉得多__。

各个Area的价格大概是 北美 < 欧洲 < 亚太 (也就酱, 没有太多的槽点).

[划重点/划重点/划重点]可以薅的羊毛有大G家用来支持研究和教育事业的TFRC计划, 如果加入了 TFRC 计划,则可在限定时间内免费使用 Cloud TPU v2 和 v3。只要 TPU 在 us-central1-f 地区内运行,就无需为 Cloud TPU 支付费用。(调用TPU的GCE/VM的money还要自己出啦…呵)

根据身边被安利的同志们的reply而言, 通过给TFRC小组去Email, 平均回复(核可)时间是2 weeks, 一般情况下granted TPU数量为: 5XTPUv2/100X TPUv2(preemptible)/2XTPUv3酱紫, 限免使用时长为1 month. [相信聪颖智慧的各位一定会善加利用free的资源或者想到搞到更多free资源的method]

曾经算过笔账, 不是密集型生产实践任务, 很单纯的使用TPU进行一些模型的验证, 每天消耗的小钱钱up的钱包还是能承受起的啦(难免心痛).

[划重点/划重点]可以薅的羊毛还有Google Colab. Colaboratory是一个研究项目,可免费使用的 Jupyter 笔记本环境,不需要进行任何设置就可以使用,并且完全在云端运行, 目前提供三种免费的硬件支持: CPU/GPU|K80/TPU|v2. 缺点是系统可能会停止长时间运行的后台计算, 你的计算也可能会因为浏览器自动节能等等原因断开连接.

好, 在下一篇文章中, 我们会进入下一个话题, 如何开始使用TPU.