愿得一心人,白首不相离

——《白头吟》汉/卓文君

解释

Submodule 是一个多项目管理工具,它允许将子项目以独立的 git 项目添加到主项目,而主项目以 submodule 的形式拥有子项目。子项目拥有自己的 commit、push、pull,而与主项目互不干扰。主项目只需要记录子项目的地址和所需要的 commit id,通过地址和 commit id 就能够得到对应的子项目。

[常用情景] 在我们的项目中调用轮子且希望两者解藕,可以分别进行版本管理。

添加

1
2
3
4
5
/*
url: 子项目远程地址或本地地址
path: 子项目路径,可省略
*/
git submodule add [url] [path]

可以看到多了文件: .gitmodules,.gitmodules 储存了 submodule 的路径及远程地址。对改变进行提交。

1
2
git add
git commit -m "xxx"

克隆

在clone带有submodules的项目时,正常clone主项目地址不会拉取submodules,可以使用如下命令同时clone主项目和依赖模块:

1
git clone --recurse-submodules [url]

更新

因为我们的目的是解藕主项目对依赖项目的关联,我在主项目中设置git不跟踪submodules,此时对submodules的更新不会提交到依赖项目中。

在使用中发现依赖项目中存在错误时,切换进submodules自己独立的仓库进行修改和PR,非 submodule 的开发人员就不用关心 submodule 是否更新。

删除

1
2
3
4
5
6
7
8
9
10
11
12
13
# 逆初始化模块,submodule为子模块目录,执行后可发现子模块目录被清空
git submodule deinit [submodule_name]

# 执行如下命令还能看到子项目信息
git submodule

# 删除.gitmodules中记录的模块信息(--cached选项清除.git/modules中的缓存)
git rm --cached test2sub
# 执行如下命令已看看不到删除的子项目信息了
git submodule

# 提交修改
git commit -m "xxx"

P.S.

  • 使用GitHub Action进行自动部署的时候需要对submodules进行额外的配置
  • ↑所以我一般将更新频率慢(固定)的submodules融入自己的项目,以较低代价防止奇奇怪怪的事情发生

Reference

  1. https://gist.github.com/myusuf3/7f645819ded92bda6677

思量。能几许,忧愁风雨,一半相妨。又何须抵死,说短论长。
幸对清风皓月,苔茵展、云幕高张。江南好,千钟美酒,一曲满庭芳。

——《满庭芳》宋/苏轼

GitHub Pages服务对与每一个用户开放username.github.io的域名解析和托管,如果你是普通用户,gh-page服务将仅对公有仓库开放,将仓库转为私有后gh-page停止解析和更新;对于学生Pro和付费Pro及以上等级用户,gh-page服务将同时对公有和私有仓库提供服务,然而Travis CI仅对公有仓库免费使用。

综上,本文面向于想使用GitHub Pages服务应用于私有仓库的大家伙。

P.S. 把托管gh-page的仓库私有化可以减少隐私泄露。但,所有(部署分支的)文件仍然可以被爬取/扫描。

前言

之前一直用最简洁直白的GitHub Pages+Jekyll进行部落格的书写(反正没有关注度不是嘛)。

近期因为疾控问题在家思思发抖,但学习和工作不能停,就干些一直以来没时间做/没接触的事情。

  • 把gh-pages仓库私有
  • 使用Hexo替代Jekyll
  • 使用Github Action实现自动部署

使用ssh-keygen生成秘钥对实现部署

1
2
3
4
5
# set up private key for deploy
mkdir -p ~/.ssh/
echo "$ACTION_DEPLOY_KEY" | tr -d '\r' > ~/.ssh/id_rsa # 配置秘钥
chmod 600 ~/.ssh/id_rsa
ssh-keyscan github.com >> ~/.ssh/known_hosts

使用github personal access token实现部署

类似于 Google 两步验证中的备用验证码,不过google token是单次生成(可见/查询)使用后销毁,github personal token是单次生成销毁(不可见)多次使用。

依次进入 Settings >> Developer settings >> Personal access tokens,点击 Generate new token

部署部落格仅需要对repo的读写权限。此页面关闭之后 token 将不可见(快记下来!!)。

设置仓库

因为gh-pages服务默认部署的分支是master,所以有如下两种常见的仓库设置方法:

  • 双仓库:仓库A用于存储hexo源文件,仓库B(xxx.github.io)用于hexo生成文件的部署,push A触发GitHub Action更新部署仓库B
  • 单仓库:source分支用于存储hexo源文件,master分支用于hexo生成文件的部署,push source触发GitHub Action更新master并部署gh-pages

这里我采用的是单仓库双分支的设置,注意不要手贱merge了就好,不然还要花时间(action刷新master之后history会消失)。

上一步生成的 token 我们不能以明文形式存放,所以要设置为仓库的 Secrets,这样就可用 Secrets 隐式引用 token。依次进入(仓库的)setting >> Secrets >> Add a new secret,名称填 GITHUB_ACCESS_TOKEN,内容填刚刚的token。

配置 GitHub Action

修改 Hexo 的 _config.yml,将下面 id 和 仓库名修改为自己的。

1
2
3
4
deploy:
type: git
repo: https://[email protected]/your-github-id/your-github-repo-name.git
branch: master

在 Hexo 根目录下新建 .github/workflows/blogci.yml,内容如下,将 git 的信息修改为自己的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
name: BlogCI

on: [push]

jobs:
build:

runs-on: ubuntu-latest

steps:
- name: Download Source file
uses: actions/[email protected]
with:
ref: source # 此处修改为自己存放 Hexo 源文件的分支

- name: Prepare Node env
uses: actions/[email protected]
with:
node-version: "10.x" # 此处和自己的nodejs版本对应(10,12测试成功)

- name: Set Env
env:
GITHUB_ACCESS_TOKEN: ${{ secrets.GITHUB_ACCESS_TOKEN }}
run: |
git config --global user.name 'my_name' # github用户名
git config --global user.email 'my_email' # github邮箱
sed -i "s/GITHUB_ACCESS_TOKEN/$GITHUB_ACCESS_TOKEN/g" ./_config.yml

- name: Hexo
run: |
npm i -g hexo-cli
npm i
hexo clean && hexo g && hexo d

最后也可以采用另一种写法(使用局部安装避免npm安装依赖出错):

1
2
3
4
5
6
- name: Hexo
run: |
npm install hexo
npm install
npx hexo clean
npx hexo g -d

保存后推送到 GitHub,再进入 Actions 会发现 BlogCI 已经在工作。

价格

目前GitHub Action采用免费时长的营销策略:

  • 对于普通用户每个月可以免费使用2000分钟
  • 对于Pro用户每个月免费使用3000分钟

据我统计一般成功的推送action执行时间在40~60s之间,执行时间较长的(3~5min)一般都是出错了 :(

按这个时间进行估算,用户在action上无需额外消费 :)

Reference

  1. gh-pages服务 https://pages.github.com/
  2. Jekyll部落格框架 https://jekyllrb.com/
  3. Google备用验证码 https://support.google.com/accounts/answer/1187538
  4. 参考的Blog 有改进和勘误 https://rook1e.com/p/6.html
  5. Hexo中文文档 https://hexo.io/zh-cn/docs/

Hands on empirically searching for learning rate

3e-4 is the best learning rate for Adam, hands down.

如果 3e-4 在我的数据集上无法作用于模型,我会采取两个办法:

  • 如果看不到损失值移动的明确方向,我会降低学习率。
  • 如果在小数点后 5 或 6 位才能看到损失减少,我会提高学习率。
  • 如有必要,我会再重复上面的过程。

2015 年,Leslie N. Smith 将上述的反复试验法形式化为一种名为 LR Range Test 的技术。这个方法很简单,你只需将模型和数据迭代几次,把学习率初始值设置得比较小,然后在每次迭代后增加。你需要记录学习率的每次损失并将它画出。

LR Range Test 图应该包括三个区域,第一个区域中学习率太小以至于损失几乎没有减少,第二个区域里损失收敛很快,最后一个区域中学习率太大以至于损失开始发散。

除了确保你正在选择最佳学习率之外,该技术还可以作为一种「保险」检查。如果 LR Range Test 没有显示上述 3 个区域,或者图中有断层(损失中有 NaN 值),则表示模型中有缺陷或者数据中有错误。在运行模型之前,最好获取一个理想的 LR range 图。

从童年起,我便独自一人,照顾着历代的星辰。

——《孤独》中/白鹤林

但遇见你以后,念念落地生根,未来欢愉在等。

——佚名补

目录

闲话TPU #1 背景/价格/TFRC计划及羊毛
闲话TPU #2 配置GCP环境/创建TPU实例
闲话TPU #3 模型编写
闲话TPU #4 Coral Edge TPU赋能移动端
闲话TPU #5 那些使用TPU训练的巨型模型(时间和算力需求)

那些使用TPU训练的巨型语言模型

Six/Those Huge Neural Network Models, Training Time and Computing Resources

BERT

原论文中描述,大型 BERT 模型在 16 个 Cloud TPU 上需要训练 4 天:

“Training of BERT_BASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERT_LARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.”

现在我们来算一下成本,16 个 Cloud TPU v3 总训练成本为 16×8×24×4=12288 美元。有研究者在 Reddit 中回复作者,他们可以使用更便宜的抢占式(Preemptible)TPU 训练模型,那样成本约为 16×2.4×24×4=3686.4 美元。不过一般的 TPU 优先于抢占式 TPU,如果它们需要计算资源,可以暂停抢占式对资源的调用。

BERT 的作者在 Reddit 上也表示预训练的计算量非常大,Jacob 说:「OpenAI 的 Transformer 有 12 层、768 个隐藏单元,他们使用 8 块 P100 在 8 亿词量的数据集上训练 40 个 Epoch 需要一个月,而 BERT-Large 模型有 24 层、2014 个隐藏单元,它们在有 33 亿词量的数据集上需要训练 40 个 Epoch,因此在 8 块 P100 上可能需要 1 年?16 个 Cloud TPU 已经是非常大的计算力了。」

为了做对比,这里统一用一般的 TPU 价格计算成本,因此 BERT 训练一次大概需要 1.23 万美元。

GPT-2

今年另一个非常受关注的语言模型就是 GPT-2 了,它充分展示了什么才算大模型。我们可以理解为,GPT-2就是在 GPT 的基础上放大十多倍,它需要的算力应该比 BERT 还大。堆了这么多算力与数据,GPT-2 的效果确实惊人,它根据一个前提就能从容地把故事编下去。

但是在 GPT-2 原论文中,我们没找到关于算力的描述,只找到了疑似论文作者的描述。他表明 GPT-2 用了 64 个 Cloud TPU v3,训练了一周多一点。

如果按这个数据,那么训练成本为 32×8×24×7=43008 美元,这个成本已经是训练 BERT 的 3 到 4 倍了。

XLNet

2018 年,谷歌发布大规模预训练语言模型 BERT ,为 NLP 领域带来了极大的惊喜。但最近,Quoc V. Le 等研究者提出的 XLNet 在 20 个任务上超过了 BERT 的表现,并在 18 个任务上取得了当前最佳效果。既然效果这么好,那么它的成本会不会也超过 BERT?

在原论文中,作者表示 XLNet 大模型在 128 个 Cloud TPU v3 下需要训练 2 天半:

“We train XLNet-Large on 512 TPU v3 chips for 500K steps with an Adam optimizer, linear learning rate decay and a batch size of 2048, which takes about 2.5 days. ”

这样算起来,128×8×24×2.5=61440 美元,没想到 XLNet 训练一次的费用比 GPT-2 还高,达到了 BERT 的 5 倍。既然成本这么高,以后可以考虑用预训练的 XLNet 代替 BERT 了。

在看了 XLNet 的算力成本之后,有开发者感叹:「谢天谢地我不在 NLP 领域工作,要是让我去说服老板训练一个模型花 6 万多美元,而且还不能保证这个模型一定好用,我觉得我会哭……」

BigGAN

视觉模型中,常见高成本任务就是训练高分辨率的 GAN 了。在去年,研究者表示他们训练 512×512 像素的图像需要 64 个 Cloud TPU v3,训练 24 到 48 个小时:

“We train on a Google TPU v3 Pod, with the number of cores proportional to the resolution: 128 for 128×128, 256 for 256×256, and 512 for 512×512. Training takes between 24 and 48 hours for most models.”

如果我们用最大训练时间 48 小时为基准,那么训练成本为 64×8×48=24576 美元。是的,BigGAN 的训练成本也比 BERT 高,大约是它的两倍左右。

StyleGAN

最后,我们统计一下 StyleGAN 的训练成本,因为这篇论文是英伟达提出来的,所以用的是 Tesla V100。该论文使用的 FFHQ 数据集由 1024×1024 的人脸图像组成,模型使用 8 张 Tesla V100 需要训练一星期:

“Our training time is approximately one week on an NVIDIA DGX-1 with 8 Tesla V100 GPUs.”

这里我们按照谷歌云的价格计算总成本,从而更好地做对比。总体而言,训练成本为 8×2.48×24×7=3333.12 美元。可能因为数据集仅限于人脸,StyleGAN 的成本要比 BigGAN 低很多。

Reference

  1. https://www.reddit.com/r/MachineLearning/comments/c2pfgb/d_how_can_you_do_great_ai_research_when_you_dont/
  2. https://www.reddit.com/r/MachineLearning/comments/c59ikz/r_it_costs_245000_to_train_the_xlnet_model512_tpu/
  3. https://www.reddit.com/r/MachineLearning/comments/9nfqxz/r_bert_pretraining_of_deep_bidirectional/

白发空垂三千丈,一笑人间万事。问何物、能令公喜?
我见青山多妩媚,料青山见我应如是。

——《贺新郎》南宋/辛弃疾

目录

闲话TPU #1 背景/价格/TFRC计划及羊毛
闲话TPU #2 配置GCP环境/创建TPU实例
闲话TPU #3 模型编写
闲话TPU #4 Coral Edge TPU赋能移动端
闲话TPU #5 那些使用TPU训练的巨型模型(时间和算力需求)

伍/边缘计算场景中TPU硬件

Five/Edge TPU and Other AI Accelerators

1. 为什么Edge TPU值得期待

AIY Edge TPU开发板在树莓派的基础上完成了AI进化,并且性能也要高于树莓派,可谓是青出于蓝而胜于蓝,而AIY Edge TPU加速器可以说是直接正面争锋Movidius NCS.

Edge TPU/Edge TPU Dev Board/Edge TPU Accelerator将共同促进边缘AI的蓬勃发展.

2. Edge TPU

树莓派(Raspberry Pi)出现的时候我还在修习本科学位,最后一个学期尝试的是IoT方向.当时国内用来做IoT模拟实验的还是类似仿真电路模拟的那一个臃肿箱子,当时好友@宋恺睿 songkairui@hackret.com第一次拿出树莓派的时候,我:哇塞!!!没有想到那么小一个东西的如此的富有功能性.

Raspberry Pi: 我们的使命是将计算和数字化的力量交付给全世界人民。 我们这样做是为了让更多人能够利用计算和数字技术的力量开展工作,解决对他们而言至关重要的问题,并激发他们的创造性。

就像我没有想到在同样的大小(量级)上能赋予了边缘计算设备AI推理的能力的设备来的如此快.不同的是这次并没有那么惊讶罢了.虽然周边的朋友没有一些在用,在见识过Intel家的神经元计算棒Movidius Neural Compute Stick后一直期待着有那么一款能真正普及的IoT-AI辅助设备.

Movidius Neural Compute Stick: 英特尔®Movidius™神经计算棒(NCS)是一款微型无风扇的深度学习硬件USB驱动器,用于学习AI编程。 NCS由相同的低功耗高性能Movidius™视觉处理单元(VPU)供电。 VPU可以在数百万智能安防摄像机,手势控制无人机,工业机器视觉设备等中找到。

Edge TPU: Google’s purpose-built ASIC designed to run inference at the edge.

Edge TPU可以在边缘部署高质量的ML推理。它增强了Google的云TPU和云物联网,以提供端到端(云端到边缘,硬件+软件)基础架构,以促进基于AI的解决方案的部署。除了开源TensorFlow Lite编程环境之外,Edge TPU最初将部署多个Google AI模型,结合了Google在AI和硬件方面的专业知识。

Edge TPU补充了CPU,GPU,FPGA和其他ASIC解决方案,以便在边缘运行AI,这将由Cloud IoT Edge提供支持。

边缘
(设备/节点/网关/服务器)
谷歌云
可进行任务 机器学习推理 机器学习训练和推理
软件, 服务 Cloud IoT Edge, Linux OS Cloud ML Engine, Kubernetes Engine,
Compute Engine, Cloud IoT Core
ML框架 TensorFlow Lite, NN API TensorFlow, scikit-learn,XGBoost, Keras
硬件加速器 Edge TPU, GPU, CPU Cloud TPU, GPU, and CPU

Edge TPU的尺寸约为1美分硬币的1/8大小,它可以在较小的物理尺寸以及功耗范围内提供不错的性能(目前具体性能指标不清楚,官方称可以在高清分辨率的视频上以每秒30帧的速度,在每帧上同时执行多个最先进的AI模型),支持PCIe以及USB接口。

Edge TPU优势在于可以加速设备上的机器学习推理,或者也可以与Google Cloud配对以创建完整的云端到边缘机器学习堆栈。在任一配置中,Edge TPU通过直接在设备本地处理数据,这样不仅保护隐私,而且消除对持久网络连接的需要,减少延迟,允许使用更少的功率和性能。

3. AIY Edge TPU开发板

AIY Edge TPU开发板是一款搭载了Edge TPU的单板计算机,功能非常丰富。开发板分为底板跟核心板,底板包括一些常用的外设接口,而核心板是基于Google Edge TPU的模块化系统子板(核心板与底板可以分离),也就是下图中带屏蔽罩的那个SOM(system-on-module )。

边缘TPU模块(SOM)规格

中央处理器 恩智浦i.MX 8M SOC(quad Cortex-A53,Cortex-M4F)
GPU 集成的GC7000 Lite图形处理器
ML加速器 Google Edge TPU协处理器
内存 1 GB LPDDR4
闪存 8 GB eMMC
无线 Wi-Fi 2x2 MIMO(802.11b / g / n / ac 2.4 / 5GHz)
蓝牙4.1
外形尺寸 40毫米x 48毫米

底板规格

闪存 MicroSD插槽
USB Type-C OTG
Type-C电源
Type-A 3.0主机
Micro-B串口控制台
LAN 千兆以太网端口
音频 3.5mm音频插孔(符合CTIA标准)
数字PDM麦克风(x2)
2.54mm 4针端子,用于立体声扬声器
视频 HDMI 2.0a(全尺寸)
用于MIPI-DSI显示器的39针FFC连接器(4通道)
用于MIPI-CSI2摄像机的24针FFC连接器(4通道)
GPIO 40针扩展头
功率 5V DC(USB Type-C)
外形尺寸 85毫米x 56毫米

支持的操作系统

Debian Linux

支持的框架

TensorFlow Lite

4. Edge TPU协处理器

AIY Edge TPU加速器是一个基于Google Edge TPU的USB设备型的神经网络加速设备,通过USB Type-C可以连接到任何基于Linux系统的PC机、单板计算机如树莓派等的设备上去执行机器学习推理。

产品规格

ML加速器 Google Edge TPU协处理器
连接器 USB Type-C *(数据/电源)
外形尺寸 65毫米x 30毫米

*仅与USB 2.0速度兼容Raspberry Pi主板。

支持的操作系统

Debian Linux

支持的框架

TensorFlow Lite

5. NVIDIA Jetson Nano($99!)

AI 新维度

Jetson Nano 模组仅有 70 x 45 毫米,是体积非常小巧的 Jetson 设备。 为多个行业(从智慧城市到机器人)的边缘设备部署 AI 时,此生产就绪型模组系统 (SOM) 可以提供强大支持。

较高计算性能

Jetson Nano 提供 472 GFLOP,用于快速运行现代 AI 算法。 它可以并行运行多个神经网络,同时处理多个高分辨率传感器,非常适合入门级网络硬盘录像机 (NVR)、家用机器人以及具备全面分析功能的智能网关等应用。

低功率需求

Jetson Nano 为您节约时间和精力,助力您实现边缘创新。 体验功能强大且高效的 AI、计算机视觉和高性能计算,功耗仅为 5 至 10 瓦。

技术规格

GPU NVIDIA Maxwell™ 架构,配备 128 个 NVIDIA CUDA® 核心
CPU 四核 ARM® Cortex®-A57 MPCore 处理器
内存 4 GB 64 位 LPDDR4
存储空间 16 GB eMMC 5.1 闪存
视频编码 4K @ 30 (H.264/H.265)
视频解码 4K @ 60 (H.264/H.265)
摄像头 12 通道(3x4 或 4x2)MIPI CSI-2 DPHY 1.1 (1.5 Gbps)
连接 千兆以太网
显示器 HDMI 2.0 或 DP1.2 | eDP 1.4 | DSI (1 x2) 2 同步
UPHY 1 x1/2/4 PCIE、1x USB 3.0、3x USB 2.0
I/O 1x SDIO/2x SPI/6x I2C/2x I2S/GPIO
尺寸 69.6 mm x 45 mm
规格尺寸 260 针边缘连接器

6.References

  1. https://www.raspberrypi.org/
  2. https://software.intel.com/en-us/movidius-ncs
  3. https://cloud.google.com/edge-tpu/
  4. https://aiyprojects.withgoogle.com/edge-tpu
  5. https://www.nvidia.cn/autonomous-machines/embedded-systems/jetson-nano/

More

  • 更新信息
  • 购买Jetson Nano
  • 购买Edge TPU
  • Edge TPU/Jetson Nano测试&&报告

人类有99%的决定,包括关于配偶、事业和住处的重要抉择,都是由各种进化而成的算法来处理,我们把这些算法称为感觉、情感和欲望。

——《未来简史:从智人到神人》以色列/尤瓦尔·赫拉利

目录

闲话TPU #1 背景/价格/TFRC计划及羊毛
闲话TPU #2 配置GCP环境/创建TPU实例
闲话TPU #3 模型编写
闲话TPU #4 Coral Edge TPU赋能移动端
闲话TPU #5 那些使用TPU训练的巨型模型(时间和算力需求)

叁/编写适用于TPU的模型

Three/Modeling and Coding

从入了DL的坑之后, up也算使用过不少的framework. 从最初在自己的机器上跑caffe到使用Keras, 从臃肿的TensorFlow转向当时正火爆的动态图多GPU数据并行PyTorch, 后来经历PyTorch和Caffe2合并从0.4.0望穿秋水的等这1.0的跨越的时候, 被借助TPU/FPGA等超级加速硬件怼paper到哭, 打不过就加入下定决心也要借助计算红利快速迭代创造一波价值.

[后说后话, 因为马上要拥有DGX-2了嘿嘿, 现在又在逐步回归Pytorch-1.0.1]渐渐也就理解了个事er, idea跟人走, framework跟硬件走, 不敢说走着走着路更宽了, 逐渐进步还是有的. ^ ^

回归正题, 如上所述TPU严格要求所对应的TF版本. 现可用版本如下表所示:

TensorFlow version Cloud TPU support start Cloud TPU support end
1.13 March 11, 2019 (End date not yet set)
1.12 November 8, 2018 (End date not yet set)
1.11 September 27, 2018 (End date not yet set)
1.9 July 2, 2018 March 11, 2019
1.8 April 20, 2018 March 11, 2019

建议的模型书写模式(up用的比较熟练的一个模式):

tf.data用来进行data ingestion and transformation via parallel input-pipeline.

tf.estimator用来build your models.

Eager execution模式训练我们的模型

为了最大获取TPU的加速比, Shapes应在模型运行时就是明确的, 因此要掌握dynamic的尺度.

The XLA compiler compiles a TensorFlow graph just in time for the first batch. If any subsequent batches have different shapes, the model doesn’t work. (Re-compiling the graph every time the shape changes is too slow.) Therefore, any model that has tensors with dynamic shapes that change at runtime isn’t well suited to TPUs.

TensorFlow Serving进行灵活的高性能服务部署.

RESTful API: https://www.tensorflow.org/tfx/serving/api_rest

具体的模型编写框架可以参照tensorflow/tpu在GitHub上的repo, 可以大致如下形容:

preprocessing.py

1
2
3
4
5
6
def	preprocess_ops(xxx):
pass
def preprocess_for_train(xxx):
pass
def preprocess_for_eval(xxx):
pass

model.py

1
2
3
4
5
6
def model(xxx):
if model is tf.keras based:
# start with a Input layer and end with layer-like function
return model
else:
return model.output

params.py

1
2
3
defaults_params = dict(
...
)

inputpipeline.py

1
2
3
4
5
import preprocessing

class InputPipeline(xxx):
do some preprocessing
return tf.dataset(ooo)

main.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import inputpipeline
import params
import model

# some FLAGS used when run shell

def model_fn(features, labels, mode, params):
# def network from model.py
return TPUEstimatorSpec(when train|eval|predict)

def main():
# tf.contrib.cluster_resolver.TPUClusterResolver
tpu_cluster_resolver(tpu_gRPC_name, zone, project)
# tf.contrib.tpu.RunConfig
config(cluster, model_dir, session_config, tpu_config)
# tf.contrib.tpu.TPUEstimator
classifier = TPUEstimator(use_tpu, model_fn, config, params, batch_size...)
train_input, eval_input = inputpipeline(is_training, data_dir, num_paralell_calls, use_bfloat16)

if FLAGS.mode == EVAL:
pass
elif FlAGS.mode == TRAIN|TRAIN_AND_EVAL:
pass

if __name__ == '__main__':
app.run(main)

Coding这方面其实没有多大的变化, 将estimator用TPUestimator封装, TensorBoard inside, 参考着offical repo 和我们常规使用没有太大差别.

然后请务必多次运行并尝试!Try/Try/Try!

肆/错误排除

Four/Troubleshooting

Wrong Message:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
WARNING:tensorflow:Estimator's model_fn (<function resnet_model_fn at 0x7f44d31fd730>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_save_checkpoints_steps': 1251, '_evaluation_master': 'grpc://10.2.3.2:8470', '_session_config': graph_options {
rewrite_options {
disable_meta_optimizer: true
}
}
cluster_def {
job {
name: "worker"
tasks {
value: "10.2.3.2:8470"
}
}
}
, '_log_step_count_steps': None, '_keep_checkpoint_max': 8, '_task_id': 0, '_global_id_in_cluster': 0, '_eval_distribute': None, '_protocol': None, '_master': 'grpc://10.2.3.2:8470', '_experimental_distribute': None, '_tpu_config': TPUConfig(iterations_per_loop=1251, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': None, '_model_dir': 'gs://my_results/resnet-tpu-framework/weighted-resnet-4', '_save_summary_steps': 100, '_train_distribute': None, '_task_type': 'worker', '_num_worker_replicas': 1, '_service': None, '_device_fn': None, '_num_ps_replicas': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f44d2fd3d68>, '_tf_random_seed': 16, '_cluster': <tensorflow.contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver object at 0x7f44d3202748>}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:tensorflow:Precision: bfloat16
INFO:tensorflow:Using dataset: gs://my_datasets/data
INFO:tensorflow:Waiting for new checkpoint at gs://my_results/resnet-tpu-framework/weighted-resnet-4
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

因为VM同时后台运行多个任务(nohup)导致内存不足, 增大VM内存即可

Multiple missions run in background(within nohup mode) causing the lack of memory of VM, just increase the memory in VM specification.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
INFO:tensorflow:Error recorded from outfeed: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 28 meaning 'Timeout was reached', error details: SSL connection timeout
when initiating an upload to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Failed to sync 715 events to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Could not flush events file.
[[node current_epoch (defined at ./resnet_main.py:393) = WriteScalarSummary[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"](SummaryWriter, strided_slice, current_epoch/tag, current_epoch/Identity)]]

Caused by op 'current_epoch', defined at:
File "./resnet_main.py", line 577, in <module>
tf.app.run()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./resnet_main.py", line 564, in main
hooks=hooks)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train
saving_listeners=saving_listeners
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2195, in _call_model_fn
features, labels, mode, config)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2503, in _model_fn
host_ops = host_call.create_tpu_hostcall()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1736, in create_tpu_hostcall
ret[name] = self._host_fns[name](*dequeue_ops)
File "./resnet_main.py", line 393, in host_call_fn
summary.scalar('current_epoch', ce[0], step=gs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 440, in scalar
return summary_writer_function(name, tensor, function, family=family)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 384, in summary_writer_function
should_record_summaries(), record, _nothing, name="")
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
return true_fn()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 377, in record
with ops.control_dependencies([function(tag, scope)]):
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 438, in function
name=scope)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 633, in write_scalar_summary
name=name)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()

AbortedError (see above for traceback): All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 28 meaning 'Timeout was reached', error details: SSL connection timeout
when initiating an upload to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Failed to sync 715 events to gs://my_results/resnet-tpu-framework/weighted-resnet-3/events.out.tfevents.1546877726.n-ec458f18-w-0.v2
Could not flush events file.
[[node current_epoch (defined at ./resnet_main.py:393) = WriteScalarSummary[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"](SummaryWriter, strided_slice, current_epoch/tag, current_epoch/Identity)]]

疑似是VM和GCS之间网络连接的问题

Bad internet access. Change and try again will solve the problem.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
FailedPreconditionError (see above for traceback): Unable to enqueue when not opened, queue: [0000:00:05.0 PE0 C1 MC2 TN0 Queue TENSOR_CORE_INFEED]. State is: FAILED
[[node input_pipeline_task0/while/InfeedQueue/enqueue/2 (defined at /home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/ops/gen_tpu_ops.py:1055) = InfeedEnqueueTuple[_class=["loc:@input_pipeline_task0/while/IteratorGetNext_2"], device_ordinal=2, dtypes=[DT_BFLOAT16, DT_INT32], shapes=[[19267584], [128]], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext_2, input_pipeline_task0/while/IteratorGetNext_2:1)]]

INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://my_results/resnet-tpu-framework/weighted-resnet-9/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into gs://my_results/resnet-tpu-framework/weighted-resnet-9/model.ckpt.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Installing graceful shutdown hook.
2019-01-10 15:02:51.275101: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
INFO:tensorflow:Creating heartbeat manager for ['/job:tpu_worker/replica:0/task:0/device:CPU:0']
INFO:tensorflow:Configuring worker heartbeat: shutdown_mode: WAIT_FOR_COORDINATOR

INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 6 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (1251) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1251) batch(es) of data from outfeed.

生成TPU的时候状态不正确? 等待了一会之后出现了WAIT_FOR_COORDINATOR之后成功初始化TPU.

TPU status is wrong(not sure why :( ). Just waiting for another minute, a massage ‘WAIT_FOR_COORDINATOR’ come out and the TPU is successfully initialized.


1
2
3
4
5
6
7
8
9
10
11
12
INFO:tensorflow:Querying Tensorflow master (grpc://10.3.2.2:8470) for TPU system metadata.
2019-01-13 03:15:46.904944: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.3.2.2:8470).
WARNING:tensorflow:Retrying (10/120).
INFO:tensorflow:Querying Tensorflow master (grpc://10.3.2.2:8470) for TPU system metadata.
2019-01-13 03:16:46.908810: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.3.2.2:8470).
WARNING:tensorflow:Retrying (11/120).
INFO:tensorflow:Querying Tensorflow master (grpc://10.3.2.2:8470) for TPU system metadata.
2019-01-13 03:17:46.913479: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.3.2.2:8470).
WARNING:tensorflow:Retrying (12/120).

gRPC连接问题.

Connection problem with gRPC. Reset the TPU, or just init another TPU. All of those kinds of problems can be solved through restarting and recreating in high probability.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 2 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (1251) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1251) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to `Session::Close()`.
INFO:tensorflow:Error recorded from training_loop: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.

Largest program allocations in vmem:

XLA label: register allocator spill slots
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.

Largest program allocations in vmem:

XLA label: register allocator spill slots
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./resnet_main.py", line 585, in <module>
tf.app.run()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./resnet_main.py", line 572, in main
hooks=hooks)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2409, in train
rendezvous.raise_errors()
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
raise value
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train
saving_listeners=saving_listeners
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
raise value
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/cyfeng16/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Compilation failure: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.

Largest program allocations in vmem:

XLA label: register allocator spill slots
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

XLA label: %fusion.177 = (bf16[], f32[256]{0}, f32[256]{0}, bf16[128,56,56,256]{3,0,2,1}) fusion(f32[256]{0}, f32[256]{0}, f32[256]{0}, f32[256]{0}, ...(+13)), kind=kOutput, calls=%fused_computation.177, sharding={ {maximal device=0}, {maximal device=0}, {maximal devi...} }
Allocation type: scoped

TPU compilation failed
[[{{node tpu_compile_succeeded_assert/_2211098028679383727/_435}} = TPUCompileSucceededAssert[_device="/job:worker/replica:0/task:0/device:CPU:0"](TPUReplicate/_compile/_11277858145685444465/_434)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node TPUReplicate/_compile/_11277858145685444465/_434/after_compilation/_436_G6101}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:6", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=452505474266104386, tensor_name="edge_6943_...ation/_436", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:6"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

OOM issue??

Oh, this version of XLA complier is something wrong with this special situation, it costs more HBM than usual.

Reducing the batchsize to 512 or even 256 will solve this problem. According to R(Russell), in TFv1.13 the problem will be solved or relief.