找回密码
 新注册用户
搜索
楼主: wpf999

终极检测方案:应对FAHClient程序僵死

[复制链接]
 楼主| 发表于 2017-3-31 20:30:42 | 显示全部楼层
ONLY 发表于 2017-3-31 20:27
等了几天,终于有个GPU客户端卡死,试验了下,telnet如常,有正常反应,此路不通~ ...

是下载卡死还是上传卡死? 或是其他原因?
回复

使用道具 举报

 楼主| 发表于 2017-3-31 20:31:50 | 显示全部楼层
ONLY 发表于 2017-3-31 20:27
等了几天,终于有个GPU客户端卡死,试验了下,telnet如常,有正常反应,此路不通~ ...

我可否远程观察一下这种情况
回复

使用道具 举报

发表于 2017-3-31 21:54:16 来自手机 | 显示全部楼层
下载卡死,已人工重启~
回复

使用道具 举报

发表于 2017-3-31 21:59:17 来自手机 | 显示全部楼层
wpf999 发表于 2017-3-31 20:31
我可否远程观察一下这种情况

已pm~
回复

使用道具 举报

 楼主| 发表于 2017-3-31 22:24:35 | 显示全部楼层

感谢兄弟大力配合,我们还需更多观察和测试,也许这个问题与系统环境有关——win/Linux。  不过从logs下的日志看,在手动重启之前的最近一次download是complete的,那么可能是其他原因导致的卡死。
360截图20170331222032929.jpg




回复

使用道具 举报

发表于 2017-3-31 23:30:03 来自手机 | 显示全部楼层
wpf999 发表于 2017-3-31 22:24
感谢兄弟大力配合,我们还需更多观察和测试,也许这个问题与系统环境有关——win/Linux。  不过从logs下 ...

没有仔细看过log。

因为网络一直不怎么好,以前观察到几次是下载到中途就卡住的,这时手动重启客户端后都会重新下载新包运算。

另外,以前也在win环境下跑过,也会卡住,一般就重启电脑了。

最近几天貌似网络还比较正常,半月前几乎每天卡,兄弟可以把logs拉过去再看看。
回复

使用道具 举报

发表于 2017-4-1 08:43:33 | 显示全部楼层
O版截图显示卡包的SLOT编号是FS01,而最近完成下载的SLOT编号是FS02,请W版再确认一下
回复

使用道具 举报

 楼主| 发表于 2017-4-1 10:07:32 | 显示全部楼层
Lynt 发表于 2017-4-1 08:43
O版截图显示卡包的SLOT编号是FS01,而最近完成下载的SLOT编号是FS02,请W版再确认一下 ...

log.txt是正在运行的日志(即重启后的日志),logs下是之前的日志。 我在logs下用vi打开最近日期的log文件, 先goto 文档末尾,然后:?Download 向上搜索,显示的即是最近的下载活动,确认是complete的。
回复

使用道具 举报

发表于 2017-4-1 12:03:29 | 显示全部楼层
从截图上看,那个Download complete是FS02的下载结果,并且后面有FS02开始计算的记录,而卡包的SLOT ID是01,对应日志中应该是FS01相关的记录
回复

使用道具 举报

发表于 2017-4-1 12:45:18 | 显示全部楼层
ONLY 发表于 2017-3-31 23:30
没有仔细看过log。

因为网络一直不怎么好,以前观察到几次是下载到中途就卡住的,这时手动重启客户端后 ...

看O版的log路径,好像跑的是安装版FAHClient? 能否帮忙测试一下本贴6楼的脚本?即使是绿色版客户端,不修改脚本也可以帮忙验证卡包检测逻辑,只是无法自动处理
回复

使用道具 举报

 楼主| 发表于 2017-4-1 12:48:24 | 显示全部楼层
Lynt 发表于 2017-4-1 12:03
从截图上看,那个Download complete是FS02的下载结果,并且后面有FS02开始计算的记录,而卡包的SLOT ID是01 ...

向上搜索,显示slot1最近一次也是complete的  



360截图20170401124352112.jpg

回复

使用道具 举报

发表于 2017-4-1 13:41:25 | 显示全部楼层
wpf999 发表于 2017-4-1 12:48
向上搜索,显示slot1最近一次也是complete的

能否把重启前的log文件抓出来给我看一下?很想知道这次不计算的原因,如果比例较大,telnet法就不好用了
回复

使用道具 举报

 楼主| 发表于 2017-4-1 15:36:18 | 显示全部楼层
Lynt 发表于 2017-4-1 13:41
能否把重启前的log文件抓出来给我看一下?很想知道这次不计算的原因,如果比例较大,telnet法就不好用了[ ...

感谢ONLY提供log文件

log-20170331-123459.txt (279.12 KB, 下载次数: 980)



回复

使用道具 举报

发表于 2017-4-1 17:43:15 | 显示全部楼层
wpf999 发表于 2017-4-1 15:36
感谢ONLY提供log文件

谢谢两位版主提供数据,结尾部分完整日志:
  1. ******************************* Date: 2017-03-31 *******************************
  2. 11:47:55:WU01:FS02:0x21:Completed 75000 out of 2500000 steps (3%)
  3. 11:47:59:WARNING:FS01:Size of positions 2696 does not match topology 2634
  4. 11:48:06:WU03:FS00:0xa4:Completed 675000 out of 1250000 steps  (54%)
  5. 11:48:54:WU02:FS01:0x21:Completed 4900000 out of 5000000 steps (98%)
  6. 11:49:05:WU01:FS02:0x21:Completed 100000 out of 2500000 steps (4%)
  7. 11:50:17:WU01:FS02:0x21:Completed 125000 out of 2500000 steps (5%)
  8. 11:50:54:WU02:FS01:0x21:Completed 4950000 out of 5000000 steps (99%)
  9. 11:50:55:WU00:FS01:Connecting to 171.67.108.45:80
  10. 11:51:27:WU01:FS02:0x21:Completed 150000 out of 2500000 steps (6%)
  11. 11:52:37:WU01:FS02:0x21:Completed 175000 out of 2500000 steps (7%)
  12. 11:52:54:WU02:FS01:0x21:Completed 5000000 out of 5000000 steps (100%)
  13. 11:52:56:WU02:FS01:0x21:Saving result file logfile_01.txt
  14. 11:52:56:WU02:FS01:0x21:Saving result file checkpointState.xml
  15. 11:52:57:WU02:FS01:0x21:Saving result file checkpt.crc
  16. 11:52:57:WU02:FS01:0x21:Saving result file log.txt
  17. 11:52:57:WU02:FS01:0x21:Saving result file positions.xtc
  18. 11:52:59:WU02:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
  19. 11:52:59:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
  20. 11:52:59:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13500 run:2 clone:439 gen:118 core:0x21 unit:0x000000a48ca304f457a359ce082bb4d7
  21. 11:52:59:WU02:FS01:Uploading 7.31MiB to 140.163.4.244
  22. 11:52:59:WU02:FS01:Connecting to 140.163.4.244:8080
  23. 11:53:08:WU02:FS01:Upload complete
  24. 11:53:08:WU02:FS01:Server responded WORK_ACK (400)
  25. 11:53:08:WU02:FS01:Final credit estimate, 68451.00 points
  26. 11:53:08:WU02:FS01:Cleaning up
  27. 11:53:48:WU01:FS02:0x21:Completed 200000 out of 2500000 steps (8%)
  28. 11:54:59:WU01:FS02:0x21:Completed 225000 out of 2500000 steps (9%)
  29. 11:56:09:WU01:FS02:0x21:Completed 250000 out of 2500000 steps (10%)
  30. 11:56:54:WU03:FS00:0xa4:Completed 687500 out of 1250000 steps  (55%)
  31. 11:57:19:WU01:FS02:0x21:Completed 275000 out of 2500000 steps (11%)
  32. 11:58:30:WU01:FS02:0x21:Completed 300000 out of 2500000 steps (12%)
  33. 11:59:41:WU01:FS02:0x21:Completed 325000 out of 2500000 steps (13%)
  34. 12:00:52:WU01:FS02:0x21:Completed 350000 out of 2500000 steps (14%)
  35. 12:02:02:WU01:FS02:0x21:Completed 375000 out of 2500000 steps (15%)
  36. 12:03:12:WU01:FS02:0x21:Completed 400000 out of 2500000 steps (16%)
  37. 12:04:23:WU01:FS02:0x21:Completed 425000 out of 2500000 steps (17%)
  38. 12:05:33:WU01:FS02:0x21:Completed 450000 out of 2500000 steps (18%)
  39. 12:05:37:WU03:FS00:0xa4:Completed 700000 out of 1250000 steps  (56%)
  40. 12:06:44:WU01:FS02:0x21:Completed 475000 out of 2500000 steps (19%)
  41. 12:07:54:WU01:FS02:0x21:Completed 500000 out of 2500000 steps (20%)
  42. 12:09:06:WU01:FS02:0x21:Completed 525000 out of 2500000 steps (21%)
  43. 12:10:16:WU01:FS02:0x21:Completed 550000 out of 2500000 steps (22%)
  44. 12:11:26:WU01:FS02:0x21:Completed 575000 out of 2500000 steps (23%)
  45. 12:12:36:WU01:FS02:0x21:Completed 600000 out of 2500000 steps (24%)
  46. 12:13:47:WU01:FS02:0x21:Completed 625000 out of 2500000 steps (25%)
  47. 12:14:24:WU03:FS00:0xa4:Completed 712500 out of 1250000 steps  (57%)
  48. 12:14:57:WU01:FS02:0x21:Completed 650000 out of 2500000 steps (26%)
  49. 12:16:07:WU01:FS02:0x21:Completed 675000 out of 2500000 steps (27%)
  50. 12:17:17:WU01:FS02:0x21:Completed 700000 out of 2500000 steps (28%)
  51. 12:18:29:WU01:FS02:0x21:Completed 725000 out of 2500000 steps (29%)
  52. 12:19:10:Started thread 42 on PID 3377
  53. 12:19:39:WU01:FS02:0x21:Completed 750000 out of 2500000 steps (30%)
  54. 12:20:50:WU01:FS02:0x21:Completed 775000 out of 2500000 steps (31%)
  55. 12:22:00:WU01:FS02:0x21:Completed 800000 out of 2500000 steps (32%)
  56. 12:23:12:WU01:FS02:0x21:Completed 825000 out of 2500000 steps (33%)
  57. 12:23:21:WU03:FS00:0xa4:Completed 725000 out of 1250000 steps  (58%)
  58. 12:23:34:Started thread 43 on PID 3377
  59. 12:24:07:Started thread 44 on PID 3377
  60. 12:24:22:WU01:FS02:0x21:Completed 850000 out of 2500000 steps (34%)
  61. 12:25:32:WU01:FS02:0x21:Completed 875000 out of 2500000 steps (35%)
  62. 12:26:43:WU01:FS02:0x21:Completed 900000 out of 2500000 steps (36%)
  63. 12:27:55:WU01:FS02:0x21:Completed 925000 out of 2500000 steps (37%)
  64. 12:29:05:WU01:FS02:0x21:Completed 950000 out of 2500000 steps (38%)
  65. 12:30:15:WU01:FS02:0x21:Completed 975000 out of 2500000 steps (39%)
  66. 12:31:26:WU01:FS02:0x21:Completed 1000000 out of 2500000 steps (40%)
  67. 12:32:31:WU03:FS00:0xa4:Completed 737500 out of 1250000 steps  (59%)
  68. 12:32:38:WU01:FS02:0x21:Completed 1025000 out of 2500000 steps (41%)
  69. 12:33:09:Lost lifeline PID 3375, exiting
  70. 12:33:48:WU01:FS02:0x21:Completed 1050000 out of 2500000 steps (42%)
  71. 12:34:06:Caught signal SIGTERM(15) on PID 3377
  72. 12:34:06:Exiting, please wait. . .
  73. 12:34:14:Caught signal SIGTERM(15) on PID 3377
  74. 12:34:14:WARNING:Next signal will force exit
复制代码
分析发现卡包点:第9行的
  1. 11:50:55:WU00:FS01:Connecting to 171.67.108.45:80
复制代码
就是FS01当前包计算到99%时准备下载下一个包,在尝试连接发包服务器之后没有进一步下载动作造成下载卡包,后面FS01的记录是当前包跑完及上传的过程。

评分

参与人数 1基本分 +100 收起 理由
wpf999 + 100 眼力好!

查看全部评分

回复

使用道具 举报

发表于 2017-4-1 21:31:33 来自手机 | 显示全部楼层
Lynt 发表于 2017-4-1 12:45
看O版的log路径,好像跑的是安装版FAHClient? 能否帮忙测试一下本贴6楼的脚本?即使是绿色版客户端,不修 ...

的确是安装版,清明前没什么时间,过完节再试试兄弟的脚本!
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2024-4-24 13:02

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表