muclemanxb 发表于 2014-4-28 23:06:52

PrimeGrid为何跑跑停停?(已解决)

同时参加的GPU项目有两个,CC和PrimeGrid,跑CC的时候虽然GPU负载(使用率)不稳定上上下下的波动,但一直没出现过低负载的情况。跑PrimeGrid的时候正常时候GPU使用率有99%(两个项目均适用AIDA64和GPUZ监视),但中间会有间歇性的出现GPU负载低到3%以下并持续5分钟左右时间的情况发生,系统是win7 64bit刚刚装上的,CPU项目满跑rosettta、docking、CAS、WCG,六核全部用尽,满负载跑了两天了没发生过蓝屏,CPU温度最高57℃。GPU适用MSI Afterburner控制风扇曲线,使得跑PrimeGrid温度最高不会超过75℃,应该没有触发过热保护吧?
检查了boinc的事件日志,毫无异常情况,连PrimeGrid项目的消息都没有。另外说一句,现在跑的是PrimeGrid的Genefer 2.12(cudaGFN)出现这种情况,不知道其他子项目或者其他版本的计算程序是否有该问题。

~~~2014.05.01更新~~~
这个是boinc的日志文件:
2014/5/1 8:25:05 | | Starting BOINC client version 6.12.43 for windows_x86_642014/5/1 8:25:05 | | Config: report completed tasks immediately2014/5/1 8:25:05 | | log flags: file_xfer, sched_ops, task2014/5/1 8:25:05 | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.52014/5/1 8:25:05 | | Data directory: D:\Program Files\BOINC\DATA2014/5/1 8:25:05 | | Running under account X2014/5/1 8:25:05 | | Processor: 6 AuthenticAMD AMD Processor model unknown 2014/5/1 8:25:05 | | Processor: 512.00 KB cache2014/5/1 8:25:05 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni cx16 syscall nx lm svm sse4a osvw ibs skinit wdt page1gb rdtscp 3dnowext 3dnow2014/5/1 8:25:05 | | OS: Microsoft Windows 7: Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)2014/5/1 8:25:05 | | Memory: 8.00 GB physical, 16.00 GB virtual2014/5/1 8:25:05 | | Disk: 50.01 GB total, 45.73 GB free2014/5/1 8:25:05 | | Local time is UTC +8 hours2014/5/1 8:25:05 | | NVIDIA GPU 0: GeForce GTX 560 Ti (driver version 33523, CUDA version 6000, compute capability 2.1, 1024MB, 922 GFLOPS peak)2014/5/1 8:25:05 | PrimeGrid | URL http://www.primegrid.com/; Computer ID 431496; resource share 1002014/5/1 8:25:05 | | Reading preferences override file2014/5/1 8:25:05 | | Preferences:2014/5/1 8:25:05 | | max memory usage when active: 7372.06MB2014/5/1 8:25:05 | | max memory usage when idle: 7372.06MB2014/5/1 8:25:05 | | max disk usage: 10.00GB2014/5/1 8:25:05 | | don't use GPU while active2014/5/1 8:25:05 | | (to change preferences, visit the web site of an attached project, or select Preferences in the Manager)2014/5/1 8:25:05 | | Not using a proxy2014/5/1 8:25:22 | | Suspending computation - initial delay2014/5/1 8:26:36 | PrimeGrid | Restarting task genefer_1048576_390573_5 using genefer version 2122014/5/1 8:29:52 | PrimeGrid | update requested by user2014/5/1 8:30:32 | | Project communication failed: attempting access to reference site2014/5/1 8:30:35 | | Internet access OK - project servers may be temporarily down.2014/5/1 8:30:59 | PrimeGrid | Sending scheduler request: Requested by user.2014/5/1 8:30:59 | PrimeGrid | Not reporting or requesting tasks2014/5/1 8:31:02 | PrimeGrid | Scheduler request completed2014/5/1 11:46:17 | PrimeGrid | Task genefer_1048576_390573_5 exited with zero status but no 'finished' file2014/5/1 11:46:17 | PrimeGrid | If this happens repeatedly you may need to reset the project.2014/5/1 11:46:17 | PrimeGrid | Restarting task genefer_1048576_390573_5 using genefer version 2122014/5/1 12:13:18 | PrimeGrid | Task genefer_1048576_390573_5 exited with zero status but no 'finished' file2014/5/1 12:13:18 | PrimeGrid | If this happens repeatedly you may need to reset the project.2014/5/1 12:13:18 | PrimeGrid | Restarting task genefer_1048576_390573_5 using genefer version 2122014/5/1 12:28:58 | PrimeGrid | Task genefer_1048576_390573_5 exited with zero status but no 'finished' file2014/5/1 12:28:58 | PrimeGrid | If this happens repeatedly you may need to reset the project.2014/5/1 12:28:58 | PrimeGrid | Restarting task genefer_1048576_390573_5 using genefer version 212
当时跑第一个WU的时候GTX560TI的C/S/M分别是950/1900/2050,默压1.05。

PrimeGrid的开发人员解释如下:
If you look at the task on the website http://www.primegrid.com/result.php?resultid=542444048 you will see MANY occurrences of this error:

maxErr exceeded for 405052^1048576+1, 0.5000 > 0.4500
MaxErr exceeded may be caused by overclocking, overheated GPUs and other transient errors.
Waiting 10 minutes before attempting to continue from last checkpoint...
Each of the 10 minute pauses causes the "program terminated with zero status" errors in the BOINC log file.
That's a hardware error on your GPU. It's almost always caused by overclocking. Even if the card came "factory overclocked", it still is overclocked and that's what's causing the error.
You need to lower the clock speed on the GPU to get it to run Genefer reliably. We've found that lowering the memory clock speed is more important than lowering the shader clocks.
Genefer stresses the GPU much more than other programs or normal usage, and is much more sensitive to overclocking than any other program I'm aware of.

跑完第一个WU,结果并没有报错,现在第三个验证WU已经发出。根据专业程序猿的答复,将GPU的C/S/M调整成950/1900/1950,目前没发生走走停停的情况,继续观察这个WU的结果吧。

~~~05.03更新~~~
成功完成,WU得分31,446.86,耗时54,697.72

w2xcn 发表于 2014-4-29 06:46:07

可能和你的硬件配置有关,我的机器跑GFN总是出错

muclemanxb 发表于 2014-4-29 16:06:40

w2xcn 发表于 2014-4-29 06:46
可能和你的硬件配置有关,我的机器跑GFN总是出错

显卡是GTX560TI,驱动是335.23 WHQL WIN7 64Bit版,OC或者默频(均未降压)都有主帖说的情况,CPU为默频,稳定性应该可以排除(默频都出事那我只能认栽了)。

zflowers 发表于 2014-4-29 20:50:45

我曾经的560跑fah超80很轻松 都不敢超频 你降频试试要不?

muclemanxb 发表于 2014-4-30 00:29:10

zflowers 发表于 2014-4-29 20:50
我曾经的560跑fah超80很轻松 都不敢超频 你降频试试要不?

降频试过了,默频默压,都不行,怀疑是不是GPU中间停顿那段时间是把计算交给CPU了~也曾经把CPU的任务全停过,专门跑PrimeGrid,照样出现这种情况。

zflowers 发表于 2014-4-30 09:07:57

要不换项目gpugird

muclemanxb 发表于 2014-4-30 11:06:30

zflowers 发表于 2014-4-30 09:07
要不换项目gpugird

坚持把这几个WU跑完,另外等我去官方论坛翻滚一下看看,总觉得应该不止我一个人遇到这种问题。

w2xcn 发表于 2014-4-30 23:22:33

muclemanxb 发表于 2014-4-30 11:06
坚持把这几个WU跑完,另外等我去官方论坛翻滚一下看看,总觉得应该不止我一个人遇到这种问题。
...

放弃吧!这个任务世界纪录的任务不给分的,或者分数很低,你的显卡比较适合跑PPS,一天20-30万分的

muclemanxb 发表于 2014-5-1 22:49:25

zflowers 发表于 2014-4-29 20:50
我曾经的560跑fah超80很轻松 都不敢超频 你降频试试要不?

结果是显存频率高了,可是我用的是默认的啊~原来是2050,现在跑在1950,貌似OK了。满载,稳定在99%左右,72℃,69%的风扇转速。
页: [1]
查看完整版本: PrimeGrid为何跑跑停停?(已解决)

论坛官方淘宝店开业啦~
欢迎大家多多支持基金会~