同时参加的GPU项目有两个,CC和PrimeGrid,跑CC的时候虽然GPU负载(使用率)不稳定上上下下的波动,但一直没出现过低负载的情况。跑PrimeGrid的时候正常时候GPU使用率有99%(两个项目均适用AIDA64和GPUZ监视),但中间会有间歇性的出现GPU负载低到3%以下并持续5分钟左右时间的情况发生,系统是win7 64bit刚刚装上的,CPU项目满跑rosettta、docking、CAS、WCG,六核全部用尽,满负载跑了两天了没发生过蓝屏,CPU温度最高57℃。GPU适用MSI Afterburner控制风扇曲线,使得跑PrimeGrid温度最高不会超过75℃,应该没有触发过热保护吧?
检查了boinc的事件日志,毫无异常情况,连PrimeGrid项目的消息都没有。另外说一句,现在跑的是PrimeGrid的Genefer 2.12(cudaGFN)出现这种情况,不知道其他子项目或者其他版本的计算程序是否有该问题。
~~~2014.05.01更新~~~
这个是boinc的日志文件:
2014/5/1 8:25:05 | | Starting BOINC client version 6.12.43 for windows_x86_64 2014/5/1 8:25:05 | | Config: report completed tasks immediately 2014/5/1 8:25:05 | | log flags: file_xfer, sched_ops, task 2014/5/1 8:25:05 | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5 2014/5/1 8:25:05 | | Data directory: D:\Program Files\BOINC\DATA 2014/5/1 8:25:05 | | Running under account X 2014/5/1 8:25:05 | | Processor: 6 AuthenticAMD AMD Processor model unknown [Family 16 Model 10 Stepping 0] 2014/5/1 8:25:05 | | Processor: 512.00 KB cache 2014/5/1 8:25:05 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni cx16 syscall nx lm svm sse4a osvw ibs skinit wdt page1gb rdtscp 3dnowext 3dnow 2014/5/1 8:25:05 | | OS: Microsoft Windows 7: Ultimate x64 Edition, Service Pack 1, (06.01.7601.00) 2014/5/1 8:25:05 | | Memory: 8.00 GB physical, 16.00 GB virtual 2014/5/1 8:25:05 | | Disk: 50.01 GB total, 45.73 GB free 2014/5/1 8:25:05 | | Local time is UTC +8 hours 2014/5/1 8:25:05 | | NVIDIA GPU 0: GeForce GTX 560 Ti (driver version 33523, CUDA version 6000, compute capability 2.1, 1024MB, 922 GFLOPS peak) 2014/5/1 8:25:05 | | Reading preferences override file 2014/5/1 8:25:05 | | Preferences: 2014/5/1 8:25:05 | | max memory usage when active: 7372.06MB 2014/5/1 8:25:05 | | max memory usage when idle: 7372.06MB 2014/5/1 8:25:05 | | max disk usage: 10.00GB 2014/5/1 8:25:05 | | don't use GPU while active 2014/5/1 8:25:05 | | (to change preferences, visit the web site of an attached project, or select Preferences in the Manager) 2014/5/1 8:25:05 | | Not using a proxy 2014/5/1 8:25:22 | | Suspending computation - initial delay 2014/5/1 8:26:36 | PrimeGrid | Restarting task genefer_1048576_390573_5 using genefer version 212 2014/5/1 8:29:52 | PrimeGrid | update requested by user 2014/5/1 8:30:32 | | Project communication failed: attempting access to reference site 2014/5/1 8:30:35 | | Internet access OK - project servers may be temporarily down. 2014/5/1 8:30:59 | PrimeGrid | Sending scheduler request: Requested by user. 2014/5/1 8:30:59 | PrimeGrid | Not reporting or requesting tasks 2014/5/1 8:31:02 | PrimeGrid | Scheduler request completed 2014/5/1 11:46:17 | PrimeGrid | Task genefer_1048576_390573_5 exited with zero status but no 'finished' file 2014/5/1 11:46:17 | PrimeGrid | If this happens repeatedly you may need to reset the project. 2014/5/1 11:46:17 | PrimeGrid | Restarting task genefer_1048576_390573_5 using genefer version 212 2014/5/1 12:13:18 | PrimeGrid | Task genefer_1048576_390573_5 exited with zero status but no 'finished' file 2014/5/1 12:13:18 | PrimeGrid | If this happens repeatedly you may need to reset the project. 2014/5/1 12:13:18 | PrimeGrid | Restarting task genefer_1048576_390573_5 using genefer version 212 2014/5/1 12:28:58 | PrimeGrid | Task genefer_1048576_390573_5 exited with zero status but no 'finished' file 2014/5/1 12:28:58 | PrimeGrid | If this happens repeatedly you may need to reset the project. 2014/5/1 12:28:58 | PrimeGrid | Restarting task genefer_1048576_390573_5 using genefer version 212
当时跑第一个WU的时候GTX560TI的C/S/M分别是950/1900/2050,默压1.05。
PrimeGrid的开发人员解释如下:
If you look at the task on the website [size=1em]http://www.primegrid.com/result.php?resultid=542444048 you will see MANY occurrences of this error:
maxErr exceeded for 405052^1048576+1, 0.5000 > 0.4500
MaxErr exceeded may be caused by overclocking, overheated GPUs and other transient errors.
Waiting 10 minutes before attempting to continue from last checkpoint...
Each of the 10 minute pauses causes the "program terminated with zero status" errors in the BOINC log file.
That's a hardware error on your GPU. It's almost always caused by overclocking. Even if the card came "factory overclocked", it still is overclocked and that's what's causing the error.
You need to lower the clock speed on the GPU to get it to run Genefer reliably. We've found that lowering the memory clock speed is more important than lowering the shader clocks.
Genefer stresses the GPU much more than other programs or normal usage, and is much more sensitive to overclocking than any other program I'm aware of.
跑完第一个WU,结果并没有报错,现在第三个验证WU已经发出。根据专业程序猿的答复,将GPU的C/S/M调整成950/1900/1950,目前没发生走走停停的情况,继续观察这个WU的结果吧。
~~~05.03更新~~~
成功完成,WU得分31,446.86,耗时54,697.72 |