找回密码
 新注册用户
搜索
查看: 3885|回复: 2

[转帖] 下周两大GPU发包服务器将会更新硬件和软件

[复制链接]
发表于 2020-4-16 11:24:45 | 显示全部楼层 |阅读模式
本帖最后由 Baiqing_Lyu 于 2020-4-16 11:32 编辑

如题,原帖在此:https://foldingforum.org/viewtopic.php?f=18&t=34224&start=15#p325911
Hi folks! I just wanted to step in to apologize for the persistent issues with plfah1-*.mskcc.org and plfah2-*.mskcc.org over the past couple of weeks.

Our poor little servers at MSKCC were purchased in 2007 and have been chugging away serving the majority of GPU work units probably up to where FAH hit nearly 1 exaflop before they started to fall over.

Also, I apologize for the perceived lack of transparency here---it's really been just a matter of putting out tons of other fires and focusing our energies on providing critical scientific support for active COVID-19 drug discovery efforts. Huge thanks to @bruce for pointing me here to at least dash off a quick update.

We've had a few issues that have appeared:
* Throughput limits with the server code: This is now addressed, or at least much improved!
* Drive failures: Now fixed, with more shelf spares that have just showed up!
* Disk space that was burned through something like 5-20x faster due to all of you wonderful people completing WUs so quickly: We're rapidly evacuating completed projects to a new external storage unit that just came online days ago, and have 14T free on plfah1 now. We're also going to release a new core22 version this week that will allow us to send back only the solute structure data and thus solve these storage issues for good.
* Server software stability: Occasionally, it looks like the server can't operate on specific WUs because the files are reported by the OS as "busy". We don't have a solution for this yet, but so far have been restarting the server when we encounter this. We could use a rapid warning system, maybe, instead of having to have @bruce ping us when this happens for a couple of hours.

TL;DR: I think it's OK to ahead and keep the blacklist for a few more days. We've spun up a bunch of new external servers which are going to be much more performant.
We're bringing new (actual, brand new!) physical servers online at MSK in the next few days. We got them just before the COVID-19 emergency, and it's been slower to roll them out as a result, but we hope we can get them into service to replace plfah1/2 in the next week!

Huge thanks to all of you folks, and stay safe and healthy. We love you all.

~ John Chodera @ MSKCC ~

大致的意思是07年买的服务器 plfah1-1.mskcc.org和plfah2-1.mskcc.org终于要退休,即将会有新的软件,硬件已更高的性能代替工作!目前,这两个服务器为绝大部分的GPU提供任务,它们的性能如下:

(IP,地址,服务器类别,服务器软件版本,相关负责人,每小时发包率,出错,警告,有任务,任务类别,公开包量,Beta包量,算法,空间,续航时间,检查时间)
new servers.PNG


期待更新以后新的发包功率!




评分

参与人数 1基本分 +20 收起 理由
金鹏 + 20 赞一个!

查看全部评分

回复

使用道具 举报

发表于 2020-4-16 13:26:45 | 显示全部楼层
之前看到他们的一个帖子是说发包服务器硬件要求都在硬盘上,诸如“8GiB RAM, 100TiB SSD storage, and 1GiB/s network.”,仔细核算了一下,发现自己不是白嫖王,于是放弃。
现在一看才知道原来是13年前的服务器,那得多慢。。。。估计还是scsi的硬盘了吧。

但是回过头来看,现在一天到晚GPU接不到包(硬件问题),他们发包也就维持在12、13万/小时,如果这2个主力服务器升级好了,岂不是要破20万/小时了?到时候是真没包了。
回复

使用道具 举报

发表于 2020-4-16 17:49:01 | 显示全部楼层
发包服务器需要高主频(4.0GHz以上,不需要太多核,6核应该足够)、大内存、ssd、高带宽。把WU缓存在内存,有request,立马发出
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~
欢迎大家多多支持基金会~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2024-4-25 06:58

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表