中国分布式计算论坛

 找回密码
 新注册用户
搜索
楼主: vmzy

[项目新闻] United Devices [已结束]

[复制链接]
 楼主| 发表于 2006-2-14 16:27:57 | 显示全部楼层
Feb 13, 2006
We are currently experiencing an issue with agents not being able to connect to the UD servers. This is due to the huge amounts of results we have received for Ligandfit (as well as Rosetta). Since we have been looking into an issue with aborted WUs on the Ligandfit job, I have not been running the result rollup script. This is the process that consolidates the results and then allows us to delete all of the temporary files.

Since we have valid results for all of the current Cancer WUs, I will go ahead and run the rollup script. As soon as that finishes processing, we will be able to delete the temporary result files and dispatching will resume. I will upload a new chunk of cancer data later today.

2006年2月27日
当前我们遇到一个问题--客户端无法连接到UD服务器。这是因为我们接收的Ligandfit(和Rosetta)的结果文件数量太大。因为我们正在调查一个Ligandfit任务异常退出问题,我就关闭了结果接收脚本。此间我们会合并结果,然后删除所有临时文件。

一但我们已经检查了所有当前癌症任务的结果,我将继续开启接收脚本。当处理结束,我们将删除临时结果文件,并恢复任务发放。今天晚些时候我将上传新一批癌症数据。

[ Last edited by vmzy on 2006-2-28 at 16:04 ]

评分

参与人数 1基本分 +15 维基拼图 +7 收起 理由
霊烏路 空 + 15 + 7

查看全部评分

 楼主| 发表于 2006-2-21 14:51:51 | 显示全部楼层
Feb 19, 2006
I have uploaded another batch of cancer data. The last job will remain active for a few days to give members time to receive credit for any outstanding workunits. Since these new workunits are processing so quickly, I do not think we need to wait a whole week.

2006年2月27日
我上传了另一批癌症数据。上一批任务将保留几天有效时间给成员接受所有有效任务的积分。因为这些新任务处理的很快,我不认为我们需要等待一个星期就可完成。

[ Last edited by vmzy on 2006-2-28 at 16:09 ]

评分

参与人数 1基本分 +10 维基拼图 +5 收起 理由
霊烏路 空 + 10 + 5

查看全部评分

 楼主| 发表于 2006-2-21 14:52:40 | 显示全部楼层
Feb 20, 2006
Cancer Data - We continue to have an issue with occassional aborts that is still under investigation. As I mentioned in the Member to Member Support thread, this issue has been identified as a problem with the Ligandfit application itself crashing. This has nothing to do with the UD servers dropping results or not giving due credit. The Ligandfit application has not changed in a very long time, prior to the last batch of cancer data that received no complaints. This would point to something with the new data that is causing Ligandfit to be unstable.

Note that we are receiving successful results from each and every workunit so there are not "bad workunits" that will always fail. Some members have stated that retrying a workunit that just aborted will result in a successful completion.

Although a bit frustrating because of lost points, know that we are getting successful results that can be returned to Oxford to assist in the search for the cure for cancer which is the ultimate objective of this project.

2006年2月27日
癌症数据 - 我们仍在调查一个异常退出问题。如同我在Member to Member Support帖子中所述,这个问题是Ligandfit应用程序自身的一个崩溃问题。这与UD服务器结果丢失或不给予积分无关。Ligandfit应用程序已经很久未更新了,但老癌症数据却没有问题。这意味着新数据导致Ligandfit不稳定。

注意:我们接受各个任务的完成的结果,并不存在“总是失败的坏任务”。一些成员声明再次尝试计算异常退出的任务,可以成功的完成计算。

虽然由于失去积分可能会感到一点不爽,但要知道,我们取得完成的结果并返回给牛津,以协助寻找治疗癌症药物,这才是这个项目的最高宗旨。

[ Last edited by vmzy on 2006-2-28 at 17:00 ]

评分

参与人数 1基本分 +30 维基拼图 +15 收起 理由
霊烏路 空 + 30 + 15

查看全部评分

 楼主| 发表于 2006-2-28 15:09:26 | 显示全部楼层
Feb 27, 2006
Outage: There was a temporary outage over the weekend that caused connectivity problems with Grid.org servers. This was caused by a router failure in our datacenter and was unanticipated. The problem has since been rectified and everything should be functioning normally.

Cancer data: I have uploaded a new chunk of data and have sent the last result set to Oxford for analysis. Some members have noticed that this batch produces an awful lot of hits. I am asking if this is to be expected and if this could in any way be related to the occassional workunit aborts that some members are seeing.

2006年2月27日
当机:周末当机了,导致无法连接Grid.org服务器。这是由于我们数据中心的路由器出现故障造成的意外。后来问题被修复了,并且现在一切应该恢复正常了。

癌症数据:我已经上传了新一批数据,并把上批结果发送到牛津进行分析了。一些用户注意到了,这批任务的hit数高的吓人。我正在询问牛津方面这是否正常,是否与某些用户遇到的任务偶尔异常退出有关。

[ Last edited by vmzy on 2006-3-1 at 13:41 ]

评分

参与人数 1基本分 +16 维基拼图 +8 收起 理由
霊烏路 空 + 16 + 8

查看全部评分

 楼主| 发表于 2006-3-7 22:08:57 | 显示全部楼层
Mar 06, 2006
Things are relatively quiet although we still have a few outstanding issues.

Cancer data - I have not heard back from our Oxford contact regarding the last batch of results I sent. This was an attempt to have them validate the many hits we are seeing on some of these workunits. It seems suspicious to me, but I am not well versed in computational chemistry and cannot speak to this. It will be up to Oxford, who provided the data, to determine if the results are valid for their research. Until then we will just keep crunching away at the new data.

There is still the issue of Ligandfit crashing. Although I do not know why, it appears that there is something about the new data that causes Ligandfit to get in a bad state. I have also mentioned this to our Oxford contact and hope that he may be able to shed some light on this issue. Perhaps the many hits or complexity of the new data is the contributing factor.

2006年3月2日
没啥大事,虽然我们仍然有几个难题。

癌症数据 - 牛津联络处仍未告诉我有关我发去的上批结果的消息。主要是想让他们确认一下这些任务的高hit率是否正常。我想可能有问题,但我不熟悉计算化学,没有发言权。只能由数据提供方牛津来决定结果是否是他们想要的。到那时我们将继续计算新数据。

仍然有Ligandfit崩溃的问题。虽然我不知道起因,看起来,新数据会导致Ligandfit出问题。我对我们的牛津联络处说了这个问题,希望他能解决这个问题。或许新数据的高命中率或复杂性是问题的起因。

[ Last edited by vmzy on 2006-3-8 at 14:54 ]

评分

参与人数 1基本分 +18 维基拼图 +9 收起 理由
霊烏路 空 + 18 + 9

查看全部评分

 楼主| 发表于 2006-3-9 16:09:09 | 显示全部楼层
Mar 08, 2006
I have received a response from our Oxford contact regarding some of the issues we have seen with the new cancer data.

High number of hits - It was stated that a determination could not be made at this stage as to whether the data was generating too many hits or not. After we process all of the data and send it back, they will apply a cutoff criteria in order to isolate the best hits. It may be that once the data is analyzed, they will suggest a change to the Ligandfit input parameters and a rerun of the data. Remember, research implies trial and error.

Aborts - While nothing specific was identified yet, they stated that is was possible for the data to cause Ligandfit to error out. It was mentioned that some of the datas are not discrete organic molecules which may be different that data that was processed in the past. It may be that under certain circumstances this data causes Ligandfit to abort, although we have seen that a replay of the data usually produces a good result so we are not talking about a bad WU per se. They also stated that in the future they will filter out anything that they think may cause an issue. Bottom line is this issue remains elusive and we do not have the answer yet.

2006年3月8日
我从我们的牛津联络处收到了关于我们遇到的新癌症数据的一些问题的答复。

高hit率 - 现在还不能确定数据的hit率是否过高。在我们计算完所有数据并送回结果后,他们将重新设置标准以确定最佳的命中率。也许一旦数据被分析完后,他们将建议Ligandfit修改输入参数,并重算数据。记住,研究意味着反复尝试和错误。

任务异常中止 - 现在还无法确认具体原因,他们说很可能是数据导致Ligandfit出错。它说,一些数据和以前计算的离散有机分子不同。也许在某些特定情况下这些数据会导致Ligandfit退出,我们知道数据的重算通常会产生一个好结果,因此我们说数据本身并没有好坏之分。他们还说,在将来他们将去除任何他们认为可能会导致出错的东西。根本问题是错误的根源仍未找到,我们没法解决它。

[ Last edited by vmzy on 2006-3-9 at 16:51 ]

评分

参与人数 1基本分 +30 维基拼图 +15 收起 理由
霊烏路 空 + 30 + 15

查看全部评分

 楼主| 发表于 2006-3-10 12:23:36 | 显示全部楼层
Mar 10, 2006

We have received too many results again for the servers to hold. I am in the process of clearing some space now. Please be patient while I get the situation remedied. The root problem is that these workunits are processing much more quickly than the previous batch. This is causing many more results in a much shorter period of time. This equates to disk space filling up much more quickly.

It is obvious that our process needs to be changed to limit the time a job runs. I thought two weeks would be sufficient, but I think now I need to turn these around every week.


I am currently uploading a new job. It has already started dispatching and I have verified a new workunit. Since we got a little backed up, it may take a while for everyone to connect and get a new workunit. I will post some info tomorrow on our new (shorter) process for turning these jobs around.

2006年3月10日

服务器接受的结果又超标了。我现在正在清理空间。请耐心等我解决问题。根本问题是,这批任务的处理速度比前一批快的多。这导致在很短的时间内会产生许多结果。所以磁盘空间很快就告急了。

显然,我们需要改变运行任务的时间限制。我认为二个星期足够了,但我想我现在需要每个星期都来修改它。

我正在上传一批新工作。它已经开始发放了,并且我试算了一新任务。因为我们遇到了一点问题,也许需要大家一会儿才能连接服务器,下载一个新任务。明天我将张贴有关这些新任务(计算量很小)的某些信息。

[ Last edited by vmzy on 2006-3-10 at 13:06 ]

评分

参与人数 1基本分 +30 维基拼图 +15 收起 理由
霊烏路 空 + 30 + 15

查看全部评分

 楼主| 发表于 2006-3-24 21:23:05 | 显示全部楼层
Mar 13, 2006
Cancer job - As you probably noticed, we had a small outage last week due to the amount of results being returned for the new cancer jobs. These WUs are running within a few hours each probably due to the (non) complexity of the new protein. This translates to lots of disk space being used up very quickly. Due to this, I will need to roll up the results of these jobs much faster than pervious ones in order to keep below maximum disk space. Here is what I propose:

On Wednesdays I will submit a new cancer job. Once it has been activated, I will run the roll up script which should only affect the previous job. The roll up script will mark the previous job inactive as part of its processing. Once the roll up script has finished, I will reset the job to be active. This will allow outstanding results to be credited.

On Fridays, I will delete the older job that has already had the results rolled up so any outstanding workunits will not be credited. You would have had 2 days to return your results which should be plenty considering the WUs complete within hours.

There may be a small window while the results are being rolled up to where newly returned results will be rejected. This would be due to the job being temporarily marked as inactive. Unfortunately, I see no way around this. The impact should be very minimal. If you are that concerned about a lost workunit, shutdown your agent on wednesdays until I post that the job has been reactivated.

2006年3月13日
癌症任务 - 你或许注意到了,由于新癌症任务的返回结果太多,我们上星期小停了一下机。或许新蛋白质不够复杂,这些任务的计算速度很快。导致磁盘空间耗用很快。因此,我将会需要回收这些小计算量任务的结果,以便尽量保持多的磁盘空间。我的计划如下:

在星期三,我将会发放一个新的癌症工作。一旦它被激活,我将会运行只影响早先任务的回收脚本。在回收过程中早先的任务将会被标记为禁用。一旦脚本运行完毕,我将会重新设定任务为激活。这将会让有效的结果被继续给予积分。

在星期五,我将会删除已经回收了的任务,届时该任务的结果将不会再给予积分。给你额外的2天时间来上传任务,我想应该是足够了。

当结果被回收时,由于任务暂时被作号为禁用,会拒绝新任务结果上传。不幸地,我只能采取这种方法。影响应该不是很大。如果你害怕丢失结果,那么请不要星期三上传结果,等我发帖说任务已经被恢复上传时再上传。

[ Last edited by vmzy on 2006-4-17 at 11:20 ]

评分

参与人数 1基本分 +40 维基拼图 +20 收起 理由
霊烏路 空 + 40 + 20

查看全部评分

 楼主| 发表于 2006-3-24 21:23:29 | 显示全部楼层
Mar 15, 2006
As per our new process described on Monday, I have uploaded a new cancer job and rolled up results from the previous one. On Friday I will deactivate the previous job afterwhich no more credit will be given.

Hopefully this process will keep results at a minimum to conserve disk space, but still allow everyone to participate and receive credit for their work.

2006年3月15日
依照星期一我们的新方法所说的,我已经上传一个新的癌症任务,而且回收了先前的结果。在星期五,我将会删除早先的任务,不再给予积分。

希望这个方法将会减少结果的磁盘空间耗用量,但是仍然会让每个人的工作得到应有的积分。

[ Last edited by vmzy on 2006-4-17 at 11:25 ]

评分

参与人数 1基本分 +15 维基拼图 +7 收起 理由
霊烏路 空 + 15 + 7

查看全部评分

 楼主| 发表于 2006-3-24 21:23:54 | 显示全部楼层
Mar 20, 2006
Cancer data - Things are running relatively smoothly right now. I hope the new cancer job process is working for everyone. As soon as we crunch through all of the new data against the current new protein, Oxford would like us to crunch the new data against the last older protein as well. It will be interesting to see if the number of aborts we are seeing goes down or remains the same.

Please note that for UDMon users the ud_mon.ini file must be updated with the new protein in the Proteins section. If this is not done, you will see many aborts for WU 8581771 in the log, which is a protein not a workunit. Please remind newer members of this if you see them post about the aborts. We do have a random abort problem, but those have nothing to do with WU 8581771. The line should look like this:

8581771=LF:AUR-B

2006年3月20日
癌症数据 - 一切正常。我希望新的癌症任务,每个人都能正常使用。一旦我们把当前新蛋白质的所有新数据算完,牛津会把以前的蛋白质用新数据再算一下。看看程序异常退出问题是否仍会发生。

UDMon用户请注意更新ud_mon.ini文件,请在Proteins区段中加入新的蛋白质信息。如果你没有这样做,你将会在日志中看到,许多WU 8581771任务将会被屏蔽。如果你见到新用户发与此相关的帖子,请提醒一下他们。我们确实有任务异常退出的问题,但是与WU 8581771无关。更新内容如下:
8581771=LF:AUR-B

[ Last edited by vmzy on 2006-4-17 at 10:51 ]

评分

参与人数 1基本分 +30 维基拼图 +15 收起 理由
霊烏路 空 + 30 + 15

查看全部评分

 楼主| 发表于 2006-3-24 21:24:12 | 显示全部楼层
Mar 22, 2006
I have rolled up the results from the current Cancer job and submitted a new one. Credit for the previous job will be given until Friday morning. If you are running UD Mon and are concerned about getting credit for all workunits, this would be a good time to dump/refresh your cache with the new ones.
2006年3月22日
我已经回收了当前的癌症任务的结果,而且发放了新的任务。前一个任务的积分,将会在星期五早晨发放。如果你正在运行UD Mon,而且关心应得的任务积分,请抓紧时间刷新您的任务缓存。

[ Last edited by vmzy on 2006-4-17 at 10:40 ]

评分

参与人数 1基本分 +12 维基拼图 +6 收起 理由
霊烏路 空 + 12 + 6

查看全部评分

 楼主| 发表于 2006-3-25 22:57:45 | 显示全部楼层
Mar 24, 2006
There have been a few suggestions to extend the time that the previous cancer job is deleted in order to give slow machines a chance to finish crunching. We want to accomodate all members regardless of the speed of their machines so I am going to change the day to Monday. New jobs will be submitted on Wednesdays and old jobs will be deleted on Mondays.
2006年3月24日
已经有一些人提议,要延长前一个癌症任务的计算期限,以便给慢机器一个机会把任务算完。我们想要考虑大多数人的利益,而不能去管部分人的机器速度,如此我要把删除日期换成星期一。新的任务将会在星期三被发放,而且旧的任务将会在星期一删除。

[ Last edited by vmzy on 2006-4-17 at 10:34 ]

评分

参与人数 1基本分 +12 维基拼图 +6 收起 理由
霊烏路 空 + 12 + 6

查看全部评分

 楼主| 发表于 2006-3-28 20:56:01 | 显示全部楼层
Mar 27, 2006
Grid.org status

Cancer job - There is no new information to post at this time other than to restate that each week the previous cancer job will be deleted on Mondays instead of Wednesdays. This will give members running on slower machines a chance to return their outstanding workunits for credit.

Cancer job deleted

The previous cancer job has been deleted. If you have any outstanding workunits that have not yet been processed, they will not be given credit and should be dumped.
2006年3月27日
Grid.org周报
癌症任务 - 没有新信息发布,除了再次声明,每周将在星期一而不再星期三删除早先的癌症任务。这将给机子慢的成员一个提交他们的任务的机会。
癌症任务的删除
早先癌症的任务被删除了。如果您有没处理完的任何任务包,他们不会被给予积分,应该被删掉了。

[ Last edited by vmzy on 2006-4-12 at 17:36 ]

评分

参与人数 1基本分 +18 维基拼图 +9 收起 理由
霊烏路 空 + 18 + 9

查看全部评分

 楼主| 发表于 2006-4-4 23:13:57 | 显示全部楼层
Apr 03, 2006
Cancer data - I have deleted the previous Cancer job so no more credit can be given for outstanding workunits. We continue to process a new job each week, which means things are going pretty well. The Ligandift aborts still continue and probably will through the entire new batch of data that Oxford provided. We will see what the behavior is when we run this data against the previous protein which will be the next task.
2006年4月03日
癌症数据 - 我删除了先前的癌症任务,因此不会再给前任务积分了。如果一切正常的话,我们继续每星期处理一个新任务。Ligandift的异常退出问题仍没有解决,大概牛津提供的所有新数据都会遇到此问题。我们将与以前的计算数据进行对比以确定究竟是什么行为导致了出错。

[ Last edited by vmzy on 2006-4-12 at 17:29 ]

评分

参与人数 1基本分 +16 维基拼图 +8 收起 理由
霊烏路 空 + 16 + 8

查看全部评分

 楼主| 发表于 2006-4-7 20:50:35 | 显示全部楼层
Apr 06, 2006
Backing off

We experienced a brief outage again due to the large amount of results. I thought that rolling up once a week was sufficient, but apparently we still ran out of space. The system has been recovered now and everyone should be able to connect now.

There may be some lost workunits due to the nature of this problem. I apologize for that. A new job has been submitted so if you are running UDMon, you should dump all of your cached workunits and reload. Credit will not be given for the previous outstanding workunits because the job had to be deleted as part of the recovery procedure.

I will be adding some additional disk monitoring tools to hopefully notify me before this issue can reoccur. Again, I thought that our weekly rollup process was sufficient. Please bear with me as we continue to tweak the process.

2006年4月6日
无任务

由于各种原因我们再次遇到了停机事件。我认为,这每周回收一次结果就够了,但很明显我们的空间很快就又用尽了。系统现在已经恢复了,大家现在应该能正常连接了。
由于这个问题的本质,也许会有一些任务会丢失。我为此表示道歉。递交了一个新工作,因此如果您使用UDMon,您应该清空您所有贮藏的任务,并重新“装弹”。作为恢复空间过程一部分,以前的任务必须被删除,当然也不会继续给予积分了。
我安装了一些磁盘空间监视工具,希望可以在问题发生之前,及时通知我。重申一下,我认为,我们每周进行一次任务回收就足够了。请原谅,我们老是出问题。

[ Last edited by vmzy on 2006-4-12 at 17:21 ]

评分

参与人数 1基本分 +26 维基拼图 +13 收起 理由
霊烏路 空 + 26 + 13

查看全部评分

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~

小黑屋|手机版|Archiver|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2021-9-22 21:36

Powered by Discuz! X3.4

© 2001-2017 Comsenz Inc.

快速回复 返回顶部 返回列表