Genome Comparison专家解释官网项目进度显示停在99%的原因

MatthewBB · 发表于 2007-4-12 13:45:38

When preparing our project we have thought that the computation of the similarity scores between all proteins included in the initial dataset would take longer than it actually has. When we realized that the progress was faster than expected (this happened in the beginning of the project), and after talking with the guys from IBM, we were offered the oportunity to include more material to be processed. This is what we are calling the "second phase", although, as already mentioned, it doesn't actually involves any kind of modification in the program. We included a manually curated dataset, which will be of great help in the validation of the automatic annotation process, and another dataset composed of open reading frames that may or may not be real genes, mainly due to their size. In other words, small genes may be "neglected" by gene prediction programs, but the verification of their presence in different genomes may be a strong argument for the confirmation of their existence as real genes. As for the progress pie, it could (and should) be reset to reflect this "new phase".

(翻译:MatthewBB)

我们在准备此项目时估计了计算初始数据集中包含的所有蛋白质的相似度所需的时间，这个时间比目前实际耗费的时间要长。当意识到项目进度比预想的要快时(这种情况在项目开始时就发生了)，我们说服了IBM人员允许我们在数据集中包含更多的待处理材料。尽管象我们已经提到，这一过程并不包含任何的程序改动，但这就是我们所说的"第二阶段"。我们加入了一个人工组织管理的数据集，这一数据集对于验证自动分析方法的有效性很有帮助。我们还增加了一个由一些悬而未决的数据帧所组成的数据集。由于这些数据帧的大小的原因，这些数据可能是也可能不是真实存在的基因。换句话说，小基因可能会被基因预测程序“漏”掉，但是在不同的染色体中验证它们的存在可能是证明他们作为真实基因存在的一个强有力的论据。至于项目进度图，可以(也应该) 重新设置以便反映这一"新阶段"。

MatthewBB · 发表于 2007-4-12 14:00:47

补充一点
In a second phase, the initial dataset is being updated with newly published genomic data, adding 393,999 new protein sequences. Additionally, a fully curated reference dataset was added (SwissProt - 254,609 sequences), contributing to controlled annotation and data cross-referencing. Finally, an experimental dataset of about 3 million potential protein sequences derived from Open Reading Frames (ORFs) lacking a classical computational coding prediction was added, in an attempt to discover additional protein sequences or coding patterns. This second phase of the project is expected to take an additional 4 months of WorldGrid processing.

(翻译:MatthewBB)
第二阶段中, 初始数据集用最新公布的染色体组数据进行了更新,增加了393999个新的蛋白质序列.此外还增加了一个完全手工组织管理的参考数据集(SwissProt-254609个序列), 用于受控的分析和数据的交叉引用. 最后, 增加了一个从缺少传统计算性编码预测的不确定数据帧(ORFS)派生出的包含3百万潜在存在的蛋白质序列的实验数据集,这一数据集用于发现另外的蛋白质序列或编码模式. 项目第二阶段预期再额外耗费WorldGrid 4个月的处理时间.

不知道翻得对不对,大概这个意思吧

碧城仙 · 发表于 2007-4-12 20:20:52

感谢楼主翻译！希望能在帖子中把引用地址帖出来。

MatthewBB · 发表于 2007-4-13 10:27:41

引用地址是:
http://www.worldcommunitygrid.or ... thread?thread=12717

		自动登录	找回密码
密码			新注册用户

Genome Comparison专家解释官网项目进度显示停在99%的原因

浏览过的版块