从FAH SMP客户端的Q&A分析SMP客户端有更高的PPD

shouldbe · 发表于 2008-7-12 07:10:25

原帖由 vmzy 于 2008-7-1 14:49 发表
July 08, 2008
FAH/SMP Q & A
There was a good question in the forum that I thought others would be curious to hear:

From Vijay's blog entries it would seem that the SMP client has some fundamental advantages over running multiple singlecore client, but I can't really think of how that might be. Do you know of some architectural overview of how the MPI stuff is being used in this context?

We could just run multiple independent clients, but this would be throwing away a lot of power.  What makes an SMP machine special is that it is more than just the sum of the individual parts (CPU cores), since those cores can talk to each other very fast.  In FAH, machines talk to each other when they return WUs to a central server, say once a day.  On SMP this happens once a millisecond or so (or faster). That 86,000,000x speed up in communication can be very useful, even if there isn't 100% utilization in the cores themselves.

The easy route would have been to run multiple single-CPU FAH-cores (this is what other projects do), but that would be a big loss for the science, as this throws away a very, very powerful resource (fast interconnects between CPUs).  Indeed, it is this sort of fast interconnect which makes a supercomputer "super", since the CPUs in supercomputers (eg BlueGene) are pretty slow, but the communication between cores is very, very fast.

We've done a lot to develop algorithms for FAH-style internet connections between CPUs, but there are some calculations which require fast interconnects, and that's where the FAH/SMP client is particularly important.  By allowing us to do calculations that we couldn't do otherwise, the science is pushed forward significantly (and we thus reward SMP donors with a points bonus due to this extra science done and the extra hassle involved in running the SMP client).

I guess it remains to be seen if we can pull off MPI on FAH to the point where it works effortlessly, but so far Lin and OSX look pretty good, so we're close.  The A2 core should hopefully seal the deal.  Now, the main task is getting Windows/SMP behaving well ...

大意：
解释了，SMP客户端为什么速度快的原因。它充分利用了多核cpu的内核间高速通信功能。smp并不是2个客户端独立运算2个WU，而且协同计算一个WU。使得SMP的计算速度大于2个客户端的计算速度。

这段话解释了为什么SMP客户端的PPD比多个单线程客户端高的原因。说说我自己的理解：

1. 在计算某个wu时实际上是要分成多个步骤来计算的。而这些步骤之间有很复杂的关联关系。例如假设计算一个WU需要ABCDE五个步骤。这5个步骤循环计算10000次即可得到最终的计算结果。那么由于计算的中间结果不同，那么可能的计算顺序是ABE，也可能是ACDE（这里画不出来，应该理解为A的计算结果同时给C和D，CD的结果再共同交给E）。
2. 对于ABE的计算顺序，单核和多核没有区别。但对于ACDE（画不出来，理解同上解释），两者就有区别。因为CD两个步骤是并行的，单核CPU只能顺序计算，而多核可以同时计算。如果程序优化的非常好的话，CD两个步骤的计算量相同，那么两个核刚好可以同时完成任务，然后交给步骤E进行运算。
3. 实际上ABCDE五个步骤的计算量可能是有很大差异的。但程序员完全可以做出一个调度程序，将所有的步骤细分为更多更细的小步骤，让每个小步骤的计算量相当。这样把大量的小步骤不断的丢给多个内核，就能做出运算效率接近100%的程序。因此我们这里假设ABCDE5个步骤的计算量相当
4. 但这仍旧没有解释为什么多核效率比单核高。因为假设经过优化，对于一个双核CPU，某一时刻AE可能是同时计算的，在下一时刻CD可能是同时计算的。即双核可以以1/2的时间完成同样的工作。但1/2的时间*2倍的核心，速度并没有加快。那么计算速度到底是因为什么而加快的呢？仔细分析原文，我认为内核之间的直接的高速通信是提高计算速度的关键。我们如果再进一步假设ABCDE5个步骤的每个步骤的计算时间为100t。内核与内存交换数据（单向，内核到内存或者内存到内核）的时间为20t。那么对于单核CPU，ACDE的总计算时间为20t+100t+20t+100t+20t*2+100t+20t+100t+20t=520t。其中首尾两个20t为原始数据从内存传到内核和最终计算结果从内核传递到内存。中间的第一个20t为A的计算结果在内核与内存之间的传递时间。第2个20t*2为C计算结果回传内存以及从内存中重新取回A的计算结果。100t为ACDE每个步骤地计算过程需要的时间。这里需要说明一点的是由于现在的操作系统都是多任务系统，每次每个进程只能分配一定的计算时间，而且CPU内部缓存有限，所以每次计算的结果回传内存是必要的步骤。而对于多核CPU（假设为双核），假设内核之间的数据传递时间为5t（内核之间的数据传递速度要远远快于内核和内存之间的数据传递，主要原因是内存的寻址时间较长。即使是对Q6600这样的胶水4核通过FSB传递数据也比内核到内存的速度快很多），那么上述计算时间就降低为20t+100t*2+20t+100t*2+20t=460t。其中第一个20t为原始数据从内存传递到内核和最终计算结果从内核传递内存的时间（两者是并行的）。中间的两个100t*2为计算步骤AE和CD的计算时间。中间的20t为A的结果传回内存以及传递给负责D的内核（原计算A的内核继续计算C），最后一个20t为CD计算结果回传内存（两部分数据打包回传，所以只用20t）。可以看出多核CPU的计算效率在片内高速通信的帮助下比单核要高。以上只是比较简单的2分支结构，实际计算中如果分支数量增加，可能提高的效率会更明显。
5. 以上计算粗略的分析了一下多核比单核计算效率高的原因。实际上FAH给以上效率提高的基础上又额外给SMP客户端增加了PPD，以鼓励更多的人运行SMP

		自动登录	找回密码
密码			新注册用户