|
楼主 |
发表于 2009-2-24 09:50:34
|
显示全部楼层
23 Feb 2009 21:06:51 UTC
Our outbound traffic has been pegged since Friday. This may seem like only a download problem, but it even affects uploads, as the basic syn/ack handshaking packets on the upload server get dropped along with the rest of the download packets that can't make it through the dam.
After discussions with Eric and Jeff, here's what we gather is happening. We use coral cache to reduce our bandwidth needs. Coral cache is an easy-to-use, free, third-party system which does some nice distributed caching just by redirecting the right apache requests to their servers. For example, somebody wants to download the latest astropulse client, they go to our download server, and then they redirected automatically to the coral cache server. The redirect is of the form such that, if the coral cache server hasn't done so already, it downloads the latest astropulse client from us, caches it, and then sends it to the requester. Once cached, it doesn't need to contact our servers again. So, in essence, all but one of the client download requests hit originate from sources outside our lab, thus saving us lots of bandwidth.
That brings us to problem 1. Many ISPs don't like redirects to third-party IPs. This is understandable. What happens in this case is a client downloads a new application, but instead of getting the actual executable they get a blob of HTML saying "this ISP doesn't like third party redirects," etc. Obviously the checksum of this HTML blob won't match the executable checksum, resulting in an application download checksum error. This has been a known problem. So we've been only using coral cache during the first couple of weeks after a new application is made available to reduce the pain of the download rush. A small fraction of our users will be inconvenienced by those redirect errors, but they'll get their clients in due time when coral cache is turned off after the initial "wave."
But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This is at least the behavior is older, yet still commonly used, boinc clients. Dave said most of that has been addressed, but if they're still bugs they'll be fixed.
In any case, what we saw this weekend was a confluence of these two problems. This may not have been an issue before due to lighter traffic patterns, but we sure fell off the deep end this time. Maybe there was a small set of heavily active clients this time around causing most of the pain. And once the network gets pegged, all hell breaks loose, and it takes a while to heal itself.
Eric actually had most of this figured out before we arrived today, and already turned off coral cache. At least the broken redirects spiraling out of control would stop happening. He also adjusted the tcp settings on the upload server to help get those partially working again (instead of only 2% uploads getting through, now it's about 50%).
The plan is to let this current state of indigestion pass on its own, and if needed change some BOINC settings (if not also BOINC code) so that future coral cache attempts will be direct links as opposed to apache redirects.
- Matt
最近的问题似乎仅仅是下载问题, 但它同时影响到上传.. 上传服务器的同步/确认信息跟随下载信息一起堵塞..
从收集的情况看, 问题可能来自 coral 服务器, 本来使用 coral 服务器是为了减轻带宽需求, 它是免费的第三方系统, 可以让请求重定向到他们的服务器. 例如有用户需要下载 ap 的应用程序, 他们连接到sah的下载服务器, 并重定向到 coral 服务器.. 如果coral服务器没有需要的问题, 它就会到sah的服务器缓存中下载, 并发放给用户.. 一旦缓存了, 就不再需要连接sah的服务器. 从本质上说, 它很节省sah的大量带宽..
而这给我们带来了问题, 1. 许多ISP商不喜欢重定向到第三方IP, 在这种情况下一个客户端下载新的应用程序, 会发生校检错误, 这是一个已知的问题, 所以coral也是在新的计算程序发布后几周才使用, 以减少下载高峰的问题.. 一小部分用户会因为重定向而下载不便, 但他们会在coral服务器关闭后解决..
但随后的问题, 2. 一个应用程序下载校检错误不造成指数退减, 导致这一特定用户请求的所有wu都下载错误并重发. Dave说大部分的问题已得到解决.
现在 coral 服务器已经关闭, 至少让几乎失控的重定向问题不再发生, 同时调整了上传服务器的TCP设置(现在是50%获得通过而不是2%)
这个计划是自身的问题, 通过更改BOINC的设置(不是代码), 以便今后直接连接 coral服务器而不是重定向.. |
|