中国分布式计算论坛 » SETI@home » SETI@home 2008 技术新闻

2008-1-3 14:37 BiscuiT
SETI@home 2008 技术新闻

[img]http://photo14.yupoo.com/20080103/140019_46838587_qajtxmta.jpg[/img]

新年快乐!SETI@home 迎接2008年,新一年希望收到外星文明的贺电!

内容转载自 [url=http://setiathome.berkeley.edu/]SETI@home 官方网页[/url] - [url=http://setiathome.berkeley.edu/tech_news.php]技术新闻[/url] 页面
内容大多诉说服务器软硬件上的问题,欢迎有兴趣的版友跟贴翻译~感谢!

另外官方提供有 技术新闻 的专门留言板:[url]http://setiathome.berkeley.edu/forum_forum.php?id=21[/url]
可以直接跟 Matt 交流。

[[i] 本帖最后由 BiscuiT 于 2008-2-8 08:46 编辑 [/i]]

2008-1-3 14:37 BiscuiT
2 Jan 2008 22:54:11 UTC

Happy new year! Actually, being that every moment is the beginning of some arbitrarily defined era, I should be more clear: Happy new calendar year number 2008, whoever uses this particular calendar system which I usually do!

The weekend was busy with the more-and-more-common fast workunits. Discussions today at the lab brought up the fact that about a third of our data will translate into these fast runners, so we better turn our attention back towards improving the data pipeline. We picked two low hanging fruits today: convert server bane from a redundant web server to a secondary download server. This will help determine if that bottleneck is the server or the storage. I also added a flag to the splitter scripts to select files in beam/polarization pair order, not filename order. This will help pseudo-randomize the creation of work, and hopefully spread the pain of fast workunit periods so we aren't so overwhelmed at times.

Nevertheless, we have Astropulse coming down the pike, and have a lot of SETI@home data to go through (and we're starting to collect new data again!). So we need to upgrade the network/servers in a big way. And acquire more participants. Not sure how this will all happen yet, but it has to happen.

Meanwhile, we might try another science database index build tomorrow (or soon thereafter). Bob found a way to do so while the database is up and inserting rows, so we might not have to shut down splitters/assimilators during the long build. Cool.

- Matt

2008-1-3 14:59 symbol
谁翻译一把吧,26个字母都认识、凑一块就全乱了,一头雾水

2008-1-3 16:22 xuqing1870534
求英语达人啊

2008-1-3 18:54 dboy27
哈哈 ,用GOOGLe自动翻译翻了一下,有点搞笑的说,内容如下:

新年快乐!其实,作为每一个时刻是开始有些任意确定的时代,我应该更清楚:快乐新历年人数2008 ,谁如果利用这个特别的日历系统,我通常会做的!

周末忙着与更多和更常见的快速workunits 。今天的讨论,在实验室带来了一个事实,即约三分之一,我们的数据将转化为这些快速跑,所以我们更好地把我们的注意力重新投入在改善数据管道。我们选择了两低挂水果今天:转换服务器的祸根,从一个冗余的网络服务器,以一所中学下载服务器。这将有助于确定,如果瓶颈是服务器或存储。我还补充说:授旗给分路剧本选取档案,在束/两极分化一双秩序,而不是文件名命令。这将有助于伪随机创建工作,并有希望蔓延的痛苦快workunit时期,因此,我们并非如此不堪重负时候。

不过,我们必须astropulse未来下来派克,有很多以前从未见过@家数据要经过(我们已经开始搜集新的数据再次! ) 。因此,我们必须以提升网络/服务器中迈出一大步。 ,并获得更多的人参加。不知道如何,这都将发生,但,但也有可能发生。

同时,我们也可以试试另一种科学数据库索引建明(或之后不久) 。鲍勃找到了一种方法,这样做的同时,该数据库已经建立和插入数据,所以我们可能不会有关闭分配器/ assimilators在长期的建设。酷

2008-1-4 23:12 BiscuiT
3 Jan 2008 20:54:14 UTC

Spreading the workunit creation over several files at once seems to be helping create a healthier mix of fast/slow workunits. However, adding a second download server seems to have confirmed a suspicion of mine (key word: "seems"): that somewhere down the pike we're being capped at 60 Mbits/sec. For a while there we had two download servers and a workunit storage server with plenty I/O capacity to spare, but still we were hitting a hard 60 Mbit ceiling outbound. Inquiries are being drafted/sent to the appropriate parties. It still could be a local problem, but we're not sure what else to try (given our current hardware).

We are in the middle of building another helpful index on the science database. Looks like Bob's magic informix incantations are working - we can keep the project running simultaneously (though the assimilators might back up a bit). It is always happier around here when work is flowing. To be safe we increased the ready-to-send queue size to one million - we have the disk space now to keep more workunits around. The only downside is that this inflates the result table in the database by approximately 5-10%, which may exercise the RAM on the BOINC database server that much more.

There is another problem Dave and I were poking at today: excessive "out of range" failures on our public web sites. Here's the deal: BOINC clients have a nice GUI which shows you icons, pictures, etc. from different projects as you select which to run on your computer. Where does it get these files? From the project's web servers. This is all well and good, but there are several (hundreds? thousands?) older clients out there making such requests but are being met with 416 "range not satisfiable" errors. Why? Because they have already downloaded the image file, but are making requests for more bytes beyond the file boundaries as if there was more to download. Obviously a bug somewhere, or a change in the way apache handles such things, but there's not much we can do about it. Even though this activity is creating bursts of heavy load on our web servers, this is a fire we're going to let burn for now.

The official press release about multi-beam is finally out. This should help on many levels (though I'll be busier making sure the servers can handle any significant load increase). I guess I'll also be shaving every morning in case there is interest from the national television news media.

I guess this is "technical" news: Our desks/chairs/furniture are mostly ancient hand-me-downs, some pieces older than I. We did get some new chair donations recently, but one of them broke - it came loose from its base, causing unsuspecting sitters to suddenly fall forward if their balance wasn't particularly keen. It's been lurking in our lab way too long, coaxing uninformed standers with tired legs to rest upon its comfortable and seemingly stable cushion base. I came to the lab this morning and that evil chair was by my desk with a note taped to it: "Matt - can you please toss this chair?" I guess enough was enough. I dragged it to the dumpster and sent it back to the dark void from whence it came.

- Matt

2008-1-9 08:48 BiscuiT
7 Jan 2008 23:28:38 UTC

Lots of weather in the Bay Area over the weekend, leading to many power outages. Luckily our project was not affected.

The new pseudo-random nature of our workunit creation finally worked itself out, and we were sending data at a relatively even pace. Speaking of sending data... At the end of last week my suspicions were confirmed: the router between us and our ISP (a Cisco 2811) has been CPU bound for who-knows-how-long, thus causing an artificial 60 Mbit/sec cap on our outbound packets. Further research will determine whether we can improve its performance or if we need to procure a better router.

We had an assimilator get jammed on a broken result. I had to delete the result to clear the pipes. This happened once before a week or two ago. A little detective work this morning uncovered that both such broken results were processed by optimized clients. I'm just sayin'. This could easily be a conincidence.

Spent a large chunk of the day trying to coax another Intel-donated server to life. We've gotten a lot of stuff from Intel recently, all in varying states of functionality (some missing CPUs, some have test boards, etc.). This particular one (4 2.66GHz CPUs, 8 GB RAM) was dead in the water for a while as it wouldn't respond to any keyboard/mouse. However, the other day I noticed one of the front-side fan modules wasn't seated properly. I adjusted it, and now the server sees all input devices. It's still a little squirly, but may be a worthwhile web server after all. We're calling it "maul" (sticking to the current "darth" theme). I'll announce it again if it actually proves to be ready for prime time.

- Matt

2008-1-9 08:49 BiscuiT
8 Jan 2008 22:16:52 UTC

So we've been running this annoyingly load-intensive query everyday on the BOINC database to clean up results that failed validation. It took up to an hour to run, during which it hogs a bunch of database memory and slows everything down, including workunit distribution. Why not build an index? Well, indexes still take up disk/memory, and the main table field in question is of low cardinality, and we're only hunting for a few thousand out of a millions of rows each time. So Bob was looking into implementing a new fangled mysql "trigger" to flag the few rows when they enter this bad state, making them much easier to find without needing the overkill of an index. However, we only discovered today triggers don't work in our current version of mysql. So we built an index after all. We'll see how much it helps.

Other than that and the usual database backup outage this morning, mostly spent the day moving large numbers of files/archives around to prepare to grow the workunit storage space again. I also got the new server (maul - see yesterday's note) up to speed, more or less. Still won't be live for at least a day or two, but it's working. It's a 4x2.66 GHz dual core intel with 4 GB of memory. Looks like another perfect web server to me. Also had to grow our home directory space because, as you know, no matter how much space you have, it's never enough.

Somebody pointed to an article that mentioned the Cisco 2811 has a known throughput rated at about 61 Mbps. This was a surprise to me and Jeff - I guess this wasn't what we were told, and you'd think a router with 100 Mbp ports could reach a theoretical maximum of 100 Mbps. The cap seems to be due to CPU limits, and we are doing tunnel encryption and have a small but still non-zero set of access rules. Anyway live and learn. And no further progress on that since yesterday.

Another storm is whizzing through. The top third of a 50 foot tree just broke off right outside my lab window. Cool. I understand why people are freaking out about this current weather, but this is nothing compared to the hurricanes I dealt with growing up in downstate NY.

- Matt

2008-1-10 09:25 BiscuiT
9 Jan 2008 22:51:15 UTC

More blips and blops in our traffic caused by who-knows-what. We still don't have enough data yet to see if yesterday's BOINC result outcome index build helped with those regular slow validation-fix updates. In any case, I misspoke: we are running a version of MySQL where triggers are available to us - we only have to figure out how to implement them to do what we need. This morning the secondary download server bane was having a mount headache and I had to give it a virtual kick to get it going again. And that router is still a problem, but we're not convinced it's the only problem. Swapped out cables, switches etc. to no avail this morning. I installed some real load balancing between vader and bane (in practice round robin DNS is hardly balanced) which may help.

There was still slowness to the web site as of a few minutes ago. This had nothing to do with recent web code tinkering/updates or database load or any such thing - this was strictly due to the aforementioned router problems, as half the web traffic was going through the same router (the other half over the standard campus network). I just moved the competing traffic onto the campus network as well, so that should improve web site performance in general.

Regarding recent assimilator clogs, we had another one this afternoon. And yes, once again it was from a result produced by an optimized client. This time around I attached a debugger and found the problem was in XML parsing of the result and sure enough with enough eye-squinting I found a couple garbage characters in the uploaded result file. Specifically, in the power-of-time declaration of a pulse. Instead of:

<pot length=211 encoding="x-csv">

It was:

<pot length=211 encoding71x-csv">

So there are two problems. First, something is causing corruption in the xml (the non-standard client? something else on our end?). And second, the assimilator is too sensitive to such corruption. It shouldn't bail out so readily and create these large ready-to-assimilate queues.

Minor updates to the server status page: I changed references to "beam/polarization pair" to the more concise "channel." I then added a parenthetic numeric value to the ends of each data file (representing total working/done channels for each file) so you don't have to count the little green squares. I also added total values at the bottom for all data files (mostly so we can see how long we have before we run out of data to split). Note how the "vertical" processes (i.e. splitting multiple files at once) has a negative side effect: we are forced to keep data files around much longer, which makes it difficult to keep a queue of data on disk. Some better "vertical" logic has been coded, to be rolled out in the next day or so.

- Matt

2008-1-10 17:28 yiwen
tes

希望可以找到所想

2008-1-11 12:27 BiscuiT
10 Jan 2008 22:47:31 UTC

The public web site servers slowed to a crawl again this morning thanks to several robots/spiders scanning us at once. So I took another gander at my robots.txt file and used Google's webmaster tools to check how well this was being parsed. This uncovered a typo (a missing "s") and while I was at it I added some new rules to robots.txt. We'll see how this all fares.

Bob and I brought the BOINC/science database servers down briefly this morning to tweak some parameters and clean out logs - some of you may have noticed a brief data server/web site outage in the process. The only tweak of note was on the science database: we reduced the checkpoint intervals and increased the between-database-ping timeouts. Why? We've been seeing the secondary spuriously enter recovery mode due to being unable to reach the primary, when really the primary was simply busy doing checkpoints at the time. Anyway, outage recovery was slowed by confluence of various stats/update scripts starting up while the database was busy flooding its memory buffers. We really need to optimize those stats queries someday. As well a relatively new BOINC feature ("resend lost workunits") was eating up a lot of database too, so we turned that off for now. Actually that last thing helped immensely.

In the process of general disk cleanup, etc. I'm now forced to finally populate the credited_job table with three years' worth of purge archives. These archives are taking up 200GB on a 1TB filesystem which we really need to convert into workunit storage sooner than later, hence the push. Reminder: this is the table that contains the history of which users processed which workunits.

Just between you and me... In addition to the outbound traffic squeezing through our maxed-out router, I am now sneaking our an additional 5-10% over the campus net. This is thanks to the simple/useful "pound" load balancing utility. The campus net can definitely handle this tiny increase. In fact I might bump up the percentage. But don't tell anybody. Mwha ha ha. [edit: I brought that percentage back down to 0% an hour later - we'll keep this extra power in our back pocket for now.]

By the way, the optimized client discussion has been taken offline and is progressing. Turns out this may actually be a single bad host more than a bad client.

- Matt

2008-1-13 17:08 lt2818
2 2008 年一月 22:54:11 UTC
新年快乐!实际上, 在每片刻是一些的开始任意地定义纪元, 我应该更清楚: 快乐的新日历2008年, 任何人使用我通常做的这一个特别的日历系统!

周末忙于更多-和-更多-通常的快速工作单位。讨论今天在实验室提出了事实大约三分之一的我们数据将会转变为这些快速的跑步者,因此,我们更向改良数据流水线把我们的注意转回来。 我们挑选了二个最低值今天悬挂成果: 转换从一个多余的网服务器到一个第二的下载服务器毒药服务器。 如果那一个瓶颈是服务器或储存,这将会帮助决定。 我也把一面旗子加入劈的人脚本选择波束/ 偏极化的文件为双命令,不是档名。 这将会帮助假的- 随机化功的产生, 而且希望传布快速工作单位期数的痛苦因此我们不被如此有时淹没。

不过,我们必须astropulse未来下来派克,有很多以前从未见过@家数据要经过(我们已经开始搜集新的数据再次! ) 。因此,我们必须以提升网络/服务器中迈出一大步。 ,并获得更多的人参加。不知道如何,这都将发生,但,但也有可能发生。

同时,我们也可以试试另一种科学数据库索引建明(或之后不久) 。鲍勃找到了一种方法,这样做的同时,该数据库已经建立和插入数据,所以我们可能不会有关闭分配器/ 吸收者。酷。

- 马特

2008-1-15 06:53 victor_fr
这是google翻译的吗?

2008-1-15 09:31 BiscuiT
14 Jan 2008 22:23:56 UTC

Things ran quite well over the weekend. Looks like we added the right index to the mysql database to reduce the slow "validator fix" queries. A note about general BOINC/mysql implementation/design: there are a lot of features in BOINC that are seemingly excessive from a single-project perspetive, but are there as every project has different needs. Project-specific factors (server power, workunit processing times, number of active users, min quorum, etc.) make some features less helpful. In the case of "resend lost workunits" (see last thread) this feature, implemented mostly for the benefit of Einstein@home, was most definitely weighing down our database server. We turned this off and have been running smoothly since. There were assumptions this would lead to greater problems down the line (fearing many results will be sitting on disk longer waiting for their redundant pairing to return) but in fact our "results returned and waiting for validation" number has been stable (if not slowly decreasing) since I made the change. Nevertheless, at some point soon we will see if we could optimize/reimplement this code, and Eric is actually making adjustments to the splitter which will perhaps create less "fast runners."

Our new-hardware-to-obtain priorities are shifting. Namely, we need a router (we're not ignoring discussion about this on other threads but we are limited to what we can use for various configuration/policy reasons). We also need a new KVM - our current one in the closet is maxed out and we'd like to get more stuff in the there ASAP. We also need three new desktop systems. Dan's using an old, sloooow solaris system which is out of support. Bob is on a slightly faster solaris system, but needs a safe mysql test sandbox. Josh's old super-cheap windows/intel box is basically a glorified console server.

Had some minor issues due to the root drive on bruno filling up on Sunday. I scanned the drive and found only 4GB of stuff, while "df" was showing 40GB. Eric eventually found a deleted-yet-open file - an infinitely growing httpd log. Apparently httpd log rotation broke at some point, but we cleaned this up. Annoying, but harmless.

Due to increased load in general, I changed the server db stats to update every hour (instead of half hour). Actually it's becoming clearer as we increase active user load and I'm populating credited_job, etc. that the mysql database might be our bottleneck du jour any jour now. There were also some issues with the user-of-the-day selection process which I tracked down and fixed this morning.

- Matt

2008-1-16 12:57 BiscuiT
16 Jan 2008 0:37:05 UTC

Yeah... we're really pushing the boundaries of our mysql database these days. I'm finally catching up on several years' of backlogged archives and inserting zillions of rows to credited_job and this, on top of general increased usage, is gumming up the works. In fact, optimizing this table alone during today's outage took three hours (normally only a few minutes) - which explains the extreme length of today's downtime. I guess we'll have to turn of credited_job optimization until we actually use the table.

This brings up several questions, the first of which was asked in a previous thread: Why are you guys using mysql instead of a more robust commercial product? Two main reasons: BOINC projects generally are small academic ventures with limited funds, and BOINC is an open-source project itself utilizing other open-source pieces of software. So all you need is a relatively cheap linux box which comes with php, apache, mysql, etc. and it's pretty much plug and play. Remember the project specific data, i.e. the science database, can be whatever you want. In our case, it's Informix. Why Informix? We got it for free 10 years ago - we now have 10 years of experience using it as a group and it is still free to us. Would we consider changing to Oracle/SQL server/etc.? If somebody wants to buy such a license and donate a man/year to change all our back end software to do so, then we would perhaps entertain the thought, but we have higher priorities, especially as Informix works perfectly well at this point. It's the BOINC/mysql part that needs help, and we're sticking with it for reasons stated above, and with SETI@home being the flagship project of BOINC we don't want to diverge from the standard.

In other news, it seems the every day there's a different reason our web sites are so darn slow. Yesterday afternoon we were getting hit by some seemingly nefarious activity which I was able to block quite easily once I discovered it. But we were also getting hit by some scraping of stats pages via a robot (called BoincBot) that was not obeying robots.txt. I blocked these hits as well. We don't allow such activity on our web sites. If you want BOINC stats you can download the daily xml dumps just like everybody else.

On the bright side, we obtained another server donation yesterday from a private party: a 1U dual-opteron (2.4GHz) server with 16GB memory. I installed FC8 on it just now, though there was a little bit of tweaking to get that to go. There's no DVD drive in the thing (only a CD drive) and for some reason the was some disconnect with the 3ware disk controller such that the linux installer couldn't see the two root drives. I ultimately took that out of the equation and plugged the drives straight into the SATA ports on the motherboard. All's well and it's getting all yummed up now.

So we're looking for a KVM-over-IP, at least 16 ports (24 preferable), easy-to-use but secure connections via a web browser, etc. Any thoughts? The Belkin Omniview seems the cheapest/easiest, but only allows one person to connect to the whole unit at a time - not a showstopper. Any suggestions, experience with such devices, etc. out there?

- Matt

2008-1-17 09:31 BiscuiT
16 Jan 2008 23:25:12 UTC

The recovery went rather well yesterday, considering its extended length. Bob made some mysql tweaks to perhaps better use the memory on jocelyn (allow more protected space for query sorting, for example).

Vexing time-sinks: I spent 45 minutes this morning trying to figure out why one of the download servers (bane) was have autofs problems. Long story short: the route map was ever-so-slightly messed up so that it couldn't mount a single particular machine on a different subnet in our lab (why it needed to mount this machine was due to an "ls" command in a script - which by default displays color, so ls will traverse sym links to see if they are broken or not in order to select the proper color scheme, and in this case one sym link was on this remote machine). Also: the new donated server came with rails! As some of you know we have hilariously bad luck with rack rails of infinitely different (and useless) non standard sizes, and this time is no different. We needed to shrink the rail depth which should be easy. I did this to one and it fit! I did this to the other and, due to different screw hole location, it remains 1 cm too deep and unable to get any smaller. Ha ha ha (sob). Bottom line: useless rails, yet AGAIN.

But that's just a minor detail really - no need to rant and I don't want to seem ungrateful to our generous donor! We ended up putting the thing in the closet flat on top of the whole rack chassis. Works for me. We now have a new server called "thinman" (dual opteron, 16GB RAM) to help bolster the BOINC back-end! Woo-hoo! We'll update the server-wish-list with routers, servers, kvms, etc. soon.

Other vexing time-sink: Bogus news reports that we found a "mystery" signal should be summarily ignored. This was a gross misinterpretation by a reporter of an quick comment Dan made off the record about AstroPulse progress and recently published millisecond pulsar findings by another group. These are new stellar phenomena which are astronomically interesting (and AstroPulse hopes to find many of) but not ET. Sigh.

- Matt

2008-1-17 09:40 xrevo
这么多啊~~

2008-1-18 09:22 BiscuiT
17 Jan 2008 22:23:19 UTC

No disasters or major revelations to report today. Interesting news from yesterday: Sun bought MySQL. Not sure how this will affect us, but it reminds me that I should mention that I am generally pleased with MySQL. There was that one comment about the professor who thought industrial grade software is the only way to go, and the MySQL is for mom-and-pop ventures. Let me address: Claiming the winners in the game of capitalism hold the best solutions to whatever problem is at best an arrogant assumption with obvious overtones of classism (both intellectual and economic), especially given that "mom-and-pop" crack.

Other than that.. mostly spent the day cleaning up spills in various aisles. I also yum'ed up my desktop to Fedora Core 8 as an exercise to do so on more heftier servers in the coming weeks.

- Matt

2008-1-23 16:13 BiscuiT
23 Jan 2008 1:16:26 UTC

Tomy fellow US citizens (and others as well), hope you had a happy MLKday (or whatever your state officially calls it). Those wondering whyno tech news item yesterday, that's why.

I'll start with the negative. Lots of the usual annoying little hiccupsover the weekend. Here's a non-chronological digest: One of the servers(bruno) lost its automount again (hasn't happened in a while), havingthe effect of inflating the validator queue before I noticed andunclogged the pipes. We went through the raw data files on disk fasterthan expected over the long weekend, so the results-to-send queuedropped down and we're going to be recovering from that for a bit. Theweb sites were increasingly dragged down by obnoxious activity over theweekend but that finally disappeared after I blocked the offending IPaddresses.

Now the positive. Our new 1U dual opteron server "thinman" is now upand running as a public web server. We were going to use new servermaul, but thinman is, well, thinner, and it's already in the closet. Sothat saves us one immediate closet upgrade. As well, we have beenredundantly sending out workunits via both vader and bane. This is [i]way[/i]overkill and a vestige of a time before we realized our problems wererouter-related. Since bane is also just 1U and already in the closet, Idecommissioned vader as a download server. The bottom line is we onlyhave two machines to get into the closet now (as opposed to 4): brunoand sidious. And we have a single web server which is much smaller andfaster than the old servers (kosh and penguin) combined. They will beshut down sooner or later.

In better news, Bill Woodcock (a key player in getting us set up withHurricane Electric, i.e. our current ISP and donator of our two currentHE routers) has donated another cisco router to us to replace to weaker2811. It a 7600 series, a bit overkill, but will give us tons ofheadroom to spare. We'll no longer be constrained by the 60Mb/sec cap!I guess we'll find the next set of bottlenecks quickly, including the100Mb cap (due to our current lab wiring to campus). Of course, we havea lot of configuring to do before this thing is up and running, but atleast it's in the rack!

By the way, if you haven't heard of email bankruptcy, please [url=http://www.wired.com/wired/archive/14.08/howtodesk.html]read this article[/url].I'm declaring "thread" bankruptcy, i.e. I am letting go all currentquestions, open-ended threads, unfinished story lines, etc. If anythingis really important it will come up again.

- Matt

2008-1-24 08:50 BiscuiT
23 Jan 2008 23:27:33 UTC

No news on the recently donated router (see yesterday's post). Basically we're in a holding pattern waiting to get the OS updated on the thing (currently running CatOS - needs to run IOS) and then configuration should be straightforward. There are some growing pains on having server bane be the single point of workunit download. I just tweaked the apache config to lessen the load. It's funny how seemingly unimportant differences in CPU/memory type/amount/speed from one server to the next require radically different settings in httpd.conf or else the whole thing grinds to a halt. Anyway, expect some download pains as knobs get turned and we slowly recover from running low on ready-to-send work.

Due to the recent long weekend we had the weekly outage today instead of yesterday. All went well with that, and my recently mentioned fixes to speed things up worked well. During all that I finally finished the last parts of the disk usage shell game so our workunit storage (on the Snap Appliance) is up to its maximum size of 2.5TB, of which we're currently occupying 50% - that will last us a while. As well, we are pretty much ready to start OS upgrades on the science database servers next week.

- Matt

2008-1-25 07:36 BiscuiT
24 Jan 2008 21:03:59 UTC

I think I have the apache/tcp config in some kind of working order so that we won't suffer such wild dips like we had over the past couple of days. These pains were brought on by a confluence of three minor events: running out of work to send, waiting an extra precious day before enacting the database compression/backup, and reducing our backend to just one download server. You'd think the last item was the main culprit as we seemingly slashed our server capacity by 50%, but the real bottleneck is still the router (the new one still not config'ed yet - waiting on a new IOS image). The single download server (bane) can handle the traffic, but the apache config was such that when all the downloads started it the cpu load went up to 400. Basically, MaxClients was set way too high but this went unnoticed when only half the load was on vader and half on bane. Then I set MaxClients too low - we were dropping connections long before hitting other theoretical limits. Now MaxClients is set just right. Or right enough for now. We're still experiencing catch up "malaise" but it's a much smoother ride in general than yesterday.

I've actually been working on some scientific programming. With the new science indexes being built we're able to analyze some data to get an idea of the current RFI structure. Basically we're seeing the radar noise in the final data - the radar blanking signals are still being implemented so new data (once it finally starts coming in) should be far less noisy. I'm hoping this kind of work will inspire more scientific updates from the others (remember: I'm a math/computer geek, not an astronomer - everything I know about SETI/astronomy is from 10+ years of osmosis working here at the lab).

- Matt

2008-1-29 08:27 BiscuiT
28 Jan 2008 21:28:05 UTC

Thingsare running more or less smoothly. The workunit/result traffic wasfairly high over the weekend, but consistent and below our current cap,so no major faults there. Our active user count is still slowlyclimbing but the acceleration of growth is negative (at least until wehave another press releases or "reminder" e-mails are sent out). Sincevarious index builds (and removals of seemingly unused indexes) theMySQL database is masterfully handling everything we give it. Therouter upgrade is still in limbo.

One odd thing was our "feeder" polarity problem reared its ugly headagain. Reminder: we have two scheduling/upload servers (bruno andptolemy) each given a separate queue of work to send to ourparticipants. If all is well, they should send out work at the samerate. However, in the past this wasn't always the case. DNS favoritismwas causing one queue to run out faster than the other, causing errant"no work from project" messages given to half the clients. This wasfixed with software load balancing on top of DNS. However, this timearound it seems the increased traffic tickled an actual, particulardisparity between the two. That is, bruno writes uploaded result filesto directly attached RAID storage, while ptolemy writes to bruno'sstorage over NFS. We seemed to hit a "too many files open" limit onbruno, and therefore bumped up the maximum on that. We'll see if thathelps.

In case you haven't noticed, I un-DNS-aliased one of the threesetiathome.berkeley.edu webservers last week, and another this morning.All public web traffic is theoretically aimed solely at our new 1U dualopteron system, and it's doing great. However, DNS rollout takesforever (even with time-to-live set for 5 minutes) - it will take aweek or so for those old aliases to disappear. The old web servers(kosh and penguin) were wonderful sparc/solaris systems but areapproaching 8 years old and therefore are relatively physically big andslow. We'll pull them out of the closet to make way for more modernsystems - like bruno. Yeah, bruno is still sitting in our secondarylab, connected to the systems in our closet via some funky switchingaround the building. It will be great to it on the same single switchas everything else.

Other plans for the week: We're upgrading the fedora core levels onseveral systems, including our science database systems. We havealready tested similar upgrades on our more-expendable desktops withlittle trouble. However, we will proceed with great caution given manyterabytes of data are involved on the database servers - full recoverywould be painful, to put it mildly.

- Matt

2008-1-29 11:04 2287732
高手来翻译下啊。[em03]

2008-1-29 14:54 Youth
内容太多,简单翻译一下:

服务器状态良好,虽然负荷比较重。活跃用户数仍在增长,但增速持续下跌。

服务器的任务分发,先前已经解决了负载平衡的问题,但最近在磁盘访问方面似乎还是有些问题,有待查证。

使用了8年的web服务器刚刚下岗。

本周其它任务:操作系统更新。

2008-1-30 12:10 BiscuiT
30 Jan 2008 0:06:05 UTC

Normal outage day for mysql database backup and compression. We took the opportunity to take care of two other things. First, we added a uniqueness constraint on a field in the analysis_config table in the science database. Interesting, no? Well, no, but long story short this constraint should have been there already, now it really is. Second, we upgraded the secondary science database server to latest Fedora rev and it seems to have accepted its new OS kindly. So far so good with that.

The recovery from the outage was slowed by a couple things. Bob also stopped/restarted mysql to incorporate/test some recently tweak config parameters. This has the unfortunate side effect of flushing the 20+ GB of memory, which means that all has to be read in again before the project comes fully back up to speed. Meanwhile I thought I'd continue tweaking the apache config on bane as it was seemingly unhappy and I ended up just making it temporarily worse. Oh well. Hang in there. Workunits will come.

Old web server penguin has been powered down and all its cables removed from the spaghetti in the closet. It has served us quite well.

- Matt

2008-1-31 18:46 BiscuiT
31 Jan 2008 0:45:41 UTC

Everything was kind of okay for most of the day. A couple new shuttle PCs came in - new desktops for Bob and Dan. I was setting those up, working on some database programming, etc. when the television crew for "Good Morning America" arrived. They were nice but they needed me to set up a shot with a computer running SETI@home. Oddly enough we don't have any systems readily available with a good display so I had to do some minor server reconfiguration to free up a fast enough computer that could show the screensaver in action.

Then the NAS holding our web site, home accounts, etc. suddenly died and was in a vicious reboot cycle. WTH? I had to power cycle the whole thing to get it to boot for real, and only then it was clear that a drive failed and it was rebuilding the respective RAID volume. Ultimately no big deal, but it is quite disconcerting it didn't recover so easily from a simple drive failure and had to be dealt with manually. The projects were offline there for a bit as the dust settled. The RAID is still rebuilding now. Let's hope another drive doesn't go in the meantime.

- Matt

2008-2-1 16:38 BiscuiT
31 Jan 2008 22:54:06 UTC

No big shakes today. Here's the lowdown:

The RAID recovered just fine last night. Continuing install of OS'es on new desktop computers. Court (former SETI@home systems administrator extraordinaire) came by for a short visit which was nice. Fighting with gnuplot to get it to do what I want. Took some active measures (using creative load balancing) to rectify long-standing feeder mod polarity problems - in other words we have too many even-numbered results-ready-to-send in the database, so I'm currently giving preference to the even-numbered scheduler so the odd results could catch up. Should be completely transparent to our users.

As a follow up to the television crews yesterday: I have no idea where/when the thing will be on air. I'm always pleased with increased media exposure, but personally I'm kind of cavalier about the whole television thing. Anyway I think Dan ended up being the only person on screen. I have been in many clips before. In fact, months before SETI@home launched a news crew showed up. I didn't know they were coming and arrived to work on little sleep, unshowered, unshaven and wearing a rocker t-shirt. I also had freshly dyed pink hair. I ignored the cameras best I could as I was actually quite busy. I also figured this footage would only be used for the local news, if at all. That night my sister who lives on the other side of the country called. She asked, "when did you dye your hair?"

- Matt

2008-2-5 11:56 BiscuiT
4 Feb 2008 22:53:30 UTC

Once again a normal weekend without anything bad to report. Though we are starting to "normally" push our current router to its limit - our normal Monday morning "bump" brought us just under 60 Mbits/sec. We really should be moving to the new router sooner than later - still waiting on OS upgrade support from others.

Meanwhile, our web server situation is now completely down to the one new server "thinman." I turned aging server "kosh" off today. Just like "penguin" it served us well over its many years. Sun servers tend to last forever if you let them. Here's a reminder that our Classic data recorder was a Sun IPX, which was already about 5 or 6 years old when we put it into service as a 24/7 collector of raw data at Arecibo, and it lasted the 5 or 6 more years beyond that with nary a single problem.

Jeff and I are mostly working on the data pipeline, which got "rusty" during the extended downtime at Arecibo. It should be running fully automatically any day now, with drives full of hot, fresh data arriving regularly. We're collecting data now, but having to kick the system along from time to time.

- Matt

2008-2-6 09:21 BiscuiT
5 Feb 2008 23:55:44 UTC

The regular weekly outage to hose down the database got started a little late today since Bob was out and I was busy voting (election day here in California - they hold elections in the U.S. in the middle of the work week and nobody gets the day off). Otherwise it was fine though it took a little longer to compact the tables as it was a generally busy week meaning a lot more database inserts/deletes and therefore a lot more fragmentation.

Spent a large chunk of the day helping Dave install a new fastcgi-enabled scheduler on the alpha project which meant figuring out the differences between fcgid and mod_fastcgi behavior and determining which apache directives work, etc. Pretty annoying, but finally got it all squared away - the upshot of this is we're now getting real scheduler logs for the first time in years, as opposed to scheduler messages cluttering up apache error logs. Cool. Of course, I was distracted enough to not notice bane (the workunit download server) spiraled out of control trying to recover from the outage. I just rebooted it with and started apache with a lower ceiling to hopefully prevent this from happening again. So I'm still operating on bane. Expect slightly slower, more painful recoveries from outages for the next while.

Despite the red bar on the science status page saying ALFA is not running, we are indeed collecting data on and off. This is a false negative due to a change in reporting from the Arecibo feed which tells us telescope position/status/etc. Jeff's fixing this now.

- Matt

2008-2-7 09:04 BiscuiT
6 Feb 2008 23:04:24 UTC

Recovery from yesterday's outage wasn't so bad after all, but we're hitting another wall. Well, not a wall as much as a mound. That mound is our science database server, thumper. Those watching the status page may have been noticing it's having a harder and harder time to keep up with making work (ready-to-send queue is hardly ever full) and keeping up with assimilation (ready-to-assimilate queue is hardly ever empty - in fact, it's been growing slowly over the past 24 hours).

Of course, it's not the database load - thumper has almost 50 Terabytes of storage on it, so it also serves as our raw data buffer (where we keep all the data images for the splitters to chew on) as well as database backup storage (where we write/archive a 500GB data file every week). In short, we're hitting disk I/O limits on thumper. I fear making the "vertical" splitter (which acts on many raw data files simultaneously to reduce impact of hitting too much noise on a single file) has reduced any benefit of disk caching to zero. Since we're basically keeping up now, I whittled our number of splitters from 10 to 6 - hopefully this will help. I don't want to revert to non-vertical splitting just yet - we'll have greater problems if we do. Bob may also employ so different informix checkpointing parameters to reduce the impact of long checkpoints blocking science database traffic about 25% of the time. We're pretty much in wait-and-see mode on that.

Jeff and I are more or less done hammering out the current set of kinks in our data pipeline from Arecibo to your computer. This will all be automated shortly. We also just threw a very short chunk of data into the splitter queue from last week (28ja08aa). It's already being split, actually. This contains radar blanking data. We're going to process it once without the blanker logic, and again with. It's a data-beta-test. We want to be really make sure it works before processing dozens of whole files. I'll try to remember to throw up some before/after plots comparing the two runs once they are complete.

- Matt

页: [1] 2 3 4 5 6
查看完整版本: SETI@home 2008 技术新闻


Powered by Discuz! Archiver 5.5.0  © 2001-2006 Comsenz Inc.