找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 2008 技术新闻

[复制链接]
 楼主| 发表于 2008-1-17 09:31:16 | 显示全部楼层

16 Jan 2008 23:25:12 UTC

The recovery went rather well yesterday, considering its extended length. Bob made some mysql tweaks to perhaps better use the memory on jocelyn (allow more protected space for query sorting, for example).

Vexing time-sinks: I spent 45 minutes this morning trying to figure out why one of the download servers (bane) was have autofs problems. Long story short: the route map was ever-so-slightly messed up so that it couldn't mount a single particular machine on a different subnet in our lab (why it needed to mount this machine was due to an "ls" command in a script - which by default displays color, so ls will traverse sym links to see if they are broken or not in order to select the proper color scheme, and in this case one sym link was on this remote machine). Also: the new donated server came with rails! As some of you know we have hilariously bad luck with rack rails of infinitely different (and useless) non standard sizes, and this time is no different. We needed to shrink the rail depth which should be easy. I did this to one and it fit! I did this to the other and, due to different screw hole location, it remains 1 cm too deep and unable to get any smaller. Ha ha ha (sob). Bottom line: useless rails, yet AGAIN.

But that's just a minor detail really - no need to rant and I don't want to seem ungrateful to our generous donor! We ended up putting the thing in the closet flat on top of the whole rack chassis. Works for me. We now have a new server called "thinman" (dual opteron, 16GB RAM) to help bolster the BOINC back-end! Woo-hoo! We'll update the server-wish-list with routers, servers, kvms, etc. soon.

Other vexing time-sink: Bogus news reports that we found a "mystery" signal should be summarily ignored. This was a gross misinterpretation by a reporter of an quick comment Dan made off the record about AstroPulse progress and recently published millisecond pulsar findings by another group. These are new stellar phenomena which are astronomically interesting (and AstroPulse hopes to find many of) but not ET. Sigh.

- Matt
回复

使用道具 举报

发表于 2008-1-17 09:40:51 | 显示全部楼层
这么多啊~~
回复

使用道具 举报

 楼主| 发表于 2008-1-18 09:22:44 | 显示全部楼层

17 Jan 2008 22:23:19 UTC

No disasters or major revelations to report today. Interesting news from yesterday: Sun bought MySQL. Not sure how this will affect us, but it reminds me that I should mention that I am generally pleased with MySQL. There was that one comment about the professor who thought industrial grade software is the only way to go, and the MySQL is for mom-and-pop ventures. Let me address: Claiming the winners in the game of capitalism hold the best solutions to whatever problem is at best an arrogant assumption with obvious overtones of classism (both intellectual and economic), especially given that "mom-and-pop" crack.

Other than that.. mostly spent the day cleaning up spills in various aisles. I also yum'ed up my desktop to Fedora Core 8 as an exercise to do so on more heftier servers in the coming weeks.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-1-23 16:13:25 | 显示全部楼层

23 Jan 2008 1:16:26 UTC

Tomy fellow US citizens (and others as well), hope you had a happy MLKday (or whatever your state officially calls it). Those wondering whyno tech news item yesterday, that's why.

I'll start with the negative. Lots of the usual annoying little hiccupsover the weekend. Here's a non-chronological digest: One of the servers(bruno) lost its automount again (hasn't happened in a while), havingthe effect of inflating the validator queue before I noticed andunclogged the pipes. We went through the raw data files on disk fasterthan expected over the long weekend, so the results-to-send queuedropped down and we're going to be recovering from that for a bit. Theweb sites were increasingly dragged down by obnoxious activity over theweekend but that finally disappeared after I blocked the offending IPaddresses.

Now the positive. Our new 1U dual opteron server "thinman" is now upand running as a public web server. We were going to use new servermaul, but thinman is, well, thinner, and it's already in the closet. Sothat saves us one immediate closet upgrade. As well, we have beenredundantly sending out workunits via both vader and bane. This is wayoverkill and a vestige of a time before we realized our problems wererouter-related. Since bane is also just 1U and already in the closet, Idecommissioned vader as a download server. The bottom line is we onlyhave two machines to get into the closet now (as opposed to 4): brunoand sidious. And we have a single web server which is much smaller andfaster than the old servers (kosh and penguin) combined. They will beshut down sooner or later.

In better news, Bill Woodcock (a key player in getting us set up withHurricane Electric, i.e. our current ISP and donator of our two currentHE routers) has donated another cisco router to us to replace to weaker2811. It a 7600 series, a bit overkill, but will give us tons ofheadroom to spare. We'll no longer be constrained by the 60Mb/sec cap!I guess we'll find the next set of bottlenecks quickly, including the100Mb cap (due to our current lab wiring to campus). Of course, we havea lot of configuring to do before this thing is up and running, but atleast it's in the rack!

By the way, if you haven't heard of email bankruptcy, please read this article.I'm declaring "thread" bankruptcy, i.e. I am letting go all currentquestions, open-ended threads, unfinished story lines, etc. If anythingis really important it will come up again.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-1-24 08:50:06 | 显示全部楼层

23 Jan 2008 23:27:33 UTC

No news on the recently donated router (see yesterday's post). Basically we're in a holding pattern waiting to get the OS updated on the thing (currently running CatOS - needs to run IOS) and then configuration should be straightforward. There are some growing pains on having server bane be the single point of workunit download. I just tweaked the apache config to lessen the load. It's funny how seemingly unimportant differences in CPU/memory type/amount/speed from one server to the next require radically different settings in httpd.conf or else the whole thing grinds to a halt. Anyway, expect some download pains as knobs get turned and we slowly recover from running low on ready-to-send work.

Due to the recent long weekend we had the weekly outage today instead of yesterday. All went well with that, and my recently mentioned fixes to speed things up worked well. During all that I finally finished the last parts of the disk usage shell game so our workunit storage (on the Snap Appliance) is up to its maximum size of 2.5TB, of which we're currently occupying 50% - that will last us a while. As well, we are pretty much ready to start OS upgrades on the science database servers next week.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-1-25 07:36:22 | 显示全部楼层

24 Jan 2008 21:03:59 UTC

I think I have the apache/tcp config in some kind of working order so that we won't suffer such wild dips like we had over the past couple of days. These pains were brought on by a confluence of three minor events: running out of work to send, waiting an extra precious day before enacting the database compression/backup, and reducing our backend to just one download server. You'd think the last item was the main culprit as we seemingly slashed our server capacity by 50%, but the real bottleneck is still the router (the new one still not config'ed yet - waiting on a new IOS image). The single download server (bane) can handle the traffic, but the apache config was such that when all the downloads started it the cpu load went up to 400. Basically, MaxClients was set way too high but this went unnoticed when only half the load was on vader and half on bane. Then I set MaxClients too low - we were dropping connections long before hitting other theoretical limits. Now MaxClients is set just right. Or right enough for now. We're still experiencing catch up "malaise" but it's a much smoother ride in general than yesterday.

I've actually been working on some scientific programming. With the new science indexes being built we're able to analyze some data to get an idea of the current RFI structure. Basically we're seeing the radar noise in the final data - the radar blanking signals are still being implemented so new data (once it finally starts coming in) should be far less noisy. I'm hoping this kind of work will inspire more scientific updates from the others (remember: I'm a math/computer geek, not an astronomer - everything I know about SETI/astronomy is from 10+ years of osmosis working here at the lab).

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-1-29 08:27:57 | 显示全部楼层

28 Jan 2008 21:28:05 UTC

Thingsare running more or less smoothly. The workunit/result traffic wasfairly high over the weekend, but consistent and below our current cap,so no major faults there. Our active user count is still slowlyclimbing but the acceleration of growth is negative (at least until wehave another press releases or "reminder" e-mails are sent out). Sincevarious index builds (and removals of seemingly unused indexes) theMySQL database is masterfully handling everything we give it. Therouter upgrade is still in limbo.

One odd thing was our "feeder" polarity problem reared its ugly headagain. Reminder: we have two scheduling/upload servers (bruno andptolemy) each given a separate queue of work to send to ourparticipants. If all is well, they should send out work at the samerate. However, in the past this wasn't always the case. DNS favoritismwas causing one queue to run out faster than the other, causing errant"no work from project" messages given to half the clients. This wasfixed with software load balancing on top of DNS. However, this timearound it seems the increased traffic tickled an actual, particulardisparity between the two. That is, bruno writes uploaded result filesto directly attached RAID storage, while ptolemy writes to bruno'sstorage over NFS. We seemed to hit a "too many files open" limit onbruno, and therefore bumped up the maximum on that. We'll see if thathelps.

In case you haven't noticed, I un-DNS-aliased one of the threesetiathome.berkeley.edu webservers last week, and another this morning.All public web traffic is theoretically aimed solely at our new 1U dualopteron system, and it's doing great. However, DNS rollout takesforever (even with time-to-live set for 5 minutes) - it will take aweek or so for those old aliases to disappear. The old web servers(kosh and penguin) were wonderful sparc/solaris systems but areapproaching 8 years old and therefore are relatively physically big andslow. We'll pull them out of the closet to make way for more modernsystems - like bruno. Yeah, bruno is still sitting in our secondarylab, connected to the systems in our closet via some funky switchingaround the building. It will be great to it on the same single switchas everything else.

Other plans for the week: We're upgrading the fedora core levels onseveral systems, including our science database systems. We havealready tested similar upgrades on our more-expendable desktops withlittle trouble. However, we will proceed with great caution given manyterabytes of data are involved on the database servers - full recoverywould be painful, to put it mildly.

- Matt
回复

使用道具 举报

发表于 2008-1-29 11:04:25 | 显示全部楼层
高手来翻译下啊。
回复

使用道具 举报

发表于 2008-1-29 14:54:09 | 显示全部楼层
内容太多,简单翻译一下:

服务器状态良好,虽然负荷比较重。活跃用户数仍在增长,但增速持续下跌。

服务器的任务分发,先前已经解决了负载平衡的问题,但最近在磁盘访问方面似乎还是有些问题,有待查证。

使用了8年的web服务器刚刚下岗。

本周其它任务:操作系统更新。
回复

使用道具 举报

 楼主| 发表于 2008-1-30 12:10:45 | 显示全部楼层

30 Jan 2008 0:06:05 UTC

Normal outage day for mysql database backup and compression. We took the opportunity to take care of two other things. First, we added a uniqueness constraint on a field in the analysis_config table in the science database. Interesting, no? Well, no, but long story short this constraint should have been there already, now it really is. Second, we upgraded the secondary science database server to latest Fedora rev and it seems to have accepted its new OS kindly. So far so good with that.

The recovery from the outage was slowed by a couple things. Bob also stopped/restarted mysql to incorporate/test some recently tweak config parameters. This has the unfortunate side effect of flushing the 20+ GB of memory, which means that all has to be read in again before the project comes fully back up to speed. Meanwhile I thought I'd continue tweaking the apache config on bane as it was seemingly unhappy and I ended up just making it temporarily worse. Oh well. Hang in there. Workunits will come.

Old web server penguin has been powered down and all its cables removed from the spaghetti in the closet. It has served us quite well.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-1-31 18:46:04 | 显示全部楼层

31 Jan 2008 0:45:41 UTC

Everything was kind of okay for most of the day. A couple new shuttle PCs came in - new desktops for Bob and Dan. I was setting those up, working on some database programming, etc. when the television crew for "Good Morning America" arrived. They were nice but they needed me to set up a shot with a computer running SETI@home. Oddly enough we don't have any systems readily available with a good display so I had to do some minor server reconfiguration to free up a fast enough computer that could show the screensaver in action.

Then the NAS holding our web site, home accounts, etc. suddenly died and was in a vicious reboot cycle. WTH? I had to power cycle the whole thing to get it to boot for real, and only then it was clear that a drive failed and it was rebuilding the respective RAID volume. Ultimately no big deal, but it is quite disconcerting it didn't recover so easily from a simple drive failure and had to be dealt with manually. The projects were offline there for a bit as the dust settled. The RAID is still rebuilding now. Let's hope another drive doesn't go in the meantime.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-1 16:38:19 | 显示全部楼层

31 Jan 2008 22:54:06 UTC

No big shakes today. Here's the lowdown:

The RAID recovered just fine last night. Continuing install of OS'es on new desktop computers. Court (former SETI@home systems administrator extraordinaire) came by for a short visit which was nice. Fighting with gnuplot to get it to do what I want. Took some active measures (using creative load balancing) to rectify long-standing feeder mod polarity problems - in other words we have too many even-numbered results-ready-to-send in the database, so I'm currently giving preference to the even-numbered scheduler so the odd results could catch up. Should be completely transparent to our users.

As a follow up to the television crews yesterday: I have no idea where/when the thing will be on air. I'm always pleased with increased media exposure, but personally I'm kind of cavalier about the whole television thing. Anyway I think Dan ended up being the only person on screen. I have been in many clips before. In fact, months before SETI@home launched a news crew showed up. I didn't know they were coming and arrived to work on little sleep, unshowered, unshaven and wearing a rocker t-shirt. I also had freshly dyed pink hair. I ignored the cameras best I could as I was actually quite busy. I also figured this footage would only be used for the local news, if at all. That night my sister who lives on the other side of the country called. She asked, "when did you dye your hair?"

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-5 11:56:04 | 显示全部楼层

4 Feb 2008 22:53:30 UTC

Once again a normal weekend without anything bad to report. Though we are starting to "normally" push our current router to its limit - our normal Monday morning "bump" brought us just under 60 Mbits/sec. We really should be moving to the new router sooner than later - still waiting on OS upgrade support from others.

Meanwhile, our web server situation is now completely down to the one new server "thinman." I turned aging server "kosh" off today. Just like "penguin" it served us well over its many years. Sun servers tend to last forever if you let them. Here's a reminder that our Classic data recorder was a Sun IPX, which was already about 5 or 6 years old when we put it into service as a 24/7 collector of raw data at Arecibo, and it lasted the 5 or 6 more years beyond that with nary a single problem.

Jeff and I are mostly working on the data pipeline, which got "rusty" during the extended downtime at Arecibo. It should be running fully automatically any day now, with drives full of hot, fresh data arriving regularly. We're collecting data now, but having to kick the system along from time to time.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-6 09:21:23 | 显示全部楼层

5 Feb 2008 23:55:44 UTC

The regular weekly outage to hose down the database got started a little late today since Bob was out and I was busy voting (election day here in California - they hold elections in the U.S. in the middle of the work week and nobody gets the day off). Otherwise it was fine though it took a little longer to compact the tables as it was a generally busy week meaning a lot more database inserts/deletes and therefore a lot more fragmentation.

Spent a large chunk of the day helping Dave install a new fastcgi-enabled scheduler on the alpha project which meant figuring out the differences between fcgid and mod_fastcgi behavior and determining which apache directives work, etc. Pretty annoying, but finally got it all squared away - the upshot of this is we're now getting real scheduler logs for the first time in years, as opposed to scheduler messages cluttering up apache error logs. Cool. Of course, I was distracted enough to not notice bane (the workunit download server) spiraled out of control trying to recover from the outage. I just rebooted it with and started apache with a lower ceiling to hopefully prevent this from happening again. So I'm still operating on bane. Expect slightly slower, more painful recoveries from outages for the next while.

Despite the red bar on the science status page saying ALFA is not running, we are indeed collecting data on and off. This is a false negative due to a change in reporting from the Arecibo feed which tells us telescope position/status/etc. Jeff's fixing this now.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-7 09:04:54 | 显示全部楼层

6 Feb 2008 23:04:24 UTC

Recovery from yesterday's outage wasn't so bad after all, but we're hitting another wall. Well, not a wall as much as a mound. That mound is our science database server, thumper. Those watching the status page may have been noticing it's having a harder and harder time to keep up with making work (ready-to-send queue is hardly ever full) and keeping up with assimilation (ready-to-assimilate queue is hardly ever empty - in fact, it's been growing slowly over the past 24 hours).

Of course, it's not the database load - thumper has almost 50 Terabytes of storage on it, so it also serves as our raw data buffer (where we keep all the data images for the splitters to chew on) as well as database backup storage (where we write/archive a 500GB data file every week). In short, we're hitting disk I/O limits on thumper. I fear making the "vertical" splitter (which acts on many raw data files simultaneously to reduce impact of hitting too much noise on a single file) has reduced any benefit of disk caching to zero. Since we're basically keeping up now, I whittled our number of splitters from 10 to 6 - hopefully this will help. I don't want to revert to non-vertical splitting just yet - we'll have greater problems if we do. Bob may also employ so different informix checkpointing parameters to reduce the impact of long checkpoints blocking science database traffic about 25% of the time. We're pretty much in wait-and-see mode on that.

Jeff and I are more or less done hammering out the current set of kinks in our data pipeline from Arecibo to your computer. This will all be automated shortly. We also just threw a very short chunk of data into the splitter queue from last week (28ja08aa). It's already being split, actually. This contains radar blanking data. We're going to process it once without the blanker logic, and again with. It's a data-beta-test. We want to be really make sure it works before processing dozens of whole files. I'll try to remember to throw up some before/after plots comparing the two runs once they are complete.

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2024-5-23 04:24

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表