找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 2008 技术新闻

[复制链接]
 楼主| 发表于 2008-7-29 07:56:49 | 显示全部楼层

28 Jul 2008 21:27:00 UTC

Wow. What a weird weekend. A lot of little minor things went wrong causing a bunch of "perfect storms" in succession. I have a technical term for this which I can't say in public. Anyway, I'll spell some of it out in no particular order and in varying amounts of detail.

Our workunit storage server filled up again. We got the warnings too late, as mounting problems were keeping the server status scripts from running, which obscured a rather large assimilator queue backlog. When results stay on disk waiting to be assimilated, so does their respective workunit. Plus with Astropulse ramping up those giant workunits were filling up the storage faster than usual. Eric did already put in code for the splitter (which generates the workunits) to check for a full disk before attempting to write anything. Of course, this fix was only deployed in beta so far. The result, there are about 20000 workunits of zero length, which will cause annoying errors for all clients trying to download them, but they should pass through like kidney stones before too long. For a while I stopped the splitters to reduce the disk usage. Today we put the updated splitter in the main project.

We've been having general scheduler problems over the last week as BOINC code updates were made in preparation for Astropulse. We haven't built a new scheduler process in a while which brought to light several problems, mostly due to our database schema being outdated and therefore out of sync with what the code expected. This didn't cause any data corruption, but caused random hosts to be unable to connect. For no real good reason a lot of hosts reporting problems were Macs which added to the difficulty of diagnosis - we thought it was an architecture dependent issue at first. In any case, we got beyond understand those problems late last week and planned to clean it all up early this week. There was some miscommunication and the new "broken" scheduler was turned on again last Friday for about a day.

On Sunday our bandwidth dropped to zero. At this point we threw up our hands and figured we'll figure this out when we're all in the lab together on Monday (today). Remember we do have a policy that it is perfectly okay for our project to be down for a day or two as this is BOINC and people can crunch on other projects in the meantime. Nevertheless, we don't want to be too cavalier about that as we know a lot of people just crunch SETI data. But still, given our meager resources our average uptime is quite good, so a day or two of occasional downtime is acceptable. But I digress... Turns out apache was the problem on this server (once again a problem obscured by alerts not running due to mounting issues) and we had to kick it a couple times (including a full system reboot due to messed up shared memory segments) to get it going again. Once going, both download servers choked. So I had to kick both of them as well.

Then we ran out of work. Remember how I said we put a fix in the splitter to keep from writing if the workunit storage server was full? Well, it was being extra cautious and not writing if it said storage server was over 90% full. So as I write this paragraph we're low on work to send out, but Eric gave me permission to turn file deletion on in beta so that'll clear up space soon enough and we'll generate fresh work.

And oh yeah.. we were slashdotted again on Sunday.

That's enough for today. We'll have the usual outage tomorrow (may be slightly longer than normal) and maybe start splitting some more Astropulse workunits to send out!

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-30 17:41:30 | 显示全部楼层

29 Jul 2008 23:13:57 UTC

Today we had our usual Tuesday outage which was a bit longer than usual as we had extra things to take care of (outside of the usual BOINC database table compression and backup to disk).

I failed to mention yesterday (though many have noticed) that db_dump hasn't been working for days, which means our stats have flatlined all weekend. This was because our mysql replica failed (we run these expensive stats lookups on the replica so they don't affect the more important updates running on the master). So part of the outage today was to rebuild this replica from scratch via the dump from the master. It was easy - we do this regularly anyway - just takes a long time.

Also, Jeff and I replaced a failed drive on thumper (the science database server). There are 48 drives on the thing so disk failures are common, and we get Sun support on this important system. We ask for a drive, they send one, we put it in and ship the old one back. Easy as pie. Unfortunately the software RAID on this system made some bogus complaints upon restart (unrelated to the device that required the new drive). I'm not sure why mdadm gets confused - for example I converted a couple spare drives to a new RAID device, which works fine, but upon reboot (many months later) mdadm freaks out that those spares are missing. Anyway, this was mostly harmless, and another warning we really need a fresh OS install on this system sooner than later (that'll be scary).

We're running full bore now. It'll take a while to catch up, and we may temporarily run out of work again (still not a comfortable amount of free disk space on the workunit storage). But it'll all clear up eventually.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-31 07:05:04 | 显示全部楼层

30 Jul 2008 20:10:28 UTC

Looks like we're pretty much out of the woods regarding recent issues. Plus the stats dumps are working again (for the first time in days) so there was an artificially inflated bump in BOINC world-wide productivity for a moment there.

Following on with the science database server stuff. I continue to play the RAID "shell game" to get the root filesystems back on the actual root drives (just for our own sanity, mostly). I also still have to drop/rebuild that one index which gave us trouble a couple weeks ago (apparently "checking" the index didn't fix it) - all very minor issues.

Regarding our experience with drive failures... We see the obvious stuff - drives fail either (a) immediately, (b) after 2-4 years, or (c) never ever. I remind people that our original SETI@home data recorder contained drives that were already heavily used for about 5-6 years when we installed them down at Arecibo in 1998, and then they were reading/writing successfully until a couple years ago. They would still probably be working but we have since switched to the newer multibeam data recorder system. Anyway, we don't have enough data to prove that high temps or heavy loads kill drives faster. My gut feeling is they don't as much as you think. My gut feeling is also that more than half our "failures" are bogus - for example, we had a lot of fibre channel errors, or RAID card bugs, or smartd being oversensitive making it seem like perfectly good drives were unhappy. Many times we just remove and re-add the "broken" drive and it works just fine. In the current case we believe the drive replacement was necessary.

Regarding linux OS re-installs... We've been using Fedora for a while now. Each OS rev has about 18 months of support, and we like to keep up to date for various compatibility/security/bug-fix reasons. It's easy to "yum upgrade" to the next OS rev, but after doing this a couple times you find configuration files get out of whack, and your system is littered with "rpmnew" files. Package conflicts arise. Plus every few years you learn enough that you might want to rethink your file systems/adjust partition sizes, etc. So a fresh install is more just "spring cleaning" than anything else.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-8-5 08:45:21 | 显示全部楼层

4 Aug 2008 21:37:18 UTC

Another wacky weekend for us. Astropulse is still ramping up - we're creating work, sending it out, receiving results back and assimilating them. However the validator stopped granting credit for these workunits - something we'll fix and we can also retroactively give people their credit. The workunit storage server ran low on room again, the bottleneck that's been giving everybody headaches over the weekend as the splitters could only create work as fast as workunits got deleted off disk. Right now things are generally running slow as I'm moving stuff off the workunit server to make room causing lots of excess internal i/o. As an added bonus the mysql database replica server crashed this morning - it ran out of memory. No harm done, but it looks like it'll take a while to catch up again (it's been lagging behind all weekend). I would like to try to split the numbers on the status page between the two different applications (SETI@home/Astropulse) but those extra "where" clauses make the queries run forever.

In better news, looks like we got our new home-grown NAS/RAID box working as we'd like it, so we may start employing that sooner than later (thus freeing up lots of room/power in our server closet). Also all drive issues on our science database server over the past couple of weeks have been completely dealt with at this point. Well.. there's one lingering corrupted index which we'll try to rebuild tomorrow during the outage.

I was actually out of the loop since Thursday as I went up to Seattle to play a gig on the main stage at the Microsoft Techready conference at Bell Harbor. Anybody around here attend that thing? Fun show/event, but the stage tent was completely inadequate and the entire band got soaked by rain and sea mist. I'm amazed none of us were electrocuted.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-8-6 09:16:49 | 显示全部楼层

5 Aug 2008 23:15:08 UTC

Today was another one of them "outage days" where we shut everything down to do basic weekly maintenance (database backup and whatnot). We had a particularly large task list this time around. A lot of it was fairly mundane - like moving/compressing files to make more room on various storage systems.

The sidious crash the other day did in fact break the mysql replica again. No big deal, but that meant recreating the database from the master - a seemingly weekly occurrence. It's easy to do, just adds extra time to the whole operation.

Also, we tried to fix that broken index on the science database. We found the corruption was actually not on the RAID system we thought (the one that required a drive replacement). Huh. Anyway.. the index repair on the whole table was taking too long. We might just go ahead and drop/rebuild the specific index later now that we are more sure what's what.

We brought all our backend services (feeder, transitioner, validator, etc.) up to spec on current BOINC code for the first time in a long time, so we carefully turned these on one at a time to observe the logs/results and make sure nothing got all screwy with the updated code.

So we're back up, more or less. The current mystery is why we are using so much bandwidth. Too many factors at play to make a clear determination - lots of known network bottlenecks, lots of database bottlenecks, unknown Astropulse behavior, etc. We'll give this a closer look tomorrow after (hopefully) some of the traffic jams disappear.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-8-7 11:10:15 | 显示全部楼层

6 Aug 2008 21:11:48 UTC

Generally speaking, the wealth of issues we've been experiencing were simply due to Astropulse adding about 10-20 more Mbits/sec to our general average. This was a little higher than we expected, hence the initial air of mystery, but still quite within our abilities given current infrastructure. This traffic might go down a bit once everybody requesting their first Astropulse workunit gets their single copy of the Astropulse client.

So this explains the big rush once we released the first workunits and the longer "catching up" period, especially given the fact we were constrained all weekend due to lack of workunit storage space.

Today I've been mostly working on build scripts and testing recent database code fixes. Getting back on the "development" train for a bit... We are also close to getting that new home-grown NAS into production.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-8-8 08:18:42 | 显示全部楼层

7 Aug 2008 22:11:38 UTC

Towards the end of the afternoon yesterday we put in a new scheduler to fix a bug with "anonymous platforms" and the way they handle Astropulse workunits. This is working fine as far as I know, but at first there were some brief issues with uploads in general (human error when installing new scheduler).

Today got our new NAS machine into the closet. We're close to removing the old NetApp filer, which still works great after so many years, but the drives are too small and we can't afford support on this system, and buying new replacement drives is prohibilitively expensive. Plus the thing is just physically huge - a whole rack taking up a third of our closet for only 3 TB raw space. We're replacing it with a 3U system that will ultimately have about 7 TB raw space. Getting that into the closet meant I was able to fire up another server-to-be today in our prep lab and get that configured.

Traffic-wise we're still trying to get a feel for our demand and our bottlenecks. Eric wrote a script that is busy deleting antique workunits/results that exist on disk but not in the database (not sure why the antique deleter built into BOINC isn't working...). This will clear up additional much needed room but this is pretty much all we can do short of getting a whole new workunit storage server.

Looks like web code was updated just now, breaking a thing or two. I think Dave's addressing that stuff. I've been mostly catching up on several behind-the-scenes programming projects today.

- Matt
回复

使用道具 举报

发表于 2008-8-8 08:24:43 | 显示全部楼层
版主辛苦!,可惜我的英文不是很好
回复

使用道具 举报

 楼主| 发表于 2008-8-26 10:33:55 | 显示全部楼层

25 Aug 2008 22:56:00 UTC

I've been out for a couple weeks. I really need to get the others around here to chime in while I'm away, but it's hard to convince people who aren't as hypergraphic as I. Anyway, it seems like whatever happened most everybody survived. Another problem: what I end up blathering on about in these posts is hardly comprehensive, and given arbitrary priority based on whatever is on my mind at the given time. This can be confusing, I imagine.

I might also just go ahead and start only posting here when I really need to (during *real* server issues) and post less important day-to-day type things in the blog. We'll see how that goes. It might help keeping specific issues contained to one meaningful thread.

In any case, a brief rundown of the past two weeks: A drive failed on the workunit storage server. Usual drill there except it hung after the failure, however once rebooted it recovered just fine using a spare drive. Outside of that were more minor issues (another server hung requiring reboot, the mysql replica stopped for no apparent reason and took a few days to catch up, etc...) causing various queues to drain or fill too fast, bottlenecks were exercised, and we had a couple temporary complete/partial public server outages... all told nothing out of the ordinary. We are still running a bit "hot" due to the Astropulse release - by "hot" I mean we're using far more storage/network resources than we'd like, but we're otherwise okay.

Going back to catching up from the absence...

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-8-27 17:24:39 | 显示全部楼层

26 Aug 2008 22:53:45 UTC

Ah, yes - here we go again - the regular Tuesday outage for mysql database backup/compression and other tasks better suited to happen during "quiescent" time.

For example, this week we replaced the failed drive in the workunit storage server with a new drive. That was painless. We also spent a bunch of time experimenting with the new-ish RAID server. I say "new-ish" as it's new to us, but it is an old system. For example, it can't handle logical volumes greater than 2TB. We however today confirmed (a) it can handle physical single drives at least 750GB in size, and (b) physical volumes greater than 2TB (i.e. put three 750GB drive together to make a 1.5TB RAID5).

We also tested that this system is keeping up pretty well doing a continual backup of our upload directory. That is, we're doing a constant rsync with the upload directory to keep a "hot backup" around on a separate system. We didn't have the bandwidth/storage capacity to do this ourselves before (and daily backups to tape were too expensive).

Anyway.. the extended length of the outage today was mostly due to revamping the way we're doing the backups. We're working to include better query blocking (to ensure the database is totally update-free) and figure out the best way to maximize our time, thus ultimately shortening these outages.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-8-29 08:01:34 | 显示全部楼层

28 Aug 2008 22:51:58 UTC

We have a lot of servers in play around here, and once in a while an operating system on one particular server falls far enough behind in spec that the best move is to do a clean reinstall of the latest OS version from DVD (as opposed to trying to do 3 or 4 separate upgrades over the net, one revision at a time). Such was the case with vader, and I bit the bullet yesterday and tackled that project. It mostly acts as a compute server and a redundant download server, so it wasn't really missed for the 24 hours it was offline. Only one annoying snag: we have a lot of systems already running this OS, but this was the first 64-bit clean install from DVD, and turns out there's a package dependency bug that caused the install to crash until I figured out the offending package and left it off the list. This morning I wrapped up work and it's back online. That's good, but I still have a few more servers needing similar upgrades.

The summer we have a volunteer undergrad, Luke, working on radar blanking code. Background: our multibeam data is inundated with military radar noise of semi-predictable rate and frequency. Such data collected since early 2008 has a "blanking signal" embedded by Arecibo within the raw data, so we can easily tell when the radar is on or off and we can ignore the loud noise. What Luke's working on is a program that analyzes pre-2008 data to retroactively find the radar noise and recreate a similar "blanking signal" so we can clean it up. We (me, Jeff, Eric, and Luke) had a code walkthrough yesterday. So far, so good. In the process of making this program Luke also found phase issues, even with the Arecibo blanking signal, which is probably why we still get overflow workunits from time to time. So there's still a little work to be done. When we have an observatory on the dark side of the moon, this won't be a problem. Don't see that happening anytime soon, though...

Still messing around with this new/old NAS system. It's becoming a real time sink. Lots of waiting through long reboots, then trying to figure out why X or Y isn't working as expected.

I don't come into the lab on Fridays, and Monday is a national holiday. So signing off for a few days...

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-3 19:49:35 | 显示全部楼层

2 Sep 2008 22:16:36 UTC

Currently as I write this we're recovering from the weekly outage (during which we take care of database backups and other sundry server details). It may take a while...

This past Friday we overloaded our science database trying to create a new index. A database engine restart solved the problem, but not after choking the whole local network. As mentioned in many posts past, we're strangely sensitive to heavy network bandwidth (I think due to linux's imperfect handling of NFS dropouts), and such periods cause random unexpected events. This time, for example, the bottleneck from the primary science database server ultimately caused the BOINC/mysql replica server to disconnect from the master. So the replica fell behind all weekend. Sigh. Instead of actually letting it catch up we're just re-mirroring it from the master as we just backed it up this morning.

Meanwhile, we're out of space again on the workunit server, and with no fast/easy way to add space. Eric's playing with the splitter mix to reduce the number of Astropulse workunits being generated (they are much larger than SETI@home workunits). Maybe that will help, but not immediately. This is what's mostly causing our headaches today as we can't create enough work to keep up with demand.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-5 17:41:43 | 显示全部楼层

4 Sep 2008 19:48:30 UTC

The good news is that recent woes due to lack of workunit disk space have seemingly passed for now. We're still on the very edge of our capacity, but now that we're prioritizing the smaller regular workunits (as opposed to the big Astropulse workunits) we were able to build up a ready-to-send queue and network traffic stabilized overnight.

The less-good news is that we still need to build some indexes on the science database. We're building one now, and it usually takes 12-24 hours. This adds a lot of CPU and disk I/O to the science database server, meaning the splitters can add rows as fast, nor can the assimilators. So the ready-to-send queue drops, and the assimilator queue rises. As an added bonus, when the assimilator queue rises, that means the deleters slow down, which means the available workunit disk space reduces, and we're back to square one again. No big deal as long as people are patient. All the backend services are doing the best they can until the index build finishes, and then we should catch up again.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-9 08:58:39 | 显示全部楼层

8 Sep 2008 23:02:11 UTC

The triplet table in the science database has been a headache for over a week now. We've been trying to add some indexes to it, but this has been mysteriously filling up some kind of logical space (not physical space) such that new triplets couldn't be inserted. This has also been adversely affecting the science database replica. For now we're giving up on the indexes and letting triplet insertions continue, and allowing the replica to recover.

Internal discussions continued today regarding what to do next as far as general storage. As mentioned often recently, we're low on workunit storage - the crux of most of our recent public server problems. We just got some disks in the mail today which were slated for our new home-made NAS box, but we might instead aim these at workunit storage somehow. Testing will commence tomorrow during the outage, as will several other server-related tests/upgrades.

To clear up some confusion: a lot of raw data files depicted on the server status page are showing errors. This is somewhat misleading as these errors all happen at the very end of the particular file/channel. So it's not like we're losing half our data. Only about one tenth of a percent. What are the errors? At the very very end of the raw data files, some channels are missing the radar blanking signal, so it's impossible to remove the RFI. These channels exit in error, though there's nothing we can do about it. We have taken steps to try to reduce the number of files that exit this way.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-10 07:48:36 | 显示全部楼层

9 Sep 2008 22:36:13 UTC

Tuesday means down time. Same drill that happens every week: projects go down for a few hours, mysql databases are washed, dried, and neatly folded, and then we're back on line sometime in the afternoon (Pacific Time). Some people don't like the scheduling of these outages, but as it happens NERSC (where we archive all our raw data off site) has their weekly maintenance outage at the exact same time. Something about Tuesday morning that makes it particularly good for maintenance downtime: it's not Monday, when we're catching up on weekend issues, but it's still early enough in the week to recover from potential problems should any arise.

We tackled several other projects during the outage, as we always try to do. We upgraded the OS on sidious (mysql replica db server), which was long overdue. There's lots of configuration involved, but with extra care the software RAID partitions containing the database survived the ordeal. We also tested some 750GB drives in one storage server - we're still trying to figure out what we have and what we can use given our current storage needs (for workunits, results, or less interesting but equally important things kept on the NAS box which will soon disappear). I also finished getting a new desktop installed - replacing the old clunker which had been our "mass mail" server (for reminder e-mails and such). I'll wait before the current smoke has cleared before telling people to "please come back."

There are always other work items too confusing to mention here. In fact I avoid a lot of happenings/details in these glib tech news posts as it will only raise more questions which I don't have the time to answer. Sometimes I'm cagey with my responses for political reasons - occasionally we have commercial vendors/anonymous donators/grant administrators involved in our decision making processes, occasionally I don't want to perpetuate the false impression I call the shots around here (I just work here - and post a lot because I happen to suffer from hypergraphia). I understand this vagueness is to the detriment of those who have a generally good understanding of the big picture and are keen to guess what our motivations and needs are, but without key bits of information people sometimes end up being a tad off base. Nevertheless it is amazing to me how much people glean from the scant amount of public relations material we barely manage to squeak out.

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2025-5-11 10:18

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表