找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 2008 技术新闻

[复制链接]
 楼主| 发表于 2008-3-14 08:58:58 | 显示全部楼层

13 Mar 2008 21:25:40 UTC

A few small items today.

Still messing with the new science database indexes. Bob just started dropping/recreating these one at a time, which may slow down the assimilator inserts, but we'll see. Having the indexes on a different volume can only help.

We just got a used Raritan 16-port network KVM donated to us - I believe the donor would like to remain anonymous (if you're readind this thank you!). Eric got this hooked up to a test server pretty quickly - it's pretty sweet. We'll get this in the closet sometime next week, and then we'll have the ability to reboot systems from home, which should minimize down time over the long haul.

With the regular BOINC database performing quite well these days, we may attempt turning on the "resend lost results" features again early next week and see if we can handle it.

I have a gig tonight where I have to sing, but with my lingering cold/congestion I currently sound kinda like Brad Garrett. Should be interesting.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-3-15 07:53:01 | 显示全部楼层

14 Mar 2008 17:52:11 UTC

We turned off the resend of old WU on client reset because of a huge IO load on the MySQL db. It was slowing down result validation, the main function.

We have done a number of things to improve the db performance, reducing IO rates and hope to turn on the resend feature in the near future for a test period.

If the IO load is manageable the feature will remain enabled.
回复

使用道具 举报

 楼主| 发表于 2008-3-19 12:34:04 | 显示全部楼层

18 Mar 2008 21:15:54 UTC

Today during the outage I installed the new network kvm in the closet and hooked up one of the servers. We're waiting on green cables to arrive (so we can tell them apart from other cables in the closet) before hooking up the other servers. Putting this server in actually maxed out our 24 port DLink gigabit switch - so I chained in an old reliable Netgear 100 Mbit switch to occupy the stuff that doesn't talk gigabit anyway - UPS's, service processors, older servers...

Bill, who donated our previous and current routers, came by to pick up the 2811 we're no longer using, now that the current one has proven itself to be able to handle what we give it. Apparently this 2811 is off to Beirut. What an adventurous life this router is leading.

Otherwise, a lot of my time the past couple of days has been spent mostly on generic network/systems administration not worth mentioning here (i.e. mundane drudgery).

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-3-25 14:26:07 | 显示全部楼层

24 Mar 2008 22:28:55 UTC

Things have been running rather well over the past couple of weeks. Having effectively unlimited bandwidth really helps. It's a little more hectic behind the scenes as new data keeps getting sent up from Arecibo - we are continually working to offload the data to our local servers (and remote mass storage) so we can send back the blank drives for more. Steps will be taken soon to improve this situation (namely: sending some data to our remote storage via our faster Hurricane connection).

There was a bit of a panic this morning, however. Suddenly gowron, our workunit storage server, reset itself. Not only did it reboot, but it lost all host/IP information. For all we could tell at first it lost everything! We had to connect to it over serial (most difficult part: finding the right cables) but once we got in we found our 2 terabytes of workunits were still intact (whew). So it was mostly a matter of reconfiguring the basic things and we were back in business. Why did it reset itself? That remains a mystery.

Another minor gripe: I spent a man/day last week working on testing mdadm's "spare group" feature. That is, if a drive fails on a RAID device without a spare, it can steal a spare from another RAID device in the same RAID group - mdadm's way of enabling a "hot spare pool." We never had a case where this would happen, nor did we ever test it. Now that thumper is less two spares (due to making a new small, separate RAID1 for database indexes) I wanted to test this. I made simple test cases and failed drives - but the available spares in the spare group weren't being utilized. Long story short - I actually recompiled my own mdadm with fprintf's all over the place and found mdadm behaving strangely. Thing is, this is mdadm version 2.6.2 we're talking about here, and mdadm is already up to version 2.6.4. So I download that, and it worked, so apparently this bad behavior has been fixed. But Fedora doesn't have the latest version available yet, at least via "yum update," so we're pretty much waiting on the new version to become available before implementing a less trusted version, even if it seems to work better.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-3-28 09:04:12 | 显示全部楼层

27 Mar 2008 22:40:40 UTC

There's not much news to report on the technical front - but that doesn't mean I haven't been busy. I've mostly been engrossed in tasks that have little effect on the public servers, so anything I've been working on is either (a) too complicated to describe to everybody's satisfaction (including my own), or (b) relatively uninteresting.

I've been lax in sending out regular "reminder" e-mails to participants who lapsed (i.e. have stopped processing data for N days) or never succeeded in processing work. We wanted to start these up in the fall, but there were server woes - and it's not good form to send "please come back" messages to people only to frustrate them with connection failures. Then everybody went on vacation at different times. Then it was donation season, and we try not to send e-mails to people more than quarterly, so that postponed the reminders until a month ago, but at that point we were having the science database/router woes. Anyway.. now seems like a good time to try and start again. Perhaps starting early next week.

Tomorrow is a University Holiday, thus making this a three day weekend. Perhaps start an office pool involving which server will croak at midnight tonight.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-3-29 15:19:29 | 显示全部楼层

29 Mar 2008 5:16:39 UTC

I was joking in my last post about machines dying at midnight starting this three day weekend. At least they were nice enough to wait 18 hours into the weekend to start failing.

In this case, our workunit download server which failed earlier in the week croaked again. I happened to notice during my usual random check in from home that we were sending out any bits, which immediately led me to the faulty machine. For a short time I was able to log into it via a serial connection but it was in some funny, unhelpful single-user mode with a broken network config. Unable to do much I tried quitting out of that and it then basically became unreachable. Since its network configuration has reset, and the serial connection now shows no pulse, there's no option except drive up to the lab and kick the thing in person.

Except it's 10pm on a Friday night, and it's raining, and the known fix will take an hour or two to enact. No thanks. Even if I wanted to go up to the lab, there's no guarantee any fix would work. And even if I did get it running, given current history there's no guarantee it would stay running through the night or the weekend, so I'm staying home.

Bottom line: no workunits until somebody is in physical contact with the server. This may happen sometime before Monday, but don't count on it. I sent warnings to the others but not sure any of them will be free to go up to the lab. I have a gig tomorrow so my next 36 hours are occupied.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-4-1 14:00:49 | 显示全部楼层

31 Mar 2008 21:46:51 UTC

The last few days were a little bumpy, with our workunit storage server disappearing out from underneath us at random (see previous posts for more info). This is still not quite clearly understood. The reigning theory is there's some faulty connection somewhere between the front face of the system (where the reset button is located) and the internal circuitry. This isn't too hard to imagine as there are some servers sitting right on top of it, and pressing ever-so-slightly down on the server's faceplate. A month ago we added that new heavy router to the stack. Perhaps this is the problem, which leads us to the general (and incredibly annoying) rack standards issue: all server racks are by default non-standard size and shape, and therefore we aren't properly racking as much as stacking.

One of the upshots of this were beta uploads were failing all weekend in various ways, most likely due to partially broken mounts between the upload server and the storage server (which contains the beta uploads as well as workunits - SETI@home public uploads are kept right on the upload server itself). This was very difficult to understand, but even worse: it just suddenly started working again - and during a meeting no less (when nobody was actually sitting at a computer doing any tweaking).

I'm leaving early today to have a meeting down on campus with the donation department. Exchanging general ideas for improvement.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-4-2 08:45:44 | 显示全部楼层

1 Apr 2008 22:15:39 UTC

Last night the workunit storage server acted up again. I attempted to reconfigure it at midnight last night, but then it reset itself an hour later, and again every hour since. So whatever the problem is, it's gotten worse. Jeff and I did some diagnosing during the regular weekly database backup outage today. The reigning theory is still a faulty faceplate sending erroneous resets to the motherboard. So as it stands now the server is running without its faceplate (and therefore no control panel - which makes powering on quite difficult)! And so far no resets. If this stays stable for a week I think we'll have nailed the problem. Meanwhile the kind folks at Adaptec already have a complete replacement at the ready if we need it - we might just need to replace the faceplate.

No other real big shakes about today's outage. I added more machines to the new kvm (which meant being able to pull more cables out of the closet) and we added a new field to the workunit table in the BOINC database - so far that hasn't broken anything as far as we can tell. The beta uploads are failing again, but hopefully that will clear up on its own like last time (I'd still like an explanation, however).

Happy April Fools, by the way!

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-4-3 08:46:50 | 显示全部楼层

2 Apr 2008 22:54:30 UTC

So far so good, running with the faceplate off the workunit download server. If this remains the case we'll get a free replacement faceplate from Adaptec. This little exercise has proven that this server is a bad single point of failure - if we actually lost all the data, it isn't a scientific disaster, but a BOINC disaster - there would be hundreds of thousands of workunits "in the field" that no longer exist, and are no longer verifiable. We can regenerate the workunits, but it would be a big waste of CPU time not to mention a public relations disaster (not like we haven't weathered those before).

Remember radar blanking? Here's a recap: unlike the classic data, the multibeam data is blitzed with radar sources, adding a lot of noise to a small subset of our workunits. The radar's time frequency is short but random, making it very hard to remove by simply randomizing data based on certain thresholds. This is more an annoyance that a threat to science. Arecibo implemented a "radar blanking signal" which we now get in our data, telling us exactly when the radar is on so we can "blank" the data exactly at that time. Among other things, we've been working to get this coded up and tested in the splitter for a while now. Jeff has been managing this recently and this morning had some final data and plots from workunits sent to our clients with the radar blanking and without. Looks like we solved the problem. Expect slightly less RFI workunits on average in the near future.

With Arecibo slated to be decommissioned in the not-too-distant coming years (write your local congressperson!) this has been an unintentional temporary boon for us as the observatory is prioritizing sky surveys to appease its current/remaining projects. That means we're collecting a lot more data than we originally intended, which means we can't seem to get disk drives back and forth between Arecibo and Berkeley fast enough. The bottleneck is our limited bandwidth to copy fresh data that arrives here down to HPSS (offsite archival storage) before erasing drives and sending them back. We're going to purchase another cheap SATA drive enclosure and try to use some of our excess Hurricane Electric bandwidth to speed up the archiving process.

Outside of that (and countless day-to-day chores) I got the basic plumbing of the "precess fix" program working. We unknowingly double-precessed all multibeam signal coordinates, so they aren't in J2000 as much as J1993 (the observatory's multibeam receiver code had coordinate precession built in, unlike classic receiver code). Not a major tragedy, and easy to revert - but this is one of those things where you want to make sure the math and logic are correct before updates billions of rows in a database.

Edit: Oh yeah, and I also sent out about 10000 reminder e-mails today. See other threads about waning user interest for more info. I'll send more each day.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-4-4 08:43:39 | 显示全部楼层

3 Apr 2008 21:31:19 UTC

Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We're continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week.

Meanwhile, spent the morning doing other stuff while the project was down. Like tightening up various aspects of our source code management. Or working on the data recorder to ensure raw data files have even numbers of blocks (blocks are written in groups of two, with the radar blanking signal for both in just one of them - so files with odd numbers of blocks may be missing blanking signals at the end, thus rendering that last block useless). And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there... Of course before we have visitors Jeff and I have to drop everything and actually clean up the lab - piles of KVM cables recently removed from the server closet, random DIMMs too small to use, on every possible flat surface O'Reilly manuals (or good ol' K&R) lying open to specific pages, empty soft drink containers...

In any event, recovery (yet again) is happening now. Hopefully as the weekend approaches there will be a wee bit more stability in our server closet. Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-4-9 10:49:50 | 显示全部楼层

8 Apr 2008 23:43:16 UTC

Had a relatively painless weekend, which is a good sign as that probably means we correctly determined the cause of our workunit download server woes (broken faceplate sending bogus resets to the system). Everything else was okay except the database statistics on the server status page flatlined. This was fallout from the mysql database server rebooting itself on Thursday and the replica server getting out of sync. Since this was a harmless, cosmetic problem we let this fire burn until we re-synced the two databases today during the (extra long) weekly outage.

Why were we down today for so long? What happened?! Seems like last week's database crash caused some minor confusion in (at least) the "credited_job" table, which of course is the largest table in the database. So we had to run a long, expensive "repair table" query after a longer, more expensive "optimize table" query failed with error thus preventing us from even backing up the database. How annoying. Even more annoying: the /tmp partition filled up during the repair so mysql twiddled its thumbs for 20 minutes before we realized and cleared out more space. Then /tmp filled up again. Then we realized the it was trying to write about 10GB of data to /tmp. This wasn't gonna happen. So we killed the "repair table" query and simply restarted the project so people could get back to work. However, without credited_job the validators can't work, so they're offline for the night. We'll discuss tomorrow what to do next. We still haven't backed up or re-synced our databases. They might be an extra outage tomorrow.

We employed the new workunit-generating splitters with radar blanking yesterday, but then overnight ran out of work to send out. This was due to the way our data was collected and stored in the raw data files. Long story short, data buffers are collected and stored in pairs, one which contains the radar blanking signal (which lets us know exactly when the noisy radar is on), the other of which does not and therefore gets its blanking signal from its sibling. However, the orientation of these pairs in the data isn't fixed and may reverse "polarity" at any time. So there's a good chance the first buffer in a data file is missing its sibling and therefore can't find any blanking information. This is a critical error, so splitters were getting hung up on these files as the queue slowly drained. Not a big deal, and Jeff reworked the logic in the splitter so these errors are not critical (we'll just skip the first buffer). Anyway, this only affects a couple months' worth of files - we already fixed the logic on the data recorder down at Arecibo to reduce the chance of "half pairs" happening in a single file.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-4-10 10:55:02 | 显示全部楼层

9 Apr 2008 21:24:22 UTC

Continuing on from yesterday's tech news note, we had a "take two" outage today for database maintenance. We "repaired" several tables (the word repair is in quotes because, while MySQL locked the tables due to potential corruption, the repair query found zero errors). Then we dumped the master database and are recreating the replica from that dump. This is actually happening now, and will probably take all afternoon, but since the master is back in one piece we started up the projects and are catching up, draining backlogs, etc. We'll start the replica once it's ready and it should catch up as well.

Outside of that, Jeff and I are tackling the current state of data flow to/from Arecibo. We have a lot of scripts in place to automate most things, but there are still some parts we do by hand based on the situation. Do we need to empty the drives as soon as possible and get them back to Arecibo to collect more data? What if there's no space available on the splitter system? Things like that. So I'll be coding up more robust scripts in the near term.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-4-11 09:11:39 | 显示全部楼层

10 Apr 2008 17:53:43 UTC

We thought we had the hardware problem with the workunit download server diagnosed, but looks like we were wrong. False positive. The good news is that the kind folks who donated the thing have another ready to ship. But until we get it, that probably means potential random resets all weekend. Jeff just put an /etc/rc script in place so that upon reset/reboot there's a chance it'll be operational, meaning short glitches instead of multi-hour outages. That's the hope anyway. We might actually test that later today (if it doesn't reset itself on its own). There was discussion about how to implement a second workunit storage server so we don't have this single point of failure anymore. Not as easy as it sounds.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-4-15 08:24:58 | 显示全部楼层

14 Apr 2008 19:03:42 UTC

Continuing problems with the workunit storage server... There were more resets over the weekend, ultimately resulting in one that caused the server to think enough drives have failed to call the entire RAID dead. We are confident we can trick the server into thinking otherwise - we actually have some helpful techs logged in doing that as I type. We still want to replace the whole box, which we'll hopefully do today, and then the drives will have to resync again. Chances are we'll be down until tomorrow (Tuesday).

So while we are down we'll try to catch up on several things. Moving servers around the closet, incorporating the new drive enclosure that arrived today, getting more stuff on the new KVM, etc.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-4-17 08:33:07 | 显示全部楼层

16 Apr 2008 21:34:36 UTC

Sofar so good with the new workunit server. We recovered from the recentspate of outages fairly quickly. The assimilator queue is starting todrain at a good clip, too. If anybody's looking at the traffic graphsand noticing a "bump" over the last hour or so - that's us sending ourraw data to HPSSover the Hurricane pipe (in additional to sending it over the standardcampus pipe). With the recently purchased (and employed) disk enclosurethis extra bandwidth is now possible, and every little bit helps (punintended).

Mostly working on programming today. Wrapping up work on the precessrecalculator - will probably deploy next week. Astropulse and thentpckr are both just around the corner as well. I know we've beensaying that a while, but it's getting truer ever day. Lots of bigthings coming down the pike.

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2025-5-11 04:51

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表