SETI@home 技术新闻 2009

BiscuiT 发表于 2009-1-7 19:54:24

新年快乐！SETI@home 迎来了2009年，希望新的一年我们的探索获得收获！

本帖内容转载自 SETI@home 官方网页 - 技术新闻页面。
官方技术新闻的专门留言板：http://setiathome.berkeley.edu/forum_forum.php?id=21

BiscuiT 发表于 2009-1-7 19:58:56

7 Jan 2009 0:06:20 UTC
It's Tuesday, so that means database maintenance outage - the usual drill. We are recovering from that now. During the downtime I added more space to the workunit storage - actually reaching an unexpected 4 terabyte logical limit on that volume. This isn't a big deal, and we converted the two drives we can't use on this volume into extra spares which are always welcome. I also rolled up my sleeves and drew up a brand new power map of the closet which was until now sorely outof date. After we get Dan to measure the current draw directly at the breakers we can start safely adding machines to the closet.

Over the holiday break, at least since I last posted anything, there was only one real incident. Our scheduling server went kaput and required reboot. Dan and Eric actually took care of that as I was happily making a chunk of change playing a New Year's Eve gig at the time. The surprise outage had the benefit of reducing demand on our resources so we could finally drain our back-end queues, and we recovered nicely once everything was back up and running.

Jeff found the bug in the validator today that's been causing some confusion when comparing cuda vs. non-cuda processed workunits. He's working on the fallout/cleanup from all that while we're still trying to figure out why some cuda clients are overflowing on certain workunits.

By the way, welcome to 2009. I'm only now just getting back into the lab (was out of town between new year's day and yesterday). I have hopes of progress regarding UC Berkeley's SETI project in general.

- Matt

官方已经发现了 CUDA 任务中泛滥的 overflowing 问题，校检服务器中比较CUDA和非CUDA任务中的bug。

BiscuiT 发表于 2009-1-8 11:25:37

7 Jan 2009 23:56:34 UTC

Nowit's Wednesday, which usually means my focus should shift towardsprogramming tasks. This actually hasn't happened in a while due toholiday schedules and other crises, but the radar blanking code reallyneeds to be hammered into shape already. See the plans page for more info on that. Lots of mental paging-in of C++ programming trickery.

But this morning I was still busy with a bunch of things on my systemstask list. Our informix replica server bambi was having fits withexporting/mounting so I had to go through the rigamarole of rebootingthe system - which always seems to be the fastest way to fix thingswhen things go awry. I also plugged away moving tons of data around ourinternal network for eventual filesystem rebuilding, tending to the rawdata pipeline, etc. - the stuff I've been talking about for a while.

I've been using an old "Solaris 8" software box (coupled with the shellof a long-defunct external SCSI hard drive enclosure) as a stand for mydesktop monitor, unaware how over the years the box has been slowlymorphing out of square and sinking towards the left thus slanting thescreen more and more. That might explain the crick in my neck I've hadthe past six months. This unergonomic situation was finally pointed outtoday by fellow SSL sysadmin Robert. Anywho, I now have the monitorsitting onto my shuttle enclosure, and even though it's perfectly levelit seems it's slanting to the right. Talk about accommodation - mybrain really got used to the old lean.

- Matt

由于假期还有一些意外事情，radar blanking（雷达消隐）的编程工作受阻，但已经确实进入了计划实施的安排。

BiscuiT 发表于 2009-1-9 08:59:48

8 Jan 2009 22:26:13 UTC

I actually should be programming all day, but when I dive head first into such activity I have to take frequent breaks to let the CPU in my head cool off as I draw odd diagrams on the dry-erase board to solidify the logic and pseudo-code tumbling around my brain. During these moments of respite I may tend to more enjoyable things, like messing around with the raw data pipeline, or figuring out why, all of a sudden, we're not sending out any work.

The last thing was due to a problem we're seeing more and more around here. As we ramp up doing actual science where hitting the science database with one-off queries that somewhere contain the phrase "order by." This seems to give informix fits when it's busy. Apparently we need to free up, or create, more resources so the db engine has more scratch space to do sorting. Otherwise it jams up in a slow, quiet manner, and nobody notices until we observe side effects - like the traffic graph dropping to zero. So we're looking into that general problem now.

- Matt

BiscuiT 发表于 2009-1-13 10:04:31

12 Jan 2009 23:58:40 UTC

A rather quiet weekend, though the astropulse validator seems to have gotten locked up on something. Josh and Eric and looking into that. This morning was a little weird. An old UPS we were using as a glorified power strip just up and stopped working, thus removing power to various sundry items in our secondary lab which wouldn't have been a big deal but one of those items was a switch, so sidious and vader (and casper for that matter) disappeared from the network for a short while there. Nobody seemed to notice. In the afternoon Jeff and I plotted some physical server moves for tomorrow's outage. We'll see how much we get done - and as always we take small steps with these big projects.

Various cuda-related items were discussed in our server meeting today. A bug that was causing the triplet overflows was found, and the blue screen of death issue with slower nVidia boards is getting a workaround. New client and application releases in the near future should clear some of this up.

Back to work - which means plotting lots of radar data for me.

- Matt

CUDA 三重溢出的bug已经被发现，并找到了在较慢的gpu上蓝屏死机的解决办法，将很快发布新的计算程序。

[ 本帖最后由 BiscuiT 于 2009-1-13 10:10 编辑 ]

BiscuiT 发表于 2009-1-14 19:12:34

13 Jan 2009 22:58:50 UTC

Typical weekly outage (for database cleanup/backup). During so Jeff and I did some more server closet reconfiguration - we consolidated all the Overload Storage stuff (servers gowron and worf, and their combined 16 TB of raw storage) into one rack, along with our router (that connects us to our private ISP separate from campus). This gave us enough room to (finally) add another UPS to the fold - which is good as older ones have been complaining/dying. Our UPS situation is far from optimal, but we're working with what we got. We also (finally) got server clarke into the closet, which will act as a much-needed build/compute server, among other things.

Steady progress is being made on both NTPCkr and radar blanking fronts - in fact I should working on the latter. Tomorrow I may tackle the RAID re-configuration project on our secondary science server, which may vastly reduce i/o and therefore increase NTPCkr throughput.

- Matt

BiscuiT 发表于 2009-1-15 11:26:07

15 Jan 2009 0:09:47 UTC

Today started the process of reconfiguring the underlying RAID devices on the secondary science database server (bambi). I was able to scrape together enough spare drives within the system to make temporary space so I could shuffle things around. Given the amount of data each shuffle takes a long long time. In fact, we're kinda stuck on this project until tomorrow. Anyway.. the database is sitting on three concatenated 6-drive RAID5's. Actually, given the way LVM is handling things it's mostly all on one 6-drive RAID5. Don't ask me why we set it up this way. The plan is to convert these 18 drives into a giant RAID10. More spindles, better striping, etc. and we can take the hit in usable storage.

Other than that, and messing around with Bob's desktop (which seems to have gotten a weird case of OS rot), I'm still elbow deep in programming. I hate C++ so very much but I admit the standard template library is helpful once you wrap your brain around it all.

- Matt

BiscuiT 发表于 2009-1-16 09:09:36

15 Jan 2009 21:45:17 UTC

This morning moved on to the next phase of the bambi RAID shuffle - destroying all current volumes and building a series of RAID1 mirrors in their wake. The initial sync will take until tomorrow. Sigh. We'll continue then.

Eric's server ewen (mostly used for studying interstellar hydrogen) crashed this morning. This should be a non-issue except due to various dependencies it hung some of our other servers. Upon restart it was having networking issues thanks to NetworkManager - something we try to uninstall on every system but apparently didn't on ewen. This is a piece of software that comes with linux distibutions which, as far as I can tell, exists strictly to create random network problems to keep your workday interesting. In better news, Bob's desktop is working again. The problem was actually a bad internal SATA cable. Or at least things are working since removing it.

The ap_validator is still offline, mostly. It restarts every 10 minutes, maybe gets a few results done, then segfaults. The astropulse people (not me) are working on it. I know nothing beyond that.

- Matt

BiscuiT 发表于 2009-1-21 12:07:52

20 Jan 2009 22:58:04 UTC

Welcome back from another long weekend - we had MLK Day off yesterday, and the whole country has been running a little late this morning. Things went mostly well in server land. The astropulse validator was (still) choking on various results so the backlog grew and thus the workunit storage filled up again for a minute there. That means the splitters halted, and we ran low on work to send out for half a day. Other than that, no major events.

Today we began the final stages of the secondary science database shuffle. We were a bit disappointed by the results at first, and did some more reconfiguration/testing before learning to not trust the output of iostat so much as the other evidence that shows we may have improved our peak science db throughput by 10x. Well maybe not so much - we'll see - if it's 2x I'll be psyched. More work tomorrow on that (the secondary is still catching up from being offline for 5+ days).

A followup on a recent story about our Overland Storage servers. I recently mentioned we hit an unexpected 4 TB file system limit on our workunit storage server (gowron). Turns out we actually hit a physical extent limit, and this will be fixed in the latest OS release. This is really just an academic point - we could only grow to 4.25 TB max anyway, given the number of drives. Thanks again to Overland for continued support.

- Matt

BiscuiT 发表于 2009-1-22 09:53:25

21 Jan 2009 22:18:50 UTC

The secondary science database finally recovered. As we poke and prod at this new configuration we're still finding things we might have done differently, but we're planning to just seal it up and call this project done. Actual gains in speed/performance are to be tested.

As many of you regular/avid readers know the last release of the cuda client got a little messed up - people were getting checksum errors meaning the files were corrupted. Bob did the code signing procedure this last time around from his desktop machine which has recently had problems with its memory DIMMs. This is our best, albeit vague and unsatisfying, theory as to why a small subset of files got corrupted when simply copying from one directory to another.

Continuing progress on radar blanking and the NTPCkr. Jeff and I are anxious to get these projects done already.

- Matt

过去发布的CUDA客户端有点搞砸了...校检错误会导致文件损坏, 计算的结果也没多少用处..

BiscuiT 发表于 2009-1-23 11:46:20

22 Jan 2009 23:34:01 UTC

We continue to have problems mounting our raw data drives (which we fill down at Arecibo and drain up here). The symptoms are random, the error messages are random, and where these messages actually appear is random. Jeff and I are pretty much giving up trying to figure it out. We'll most likely remove as many moving parts from the whole system and deal with continuing issues as they arise. Not sure who/what to blame. Linux? SATA? USB? The enclosures? The cables? The drives themselves?

I actually got the software radar blanker working. Whether or not the output it generates is worth anything remains to be seen, but at first glance it looks pretty good. The proof is when I run this on a whole file and make some workunits, and then see if these workunits explode.

- Matt

原始数据驱动器继续出现问题，并未清楚问题来源。。
雷达消隐的软件已经开始试运作，虽然不知道会生成什么东西。。乍看起来好像不错。。

BiscuiT 发表于 2009-1-27 10:34:20

26 Jan 2009 23:17:39 UTC

Due to various bugs on the scheduler/client side of things some users have been getting far too much work to do. This results in excess workunit downloads which eats up our bandwidth and makes it generally difficult for anything to happen, then queues start backing up, etc. The scheduler fix has already been employed late last week, a client bug-fix is in the works.

I have little to do with the above, and the problems should clear up on their own once traffic settles down. Today has been a catch-up-on-mundane-sys-admin tasks kind of day for me, which is fine once in a while.

- Matt

BiscuiT 发表于 2009-1-28 17:32:22

27 Jan 2009 22:40:49 UTC

Last night, due to the high traffic I was grousing about yesterday, the workunit storage filled and therefore no new work could be generated, so we ran out of stuff to send to clients. This cleared up on its own this morning, but then we started the regular weekly database maintenance outage, so we'll be in a bit of connectivity pain for a while.

During the outage I tested the stability of our secondary science database server (bambi). In other words: will it survive reboot without missing drives? It did. So that project is more or less done, and we'll start focusing on the primary science database server (thumper) next.

Even more exciting is that Jeff and I added a couple more servers to the closet today: sidious and casper. The latter is a multi-purpose machine used by the tangentially related CASPER project. The former is the replica mysql database. We were happy to finally get it out of our "test lab" and into the closet because it's big, noisy, and there's a chance its particular network hangups will be solved by moving it physically closer to its friends (all talking over one switch, as opposed to traversing at least three). We have only one major server left to move into the closet: vader. This is all good news but we're kind of maxed out on power usage in the closet, and need to do some breaker tests before adding anything else.

- Matt

[ 本帖最后由 BiscuiT 于 2009-1-28 17:38 编辑 ]

BiscuiT 发表于 2009-1-29 13:22:32

28 Jan 2009 23:24:18 UTC
Last night sidious (mysql replica database server) rebooted itself. Yeah, we did just move this into the closet, so there's non-zero worry that something may have gotten injured in transit, or it's unhappy in its new home. On the flip side, our servers are rebooting themselves from time to time for no apparent reason except maybe high stress. I love all operating systems (this is sarcasm). Anyway, that meant mysql crashed ungracefully and has been recovering all day - however succesful this recovery is remains to be seen. It is just the replica, so no big shakes, really.

And this afternoon we ran out of work to send out. This was due to our science database getting "brain freeze" which is what I'm calling it these days. If you run the wrongly formatted query the whole engine silently grinds to a halt, effectively blocking all splitter and assimilator access. I found and killed the errant queries and the dam burst. So yet again we're recovering from an unexpected semi-outage this week.

Regarding the setisvn server (from last thread)... I'm fully aware of the poor configuration of that virtual domain. Low on my priority list.

- Matt

BiscuiT 发表于 2009-1-30 11:36:47

29 Jan 2009 23:25:26 UTC

The replica mysql database on sidious recovered more or less just fine. It may be ever so slightly out of sync with the master database. This means we'll probably rebuild it during the next weekly outage just to be sure.

The scheduling server was up and down yesterday afternoon and this morning. The scheduler CGIs have been segfaulting and adding core dumps caused the system to grind to a halt, needing a reboot. Turns out the problem wasn't in the CGI, but in apache itself (or the fastcgi module). This has been a problem in the past. We seem to have to tweak various apache parameters at random times, based on a chaotic, unpredictable equation involving current resources/demands, mysql health, network health, system health, various queue sizes, etc. Simply reducing the MaxClients to a much lower number caused the segfaults to disappear while still servicing all incoming requests.

We're running low on data to send out, and we're in a murky period where the weekend is rapidly approaching and we are still awaiting the latest shipment of raw data drives from Arecibo. We could pull up as-yet-unanalysed data from our archives, but the offsite storage archive (HPSS) is undergoing several upgrades and have been offline for days. We'll see how this all pans out...

- Matt

mysql 的冗余数据库服务器恢复得不错。。问题可能是之前与主数据库有稍微不同步所致，在下周的维护中需要重建它。
原始数据几乎分发完了，周末降至。。
正在等待阿雷西博最新的一批数据，虽然可以从档案库中拿出未被分析的数据，但那些硬件在最近几天或多或少需要进行一些升级。。看是否顺利吧。。

页: [1] 2 3 4 5 6 7 8

中国分布式计算论坛's Archiver