找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 2008 技术新闻

[复制链接]
 楼主| 发表于 2008-9-12 07:42:56 | 显示全部楼层

11 Sep 2008 22:08:04 UTC

So we hit that brick wall again with the science database - that is, when we try to create a new index it works fine on the primary server but then clogs up sending the new index pages to the secondary. This clog locks up the database, the splitters grind to a halt, the assimilators grind to a halt, i.e. fun for everybody!

We thought we were out of the woods yesterday afternoon but checking in at 1am last night (this morning?) I saw this all happening again, so I gave things a swift kick and went to bed. This morning, once we were all here at the lab, we decided to just bite the bullet this time and shut down all the splitters/assimilators and let the clog work through naturally on its own, which it did. We also took the down time to do an "update statistics" on one signal table (this helps re-sort current indexes for speedier lookups) and add disk space for said indexes. I just turned things back on, we'll be catching up for a while, etc.

I did do some qlogic card testing today which got us over my "information gathering and training" hurdle so we can upgrade the remaining two servers with old OS's in the coming weeks. We also got our homemade NAS configured so that we may get the old NetApp rack out of the closet maybe next week. It's still working quite reliably, but it's taking up a third of our closet space, a seventh of our power, but delivering only 2 TB of raw disk space. Not really efficient, and we have a *lot* of servers waiting to get into the closet already.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-16 09:02:22 | 显示全部楼层

15 Sep 2008 23:14:16 UTC

Happy Monday, everybody. We've been in a holding pattern all weekend, more or less, dealing with the usual constraints (not enough space for workunits, mostly). This morning was weird - something tripped the "stop all daemons" trigger on our back end, so we were weren't sending out work for a couple hours until I noticed. Even then restarting everything was blocked by the lack of space again.

On the bright side, we've been getting this homemade NAS box up (for use as general backup of stuff we don't want to waste time/money backing up to tape, as well as administrative stuff, home accounts, etc.). So far so good, and there's a lot of extra space on it to move the less-active beta downloads there thus freeing up space to make SETI@home/Astropulse workunits to keep up with demand. Woo-hoo! That'll break the dam, at least temporarily. We're still looking for a cleaner long term solution - several things are in the works on that front.

Other than that, spent a lot of today in meetings, installing high-end graphics cards (for CUDA development/testing), and writing scripts to kick the replica mysql database when it lags behind for no good reason.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-17 20:49:50 | 显示全部楼层
16 Sep 2008 22:25:07 UTC
Another week, another database maintenance outage. This one was short but busy. We actually had major upgrade plans for one server but feared this would take all day and lock out the servers so we postponed it until less week which may be less stressful.

Eric cleared a bunch of space of the workunit storage so that bottleneck has been alleviated for now, i.e we have elbow room to create enough workunits to keep up with demand. However this leads us to the first of two mysteries today. You see, he's moving all the beta workunits to our new homemade NAS box (ptolemy). While this move has been already been helpful, it's taking forever to complete. Why are the disks pegged at 100% utilization? Lack of spindles? PCI bus traffic? Old/slow controller cards? RAID5 biting us again? We'll either sort that out or eventually give up on this machine as anything more than archival storage.

The other mystery has been a known issue for some time, but with the down time we revisited the problem: our secondary science database server, bambi, works great except for the fact that upon reboot there's a random chance one or two (or three) drives simply don't show up on the 3ware controller, causing all kinds of RAID panics/rebuilds. It's never clear why this happens, or when it will happen, and when it does it's not always the same drives that disappear.

However, a full power cycle always works. The only difference really is that the drives have to spin up on power cycle, but not on reboot. So we've been assuming there's some spin-up settings that need to be tweaked. There's been talk of making bambi the primary database server, so today we looked for those settings. Couldn't find them - nothing in the regular motherboard BIOS, and nothing useful in the 3ware BIOS - and the latter was moot because the drives would have already disappeared according to the 3ware BIOS, so all the spin-up problems are happening before the 3ware is aware. I find nothing about this in any documentation or on the web. It's not a showstopper, we can still use bambi as the backup that it is, but this pretty much means we'll never be able to fully trust bambi as a "main" server.

Oh yeah.. other stuff. The mysql replica croaked this morning just before we arrived - a partition on the server filled up. Apparently when upgrading the OS we missed a sym link somewhere. So the replica is resync'ing yet again. Also messing around getting the CUDA development/testing server up and running.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-19 07:54:43 | 显示全部楼层

18 Sep 2008 23:30:22 UTC

Just checking in before the weekend. Not much super urgent to report. The mysql replica fell behind again as our alert scripts didn't exactly work as expected. When the replica lost connection to the master the "seconds behind master" diagnostic variable gets set to NULL, which my scripts interpreted as "zero" as in "zero seconds behind master" - which is usually optimal. Ha ha ha. Anyway, it didn't fall that far behind and is catching up now.

Otherwise I've been doing some data pipeline scripting updates - for example you may have noticed that the server status page no longer gets cluttered with files that finished "in error" - as mentioned in a previous post these files are finishing fine except for some "raggedness" at the very end. Also some fighting with sendmail, and moving servers around. I moved a rather heavy desktop server downstairs into a new office - while carrying it the weight was enough to keep me distracted from the fact the sharp corner was digging two bleeding holes into my wrist. No big deal - but I showed my wife the wound later and she said it looked like a snake bite, which was amusing as the offending server's name is "snake."

We also walked through Luke's radar blanking code today - he's back to school so he was wrapping it up best he can this week and all our free resources were aimed at making this possible. His program is pretty much doing its job - in fact it's detecting the radar in our data better than the embedded hardware radar blanking signal we currently use! Well, we'll confirm this we more analysis.

Thanks for the concerns/tips/suggestions regarding my previous post about the mysterious RAID controller card behaviour. Maybe I'll check jumpers/etc. next week.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-23 08:03:32 | 显示全部楼层

22 Sep 2008 21:19:03 UTC

No big disasters over the weekend. However, turns out one of the download servers had its root partitions fill up yesterday due to faulty log rotation behaviour. I'm figuring that's why outbound traffic was spotty for a while. I had to clean that mess up this morning - I think we're out of the woods on that front but the traffic graphs still seem kinda weird to me.

I plan to upgrade the OS on server bruno tomorrow, and with that being the "hub" computer for BOINC in a lot of ways, the outage may be longer than usual. Hopefully not too long.

It is coming clear that our hopes for the new NAS box we assembled aren't being realized - it's pretty slow. It is also clear that using thumper as both a raw data storage buffer and science database server isn't going to work out for much longer. The I/O on the machine is usually maxed out, and we need a better solution. Not sure exactly what that solution is yet.

I'm going to be prioritizing helping to implement the new radar blanking code, as Astropulse is kinda blocked until it's ready. Jeff's been working pretty hard on that, as the program required some changes to core data management routines without breaking currently working software. Once we're over the hump on that he (or we) can turn our attention back to the NTPCkr.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-24 08:50:16 | 显示全部楼层

23 Sep 2008 23:17:01 UTC

We had the regular database maintenance outage today - no news there and we're recovering from that now. We have several backlogged data pipeline jobs adding much noise to our backend network, so progress is slower than normal.

We also planned to do some OS upgrading today but were blocked waiting for some backup jobs to finish. The influx of free time led me to do some extensive testing regarding our general bottlenecks as of late. I'll cut to the chase. We can blame RAID5 for pretty much everything. No real shocker there, but I was surprised by the extent of RAID5's lousy performance. In one example, a large file copied from temp space to a directly attached RAID5 partition took two minutes, and the same file copied over NFS to a remote RAID10 device took 6 seconds (file caching had nothing to do with it, in case you're wondering). While some systems handle RAID5 (or RAID4) much better than others, we simply can't afford the performance hit on the writes no matter how fast the parity bits are computed.

So why choose RAID5? Well, you get far more raw storage that way. But that's pretty much it as far as I care. Unfortunately in some cases (like our raw data storage buffer on thumper) we need every terabyte we can get. Seems kinda silly what with single terabyte drives readily available to the world, but spindle count is also quite important to us. In any case we have some convertin' to RAID10 ahead of us on several systems and the usual round of careful/paranoid testing. I don't think we have much of a choice in making some of thumper's partitions RAID10 as well, and that'll mean sometime in the future a planned outage of indeterminate length.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-9-25 08:02:48 | 显示全部楼层

24 Sep 2008 21:12:42 UTC

Something we've been lagging on is separating the database count totals on the server status page. Currently we're showing "totals" - for example, the "results ready to send" is a sum of both SETI@home and Astropulse results ready to send. For diagnostic purposes, it would be much better to split these into two separate columns.

However, this isn't so easy, as such queries become suddenly very expensive if we're adding an additional "where appid = N" conditional (AstroPulse and SETI@home are considered two different "applications" in the BOINC realm). I'm talking the difference being a 3 second query versus a 3600 second query. Yup. We've made joint indexes in the past for servers that needed them, but this hasn't been a priority for diagnostic stuff. We also don't really have the memory/resources to keep such extra indexes around. In any case, Bob pointed out that newer versions of mysql are smarter - doing the index joins automatically - so we may push on upgrading mysql sooner than later.

Today I'm actually lost in mundane bureaucracy land. I also should be working on the new software radar blanking embedder code. Sigh.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-10-1 10:09:09 | 显示全部楼层

30 Sep 2008 23:28:47 UTC

We had an extended outage today (more than the regular 3-4 hour database maintenance outage) to finally upgrade one of our core servers, bruno. Usually the OS upgrades are trivial, however this particular machine required a little extra TLC, due to its functional importance, as well as its unique (but admittedly not that unusual) hardware configuration. In regards to the latter, we basically put off upgrading this system until a modern day OS would automatically support its fibre channel card (as opposed to us having to compile drivers into the kernel, etc... blech...).

Anywho... there were no major failures during the long procedure (which included backing everything up, reconfiguring root RAID devices (while trying not to destroy others), then resetting all the network/RAID/apache/etc. services). It still took longer than it should due to a steady stream of minor annoyances (installer crash on first attempt, missing sym links that had to be discovered/recreated, missing packages to be installed, having to recompile every BOINC service due to standard library changes). Doesn't matter - it's done. Or at least done enough - there are still some screws to tighten which I'll tackle later.

So, we'll be catching up for a while. If at first you don't connect, let your client try again later.

- Matt
回复

使用道具 举报

发表于 2008-10-1 11:09:11 | 显示全部楼层
什么东东……?

SETI@home以后支不支持CUDA啊……
回复

使用道具 举报

发表于 2008-10-1 12:15:01 | 显示全部楼层

回复 #129 v724 的帖子

至少短期内没有这方面的计划。
回复

使用道具 举报

发表于 2008-10-1 20:14:53 | 显示全部楼层
虽然CUDA很牛B,但是SETI好像没有需要CUDA计算的地方吧
回复

使用道具 举报

 楼主| 发表于 2008-10-2 11:18:57 | 显示全部楼层

1 Oct 2008 21:01:21 UTC

Random day. Fixed more stuff on bruno (which got upgraded yesterday), most notably the update_stats process which needed to be recompiled to find newer libraries. Also dealt with lots of internal data pipeline management. And some subversion repository cleanup (in preparation to possibly improve web page translations).

The big thing is that I finally got some time to reconfigure that one RAID5 system into RAID10 (effectively), and the write rates increased by over 16x. Now we're talking. As we get more disk space to work with, we'll pretty much convert all our RAID5's to something else to help get beyond several backend IO bottlenecks. I know this sounds like we only now just discovered the joys of non-parity-based RAID systems, but - like most things around here - we are always firmly aware of better solutions but lack to resources to enact them. Pretty much all our RAID5 systems were built grudgingly but we needed the extra storage at the time.

- Matt
回复

使用道具 举报

发表于 2008-10-2 14:36:47 | 显示全部楼层
又是什么新闻?
回复

使用道具 举报

 楼主| 发表于 2008-10-4 00:46:24 | 显示全部楼层

2 Oct 2008 21:22:11 UTC

Not much to report, really. We had a couple blips or brownouts which were minor and easily corrected. Mostly spending my day working on R&D type stuff (mysql replication, radar blanking, etc.) and data pipeline management - this included boxing up freshly reformatted drives to ship to Arecibo.

One thing in the works, maybe, is changing the workunit redundancy to effectively zero. There is already the mechanism in BOINC to "trust" hosts that continually return validated work. These hosts are then sent workunits that only they will have to process (not a redundant "wingman"). No validation is required (or actually possible) upon returning the result, and no waiting on others for credit, either. Of course, even trusted hosts will get occasional tests to prove they are still trustworthy. Plus there are quick tests we can do on the backend in lieu of "comparison validation." Other pros for doing this include using half the resources for the same amount of science (hooray!) and potentially getting through our backlog of data twice as fast.

The cons are mostly concerns. If we try to keep up with current demand for work we'd have to run twice as many splitters, which is impossible given our current resources (we'd at least need more cpus, more disks, and better disk i/o). Or we could split at today's rate and regularly run out of work, which might upset some people. If we do increase our splitter production rate and burn through our data, we will even more likely run out of work on a regular basis (since we can't pad fresh data with old data if we used up the old data).

Just some thoughts for now. We haven't really decided on anything yet.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-10-4 00:49:36 | 显示全部楼层

回复 #133 v724 的帖子

项目服务器相关的东西,Matt是SETI@home项目的开发者和管理员。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~
欢迎大家多多支持基金会~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2025-5-12 03:36

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表