找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 2008 技术新闻

[复制链接]
 楼主| 发表于 2008-2-8 08:45:05 | 显示全部楼层

7 Feb 2008 22:58:44 UTC

We're having little luck getting science database thumper to perform up to expectations. We determined the fact it is both a database and raw data storage server isn't really the problem - the database alone is somehow constrained. Is it all the additional indexes we added recently? Extra load due having to make logical logs for the replica? Something else entirely? Of course, while testing/tweaking the OS root mirror drive on thumper failed. We got the notice from smartd but mdadm didn't notice, which was scary. We manually failed the mirror and brought in the hot spare which is sync'ing up now. Anyway.. the assimilator queue is growing and there doesn't seem to be much we can do about it now, at least anything drastic given it's the end of the week. We are sending out a lot of short work - maybe this will change soon and give us some relief.

Other small news: recent splitter updates include (a) more realistic deadlines, i.e. they have been reduced 25%, and (b) radar blanking code - we're testing that now. There also has been a little bit of scheduler/upload server choking due to the aforementioned headaches - including one of the schedulers running out of work (as it runs faster than the other and therefore its queue depletes faster). Once again, we're have little choice but to wait out the storm.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-12 13:56:26 | 显示全部楼层

11 Feb 2008 22:48:02 UTC

Cameinto the lab this morning and it was well over 70 degrees. This mayseem nice on a winter day, but (a) we have fairly warm winters here inthe Bay Area, and (b) the usual temperature in the lab is closer to 60degrees - even in the summer. This isn't great from a human perspective- we wear jackets while sitting at our computers all year round. From ahardware perspective, the extra cold lab air assists in keeping oursystems nice and cool. This is why I was immediately concerned aboutthe suddenly warmer air. Turns out a fuse blew over the weekend, and itwas already repaired before anything came close to melting. Still.. alittle bit of panic this morning.

Despite the load on our backend servers being on the low side (averagedover the past 5 days or so) the assimilator queue was barely able toshrink. In fact, it's growing again due to the Monday bump. My guess(and others') which I already mentioned is that the new sciencedatabase indexes, which add more random reads/writes during inserts,are to blame. We're doing more aggresive analysis and will try some"low hanging fruit" type solutions before too long. Not a major tragedyjust yet, especially as workunit may be generally less noisy in thenear future. The scheduling/upload servers are also on the brink ofdisaster - they have short but nevertheless frequent periods ofdropping connections. They too would benefit from less noisy workunits.Or more/better hardware.

On that note, if you check out the slightly updated hardware donation pageyou'll see I added an item for a KVM-over-IP which would help usupgrade our server closet faster. We're maxed out in the consoledepartment. In fact, our one public web server has nokeyboard/mouse/monitor attached to it. If it freaks out, we hope we canlog in remotely and fix it. Any incredibly generous takers? Anybodyhave strong opinions about which make/model to obtain?

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-13 12:53:51 | 显示全部楼层

13 Feb 2008 0:34:39 UTC

E-mail administration is utter torture. Time was every project in the lab had their own separate mail servers. Over the years people wisely moved towards a more unified lab-wide e-mail system. Of course, SETI was the last project to convert, pretty much due to not having the man-week to spare fixing something that ain't broke. Well, it suddenly broke last night enough that I had to pretty much drop everything today and make everyone bite the bullet to start switching over - something that should have happened years ago but nobody has had the time to deal with it. Not like I have the time to deal with it now. Ugh. At least it'll all be out of my hands in the coming weeks. Until then, I'll be up to my eyeballs in sendmail drudgery.

Meanwhile, we had our usual outage today, during which we replaced the seemingly bad drive on thumper - the master science database. That was easy, but upon restart another of its 48 drives started complaining. So far the complaints can be seen as spurious enough to ignore. We'll do more robust RAID checking soon. Bob also moved some logs files around to hopefully reduce random access disk I/O, and is running some "update stats" on the tables to see if that improves performance.

In better news, I did some DNS twiddling to split the upload and scheduling services to two separate machines (as opposed to running both services on both machines). This vastly improved performance, as splitting the functionality reduced the NFS traffic between the two to zero. We had it set up the previous way for historic reasons which were no longer apt. This is all very good but as it stands we have single points of failure for all our public facing servers. We have some systems in line to fix that but they are in use for Astropulse testing. And we still need to work that router into the fold.

Note regarding the previous thread: I should take updated photos of the server closet - not that much different but a lot neater.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-14 09:28:01 | 显示全部楼层

13 Feb 2008 23:54:49 UTC

I'm realizing the server status page is giving a slightly bogus picture of our current server setup, and it's actually too much work right now to fix the status script, so I'll just tell you now what the current situation is: our public web server is thinman, our scheduling server is ptolemy, our upload server is bruno, and our download server is bane. None of these currently a redundant twin or a "hot" backup (but we have vader and maul all set up to be a replacement for any of the above if need be). More on that below Our primary/secondary BOINC (mysql) database servers are jocelyn/sidious, and our primary/secondary SETI science (informix) database servers are thumper/bambi. Specs for all these are correctly noted on the status page. We have other systems employed for less interesting but important things, but that's basically the meat of it. If we could double the CPU/memory/disk space on everything we have we'll be set (for the time being).

Anyway.. things are looking better. Weekly outage recovery is still a little weird - I don't think our single download server (bane) can handle such crunch periods alone so we'll probably bring vader back into the fold for that. The other servers are super happy given the recent changes to reduce NFS traffic. I enacted some more such changes this morning. This tweaking, coupled with server ewen (where Eric does his Hydrogen work) crashing and hanging the network a bit, made for a slightly bumpy ride this morning. However, between smoother seas and perhaps running "update stats" on a couple signal tables made the assimilators much faster. We'll finally catch up on that queue in a couple hours I think. Due to the reduced dropped connections on the scheduling/upload servers it seem that the router got more cycles to spend on downloads, and we reached almost 70Mbps last night. Still need to get that new router going...

Other than that - more mail drudgery. As much as I like computers, I hate when perfectly good but nevertheless wonky solutions to small problems become the foundations for advanced development, thus amplifying the original wonky-ness.

Oh yeah - Eric sent some graphs around. Looks like the radar blanking code is working. Neat. Jeff's working that code into the splitter now so we can retest that small data file and compare results.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-15 09:28:43 | 显示全部楼层

14 Feb 2008 22:11:21 UTC

Right after writing yesterday's tech news I spotted the validators haven't been running since the morning. Oops! Turns out I discovered something that's been a problem for many, many months but only got triggered now: when starting validators from the command line (which is how we do it 99% of the time) everything is fine. But when started via cronjob (which is what happened this time) they couldn't find the right libraries and immediately quit. Trivial environment/path issue - just funny we haven't seen it before. I started them up, the queues cleared out, and the assimilator queue returned to slowly draining itself.

Things got a little weird over night. Our single download server seemed to be unable to get work out fast enough. First thing we did this morning was hook up vader again to be a redundant download server, so already my configuration explanation from yesterday is out of date. That's how it is around here. Anyway.. this download redundancy, however nice to have, didn't help very much nor did we expect it to, because we already guessed the router was the choke point. But why? The outgoing data was far less than normal. So what's the deal? I noticed the incoming data rate was strangely high, so I checked the router graphs not by bytes but by packets, and we were pegged packet-wise. I repeat: but why?

Turns out it was a DNS loop brought on by our recent separation of the scheduler and uploader. Clients were coming into the "wrong" server and being redirected to the other (via apache). But due to incredibly short TTLs there were still a few DNS servers or caches out there saying the "other" was still "both" (standard round robin DNS). This bogus information only affected about 3% of incoming requests, but half those requests were being redirected right back to the same machine. Not very noticeable at first, but over time more computers with outdated DNS maps would connect and get stuck in a loop, and eventually we were distributed-DOS'ing ourselves. We broke those apache redirects and immediately everybody was happy, and just now reinstated the redirects using hard IP addresses to avoid further DNS mistakes.

I brought the digital camera today and took pictures of the closet in its current state. I'll put them on line over the weekend or early next week.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-20 11:35:07 | 显示全部楼层

20 Feb 2008 0:10:42 UTC

Anotherlong weekend, literally thanks to the President's Day holiday,figuratively thanks to the various network bottlenecks. For the mostpart there was nothing out of the current usual - we were sending out alot of fast workunits which meant our backend servers were swampeddealing with the increased number of results coming in. What wasunusual was ptolemy having some kind of inexplicable freeze for severalhours. It was sending away every scheduler request with 503 errors.Jeff examined everything but found nothing unusual going on to causethis - and service restarts and even a whole system reboot didn't fixthe problem. Then all of a sudden it all just started working again. Sowe're calling this a fluke and perhaps something fishy further up thepike for now. One of download servers was having fits all weekend,losing mounts, etc. but that didn't seem to cause any additionalheadaches from the perspective of the public. Jeff and Eric were on topof all this, which was good as I was spending most of the weekend outof town - it was a battle to get wireless to work at my in-laws' house.

Had the usual Tuesday outage today. No news there except recovery wasslowed by a broken query which erroneously tries to slurp up the entireuser table into memory. This happened before, but we couldn't find theculprit. Can you? I posted thread about this in our help wanted forum.

I also just uploaded a new set of photos and descriptions for your viewing pleasure.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-22 08:41:14 | 显示全部楼层

21 Feb 2008 21:17:55 UTC

Yesterday I didn't have much news about anything to report. I was mostly spending my day elbow deep in pointing code, so we could determine when/where we observed known pulsars, and see if we actually found them in our data.

However, we've been since experiencing some general aches and pains. In order to get the aforementioned code working we needed to add an index to the science database, and while it's able to create an index "live" the splitters/assimilators have been getting blocked for hours at a time. This should wrap up sometime later today. The lab in general has also been having mail server problems, which isn't helpful.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-27 09:07:38 | 显示全部楼层

27 Feb 2008 0:09:25 UTC

Let's see.. it's been a bit since I last wrote. I've been mostly working on code to pull pulses out of the database, which uncovered a couple general minor bugs that had to be fixed. These were successfully dumped and handed off to Josh to find good candidates for initial Astropulse analysis.

Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption.

Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful).

Meanwhile, the assimilator queue is growing and growing until we either let it drain, or we reconfigure thumper.

Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila!

In good news, Jeff has been hammering on the new router today, and we got over a major hurdle of getting IOS installed on it. Only thing left now is configuration. It might be ready tomorrow!

Buckle your seatbelts.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-28 08:40:08 | 显示全部楼层

27 Feb 2008 22:15:24 UTC

So as the hours wore on last night the work queue was low enough that I had to stop scheduling lest we run out of work. This morning Jeff and I determined the science database server was in a stable-enough state to start everything up again, so we did. That's basically where we are now with that. The OS upgrade was a double leap frog (i.e. up 3 revision levels) so we're getting a few errors that are noisy but most likely bogus, caused by out-of-spec config files left behind and whatnot. We'll have to do a clean OS install at some point to clean out the chaff.

At any rate we removed the old-OS variable from the mix, and the database is still slow as molasses. We really need to update the filesystems (both RAID and fs type, perhaps) and reorganize which data go where. Plans are being spelled out for that. The assimilator queue is getting to be more of a crisis, though. We'll panic more once the outage recovery mellows out a bit.

More on the proposed RAID changes as there seems to be some interest. The current database (data *and* indexes) are on a single software RAID5 device. When we were just adding signals to the database, there were 0 reads and nothing but sequential writes, so this worked well. Now with all the indexes built, and some scientific analysis taking place, the read/write mix is far more random. Plus the stripe size is way too big for the random I/O (we're reading in a 64K stripe to read a 2K page - or something like that). It's very hard to predict what we'll ultimately need RAID-wise for any given server (as they change roles quite often), so we've had to bite the bullet and change RAID levels mid-stream before. This time, the general idea is to create a new RAID10, and drop the random-access indexes off the RAID5 and rebuild them on the RAID10. We shall see.

Jeff, with my help, got the new router configured today. There were some blips as we swapped wires around to test this and that, and we eventually reached that magic 95% point where everything looks like it should work but just doesn't for some small number of unidentifiable reasons. E-mails to experts have been sent, and we'll sleep on it.

Minor news: web server thinman choked on a bunch of stale cron job processes (presumably stuck on lost mounts over the past week) so I had to reboot it - the web site disappeared for a few minutes there. Also that root drive errors on thumper turned out to be bogus (again!). I added the wrongly failed drive back as a spare. Weird.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-2-29 09:00:46 | 显示全部楼层

28 Feb 2008 21:25:13 UTC

Fully recovered from the long outages earlier this week. I also employed more assimilators (and even more just now) to try to capitalize on periods of low I/O to help catch up on the big assimilator queue backlog. Seems to be working, sort of. We also changed the mount flags on the database volume to include "noatime" - we'll see if this actually makes a difference in performance.

Jeff and I are still getting beyond the router config. One of our roadblocks was using cables that were gigabit capable mixed with ones that were not (once again it's cheap parts causing the headache). We might actually be ready to go except we have to upgrade the super-long cable going from our closet to the main lab server closet, which is inaccessible to us. Waiting on the appropriate parties to handle that.

Regarding hardware/software RAID: We tend to shy away from hardware RAID as we've had many nightmares in the past regarding configuration and implementation. Namely, it takes forever to figure it out, and then drives fail spuriously and/or silently. The software RAID hit isn't enough to make us consider going hardware on our current systems any time soon.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-3-4 09:05:16 | 显示全部楼层

3 Mar 2008 23:13:14 UTC

Soit was a rough weekend, mostly due to the excess assimilators beingemployed to knock down the ridiculously large back of results waitingto be entered into the science database. Long, long ago we had chronicproblems with a memory leak in the assimilators, but that hasn't been aproblem so much lately as things have moved it to a more powerfulserver and got BOINC going. Now they all get restarted every week dueto the database backup outage. Anyway... having 12 running at onceseemed to exercise the memory problem enough to cause the upload serverto lock up a couple times. This created a general malaise on thebackend, aggravated by a current period of fast workunits creating aheavy load on everything.

This morning bruno was rebooted and log jams were cleared. Servers aretrying to get on top of their queues. But in the positive progressdepartment, check out the most recent traffic graph (green = outbound,blue = inbound). Can you guess when we switched over to the new router?



Yay! We now increased our bandwidth capacity by about 50%. The rovingbottlenecks are surfacing elsewhere, though until we get beyond thecurrent period of catchup we don't have a good sense of what's normalor what to expect. We still have a ways to go to fully capitalize onthe full gigabit of bandwidth Hurricane Electric is offering us, butthis is still a vast improvement for now.

In regards to one comment in the previous thread: despite our smallstaff and minuscule pay scale we're generally close to 24/7 systemmonitoring, what with all of us on different schedules checking inregularly at random. And nope - I still don't have a cell phone. Neverhad one and, if possible, never will.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-3-5 08:39:01 | 显示全部楼层

4 Mar 2008 23:27:02 UTC

Some positive progress today: During the weekly database backup outage I removed old kosh/penguin from the server closet, and replaced them both with bruno (the upload server) and its disk array. So the only backend servers still outside the closet are sidious and vader. In order to accommodate the new server I also put a second KVM and did some recabling to daisy chain it with our current one. The upshot is that thinman (the web server) which was up until today totally headless now has a spot on the KVM, which gives us some warm fuzzies.

Even better: Thanks to the "help wanted" post use Gerry Green found the bug causing those occasional broken queries tying up our database. It was a bad function call lost in the "ask a friend" web code. Thank you Gerry!

However, the outage was slowed due to our database simply getting larger and larger, and then we tried to let the assimilator queue drain a little bit before starting up again. A new splitter is also being rolled out today - the only difference is correcting a minor precession bug (for better accuracy we still have to un-precess our coordinates in all the previous signals up to this point - which we plan to do sooner than later).

I'm reverting the four assimilators. Doesn't seem like 12 helps and only caused memory problems on bruno. We're really going to have to do some major reconfiguration on thumper before we can catch up again.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-3-11 08:06:30 | 显示全部楼层

10 Mar 2008 18:58:22 UTC

Hello, folks - just getting over a really really bad cold. I rarely ever get sick like this so it's a bummer when I do. Anyway, I'm back, though still only about 80-90%.

In the meantime, nothing much happened except the happy mixture of (a) enough download bandwidth to ensure an even flow of work, (b) a consistently long average workunit turnaround time, and (c) no unexpected other stresses, allowed us to finally, albeit slowly, catch up on the assimilator queue over the past week. At first I thought our queues were benefiting from the new splitter which might have been generating less noisy workunits (and therefore less prone to quick overflow and return), but the opposite was true: the new splitter was generating annoying broken workunits that errored out immediately. Sorry about that. In any case we're still in dire need of database server improvements, mostly in the RAID re-configuration realm. We're also getting smartd errors more and more - these drives are approaching retirement already. Can you believe it?

- Matt (sniff cough)
回复

使用道具 举报

 楼主| 发表于 2008-3-12 08:09:41 | 显示全部楼层

11 Mar 2008 22:09:13 UTC

Typical Tuesday. The weekly outage went along just fine. This is the first time in many weeks the result table has been "lean" - i.e. no large excess of result entries due to blocked queues, waiting for purging, etc. How nice.

Despite the happy current performance of our servers, we're still keen on improving science database throughput. We met today to discuss a plan to shuffle disks/RAID/LVMs around to optimize performance on thumper. I'm building the first RAID1 pair - it's syncing up now - where we'll start recreating indexes as soon as tomorrow.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-3-13 08:52:31 | 显示全部楼层

12 Mar 2008 22:32:31 UTC

As for science database improvements... While getting the new science database RAID1 volume set up we discovered that the lvm gui doesn't allow for resizing of logical volumes containing xfs filesystems. Huh. We were able to grow these on the command line (both the logical volume and then the filesystem itself), so we'll just had to use the command line in instances like these. At any rate, Bob is building new db spaces for the indexes on this new volume. We'll recreate indexes there after dropping them from the old spaces (which are in I/O contention with the actual data). This will happen gradually over the next few weeks.

And yes, there were still lingering issues with the donation script. Actually I should point out that the problems were not in my parsing script, nor the whole system I set up to garner information from campus. The problem is that the formatting of the confirmations from campus change format every so often. And by "change format" I mean they suddenly contain random line feeds in unexpected locations for no explicable reason. So my parsing script needs to be "improved" every so often to pick up the exciting new places these line feeds might happen to turn up. Anyway, it's fixed, and a couple "clogged" donations pushed through just now.

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2024-5-23 01:37

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表