找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 技术新闻 2009

[复制链接]
 楼主| 发表于 2009-8-14 20:15:48 | 显示全部楼层

13 Aug 2009 20:22:28 UTC

I was actually out the past couple of days. Family stuff, including an adventure where we had to tow our Prius almost 100 miles back to Oakland (it freaked out and lost power on I-5). It's in the shop now - luckily these newfangled cars store debugging information so they were able to locate the problem (flakey potentiometer causing erratic accelerator information, and as a failsafe the Prius cut its own power).

Anyway.. during the past couple of days Jeff and Bob handled the Tuesday outage, and Eric tackled a couple general network issues as well (the upload server got misconfigured somehow and was dropping excess connections, and then the assimilators were dead in the water for a while there, causing the queue to back up, the workunit disks to fill up, and finally the splitters to shut down - which is why we ran out of work to send out last night). All seems much better now, albeit jammed with traffic.

In better news we did finally get the first two data drives from Arecibo as recorded by the upgraded data recorder and new external drive docks under normal operations. So we're not going to run out of raw data after all, or at least not just yet. I'm copying those raw data files onto our local drives as I type.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-8-18 17:54:10 | 显示全部楼层

17 Aug 2009 21:22:19 UTC

Okay things haven't been running so well the past couple of days. First, there were some mount problems in the middle of last week which caused our assimilator queue to clog up. This inflates our result table causing all kinds of table fragmentation which never helps the general pipeline. Later in the week I noticed the spike table in the science table was running out of space, so Bob added a few more database chunks. That process eats up a bunch of disk i/o, causing splitters/assimilators to slow down temporarily. But then we hit some major chokepoint causing work production to grind to a halt.

Actually it was worse than that - things were working normally, but only really slowly. This makes it hard to find an obvious smoking gun. Usually this is a symptom of heavy disk/database i/o on thumper. We were testing all that this morning by turning processes off but to no avail.

So.. remember how I mentioned in my last note how we just got new raw data from Arecibo? Well, the script copying it over to the raw data storage server failed to register the file system was full, and packed it up tight. Turns out this caused the storage server some distress, and when I finally checked into it this morning the load was high and all the nfsd's were in disk wait. I deleted one excess file, the nfsd's sprung to life and the whole dam broke, the splitters charged full steam ahead, and the network bandwidth is now tapped out trying to catch up on demand. Fair enough.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-8-19 09:29:28 | 显示全部楼层

18 Aug 2009 22:41:06 UTC

Outage day, usual drill: shut everything down, back up the mysql databases, fire off a science database backup as well while we're at it, compress the mysql tables (which get fragmented over the course of a week), and start everything back up. As far that was concerned, everything went smoothly.

However, we were hoping to hook up a couple extra solid state drives to the new replica server mork. The plan was to put some mysql logs on these drives to help unload extra i/o from the rest of the database drives. We got all the hardware in place and hooked it up today, only to find the server BIOS wasn't seeing these drives. In the time allotted for this task I determined this was either due to (1) bad cables, or (2) motherboard weirdness. Since this is an Intel donated server with an "experimental" motherboard, all best are off. I did prove we could see the SSDs when I swapped cables around, but given the current setup we couldn't run normally like that (long story). In any case, I think we're fine without these drives for now, and may still go along with the plan to make mork the master next week.

Other than that, radar blanking woes continue. I'm going to have Eric and Jeff look at my code tomorrow and point out what I'm doing wrong, if anything. I also hope to get some version of the NTPCkr page online tomorrow (he says with little fanfare).

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-8-20 22:53:40 | 显示全部楼层

19 Aug 2009 22:07:01 UTC

Okay. Spent a large chunk of the day hacking the last final bits of the NTPCkr web page together and made it available for public viewing. Yippee! There's a link on the front page in the news section if you're looking for it.

There's still a ton more work to be done on this page, as well as the NTPCkr itself, and this is still just the first step in many as far as final data analysis is concerned. We haven't even touched radio frequency interference removal yet (outside of the tools we already have from other SETI projects that we could retrofit for SETI@home). Still, it's a (seemingly rare) major step in the right direction around here.

I also had a code walkthrough with Jeff/Eric about my radar blanking difficulties. Eric had several good things to try, which I'll get started on once I post this message. Actually I might look into the stuck science status page first...

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-9-4 11:36:02 | 显示全部楼层

3 Sep 2009 17:32:56 UTC

Sorry about the delay in posting. I've been around, just busy. Those interested in more info should note that we are posting general weekly meeting updates at seti.berkeley.edu.

Outside of lots of little network/system hiccups which have been addressed in our usual whac-a-mole manner, there has been continuing data pipeline issues. The data recorder at Arecibo has been crashing, seemingly randomly. This wouldn't be a big deal but it requires human intervention to reboot, so when it locks up at night, we can miss hours of data. Meanwhile, our reserves are pretty much running dry. We do expect a shipment of at least 4 full data drives by early next week. We may run out of data of the weekend, but that's okay. And yes we are aware of splitters stuck on certain files.

On a more positive note, server mork (a new 24 processor/64 GB RAM intel system) is working beautifully as our master mysql database server (handling a sustained 2500 queries/second without breaking a sweat). Meanwhile we reconfigured jocelyn to be the replica server now. There are some gotchas we've been working around so not all pieces have fallen into place on that front, but we're close. The former replica server, sidious, has been retired (it's actually powered off and sitting on a lab bench).

I haven't updated the NTPCkr candidate list in a while as the candidate scorer program seems to lock up the primary science database. I'll mess around with that today (mainly trying to force it to connect to the secondary science database server).

Little progress on the radar blanking front, though still non-zero progress. Finding the time is difficult.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-9-17 07:40:45 | 显示全部楼层

16 Sep 2009 20:53:41 UTC

Hello again. Sorry about the lack of information lately. I was out sick a large chunk of last week.

Anyway... it's been business as usual more or less. The raw data pipeline really shrunk down but fresh data finally arrived from Arecibo, so we were able to flood the queues again. But I see that we're in a period of about two weeks of zero observations, so we might tighten the belt again before too long. The new mysql setup (mork as master, jocelyn as replica) has been working quite well the past couple of weeks. We have another mork-like server (tentatively called mindy) but, like most of our equipment around here, was a donated system of unknown quality. Several hours of fighting with it yesterday makes me believe mindy may be a dud (processor errors during boot, etc.).

There have been complaints lately about uploads. I don't see any immediate problems on my end. I see files appeared on the server at the normal rate. The traffic graphs don't show anything vastly awry. Eric's been messing with the apache/balance settings on that system so I defer all questions to him.

Eric and Jeff are working on the first gross-level RFI removal infrastructure. Once that's in place the NTPCkr data will start making slightly more sense (the top candidates are all pretty much junk right now). Until then, I will only upload the top ten list by hand every so often.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-9-18 07:56:00 | 显示全部楼层

17 Sep 2009 19:55:10 UTC

As Josef pointed out in yesterday's thread we are indeed unable to get any new data from the telescope until early November. This is a problem because we have only a few drives full of data on our shelf, and maybe a few drives down at Arecibo (which we asked to have shipped up to Berkeley).

The silver lining is that Jeff has been putting effort into getting the data recorder crashing issues fixed - now that project can be back-burnered and he can focus on RFI issues. Meanwhile I'm cracking on the software radar blanking stuff. I actually made a significant advance this morning, discovering that at any given time the radar patterns we are locking onto can drift as little as 0.1 samples, with drastic results in our ability to find the radar. I've solved that little bit, and it's all pretty much plumbing/testing/deploying at this point. Hopefully I can get this rolling before we completely run out of data. Of course, I always feel that running out of data shouldn't be that big a deal.

By the way, one of the reasons I've been lax with these threads lately is that I'm getting tired of being the sole focus for tech support/donation queries/etc. Please don't be insulted if I address roughly 0% of your requests that are personally addressed to me. I simply don't have the time. I keep asking for additional web presence and user interaction from the others or perhaps the hiring of actual web support staff, to no avail.

- Matt


项目暂时无法从望远镜获得新数据,并会持续到11月初。这是一个比较麻烦问题,因为目前目前只有少量满数据的存盘,可能还有一些存盘从阿雷西博送过来,但需要船运到伯克利。。(付不起空运费麽。。
回复

使用道具 举报

 楼主| 发表于 2009-9-23 19:08:38 | 显示全部楼层

22 Sep 2009 20:43:14 UTC

Today was an outage day, with nothing special to report on that front. One interesting note is that our master mysql database server (mork) has 24 processors and 64 GB of memory, and the replica server (jocelyn, which used to be the master) has 4 processors and 28 GB of memory. Eric recently cleaned out really old rows from the beta result table - now the entire database fits better in memory on jocelyn, and in turn this database engine generally performs better than mork. How could this be? Because despite have far less memory and processors, jocelyn has more disk spindles (and faster disks, for that matter) than mork. Not really all that surprising, but it's fun to see our suspicions about disk performance confirmed with memory being less of a bottleneck. In any case, both servers are zippy and today's outage wasn't very long, was it?

So the weekend went by with nary a blip, or even a single alert from my web of alert scripts. This pretty much never happens. We always get kind of warning, severe or otherwise - high load on this server, replica database is falling behind, rising temperatures in the closet... but nope. Everything was just fine.

However yesterday we did have one short traffic dip due to the science database getting locked up on too many internal user queries, so the splitters weren't creating work for a couple hours there. No biggie - we killed the queries and informix sprung back to life. It is a bit worrisome how locked up the database can get, though, and it's hardly predictable when (or why) it does.

I'm actually running my software radar blanker through an entire 50GB test file right now. It processes in roughly twice real time (meaning a file containing n hours of data takes 2n hours to find radar and blank it). Not to worry - we can run many of these in parallel. I could also make several code optimizations if need be. Anyway, I'm hoping by the end of the week to trust this suite of software enough to start processing our large backlog of 2007-2008 data by next month.

Oh yeah one more thing - we do know that "queries/second" field is blank on the server status page. For some reason the same exact informational query on one server returns in a different format
than the other, so our general "db stats" script is sorta broken. Bob is fixing it.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-9-24 07:41:52 | 显示全部楼层

23 Sep 2009 20:46:09 UTC

Had more science database woes at the end of the day yesterday - processes (including splitters) getting logjammed. I'm hoping a couple "update stats" commands will fix all that.

Speaking of splitters, I'm actually running (drumroll please) the first software radar blanked data through a splitter right now, and workunits will be distributed to the public fairly soon. This is still in test phase - we shall see if the software blanking performs better than (or worse, or the same as) the hardware blanking. I'm guess with a couple tweaks here and there my code will be far better.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-9-25 08:03:44 | 显示全部楼层

24 Sep 2009 19:29:14 UTC

Hey gang. Sorry to say the first software radar blanker tests were kind of a bust - apparently some radar still leaked through. But we have strong theories as to why, and the fixes are trivial. I'll probably start another test this afternoon (a long process to reanalyze/reblank/resplit the whole test file - may be a day or two before workunits go out again).

To answer one question: these tests are happening in public. As far as crunchers are concerned this is all data driven, so none of the plumbing that usually required more rigorous testing has changed, thus obviating the need for beta. And since there are far more flops in the public project, I got enough results returned right away for a first diagnosis. I imagine if I did this in beta it would take about a month (literally) before I would have realized there was a problem.

To sort of answer another question: the software blanker actually finds two kinds of radar - FAA and Aerostat - the latter of which hits us less frequently but is equally bad when it's there. The hardware blanker only locks onto FAA, and as we find misses some echoes, goes out of phase occasionally, or just isn't there in the data. Once we trust the software blanker, we'll probably just stick with that.

On the upload front: Sorry I've been ignoring this problem for a while, if only because I really see no obvious signs of a problem outside of complaints here on the forums. Traffic graphs look stable, the upload server shows no errors/drops, the result directories are continually updated with good looking result files, and the database queues are normal/stable. Also Eric has been tweaking this himself so I didn't want to step on his work. Nevertheless, I just took his load balancing fixes out of the way on the upload server and put my own fixes in - one that sends every 4th result upload requests to the scheduling server (which has the headroom to handle it, I think). We'll see if that improves matters. I wonder if this problem is ISP specific or something like that...

I'll slowly start of the processes that hit the science database - the science status page generator, the NTPCkrs, etc. We'll see if Bob's recent database optimizations have helped.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-9-30 08:20:48 | 显示全部楼层

29 Sep 2009 21:36:20 UTC

Hello all - usual outage day again today. It's an interesting battle between our two mysql database servers. Okay maybe not that interesting. But mork has far more RAM, and jocelyn has a much faster disk array. And we see what we expect - mork is a much better master server as it can hold the database in memory and do all kinds of random access, but during the outages jocelyn does its database table compression much faster, as it involves a lot of sequential writes to disk. Anyway, we're back up - not much shakin' on that front.

We did have an outage last night for an hour. This was a known event involving some network infrastructure maneuvering down on campus. It was unclear how long this would take, so we didn't bother with any kind of panicky warning on the home page that we were going to be down for an unspecified amount of time. I think you're all used to that by now anyway. Plus, the good news is that this was one more task out of the way such that campus can get back to determining our bandwidth upgrade needs.

I found yet another radar blanking bug. I least I'm *finding* the bugs I guess, and it's much easier to fix them once they are spotted. Anyway, iteration 3 will commence sometime in the next day or so.

And thank you to Tiaan Geldenhuys, who donated a bit of javascript to our NTPCkr page such that if you zoom in on the Google skymap you'll see the border of the "pixel" which makes up this candidate.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-10-2 15:25:23 | 显示全部楼层

1 Oct 2009 19:41:59 UTC

Some random news items as the work week winds down. First, we did finally get some data drives from Arecibo - the last of them until we start observing again in early November (at the earliest). So that'll tide us over for a short while. Second, it seems like the third time's the charm: preliminary results from the third software radar blanked data test are looking good! We might roll this into production as early as next week. This means we can start analyzing a wealth of pre-2008 multibeam data that was otherwise useless.

We're still having some science database throughput issues that's keeping us from running the NTPCkr as much as we'd like. More and more this is becoming my number one priority.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-10-6 08:55:16 | 显示全部楼层

5 Oct 2009 20:43:16 UTC

Okay that was an ugly weekend. On Saturday morning I came to realize that our master mysql database server (mork) had crashed. I was the only one available at the time so I came up to the lab and rebooted the thing. We really need to improve our remote kvm/power cycle situation. I babysat the reboot long enough to see that mysql was recovering, knowing though that the replica would be out of sync (and need to be regenerated from scratch during the next weekly backup).

But then everything else crashed, and also hard enough to require human intervention. This time Eric eventually came up on Sunday to try to reboot a series of servers, but to no avail - they kept locking up shortly after reboot.

So Monday morning (today) we came into the lab and started cleaning up the server situation. Eric finally found the cause of the latter, if not all, of our problems. We have a pseudo user account is the "user" that runs a lot of stuff, apache processes, cron jobs, some of the BOINC back end servers, etc. For some reason the .history file had grown to 8GB in size, and it was full of garbage. Not sure why just yet, but that meant every time one of the above processes started, the shell tried to read in this impossibly large history file. Oops. Once Eric deleted this file all these dams broke free and we were able to safely recover all the databases/etc. throughout our long morning.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-10-7 14:03:04 | 显示全部楼层

6 Oct 2009 22:43:01 UTC

Quick post-weekly-outage wrapup: everything went fine, albeit a little slow given recent events. The replica recovery is going on now. Hopefully it'll continue along safely overnight and we can turn the replica back on sometime tomorrow.

One hilarious note. All our server reboots over the weekend dislodged several instances of sendmail, which then went on to send forth unexpectedly large queues of cronjob/server related e-mails to me, Jeff, and Eric. We're talking about 35,000 e-mails, all of which went through the lab spam firewall first, thus clobbering everybody's e-mail in the entire Space Lab for about 24 hours. Fun.

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~
欢迎大家多多支持基金会~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2024-6-3 19:17

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表