找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 技术新闻 2009

[复制链接]
 楼主| 发表于 2009-7-3 06:50:19 | 显示全部楼层

2 Jul 2009 18:24:11 UTC

Looks like we're back in another noisy period, or at least the bandwidth is maxed out enough that it's constraining both downloads and uploads. Let's just try to ride this storm out - it should hopefully clear up on its own.

Regarding the videos I linked to yesterday, there were plans to get the powerpoint images linked into the actual camera footage, but I guess that never panned out. That's fine. Or maybe that only happened on the live feed... Anyway, you get the basic gist of what we're trying to say from this footage. I was kind of rushing through my talk - how do you condense 10 years of effort into 20 minutes?

We were hoping to get the NTPCkr pages up this week but I'm finding that I really need to update the FAQ and other informational pages before making this live, lest we get flooded with common questions. Plus we have a little bit of feature creep, which is okay - better to rush and do these things now or they'll probably never get done.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-7 19:00:20 | 显示全部楼层

6 Jul 2009 23:00:18 UTC

It's still pretty ugly out there - we're maxed out our bandwidth and mysql resources. We were able to squeeze out a few more cycles from the upload/scheduling servers this morning, but generally it's been quite impossible the past week or so. Clearly this is a result of increasing our user base, and the growing percentage of results being processed by cuda clients.

To solve this problem we have several options. There is non-zero but nevertheless slow progress in both the bandwidth and mysql fronts, so we're effectively stuck with what we got for now. We could go to single redundancy and keep the split rate the same. This will immediately divide out outgoing bandwidth in half, but people will, on average, get less work to chew on. We could also increase the resolution of chirp rates that we process, thus lengthening the time it takes to process a workunit. We may do both. From what Eric tells me compressing workunits only helps multibeam, and only by about 20%. Almost not worth considering, since that will get us 5-10 Mbits back, and we need something like 50.

The other annoying thing is that on Friday/Saturday our raw data storage server got hung up while we were copying a file up from our archives. This caused splitting to slow down until we ran out of work to send. Not sure why this was the case, as I killed that transfer and everything worked fine after that. Even more mysterious is that, while bringing the same file up again this morning it choked our server once more. Why this one particular file is having such a random and extreme negative effect is beyond me at this point, but we're doing other tests, etc.

You know, I should point out that while I write these daily missives I tend to disagree with a lot of policies that end up getting enacted around here, which it makes it difficult for me to defend one practice or another that might be discussed on these threads. Anyway, don't blame the messenger.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-8 08:04:01 | 显示全部楼层

7 Jul 2009 22:35:01 UTC

Had the usual weekly database maintenance outage again today. It looks like our mysql database has shrunk for two weeks in a row now (due to less results out in the field). This is a good thing as it means more internal I/O resources. We're recovering from the outage now as I type this. I still expect it to take a while (maybe a full 24 hours?) before we stop dropping connections left and right.

As for that raw data storage server issue mentioned yesterday... turns out it was, uh, user error. A partition filled up. Oops. Still, not sure why the data trasnfer tools (to pull data up from our off-site archive) wasn't noticing that the disk was full and kept trying to write to it over and over and over again.

Question: does anybody out there actually *use* NetworkManager? Or does it exist simply to confound and annoy? I'm willing to believe it's a useful tool, but unfortunately my experience pretty much shows the latter - it randomly and unexpectedly breaks network connections without remorse. I have made it a habit to remove that package and all my machines whenever I find it. Of course, I just installed a clean OS on my desktop. Suddenly firefox is starting up in "work offline" mode, even though I uncheck the box every time. I did some research and found, ha ha, it was my old nemesis NetworkManager getting in the way - it got reinstated with the new OS install. One quick "yum erase" and firefox was once again starting up actually connected to the internet, which I think is preferable, no?

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-9 07:06:56 | 显示全部楼层

8 Jul 2009 19:03:43 UTC

Once again it took the replica all night to recover. I started it up this morning, and it's catching up now. Well, almost. I'll turn the "show tasks/results" feature back on once it really starts catching up.

There's been a lot of discussion lately about our bandwidth woes. I actually talked to Blurf this morning on the phone regarding the (rather generous) push to donate money/hardware towards solving this problem. Let me try to paint a big picture here.

We pay for a gigabit of bandwidth from our private ISP (Hurricane Electric), but can only use 100 Mbits given current campus infrastructure. Most of campus is on gigabit already, but our lab is all the way up the hill - so it's much harder and more expensive to improve the old wiring/routing. The entire rest of the Space Lab uses about 10 Mbits/sec, so there is absolutely zero push by anybody else to spend money/effort on this project. Luckily, there was a spare 100 Mbit cable which is what we are using for the Hurricane Electric link.

While we pay for our bits, they still have to route through campus in order to ultimately hook up with the right backbones. That means we have to adhere to campus's network specs, which in turn means we can only use very specific brands/models of hardware, and can only act once they've fully researched our needs. We opened up a ticket months ago asking to start this research. We got word a couple days ago this research has more or less finally begun. Not much progress, but still non-zero. This may seem impossibly slow, but campus really pretty much always has much bigger fish to fry. Plus our requests usually present them with something new they haven't dealt with before, and therefore they are far more careful.

Ultimately, we should be presented with a couple options from campus which include exact pieces of hardware to be obtained. It's still not clear how much cable has to be upgraded and where, but we know we'll need two new routers, if not also other hardware. When campus gives us this final report, only then can we start figuring out how to obtain the necessary hardware.

As for other options, like going wireless... There actually used to be a building down in the flats that got wireless bandwidth from us. The experience was that it was quite slow and prone to suffering during bad weather, etc. This was a while ago, but still there is enough concern about reliability that nobody seems to want to go down this path.

Of course, another option is relocating our whole project down the hill (where gigabit links are readily available), or at least the server closet. Since the backend is quite complicated with many essential and nested dependencies it's all or nothing - we can't just move one server or functionality elsewhere - we'd have to move everything (this has been explained by me and others in countless other threads over the years). If we do end up moving (always a possibility) then all the above issues are moot.

Another important thing to consider is that we can always reduce are bandwidth demands via other means, which I also explained in another recent threads. Things like removing redundancy (and putting a cap on workunit downloads per day per host), or adding scientific analysis. Or, to be a little extreme, calling SETI@home done, turning off the downloads for good, and moving on to the next thing (something I am actually in favor of doing sooner than later, but the others around here seem to disagree).

I definitely appreciate past and current efforts to help us get beyond the current bandwidth crisis. However, as noted above, there are enough variables involved that I'd hate for you all to start collecting money directed towards a solution to a problem which might just go away. In the meantime, thanks as always for your patience (and crunching time when you actually do get workunits) - we'll keep working with what we got and see if we can't get beyond the storm sooner.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-10 08:57:35 | 显示全部楼层

9 Jul 2009 22:09:13 UTC

Not much news. Eric, Jeff, and I are still poking and prodding the servers trying to figure out ways to improve the current bandwidth situation. It's all really confusing, to tell you the truth. The process is something like: scratch head, try tuning the obvious parameter, observe the completely opposite effect, scratch head again, try tuning it the other direction just for kicks, it works so we celebrate and get back to work, we check back five minutes later and realize it wasn't actually working after all, scratch head, etc.

Thanks for all the suggestions the past couple of days (actually the past ten years). Bear in mind I'm actually more of a software guy, so I'm firmly aware that there's far more expertise out there regarding the nitty gritty network stuff. That said, like all large ventures of this sort the set of resources and demands are quite random, complicated, and unique - so solutions that seems easy/obvious solution may be impossible to implement for unexpected reasons - or there's some key details that are misunderstood. This doesn't make your suggestions any less helpful/brilliant.

Okay.. back to multitasking..

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-14 07:39:27 | 显示全部楼层

13 Jul 2009 22:11:48 UTC

The data pipeline over the weekend seemed to be more or less okay, thanks to running out of Astropulse workunits and not having any raw data to split to create new ones. Of course, I shovelled some more raw data to the pile this morning, and our bandwidth shot right back up again. This pretty much proves that our recent headaches have been largely due to the disparity of workunit sizes/compute times between multibeam/Astropulse, but that's all academic at this point as Eric is close to implementing a configuration change which will increase the resolution of chirp rates (thus increasing analysis/sensitivity) and also slowing clients down so they don't contact our servers as often. We should be back to lower levels of traffic soon enough.

We are running fairly low on data from our archives, which is a bit scary. We're burning through it rather quickly. Luckily, Andrew is down at Arecibo now, with one of our new drive bays - he'll plug it in perhaps today and we'll hopefully be collecting data later tonight...?

To be clear, we actually have hundreds of raw data files in our archives, but most of them suffer from (a) lack of embedded hardware radar signals (therefore making it currently impossible to analyse without being blitzed by RFI), or (b) accidental extra coordinate precession, or (c) both of the above. Software is in the works (mostly waiting on me) to solve all the above.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-24 07:48:16 | 显示全部楼层

23 Jul 2009 20:21:14 UTC

Oh, hello. I was out of town most of last week on vacation, then Jeff and I were at OSCON 2009 down in San Jose until today. Despite being billed as an open source developer conference we got all kinds of linux sysadmin and mysql tips and tricks from various experts that we may apply towards better diagnosing of system/network/database issues in the future.

That all said, I haven't had the time to catch up on the lengthy discussions here in this forum during my absence. I imagine it has been mostly about our continuing network struggles. This may all become quite moot quite fast as Eric started rolling out the updated scientific analysis configuration, which is an easy knob to turn as we can increase sensitivity, thus improving our science, with the additional happy side benefit of reducing demand on our servers. I think, though, that we have now just reached the limits of that particular knob before getting diminishing returns.

Apparently there were a few servers that needed to be kicked while I was away. Jeff and Eric took care of all that. Mount issues and the like. We also seem to have our new disk arrays set up both at Arecibo and here, so the raw data pipeline should be kicking into full swing again soon. This is good as we're down to our last 10 files that we've been bringing up from the archives (there are a lot more files, but they require the radar blanking software to work in order to be processed, and I haven't gotten around to that yet).

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-28 23:13:47 | 显示全部楼层

27 Jul 2009 21:54:56 UTC

Not much time to report very much, but the good news is that we finally got one of those new Intel machines working. Eric was in over the weekend installing a new disk controller card, and Jeff and I wrapped up the OS install/configuration today. We now have a new system with 24 processors (4x6 2.13 GHz) and 64 GB ram. We'll try to make this a replica mysql server (in addition to sidious) and see how it does, maybe tomorrow...?

Data-wise, we're finding the Alfa receiver isn't on as much as we thought, and we're running low of data from our archives, as well as data currently on-line. Actually, that's not true at all - we have plenty of data taken between January and April 2008 which has the hardware radar blanking signal (so we can reject RFI), but was accidentally pre-precessed (so we have to unprecess after the fact). Not that big a deal.

About to disappear into the basement and throw out a bunch of old computer junk we haven't used in many years (various people are complaining about how much space it's taking up, which is fair).

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-29 08:46:59 | 显示全部楼层

28 Jul 2009 22:32:02 UTC

Had the usual Tuesday outage today for database maintenance. Nothing too exciting to report about that except we continue to have progress getting new server mork on line as secondary replica (and hopefully someday primary master). MySQL is running on it, and all the tables are being populated as I type this.

A note about the "old junk" I mentioned yesterday. I was talking about real junk (gutting parts servers, shipping boxes, etc.). We still have the E450s that were our various servers during the "classic" phase of SETI@home. We keep talking about auctioning those off but I doubt any of us will ever have the time to coordinate that. Maybe we'll donate them to the Smithsonian.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-30 18:01:03 | 显示全部楼层

29 Jul 2009 22:32:28 UTC

So.. getting mork on line as a test replica server still continues to be one headache after another. We finally got the hardware working, finally got the drive configuration set up, finally got the OS installed, finally got MySQL fired up, and we were populating the databases using Tuesday's dump files.

Then we hit a completely mysterious error and consistently at the same point in the dump file. Long story short, I spent pretty much all day today trying to find the cause of this error. At this point we're about 90% convinced it's an actual bug in the MySQL version that comes standard with Fedora Core 11 (version 5.1.35) where it fails reading mysqldumps containing large text fields. This seems like a major problem, no? Anyway, the same mysqldump worked on a test 5.0.x database engine. So I'm looking to upgrade this version beyond what's in the current Fedora repositories. What a pain!!!

I just turned on the "show results" flag, even though our current replica is still far behind reality.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-7-31 07:35:59 | 显示全部楼层

30 Jul 2009 19:46:03 UTC

So we seem to have gotten over the hump with this new replica server. I should point out working on this server has had zero effect on the rest of the normal project operations, except for perhaps eating up all my time. Anyway, my script got around the dump/restore bug, and after some configuration headaches this morning we are successfully replicating on mork! Of course, sidious continues to be the replica we are using for production, while mork is considered "beta test."

It is catching up on the backlog far slowly than we hoped, especially given the power of the machine. Of course, power is measured in network, disk, memory and cpu. This system certainly has cpu (24 processors!) but word on the street is that mysql actually *drops* in performance after n processors. What "n" is, and what the penalty is remains unclear. Also, this system has fewer disk spindles than sidious (8 compared to 10), and they are slower disks, I think. So we may be seeing a disk i/o hit, but iostat doesn't really show anything amiss. The system is also in our lab and not in the closet, so there may be an extra network hop or two slowing things down. Anyway, as it progresses we'll gauge its performance and act accordingly.

As for changing linux flavors, the current issue here is mysql versions, and not so much linux distributions. As mentioned elsewhere we're trying to adhere to a homogenous setup, and we have less than zero time to mess around with anything experimental like trying new OSes on for size. In any case, Fedora works well enough, and while I generally swear by open source software for both philosophical and practical reasons, I do understand that you get what you pay for.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-8-4 16:49:31 | 显示全部楼层

3 Aug 2009 23:19:10 UTC

A relatively spotless weekend (though I did arrive this morning to find 1000 e-mails in my inbox - all warnings about mount issues from a behind-the-scenes compute server). The new replica server "mork" caught up pretty much instantly last week once the whole database was read into memory (about 32GB) and is now actually serving as the main replica for now, if only to stress test it. We still may crack it open and reconfigure it if we find the drive configuration is a bottleneck. In any case, if you're looking at results on this web site, you're pulling them off mork.

We are also getting close to running out of data. Just as we got the data recorder working again they had two weeks without any Alfa observations. We're currently trying to split raw data files that were only partially split for one reason or another, but after that... looks like my software radar blanker project has been bumped up in priority. No need to panic, at any rate - we probably have a couple weeks, I think, and we might get a burst of new data from Arecibo during that time.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-8-5 08:49:51 | 显示全部楼层

4 Aug 2009 22:52:27 UTC

Tuesday is our usual outage day, as many of you are firmly aware. Today was the usual drill, except we have two replica databases to deal with. We set the "alter table" scripts on these two systems simultaneously, prepared to laugh at how much faster mork will perform than sidious.

And it was doing great, even faster than the master database (jocelyn)... until it crashed. And it was the worst kind of crash - the system simply froze, requiring a hard reset, and there was not a trace of any evidence anywhere upon reboot about what happened. So now we have the completely opposite of a warm fuzzy feeling about mork, but nevertheless even with this setback, and the ensuing innodb database recovery, it still wrapped up all its tasks around the same time as the master database, and so both master/replica are back online and serving requests. I didn't need to temporarily turn off the "show tasks" pages because we can handle them, even right after an outage. The old replica (sidious) is still chugging away on its table compression tasks, and will probably be done with those around midnight.

Meanwhile the rest of the day I've been gathering data and making plots to better understand the radars that clobber our Arecibo data. Selecting thresholds is rather difficult, as it changes from file to file where the baby ends and the bathwater begins. Sigh. But we're close, and can do a rough enough job of getting most of the radar out without losing too much data.

People asked about the NTPCkr pages. Oh yeah.. That.. Jeff and I were pushing on those last month, then I disappeared on vacation, and then we both were at the OSCON in San Jose, and then the new replica server finally started working so that's been occupying our time, along with scrounging data together to process. Sorry about the delays. I know we're close to publishing something. This is kind of an important addition to the web site so we want to make it kinda works before embarrassing ourselves with broken/misleading information.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-8-6 07:35:57 | 显示全部楼层

5 Aug 2009 21:57:21 UTC

Raw data pipeline: Jeff and I are mining old files that were only partially done for one reason or another. Hopefully these can keep us crunching until we get more data from Arecibo. To add insult to injury, it seems the observatory has been suffering from several power outages the past few days, probably due to thunderstorms.

MySQL databases: So far so good with mork as the replica. We recovered pretty quickly from the outage yesterday. I'm hoping the freeze yesterday was a fluke, or caused by some temporary variable which has since changed, or at the very least next time it happens we'll have some kind of smoking gun somewhere on the system. We're looking into getting its twin "mindy" on line sooner than later.

NTPCkr: Jeff and I met this morning to discuss the current status of what we need to do to get this thing on line. To be clear, Jeff has been doing pretty much all the work on the NTPCkr engine, and I've been helping with the cosmetic/web stuff. Anyway, Jeff has a couple bugs to clear up. Nothing major - things like the reporting mechanism sometimes spits out the same candidate twice. I've been working on web site stuff, like putting in all the hooks to allow people to discuss candidates amongst themselves on a separate forum. Once we clear up our current set of bugs/updates I'll fire up a daily cronjob which will (a) generate the current "top ten" list, (b) pull all the data from the science database from these candidates (if not already on disk) for plotting purposes, and (c) create discussion threads for each candidate (if they don't already exist). Then we're live, but we'll have many "version 2.0" tasks to address right away.

- Matt
回复

使用道具 举报

 楼主| 发表于 2009-8-11 17:26:30 | 显示全部楼层

10 Aug 2009 21:30:52 UTC

Happy new work week (for those with "standard" work week formation)! The weekend was rather quiet - no major outages or glitches. We burned through all the data we have on line for Astropulse, but still have plenty to process for multibeam. We do have at least a couple drives containing hot, fresh data coming up from Arecibo any day now. We're also hoping the amount of time ALFA gets to observe actually increases, or else we'll always continually be dangerously close to being, if not completely, out of data to process. As far as problems/concerns go, this is a good one.

I got the first rev of the daily cronjob running right now which creates an updated "top ten" candidate list (via the NTPCkr) to be parsed by some PHP for public consumption. It's running now, and taking a long time. I'll see how long it takes before making anything live, being as how we'd like to run this every day, but may be forced to pace it slower than that.

As for radar blanking, I'm finding the correlations still aren't clearly defining which is radar and which isn't. I'm going to talk to Dan about that shortly.

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~
欢迎大家多多支持基金会~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2025-5-10 00:51

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表