|
楼主 |
发表于 2009-6-18 09:02:44
|
显示全部楼层
17 Jun 2009 20:16:11 UTC
I've been busy. Almost too much to write about, none of it all that interesting in the grand scheme of things, so I'll just stick to recent stuff.
Our main problem lately has been the mysql database. Given the increased number of users, and the lack of astropulse work (which tends to "slow things down"), the result table in the mysql database is under constant heavy attack. Over the course of a week this table gets severely fragmented, thus resulting in more disk i/o to do the same selects/updates. This has always been a problem, which is why we "compress" the database every tuesday. However, the increased use means a larger, more fragmented table, and it doesn't fit so easily into memory.
This is really a problem when the splitter comes along every ten minutes and checks to see if there's enough work available to send out (thus asking the question: should I bother generating more work). This is a simple count on the result table, but if we're in a "bad state" this count which normally takes a second could take literally hours, and stall all other queries, like the feeder, and therefore nobody can get any work. There are up to six splitters running at any given time, so multiple this problem by six.
We came up with several obvious solutions to this problem, all of which had non-obvious opposite results. Finally we had another thing to try, which was to make a tiny database table which contains these counts, and have a separate program that runs every so often do these counts and populate the proper table. This way instead of six splitters doing a count query every ten minutes, one program does a single count query every hour (and against the replica database). We made the necessary changes and fired it off yesterday after the outage.
Of course it took forever to recover from the outage. When I checked in again at midnight last night I found the splitters finally got the call to generate more work.. and were failing on science database inserts. I figured this was some kind of compile problem, so I fell back to the previous splitter version... but that one was failing reading the raw data files! Then I realized we were in a spate of raw data files that were deemed "questionable" so this wasn't a surprise. I let it go as it was late.
As expected, nature took its course and a couple hours later the splitter finally found files it could read and got to work. That is, until our main /home account server crashed! When that happens, it kills *everything*.
Jeff got in early and was already recovering that system before I noticed. He pretty much had it booted up just when I arrived. However, all the systems were hanging on various other systems due to our web of cross-automounts. I had to reboot 5 or 6 of them and do all the cleanup following that. In one lucky case I was able to clean up the automounter maps without having to reboot.
So we're recovering from all that now. Hopefully we can figure out the new splitter problems and get that working as well or else we'll start hitting those bad mysql periods really soon.
- Matt |
|