Re: Building Sentry Systems
From: Benedikt Stockebrand (me_at_benedikt-stockebrand.de)
To: firstname.lastname@example.org Date: Wed, 10 Sep 2003 14:02:02 +0200
Hal Flynn <email@example.com> writes:
> [...] what I started thinking
> about is creating a self-contained system that functions independent of
> the cluster, and has access to all drives in the cabinet. The system
> stores integrity information locally, and acts as an audit host to monitor
> the integrity of files within the storage cabinet.
> Is anybody familiar with any research in this area? Has anybody
> experimented with anything like this? I don't have access to any large
> cabinets to tinker with these days, so I don't have the ability to play
> with this on my own. I'd be interested in hearing about other similar
> research and experimentation.
How would you want to check integrity while the cluster nodes are
actually working on the data? This sounds like you either have to
notify your monitor about every intended write operation, which itself
sounds like a major performance hit and makes it necessary to take
into consideration that your monitor box may be down, or you wind up
with a load of race conditions and temporary inconsistencies.
As an additional note from some ("management") experience (I was
somewhere between the customer/project and the data center/operations
sides) with a range of cluster systems: These beasts are complex
enough as is, adding more complexity will only make it even more
likely that some sysadmin on call duty, called out in the middle of
the night and unfamiliar with the cluster in question, will end up
with a major fsckup.
In my opinion a more reasonable approach, if you can't avoid using a
cluster, is to make really sure a split brain won't happen. First
thing you do is make sure you actually follow the cluster specs and
don't take "shortcuts" using only a single heartbeat link or such.
Then you make sure that the redundancy is properly monitored. It
doesn't do to make sure that, yes, the data base is up and running.
Check every heartbeat link, every redundant host-storage connector,
storage cabinet, every disk in the RAIDs and all cluster nodes. Then
you make sure only qualified staff touch the system (and, I am sorry
to say, this is sometimes difficult to enforce if the vendors support
staff is plain incompetent). Finally, make sure you test things
periodically. If this still isn't good enough for you, disable any
automatic failover and only do a manual switchover after someone
clueful has taken a look into the whole situation.
Of course I assume you don't intend to run the cluster without proper
backup (like the old "why do I need backup---I've got a RAID system?"
line) and don't expect the cluster to be the solution to all your IT
problems in the world.
-- Dipl. Inform. Tel.: +49 (0) 6151 - 971 823 Benedikt Stockebrand Mobil: +49 (0) 177 - 41 73 985 Am Karlshof 1a Mail: firstname.lastname@example.org D-64287 Darmstadt WWW: http://www.benedikt-stockebrand.de