Re: HTTP LOG files Labeling

Ideally labeling the HTTP logs is to use a precise signature-based
IDS (e.g., snort), but we didn't use it during data collection.

That's senseless, since:
a) Snort may have false negatives, or exhibit noncontextual alerts because
of misconfiguration

Don't forget that it may have false positives as well.

b) An anomaly detector should flag things that a misuse detector by
definition doesn't care about

you need a dataset which is hand labelled, sorry.

I think your conclusion is predicated much more on your second point,
which may depend on the situation. Anomaly detection systems offer
many potential benefits, which is why they continue to be explored so
much after so many years with so few successes. One of the potential
benefits is catching zero-day exploits for which there is not yet a
signature for the traditional NIDS of your choice. If your primary
interest is detecting these, then doing a full-packet capture for your
net for some time period then running it through the NIDS a month or
two later (by which time it hopefully has received the appropriate
signatures) should give you a reasonable labeling of the dataset,
barring the problems noted in the first point. Performing a manual
check of the labels produced is always advisable.

I think this approach may actually work here given the stateless
nature of web attacks (as they exist within a single TCP connection),
allowing for their detection with signature-based detectors. I
wouldn't recommend this approach for general network ID, as there's a
lot of malicious activity that signature-based detectors can't detect,
as they require more state than is reasonable for a detector to keep.
These are one of the other major areas where anomaly based detectors
would be useful, and where we've had the most commercial success
(particularly in detecting DoS attacks).

Where I think most people want to see anomaly detection go is being
able to detect abuse of privileges (the insider problem), and usurped
credentials (it's a lot harder to detect an attacker when they have a
legitimate login). <soapbox>The only way we're going to make progress
in improving anomaly detection to reach this state is through the
creation of good testing data, and I'm convinced that the only viable
way of doing that is simulated traffic: any method of labeling real
data is too error prone. The trick is getting simulated traffic that
looks like real traffic (this is what's driving my research). This is
a hard problem whose surface has only been scratched, and as such,
it's ripe with potential for research projects, dissertations, and the
like. Going on this rant in this thread seems appropriate because it's
worth pointing out that doing the simulation of web traffic (with the
injection of a variety of attacks) should be much easier. Just keep in
mind that one must be able to show that the simulated web traffic is
sufficiently similar to real web traffic that results from the data
can be extrapolated to real web servers.</soapbox>

My two bucks (yes, I value it about two orders of magnitude over most
of my opinions),

Test Your IDS

Is your IDS deployed correctly?
Find out quickly and easily by testing it
with real-world attacks from CORE IMPACT.
Go to
to learn more.