-
Notifications
You must be signed in to change notification settings - Fork 8
/
web.tex
1575 lines (1284 loc) · 115 KB
/
web.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Anomaly Detection of Web-based Attacks}
\label{web}
Anomaly detection techniques have been shown to be very effective at mitigating the widespread of malicious activity against web applications. In fact, nowadays, the underground criminals' preferred way of spreading malware\index{malware} consists in compromising vulnerable web applications, deploying a phishing-, spamming-, malware-kit\index{malware} and infecting an enormous amount of users that simply visit the website using a vulnerable browser (or plug-in). Further exacerbating the situation is the use of botnets\index{botnets} to exhaustively compromise vast amounts of web applications. Thus, due to their popularity, web applications play a significant role in the malware\index{malware} spreading work-flow. As a consequence, by blocking attacks against web applications the majority of potential infections can be avoided. Indirectly, all the visitors of the website benefit from the adoption of web-based \acp{IDS}\index{IDS}.
Unfortunately, even the most advanced anomaly detectors of web-based attacks are not with their drawbacks. In particular, in this chapter we discuss our contributions to overcome two relevant training issues, overfitting\index{overfitting} and concept-drift, that both manifest themselves as increased \acp{FPR}\index{FPR}. First, we address overfitting\index{overfitting} due to training data scarcity, which results in under-trained activity models with poor generalization capabilities. Consequently, a considerable amount of normal events is classified as malicious. Secondly, we propose a simple but extremely effective mechanism to detect changes in the monitored system to adapt the models to what we named web application concept drift.
\section{Preliminaries}
\label{web:intro}
The two contributions described in Section~\ref{web:longtail} and \ref{web:conceptdrift} are both based on the generic anomaly detection architecture described in the following. In this section, a generic model of an anomaly-based detector of attacks against web applications is given. In addition, the details of the datasets used to evaluate the effectiveness of both the approach are described.
\subsection{Anomaly Detectors of Web-based Attacks}
\label{web:intro:ad}
Unless differently stated, we use the shorthand term \emph{anomaly detector} to refer to anomaly-based detectors that leverage unsupervised machine learning techniques. Let us first overview the basic concepts.
Without loss of generality, a set of web applications $A$ can be organized into a set of \emph{resource paths} or \emph{components} $R$, and named \emph{parameters} $P$. For example, $A = \{a_{1}, a_{2}\}$ may contain a blog application $a_1=\mathtt{blog.example.com}$ and an e-commerce application $a_{2}=\mathtt{store.example.com}$. They can be decomposed into their resource paths:
\begin{displaymath}\footnotesize
\begin{array}{cc}
R_{1} = & R_{2} = \\
\left\{
\begin{array}{l}
r_{1,1} = \mathtt{/article/},\\
r_{1,2} = \mathtt{/comments/},\\
r_{1,3} = \mathtt{/comments/edit/},\\
r_{1,4} = \mathtt{/account/},\\
r_{1,5} = \mathtt{/account/password/}
\end{array}
\right\}
&
\left\{
\begin{array}{l}
r_{2,1} = \mathtt{/list/},\\
r_{2,2} = \mathtt{/cart/},\\
r_{2,3} = \mathtt{/cart/add/},\\
r_{2,4} = \mathtt{/account/},\\
r_{2,5} = \mathtt{/account/password/}
\end{array}
\right\}
\end{array}
\end{displaymath}
\noindent In this example, resource path $r_{1,5}$ might take a set of parameters, as part of the \ac{HTTP}\index{HTTP} request. For instance, the following request to $r_{1,5} = \mathtt{/account/password/}$
\begin{logs}
GET /account/password/id/12/oldpwd/14m3/newpwd/1337/ HTTP/1.1
\end{logs}
\noindent can be parsed into the following abstract data structure:
\begin{displaymath}
P_{1,5}=\left\{
\begin{array}{rlrl}
\langle p_{1,5,1} =& \mathtt{id} & v_{1,5,1} =& \mathtt{12}\rangle,\\
\langle p_{1,5,2} =& \mathtt{oldpw} & v_{1,5,2} =& \mathtt{14m3}\rangle,\\
\langle p_{1,5,3} =& \mathtt{newpw} & v_{1,5,3} =& \mathtt{1337}\rangle
\end{array}
\right\}.
\end{displaymath}
A generic web application \ac{IDS} based on unsupervised learning
techniques captures the system activity $\mathbb{I}$ as a sequence of
\emph{requests} $Q = \left\{q_1, q_2, \ldots \right\}$, similar to
those exemplified in Section~\ref{detection:id}, issued by web clients
to the set of monitored web applications. Each request $q \in Q$ is
represented by a tuple $\left\langle a_i,r_{i,j},P_q \right\rangle$,
where $P_q$ is a set of parameter name-value pairs such that $P_q
\subseteq P_{i,j}$.
During the initial \emph{training phase}, the anomaly detection system
learns the characteristics of the monitored web applications in terms
of \emph{models}. As new web application, resource path, and
parameter instances are observed, the sets $A$, $R$, and $P$ are
updated. For each unique parameter $p_{i,j,k}$ observed in association
with a particular application $a_i$ and path $r_{i,j}$, a set of
models that characterize the various features of the parameter is
constructed. An activity profile
(Definition~\ref{def:activity-model}) associated with each unique
parameter instance is generated $c_{\left(.\right)} = \left\langle
m^{(1)},m^{2},\ldots,m^{(u)},\ldots,m^{(U)} \right\rangle$, which we
name \emph{(parameter) profile} or \emph{model composition}.
Therefore, for each application $a_i$ and resource path $r_{i,j}$, a
set $\mathcal{C}_{i,j}$ of model compositions is constructed, one for
each parameter $p_{i,j,k}\in P_{i,j}$. The \emph{knowledge base} of
an anomaly detection system trained on web application $a_i$ is
denoted by $\mathcal{C}_{a_i}=\bigcup_{j}\mathcal{C}_{i,j}$. A
graphical representation of how a knowledge base is modeled for
multiple web applications is depicted in Figure~\ref{fig:profiles}.
\begin{example}\label{ex:web-models}
In \webanomaly, a profile for a given parameter $p_{i,j,k}$ is the
following tuple:
\begin{displaymath}
c_{i,j,k}=\langle \mtok, \mint, \mlen, \mchar, \mstruct \rangle.
\end{displaymath}
\noindent Similarly to \LibAnomaly:
\begin{itemize}
\item [$\mtok$] models parameter values as a set of legal tokens
(i.e., the set of of possible values for the parameter
\texttt{gender}, observed during training).
\item [$\mint$] and $\mlen$ describe normal intervals for literal
integers and string lengths, respectively, using the Chebyshev\index{Chebyshev}
inequality.
\item [$\mchar$] models the character strings as a ranked frequency
histogram, named \ac{ICD}, that are compared using the $\chi^2$ or
G tests.
\item [$\mstruct$] models sets of character strings by inducing a
\ac{HMM}. The \ac{HMM}\index{HMM} encodes a probabilistic grammar
that represents a superset of the strings observed in a training
set. Aside from the addition of $\mint$, which is a
straightforward generalization of $\mlen$ to numbers, the
interested reader may refer to~\citep{kruegel:jcn2005:webanomaly}
for further details.
\end{itemize}
\end{example}
\begin{figure}[p]
\centering
\includegraphics[angle=-90,width=.6\textwidth]{figures/web/profiles}
\caption{Overview of web application model construction.}
\label{fig:profiles}
\end{figure}
In addition to requests, the structure of user sessions can be taken
into account to model the normal states of a server-side
application. In this case, the anomaly detector does not consider
individual requests independently, but models their sequence. This
model captures the legitimate order of invocation of the resources,
according to the application logic. An example is when a user is
required to invoke an authentication resource (e.g.,
\texttt{/user/auth}) before requesting a private page (e.g.,
\texttt{/user/profile}). In~\citep{kruegel:jcn2005:webanomaly}, a
session $S$ is defined as a sequence of resources in $R$. For
instance, given $R = \{r_{1}, r_{2}, \dots, r_{10}\}$, a sample
session is $S = \langle r_{3}, r_{1}, r_{2}, r_{10}, r_{2}\rangle$.
Some systems also model \ac{HTTP}\index{HTTP} responses that are
returned by the server. For example,
in~\citep{kruegel:jcn2005:webanomaly}, a model $\mdoc$ is presented
that takes into account the structure of documents (e.g.,
\ac{HTML}\index{HTML}, \ac{XML}\index{XML}, and \ac{JSON}\index{JSON})
in terms of partial trees that include security-relevant nodes (e.g.,
\texttt{<script />} nodes, nodes containing \ac{DOM}\index{DOM} event
handlers, and nodes that contain sensitive data such as credit card
numbers). These trees are iteratively merged as new documents are
observed, creating a superset of the allowed document structure and
the positions within the tree where client-side code or sensitive data
may appear. A similar approach is adopted
in~\citep{masibty}.
\begin{note}[Web Application Behavior]
The concept of \emph{web application behavior}, used in the
following, is a particular case of the system behavior as defined by
Definition~\ref{def:system-behavior}. Depending on the accuracy of
each model, the behavior of a web application reflects the
characteristics and functionalities that the application offers and,
as a consequence, the content of the inputs (i.e., the requests)
that it process and the outputs (i.e., the responses) that it
produces. Thus, unless differently stated, we use the same term to
indicate both the ideas.
\end{note}
After training, the system is switched to \emph{detection mode}, which
is performed online. The models trained in the previous phase are
queried to determine whether or not the new parameters observed are
anomalous. Without going into the details of a particular
implementation, each parameter is compared to all the applicable
models and an aggregated anomaly score between 0 and 1 is calculated
by composing the values returned by the various models. If the anomaly
score is above a certain threshold, an alert is generated. For
example, in~\citep{kruegel:acsac2003:bayesian}, the anomaly score
represents the probability that a parameter value is anomalous and it
is calculated using a Bayesian network that encodes the probability
that a parameter value is actually normal or anomalous given the
scores returned by the individual models.
\subsection{A Comprehensive Detection System to Mitigate Web-based
Attacks}
\label{web:intro:masibty}
The approach described in~\citep{kruegel:jcn2005:webanomaly} are
further exploited in a recent work, \citep{masibty}, which we
partially contributed to. In particular, we defined a web application
anomaly detector that is able to detect a real-world threats against
the clients (e.g., malicious \textsf{JavaScript}\index{JavaScript} code, trying to
exploit browser vulnerabilities), the application (e.g., cross-site
scripting, permanent content injection), and the database layer (e.g.,
SQL injection). A prototype of the system, called \masibty, has been
evaluated on a set of real-world attacks against publicly available
applications, using both simple and mutated versions of exploits, in
order to assess the resilience to evasion. \masibty has the following
key characteristics:
\begin{itemize}
\item its models are designed with the explicit goal of not requiring
an attack-free dataset for training, which is an unrealistic
requirement in real-world applications. Even if in
\citep{10.1109/SP.2008.11} techniques are suggested to filter
outliers (i.e., attacks) from the training data, in absence of a
ground truth there can be no guarantee that the dataset will be
effectively free of attacks. Using such techniques before training
\masibty would surely improve its detection capabilities.
\item As depicted in Figure~\ref{fig:masibty-architecture}, \masibty
intercepts and process both HTTP requests (i.e., \emph{PAnomaly})
and responses (i.e., \emph{XSS\-Anomaly}) and protects against both
server-side and client-side attacks, an extremely important feature
in the upcoming ``Web 2.0'' era of highly interactive websites based
mainly on user contributed content. In particular, we devised two
novel anomaly detection models --- referred to as ``engines'' ---
based on the representation of the responses as trees.
\item \masibty incorporates an optional data protection component,
i.e., \emph{QueryAnomaly} that extracts and parses the SQL queries
sent to the database server. This component is part of the analysis
of HTTP requests, and thus is not merely a reverse proxy to the
database. In fact, it allows to bind the requests to the SQL queries
that they generate, directly or indirectly. Hence, queries can be
modeled although are not explicitly passed as a parameter of the
HTTP requests.
\end{itemize}
\begin{figure}[p]
\begin{center}
\hspace*{2.5cm}
\includegraphics[angle=90,scale=.8]{figures/web/masibty/masibty-architecture}
\end{center}
\vspace*{-1cm}
\caption{The logical structure of \masibty.}
\label{fig:masibty-architecture}
\end{figure}
\subsection{Evaluation Data}
\label{web:intro:eval}
The experiments in Section~\ref{web:longtail} and \ref{web:conceptdrift} were conducted using a dataset drawn from real-world web applications deployed on both academic and industry web servers. Examples of representative applications include payroll processors, client management, and online commerce sites. For each application, the full content of each \ac{HTTP}\index{HTTP} connection observed over a period of several months was recorded. The resulting flows were then filtered using the most advanced signature-based detection system, \textsf{Snort}\footnote{Source and attack signatures available for download at \url{http://snort.org}.}, to remove known attacks. In total, the dataset contains 823 distinct web applications, 36,392 unique resource paths, 16,671 unique parameters, and 58,734,624 \ac{HTTP}\index{HTTP} requests.
\begin{note}[Dataset privacy]
To preserve the privacy, the dataset used has been given to the Computer Security Laboratory of UC Santa Barbara under strict contractual agreements that denies to disclose specific information identifying the web applications themselves.
\end{note}
Two sets of 100,000 and 1,000 attacks was introduced into the dataset used for the approach described in Section~\ref{web:longtail} and \ref{web:conceptdrift}, respectively. These attacks were real-world examples and variations upon \ac{XSS} (e.g., CVE-2009-0781), \ac{SQL} injections (e.g., CVE-2009-1224), and command execution exploits (e.g., CVE-2009-0258) that manifest themselves in request parameter values, which remain the most common attacks against web applications. Representative examples of these attacks include:
\begin{itemize}
\item malicious JavaScript\index{JavaScript} inclusion
\begin{lstlisting}[numbers=none]
<script src="http://example.com/malware.js"></script>
\end{lstlisting}
\item bypassing login authentication
\begin{lstlisting}[numbers=none]
' OR 'x'='x'--
\end{lstlisting}
\item command injection
\begin{lstlisting}[numbers=none]
cat /etc/passwd | mail attacker@gmail.com \#
\end{lstlisting}
\end{itemize}
More precisely, the \ac{XSS}\index{XSS} attacks are variations on those listed in \citep{xss_cheat_sheet}, the \ac{SQL}\index{SQL} injections were created similarly from \citep{sqli_cheat_sheet}, and the command execution exploits were variations of common command injections against the \textsf{Linux} and \textsf{Windows} platforms.
\section{Training With Scarce Data}
\label{web:longtail}
In this section, we describe our contributions \citep{2009_robertson_maggi_kruegel_vigna} to cope with the difficulty of obtaining sufficient \emph{amounts} of training data. As we explained in Section~\ref{detection:ad:host} and \ref{detection:evaluation:issues}, this limitation has heretofore not been well studied. We developed this technique to work with \webanomaly and, in general, to any learning-based anomaly detection system against web attacks. In fact, the problem described in the following is significantly evident in the case of web applications. However, it can be easily applied to other anomaly-based system, as long as they rely on behavioral models and learning techniques such as those described in Section~\ref{host:syscall}.
The issue that motivates this work is that the number of web application component invocations is non-uniformly distributed. In fact, we noticed that relatively few components are dominant in the traffic and the remainder components are accessed relatively infrequently. Therefore, for those components, it is difficult to gather enough training data to accurately model their normal behavior. In statistics, the infrequently accessed population is known as the ``long tail''. Note that, however, this does not necessarily imply a power law distribution. Clearly, components that are infrequently accessed lead to undertrained models (i.e., models that do not capture the normal behavior accurately). Consequently, models are subject to overfitting\index{overfitting} due to lack of data, resulting in increases in the \ac{FPR}.
In particular, in this section, we describe our joint work with UC Santa Barbara in which we propose to mitigate this problem by exploiting natural similarities among all web applications. In particular, we show that the values of the parameters extracted from \ac{HTTP}\index{HTTP} requests can generally be categorized according to their type, such as an integer, date, or string. Indeed, our experiments demonstrate that parameters of similar type induce similar models of normal behavior. This result is then exploited to supplement a local scarcity of training data for a given web application component with similar data from other web applications.
This section is structured as follows. In Section~\ref{web:longtail:motivation} the problem of the non-uniform distribution to the different components of a web application is discussed. In addition, we provide evidence that it occurs in the real world. In Section~\ref{web:longtail:design} approach to address the problem of model profile undertraining\index{undertraining} by using the notion of \emph{global profiles} that exploit similarities between web application parameters of similar type. In Section~\ref{web:longtail:eval} the application of global profiles is evaluated on a large dataset of real-world traffic from many web applications, and demonstrate that utilizing global profiles allows anomaly detectors to accurately model web application components that would otherwise be associated with undertrained models.
\subsection{Non-uniformly distributed training data}
\label{web:longtail:motivation}
To describe the undertraining\index{undertraining} problem, in this section we refer to the aforementioned, generic architecture of a web-based anomaly detector. To address the problem of undertraining\index{undertraining}, we leverage the set of models described in the Example~\ref{ex:web-models}.
Because anomaly detection systems dynamically learn specifications of
normal behavior from training data, it is clear that the quality of
the detection results critically relies upon the quality of the
training data. For example, as mentioned in
Section~\ref{detection:evaluation:issues}, a training dataset should
be attack-free and should accurately represent the normal behavior of
the modeled features. To our knowledge, the difficulty of obtaining
sufficient amounts of training data to accurately model web
applications is less well-known. In a sense, this issue is similar to
those addressed by statistical analysis methods with missing
data~\citep{little1987statistical}. Although a training procedure would
benefit from such mechanisms, they require a complete redesign of the
training algorithm specific to each model. Instead, a non-obtrusive
approach that can improve an existing system without modifying the
undertrained models is more desirable. Typically, anomaly-based
detectors cannot assume the presence of a testing environment that can
be leveraged to generate realistic training data that exercises the
web application in a safe, attack-free environment. Instead, an
anomaly detection system is typically deployed in front of live web
applications with no \emph{a priori} knowledge of the applications'
components and their behavior.
In particular, in the case of low-traffic web applications, problems
arise if the rate of client requests is inadequate to allow models to
train in a timely manner. Even in the case of high-traffic web
applications, however, a large subset of resource paths might fail to
receive enough requests to adequately train the associated models.
This phenomenon is a direct consequence of the fact that resource path
invocations issued by web clients often follow a non-uniform
distribution. To illustrate this point, Figure~\ref{fig:long_tail}
plots the normalized cumulative distribution function of web client
resource path invocations for a variety of real-world, high-traffic
web applications (details on the source of this data are provided in
Section~\ref{web:longtail:eval}). Although several applications have
an approximately uniform client access distribution, a clear majority
exhibits highly skewed distributions. Indeed, in many cases, a large
percentage of resource paths receive a comparatively minuscule number
of requests. Thus, returning to the example mentioned in
Section~\ref{web:intro:ad}, assuming an overall request volume of
500,000 requests per day, the resource path set would result in the
client access distribution detailed in
Table~\ref{tab:client-access-distribution}.
\begin{table}[t]
\centering
\begin{tabular}{rl}
\toprule
\textsc{Resource Path} & \textsc{Requests}\\
\midrule
\texttt{/article} & 95.0\% \\
\texttt{/comments} & 3.0\% \\
\texttt{/comments/edit} & 1.8\% \\
\texttt{/account} & 0.2\% \\
\texttt{/account/password} & 0.02\%\\
\bottomrule
\end{tabular}
\caption{Client access distribution on a real-world web application (based on 500,000 requests per day).}
\label{tab:client-access-distribution}
\end{table}
Profiles for parameters to resource paths such as \texttt{/article} will receive ample training data, while profiles associated with account components will be undertrained. The impact of the problem is also magnified by the fact that components of a web application that are infrequently exercised are also likely to contain a disproportionately large portion of security vulnerabilities. This could be a consequence of the reduced amount of testing that developers invariably perform on less prominent components of a web application, resulting in a higher rate of software flaws. In addition, the relatively low request rate from users of the web application results in a reduced exposure rate for these flaws. Finally, when flaws are exposed and reported, correcting the flaws may be given a lower priority than those in higher traffic components of a web application.
\begin{figure}[t]
\centering
\includegraphics[width=.85\textwidth]{figures/web/longtail/fig_long_tail}
\caption{Web client resource path invocation distributions from a
selection of real-world web applications.}
\label{fig:long_tail}
\end{figure}
\subsection{Exploiting global knowledge}
\label{web:longtail:design}
The approach described in this section is based on a key observation: parameters associated with the invocation of components belonging to different web applications often exhibit a marked similarity to each other. For instance, many web applications take an integer value as a unique identifier for a class of objects such as a blog article or comment, as in the case of the \texttt{id} parameter. Similarly, applications also accept date ranges similar to the \texttt{date} parameter as identifiers or as constraints upon a search request. Similarly, as in the case of the \texttt{title} parameter, web applications often expect a short phrase of text as an input, or perhaps a longer block of text in the form of a comment body. In some sense, each of these groupings of similar parameters can be considered as distinct \emph{parameter types}, though this need not necessarily correspond to the concept of types as understood in the programming languages context.
Our approach is based in the following assumption. Parameters of the
same type tend to induce model compositions that are similar to each
other in many respects. Consequently, if the lack of training data for
a subset of the components of a web application prevents an anomaly
detection system from constructing accurate profiles for the
parameters of those components, we claim that it is possible to
substitute, with an acceptably low decrease in the \ac{DR}, profiles
for similar parameters of the same type that were learned when enough
training data was available to construct accurate profiles. It must be
underlined that the substitution operates at the granularity of
\emph{parameters} rather than \emph{requests} (which may contain more
than one parameter). This increases the likelihood of finding
applicable profile similarities, and allows for the substitution of
models taken from radically different components. However, although
the experiments on real-world data, described in
Section~\ref{web:longtail:eval}, confirm that the aforementioned
insight is realistic, our hypothesis might not hold in some very
specific settings. Thus, to minimize the risks brought by migrating
global knowledge across different deployments, we interpreted this
result only as an insight and developed a robust criterion able to
find similar profiles independently from the actual types of the
modeled parameters.
\begin{figure*}
\centering
\includegraphics[width=\textwidth]{figures/web/longtail/overview}
\caption{Overall procedure. Profiles, both undertrained and well-trained, are collected from a set of web applications. These profiles are processed offline to generate the global knowledge base $\kb$ and index $\kb^I$. $\kb$ can then be queried to find similar global profiles.}
\label{fig:overview}
\vspace*{-.4cm}
\end{figure*}
As schematized in Figure~\ref{fig:overview} our approach is composed of three phases.
\begin{description}
\item [First phase] (offline) Enhancement of the training procedure originally implemented in~\citep{kruegel:jcn2005:webanomaly}.
\item[Second phase] (offline) It is divided into three sub-steps.
\begin{enumerate}
\item A \emph{global knowledge base} of profiles $\kb=\bigcup_{a_i}\kb_{a_i}$ is constructed, where $\kb_{a_i}$ are knowledge bases containing only well-trained, stable profiles from anomaly detection systems previously deployed on a set of web applications $\bigcup_{i}a_i$.
\item A knowledge base $\kb^I=\bigcup_{a_i}\kb_{a_i}^I$ of undertrained profiles is then constructed as an \emph{index} into $\kb$, where $\kb_{a_i}^I$ is a knowledge base of undertrained profiles from the web application $a_i$.
\item Finally, a \emph{mapping} $f:\left\{\kb^I\right\}\times\kb_{a_i}\mapsto\kb$ is defined.
\end{enumerate}
\item[Third phase] (online) For any new web application where insufficient training data is available for a component's parameter, the anomaly detector first extracts the undertrained profile $c'$ from the local knowledge base $\kb_{a_i}$. Then, the global knowledge base $\kb$ is queried to find a similar, previously constructed profile $f\left(\kb_I,c'\right)=c\in\kb$. The well-trained profile $c$ is then substituted for the undertrained profile $c'$ in the detection process.
\end{description}
The following sections detail how $\kb_I$ and $\kb$ are constructed, and how $f$ maps elements in $\kb_I$ and $\kb_{a_i}$ to elements in $\kb$.
\subsubsection{First Phase: Enhanced Training}
\label{web:longtail:design:enhanced-training}
A significant refinement of the individual models described in~\citep{kruegel:jcn2005:webanomaly} is the criterion used to determine the length of the training phase. An obvious choice is to fix a constant training length, e.g., a thousand requests. Unfortunately, an appropriate training length is dependent upon the complexity of modeling a given set of features. Therefore, we have developed an automated method that leverages two stop criteria, \emph{model stability} and \emph{model confidence}, to determine when a model has observed enough training samples to accurately approximate the normal behavior of a parameter.
\paragraph{Model Stability}
\label{web:longtail:design:enhanced-training:stability}
As new training samples are observed early in the training phase, the
state of a model typically exhibits frequent and significant change as
its approximation of the normal behavior of a parameter is updated.
Informally, in an information-theoretic sense, the average information
gain of new training samples is high. As a model's state converges to
a more precise approximation of normal behavior, however, its state
exhibits infrequent and incremental changes. In other words, the
information gain of new training samples approaches zero, and the
model stabilizes. Note that, we refer to the information gain as an
informal and intuitive concept to explain the rationale behind the
development of a sound model stability criterion. By no means we claim
that our procedure relies on the actual information gain to calculate
when a model converges to stability.
Each model self-assesses its stability during the training phase by maintaining a history of snapshots of its internal state. At regular intervals, a model checks if the sequence of deltas between each successive historical state is monotonically decreasing and whether the degree of change drops below a certain threshold. If both conditions are satisfied, then the model considers itself stable. Let $\kappa_{\text{stable}}^{\left(u\right)}$ denote the number of training samples required for a model to achieve stability. A profile is considered stable when all of its constituent models are stable.
\begin{definition}[Profile stability]
Let $\kappa_{\text{stable}}$ be the number of training samples required for a single model to achieve stability. The \emph{aggregated stability} measure is defined as:
\begin{equation}
\kappa_{\text{stable}}=\max_{u\in
U}\kappa_{\text{stable}}^{\left(u\right)}.
\end{equation}
\end{definition}
The notion of model stability is also leveraged in the third phase, as detailed in Section~\ref{web:longtail:design:global}.
\begin{note}
Instead of describing the internal stop criterion specific to each
model, if any, we developed a model-agnostic minimization algorithm
detailed in Section~\ref{web:longtail:design:mapping} (and evaluated
in Section~\ref{web:longtail:eval:robustness}) that allows one to
trade off detection accuracy against the number of training samples
available.
\end{note}
\paragraph{Model Confidence}
\label{web:longtail:design:enhanced-training:confidence}
A related notion to model stability is that of \emph{model confidence}. Recall that the goal of a model is to learn an abstraction of the normal behavior of a feature from the training data. There exists a well-known tension in the learning process between accurately fitting the model to the data and generalizing to data that has not been observed in the training set. Therefore, in addition to detecting whether a model has observed enough training samples to accurately model a feature, each model incorporates a self-confidence measure $z^{\left(u\right)}$ that represents whether the model has generalized so much as to lose all predictive power.
For example, the confidence of a token model should be directly related to the probability that a token set is present.
\begin{definition}[Token model confidence]
The \emph{token model confidence} is defined as:
\begin{equation}
z^{\left(\text{tok}\right)}=\tau=\frac{1}{2}+\frac{n_c-n_d}{4n\left(n-1\right)},
\end{equation}
where $n_c$ and $n_d$ are the number of concordant and discordant pairs, respectively, between the unique number of observed samples and the total number of samples observed at each training step, and $n$ is the total number of training samples.
\end{definition}
Note that, $z_{\mtok}$ is an adaptation of the $\tau$ coefficient~\citep{kendalltau}.
For the integer and length models, the generality of a particular model can be related to the statistical dispersion of the training set.
\begin{definition}[Integer model confidence]
Given the observed standard deviation $\sigma$, minimum observed value $u$, and maximum observed value $v$, the \emph{confidence of an integer or length model} is defined as:
\begin{equation}
z^{\left(\left\{\text{int,len}\right\}\right)}=
\begin{cases}
1-\frac{\sigma}{v-u} &\text{if } v-u>0 \\
1 &\text{otherwise}
\end{cases}.
\end{equation}
\end{definition}
The confidence of a character distribution model is determined by the variance of observed \acp{ICD}\index{ICD}. Therefore, the confidence of this model is given by a similar construction as the previous one, except that instead of operating over observed values, the measure operates over observed variances.
\begin{definition}[Char. distr. model confidence]
Given the standard deviation of observed variances $\sigma$, the minimum observed variance $u$, and the maximum observed variance $v$, the \emph{confidence of a character distribution model} is:
\begin{equation}
z^{\left(\text{char}\right)}=
\begin{cases}
1-\frac{\sigma}{v-u} &\text{if } v-u>0 \\
1 &\text{otherwise}
\end{cases}.
\end{equation}
\end{definition}
Finally, the \ac{HMM}\index{HMM} induced by the structure model can be directly analyzed for generality. We perform a rough estimate of this by computing the mean probability for any symbol from the learned alphabet to be emitted at any state.
\begin{definition}[Struct. model confidence]
Given a structural model, i.e. an \ac{HMM}, specified by the tuple $\langle\mathbb{S},\mathbb{O},M_{\mathbb{S}\times\mathbb{S}}, P\left(\mathbb{S},\mathbb{O}\right),P\left(\mathbb{S}\right)\rangle$, where $\mathbb{S}$ is the set of states, $\mathbb{O}$ is the set of emissions, $M_{\mathbb{S}\times\mathbb{S}}$ is the state transition probability matrix, $P\left(\mathbb{S},\mathbb{O}\right)$ is the emission probability distribution over the set of states, and $P\left(\mathbb{S}\right)$ is the initial probability assignment, \emph{the structural model confidence}:
\begin{equation}
z^{\left(\text{struct}\right)}=1-
\frac{1}{\left|\mathbb{S}\right|\left|\mathbb{O}\right|}
\sum_{i=1}^{\left|\mathbb{S}\right|}
\sum_{j=1}^{\left|\mathbb{O}\right|}P\left(s_i,o_j\right).
\end{equation}
\end{definition}
At the end of this phase, for each web application $a_{i}$, the profiles are stored in the corresponding knowledge base $\kb_{a_{i}}$ and are ready to be processed by the next phase.
\subsubsection{Second Phase: Building a global knowledge base}
\label{web:longtail:design:global}
This phase is divided into the three sub-steps described in the following.
\paragraph{Well-trained profiles}
The construction of $\kb$ begins by merging a collection of knowledge bases $\left\{\kb_{a_1},\kb_{a_2},\ldots,\kb_{a_n}\right\}$ that have previously been built by a web application anomaly detector over a set of web applications $\bigcup_{i}a_i$. The profiles in $\kb$ are then clustered in order to group profiles that are semantically similar to each other. Profile clustering is performed in order to time-optimize query execution when indexing into $\kb$, as well as to validate the notion of parameter types. In this work, an agglomerative hierarchical clustering algorithm using group average linkage was applied, although the clustering stage is, in principle, agnostic as to the specific algorithm. For an in-depth discussion of clustering algorithms and techniques, we refer the reader to~\citep{clustering:survey}.
Central to any clustering algorithm is the distance function, which defines how distances between the objects to be clustered are calculated. A suitable distance function must reflect the semantics of the objects under consideration, and should satisfy two conditions:
\begin{itemize}
\item the overall similarity (i.e., inverse of the distance) between elements within the same cluster is maximized, and
\item the similarity between elements within different clusters is minimized.
\end{itemize}
We define the distance between two profiles to be the sum of the distances between the models comprising each profile. More formally.
\begin{definition}[Profile distance]
The \emph{distance between two profiles} $c_i$ and $c_j$ is defined as:
\begin{equation}
d\left(c_i,c_j\right)=\frac{1}{\left|c_i\bigcap c_j\right|}
\sum_{m_i^{\left(u\right)},m_j^{\left(u\right)}\in c_i\bigcap c_j}
\delta_u\left(m_i^{\left(u\right)},m_j^{\left(u\right)}\right),
\label{eq:dist}
\end{equation}
where $\delta_u:\mathcal{M}_u\times\mathcal{M}_u\mapsto\left[0,1\right]$ is the distance function defined between models of type $u\in U=\left\{\text{tok}, \text{int}, \text{len}, \text{char}, \text{struct}\right\}$.
\end{definition}
The token model $\mtok$ is represented as a set of unique tokens observed during the training phase. Therefore, two token models $\mtok_i$ and $\mtok_j$ are considered similar if they contain similar sets of tokens. More formally.
\begin{definition}[Token models distance]
The \emph{distance function for token models} is defined as the Jaccard\index{Jaccard} distance~\citep{clustering:distance_metrics_survey}:
\begin{equation}
\delta_{\text{tok}}\left(\mtok_i,\mtok_j\right)=
1-\frac{\left|\mtok_i\bigcap\mtok_j\right|}{\left|\mtok_i\bigcup\mtok_j\right|}.
\end{equation}
\end{definition}
The integer model $\mint$ is parametrized by the sample mean $\mu$ and variance $\sigma$ of observed integers. Two integer models $\mint_i$ and $\mint_j$ are similar if these parameters are also similar.
\begin{definition}[Integer models distance]
The \emph{distance function for integer models} is defined as:
\begin{equation}
\delta_{\text{int}}\left(\mint_i,\mint_j\right)=
\frac{\left\|\frac{\sigma_i^2}{\mu_i^2}-\frac{\sigma_j^2}{\mu_j^2}\right\|}
{\frac{\sigma_i^2}{\mu_i^2}+\frac{\sigma_j^2}{\mu_j^2}}.
\end{equation}
\end{definition}
\begin{note}[Length models distance]
As the string length model is internally identical to the integer model, its distance function $\delta_{\text{len}}$ is defined similarly.
\end{note}
Recall that the character distribution model $\mchar$ learns the frequencies of individual characters comprising strings observed during the training phase. These frequencies are then ranked and coalesced into a $n$ bins to create an \ac{ICD}\index{ICD}. Two character distribution models $\mchar_i$ and $\mchar_j$ are considered similar if each model's \acp{ICD}\index{ICD} are similar. More formally.
\begin{definition}[Char. distr. models distance]
The \emph{character distribution models distance} is defined as:
\begin{equation}
\delta_{\text{char}}\left(\mchar_i,\mchar_j\right)=
\sum_{l=0}^{n}\frac{\left\|b_i\left(l\right)-b_j\left(l\right)\right\|}
{\max_{k=i,j}b_k\left(l\right)},
\end{equation}
where $b_i\left(k\right)$ is the value of bin $k$ for $\mchar_i$.
\end{definition}
The structural model $\mstruct$ builds an \ac{HMM}\index{HMM} by observing a sequence of character strings. The resulting \ac{HMM}\index{HMM} encodes a probabilistic grammar that can produce a superset of the strings observed during the training phase. The \ac{HMM}\index{HMM} is specified by the tuple $\langle\mathbb{S},\mathbb{O},M_{\mathbb{S}\times\mathbb{S}}, P\left(\mathbb{S},\mathbb{O}\right),P\left(\mathbb{S}\right)\rangle$. Several distance metrics have been proposed to evaluate the similarity between \acp{HMM}\index{HMM}~\citep{hmm:metrics:1, hmm:metrics:2, hmm:metrics:3, hmm:metrics:4}. Their time complexity, however, is non-negligible. Therefore, we adopt a less precise, but considerably more efficient, distance metric between two structural models $\mstruct_i$ and $\mstruct_j$.
\begin{definition}[Struct. models distance]
The \emph{structural models distance} is defined as the Jaccard\index{Jaccard} distance between their respective emission sets:
\begin{equation}
\delta_{\text{struct}}\left(\mstruct_i,\mstruct_j\right)=
1-\frac{\left|\mathbb{O}_i\bigcap\mathbb{O}_j\right|}
{\left|\mathbb{O}_i\bigcup\mathbb{O}_j\right|}.
\end{equation}
\end{definition}
\paragraph{Undertrained profiles}
Once the global knowledge base $\kb$ has been built, the next step is to construct a knowledge base of undertrained models $\kb^I$. This knowledge base is composed of profiles that have been deliberately undertrained on $\kappa\ll\kappa_{\text{stable}}$ for each of the web applications $a_i$. Because each member of $\kb^I$ uniquely corresponds to a member in $\kb$ and undertrained profiles are built for each well-trained profile in $\kb$, a bijective mapping $f':\kb^I\mapsto\kb$ exists between $\kb^I$ and $\kb$. Therefore, when a web application parameter is identified as likely to be undertrained, the corresponding undertrained profile $c'$ can be compared to a similar undertrained profile in $\kb^I$, which is then used to select a corresponding stable profile from $\kb$. This operation requires the knowledge base $\kb^{I}$ to be indexed. The construction of the indices relies on the notion of model stability, described in Section~\ref{web:longtail:design:enhanced-training:stability}.
\begin{figure}[t]
\centering
\begin{tikzpicture}[el/.style={circle,fill=black!75,inner sep=1pt,label={above right:#1}},every label/.style = {font=\footnotesize},every node/.style={font=\footnotesize},ci/.style={shape=rectangle,thin,draw,inner sep=2pt}]
\foreach \p in {1, 2, ..., 7}
\foreach \k in {1, 2, ..., 8}
\node [el] at ($ (5.5,0) + .8*(\k,\p)$) (d_\k_\p) {};
\foreach \p/\k in {1/1, 2/2, 3/4, 4/8, 5/16, 6/32, 7/64}
\node at ($ (5.5,0) + .8*(0,\p)$) {$\kappa = \k$};
\node [ci,fit=(d_1_1)] {}
node [ci,fit=(d_2_1)] {}
node [ci,fit=(d_4_1)] {}
node [ci,fit=(d_5_1)] {}
node [ci,fit=(d_6_1)] {}
node [ci,fit=(d_7_1)] {};
\node [ci,fit=(d_1_2) (d_2_2)] {}
node [ci,inner sep=3.5pt,fit=(d_2_2) (d_3_2)] {}
node [ci,fit=(d_4_2) (d_5_2)] {}
node [ci,fit=(d_7_2) (d_8_2)] {};
\node [ci,fit=(d_2_3) (d_3_3) (d_4_3) (d_5_3)] {}
node [ci,inner sep=3.5pt,fit=(d_5_3) (d_6_3) (d_7_3) (d_8_3)] {};
\node [ci,fit=(d_1_4) (d_8_4)] {};
\node [rotate=90] at (13,3.25) {Requests in $Q^{\left(p\right)}$};
\end{tikzpicture}
\caption{Partitioning of the training set $Q$ for various $\kappa$.}
\label{fig:slicing}
\end{figure}
The undertrained profiles that comprise $\kb^I$ are constructed using
the following procedure. Let
$Q^{\left(p\right)}=\{q_1^{\left(p\right)},q_2^{\left(p\right)},\ldots\}$
denote a sequence of client requests containing parameter $p$ for a
given web application. Over this sequence of requests, profiles are
deliberately undertrained on randomly sampled $\kappa$-sequences.
Each of the resulting profiles is then added to the set $\kb^I$.
Figure~\ref{fig:slicing} depicts this procedure for various $\kappa$.
Note that the random sub-sampling is performed with the goal of
inducing undertraining\index{undertraining} to show that clustering is feasible and leads
to the desired grouping even -- and especially -- in the presence of
undertraining\index{undertraining}.
Recall that $\kappa_{\text{stable}}$ is the number of training samples required for a profile to achieve stability. Then, a profile is considered undertrained when the number of training samples it has observed, $\kappa$, is significantly less than the number required to achieve stability. Therefore, an undertrained knowledge base $\kb^I$ is a set of profiles that have observed $\kappa$ samples such that $\kappa\ll\kappa_{\text{stable}}$.
The selection of an appropriate value for $\kappa$ is central to both
the efficiency and the accuracy of this process. Clearly, it is
desirable to minimize $\kappa$ in order to be able to index into $\kb$
as quickly as possible once a parameter has been identified to be
subject to undertraining\index{undertraining} at runtime. On the other hand, setting
$\kappa$ too low is problematic, as Figure~\ref{fig:clustering}
indicates. For low values of $\kappa$, profiles are distributed with
relatively high uniformity within $\kb^I$, such that clusters in
$\kb^I$ are significantly different than clusters of well-trained
profiles in $\kb$. Therefore, slight differences in the state of the
individual models can cause profiles that are close in $\kb^I$ to map
to radically different profiles in $\kb$. As
$\kappa\rightarrow\kappa_{\text{stable}}$, however, profiles tend to
form meaningful clusters, and tend to approximate those found in
$\kb$. Therefore, as $\kappa$ increases, profiles that are close in
$\kb^I$ become close in $\kb$ under $f$ -- in other words, $f$ becomes
\emph{robust} with respect to model semantics.
\begin{figure}[p]
\centering
\begin{tikzpicture}[el/.style={circle,fill=black!75,inner sep=1pt,label={above right:#1}},every label/.style = {font=\footnotesize},every node/.style={font=\footnotesize},ci/.style={shape=rectangle,thin,draw,inner sep=2pt}]
\node [el,label={above right:$c_{p_{1,1}}$}] at (0, 0) (1_p_1) {};
\node [el,label={above right:$c_{p_{2,1}}$}] at (0.5, 0.75) (1_p_2) {};
\node [el,label={above right:$c_{p_{3,1}}$}] at (0, 1.5) (1_p_3) {} ;
\node [el,label={above right:$c_{p_{4,1}}$}] at (-0.5, 0) (1_p_4) {};
\node [el,label={above right:$c_{p_{5,1}}$}] at (1, 0.5) (1_p_5) {};
\node [el,label={above right:$c_{p_{6,1}}$}] at (0, 1) (1_p_6) {};
\node [el,label={above right:$c_{p_{7,1}}$}] at (0.5, 1.75) (1_p_7) {};
\node [el,label={above right:$c_{p_{8,1}}$}] at (-1, 1) (1_p_8) {};
\node [el,label={above right:$c_{p_{9,1}}$}] at (-1.5, 1.6) (1_p_9) {};
\node [el,label={above right:$c_{p_{10,1}}$}] at (-1.25, 1.4) (1_p_10) {};
\foreach \k/\name/\l in {3/2/2, 6/4/4, 8.5/K/\kappa}
\foreach \i in {1, 2, ..., 10}
\node [el,label={above right:$c_{p_{\i,\l}}$}] at ($ (1_p_\i) + (-1/\k*rand,\k) $) (\name_p_\i) {};
\foreach \i in {1, 2, ..., 10}
\node [el,label={above right:$c_{p_{\i}}$}] at ($ (K_p_\i) + (0,3) $) (s_p_\i) {};
\foreach \k/\name in {1/1, 3/2, 6/4, 8.5/K}
\foreach \i in {1, 2, ..., 10}
\foreach \j in {1, 2, ..., 10}
\fill [black,opacity=.45] ($ (\name_p_\i) + 0.8/\k*(rand,rand) $) circle (.7pt);
\foreach \from/\to in {1/2, 2/4, 4/K}
\foreach \i in {1, 2, ..., 10}
\draw [opacity=.35] (\from_p_\i) -- (\to_p_\i);
\node [el,inner sep=1.5pt,label={above right:$\mathbf{c}$}] at ($(K_p_8) - (.4, .4) $) (p) {};
\node [circle,draw=black,inner sep=10pt,fill=black,opacity=.18] at ($ (p) $) (pn) {};
\draw [-stealth] (p) to ($ (pn) - (10pt,10pt) $) {} node at ($ (p) - (.1,.25) $) {};
\node [circle,draw=black,inner sep=15pt,fill=black,opacity=.15] at ($ (p) $) (pnn) {};
\draw [-stealth] (p) to ($ (pn) - (22pt,0) $) {} node at ($ (p) - (.25,-.2) $) {};
\draw [-stealth,dashed] (p) to [bend left=25] ($ (s_p_8) - (0.25,0)$) {};
\node [draw = black!35, inner sep=5pt, thin, ellipse, fit = (1_p_1) (1_p_2) (1_p_3) (1_p_4) (1_p_5) (1_p_6) (1_p_7) (1_p_8) (1_p_9) (1_p_10)]{};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (2_p_8) (2_p_9) (2_p_10)]{};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (2_p_4)]{};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (2_p_1) (2_p_2) (2_p_3) (2_p_5) (2_p_6) (2_p_7)]{};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (4_p_8) (4_p_9) (4_p_10)]{};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (4_p_1) (4_p_4) (4_p_5)]{};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (4_p_2) (4_p_3) (4_p_6) (4_p_7)]{};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (K_p_8) (K_p_9) (K_p_10)] {};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (K_p_6) (K_p_5) (K_p_2)] {};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (K_p_1) (K_p_4)] {};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (K_p_3) (K_p_7)] {};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (s_p_8) (s_p_9) (s_p_10)] {};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (s_p_6) (s_p_5) (s_p_2)] {};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (s_p_1) (s_p_4)] {};
\node [draw = black!35, inner sep=0pt, thin, ellipse, fit = (s_p_3) (s_p_7)] {};
\foreach \i/\k in {-1/0, 2.5/1, 5.5/2, 8.25/4, 11/\kappa_{min}, 14/{\kappa_{stable} \gg \kappa_{min}}}
\draw [black!40] (-3, \i) to (3, \i) {}
node [color=black,right] {$\kappa = \k$};
\node at (-3, 0.75) {$\mathcal{C}_{1}^{I}$};
\node at (-3, 4) {$\mathcal{C}_{2}^{I}$};
\node at (-3, 6.75) {$\mathcal{C}_{4}^{I}$};
\node at (-3, 9.5) {$\mathcal{C}_{\kappa_{min}}^{I}$};
\node [rotate=90] at (11, 9.5) {Well-trained};
\node at (10, 9.5) {$\mathcal{C}$};
\node at ($ (K_p_7) + (1,1) $) {};
\end{tikzpicture}
\caption{Procedure for building global knowledge base indices.}
\label{fig:clustering}
\vspace*{-.4cm}
\end{figure}
\begin{note}[Robustness]
Our use of the term ``robustness'' is related, but not necessarily equivalent, to the definition of robustness in statistics (i.e., the property of a model to perform well even in the presence of small changes in the underlying assumptions).
\end{note}
\paragraph{Mapping undertrained profiles to well-trained profiles}
\label{web:longtail:design:mapping}
A principled criterion is required for balancing quick indexing against a robust profile mapping. Accordingly, we first construct a candidate knowledge base $\kb_{\kappa}^I$ for a given $\kappa$, as in the case of the global knowledge base construction. Then, we define a \emph{robustness metric} as follows. Let $H^I=\bigcup_{i}h_i^I$ be the set of clusters in $\kb^I$, and $H=\bigcup_{i}h_i$ be the set of clusters in $\kb$. Let $g:H^I\mapsto\mathbb{N}$ be a mapping from an undertrained cluster to the maximum number of elements in that cluster that map to the same cluster in $\kb$. The robustness metric $\rho:\left\{\kb^I\right\}\mapsto\left[0,1\right]$ is then defined as
\begin{equation}
\rho\left(\kb^I\right)=\frac{1}{\left|\kb^I\right|}\sum_{i}g\left(h_i^I\right).
\end{equation}
With this metric, an appropriate value for $\kappa$ can now be chosen as
\begin{equation}
\kappa_{\text{min}}=\min_{\kappa}\left(\rho\left(\kb_{\kappa}^I\right)\geq \rho_{\text{min}}\right),
\end{equation}
where $\rho_{\text{min}}$ is a minimum robustness threshold.
With the construction of a global knowledge base $\kb$ and an undertrained knowledge base $\kb^I$, online querying of $\kb$ can then be performed.
\subsubsection{Third Phase: Querying and Substitution}
Given an undertrained profile $c'$ from an anomaly detector deployed over a web application $a_i$, the mapping $f:\left\{\kb^I\right\}\times\kb_{a_i}\mapsto\kb$ is defined as follows. A nearest\hyp{}neighbor match is performed between $c'$ and the previously constructed clusters $H^I$ from $\kb^I$ to discover the most similar cluster of undertrained profiles. This is done to avoid a full scan of the entire knowledge base, which would be prohibitively expensive due to the cardinality of $\kb^I$.
Then, using the same distance metric defined in Equation~\eqref{eq:dist}, a nearest-neighbor match is performed between $c'$ and the members of $H^I$ to discover the most similar undertrained profile $c^I$. Finally, the global, well-trained profile $f'\left(c^I\right)=c$ is substituted for $c'$ for the web application $a_i$.
To make explicit how global profiles can be used to address a scarcity
of training data, consider the example introduced in
Section~\ref{web:intro:ad}. Since the resource path
\texttt{/account/password} has received only 100 requests (see
Table~\ref{tab:client-access-distribution}: \texttt{/account/password}
has received $0.02\%$ of 500,000 total requests), the profiles for
each of its parameters
$\left\{\mathtt{id},\mathtt{oldpw},\mathtt{newpw}\right\}$ are
undertrained, as determined by each profile's aggregate stability
measure. In the absence of a global knowledge base, the anomaly
detector would provide no protection against attacks manifesting
themselves in the values passed to any of these parameters.
If, however, a global knowledge base and index are available, the situation is considerably improved. Given $\kb$ and $\kb^I$, the anomaly detector can simply apply $f$ for each of the undertrained parameters to find a well-trained profile from the global knowledge base that accurately models a parameter with similar characteristics \emph{from another web application}. Then, these profiles can be substituted for the undertrained profiles for each of $\left\{\mathtt{id},\mathtt{oldpw},\mathtt{newpw}\right\}$. As will be demonstrated in the following section, the substitution of global profiles provides an acceptable detection accuracy for what would otherwise be an unprotected component (i.e., without a global profile 0\% of the attacks against that component would be detected).
\begin{note}[HTTP parameters ``semantics'']
It is important to underline that the goal of this approach is not
that of understanding the types of the fields nor their
semantic. However, if the system administrator and the security
officer had enough time and skills, the use of manual tools such as
live debuggers or symbolic execution engines may help to infer the
types of the fields and, once available, this information may be
exploited as a lookup criterion. Unfortunately, this would be a
human-intensive task, thus far from being automatic.
Instead, our goal is to find groups of similar models, regardless of
what this may imply in terms of types. According to our experiments,
similar models tend to be associated to similar types. Thus, in
principle, our method cannot be used to infer the parameters' types
but can certainly give insights about parameters with similar
features.
\end{note}
\subsection{Experimental Results}
\label{web:longtail:eval}
The goal of this evaluation is three-fold. First, we investigate the
effects of profile clustering, and support the hypothesis of parameter
types by examining global knowledge base clusters. Then, we discuss
how the quality of the mapping between undertrained profiles and
well-trained profiles improves as the training slice length $\kappa$
is increased. Finally, we present results regarding the accuracy of a
web application anomaly detection system incorporating the application
of a global knowledge base to address training data scarcity.
\subsubsection{Profile clustering quality}
\label{web:longtail:eval:clustering}
\begin{figure}[p]
\centering
\subfloat[][$\kappa = 64$] {
\includegraphics[scale=.5]{figures/web/longtail/undertrained-site-1-k-64-o}
\label{subfig:undertrained-site-1-k-64}
}\\
\subfloat[][$\kappa_{stable} \simeq 10^{3}$] {
\includegraphics[scale=.5]{figures/web/longtail/undertrained-site-1-k-stable-o}
\label{subfig:undertrained-site-1-k-stable}
}\\
\subfloat[][$\kappa = 8$]{
\includegraphics[scale=.5]{figures/web/longtail/undertrained-site-1-k-8-o}
\label{subfig:undertrained-site-1-k-8}
}\\
\subfloat[][$\kappa = 32$]{
\includegraphics[scale=.5]{figures/web/longtail/undertrained-site-1-k-32-o}
\label{subfig:undertrained-site-1-k-32}
}\\
\caption{Clustering of $\kb$, (a-b), and $\kb^I$, (c-d). Each leaf (a profile) is labeled with the parameter name and samples values observed during training. As $\kappa$ increases, profiles are clustered more accurately.}
\label{fig:dendrograms}
\vspace*{-.5cm}
\end{figure}
To evaluate the accuracy of the clustering phase, we first built a global knowledge base $\kb$ from a collection of well-trained profiles. The profiles were trained on a subset of the data mentioned in Section~\ref{web:intro:eval}; this subset was composed of 603 web applications, 27,990 unique resource paths, 9,023 unique parameters, and 3,444,092 \ac{HTTP}\index{HTTP} requests. The clustering algorithm described in Section~\ref{web:longtail:design:global} was then applied to group profiles according to their parameter type. Sample results from this clustering are shown in Figure~\ref{subfig:undertrained-site-1-k-stable}. Each leaf node corresponds to a profile and displays the parameter name and a few representative sample values corresponding to the parameter.
As the partial dendrogram indicates, the resulting clusters in $\kb$ are accurately clustered by parameter type. For instance, date parameters are grouped into a single hierarchy, while unstructured text strings are grouped into a separate hierarchy.
The following experiment investigates how $\kappa$ affects the quality of the final clustering.
\subsubsection{Profile mapping robustness}
\label{web:longtail:eval:robustness}
Recall that in order to balance the robustness of the mapping $f$
between undertrained profiles and global profiles against the speed
with which undertraining\index{undertraining} can be addressed, it is necessary to select
an appropriate value for $\kappa$. To this end, we generated
undertrained knowledge bases for increasing values of $\kappa=
1,2,4,8,16,32,64$ from the same dataset used to generate $\kb$,
following the procedure outlined in
Section~\ref{web:longtail:design:global}. Partial dendrograms for
various $\kappa$ are presented in
Figure~\ref{subfig:undertrained-site-1-k-8},
\ref{subfig:undertrained-site-1-k-32},
\ref{subfig:undertrained-site-1-k-64}.
At low values of $\kappa$ (e.g.,
Figure~\ref{subfig:undertrained-site-1-k-8}), the clustering process
exhibits non-negligible systemic errors. For instance, the parameter
\texttt{stat} clearly should be clustered as a token set of states,
but instead is grouped with unstructured strings such as cities and
addresses. A more accurate clustering would have dissociated the
token and string profiles into well-separated sub-hierarchies.
As shown in Figure~\ref{subfig:undertrained-site-1-k-32}, larger
values of $\kappa$ lead to more meaningful groupings. Some
inaccuracies are still noticeable, but the clustering process of the
sub-hierarchy is significantly better that the one obtained at $\kappa
= 8$. A further improvement in the clusters is shown in
Figure~\ref{subfig:undertrained-site-1-k-64}. At $\kappa = 64$, the
separation between dates and unstructured strings is sharper; except
for one outlier, the two types are recognized as similar and grouped
together in the early stages of the clustering process.
Figure~\ref{fig:robustness} plots the profile mapping robustness
$\rho(\cdot)$ against $\kappa$ for different cuts of the dendrogram,
indicated by $D_{\text{max}}$. $D_{\text{max}}$ is a threshold
representing the maximum distance between two clusters. Basically, for
low $D_{\text{max}}$, the ``cut'' will generate many clusters with a
few elements; on the other hand, for high values of $D_{\text{max}}$
the algorithm will tend to form less clusters, each having a larger
number of elements. Note that this parameter is known to have two
different possible interpretations: it could indicate either a
threshold on the real distance between clusters, or a ``cut level'' in
the dendrogram constructed by the clustering algorithm. Although they
are both valid, we prefer to utilize the former.
Figure~\ref{fig:robustness} shows two important properties of our
technique. First, it demonstrates that the robustness is fairly
insensitive to $D_{\text{max}}$. Second, the robustness of the
mapping increases with $\kappa$ until saturation at
$32\leq\kappa\leq64$. This not only confirms the soundness of the
mapping function, but it also provides insights on the appropriate
choice of $\kappa_{min}$ to minimize the delay to global profile
lookup while maximizing the robustness of the mapping.
\begin{figure}[t]
\centering
\includegraphics[width=.9\textwidth]{figures/web/longtail/fig_robustness}
\caption{Plot of profile mapping robustness for varying $\kappa$.}
\label{fig:robustness}
\end{figure}
\subsubsection{Detection accuracy}
\label{web:longtail:eval:detection}
Having studied the effects of profile clustering and varying values
for $\kappa$ upon the robustness of the profile mapping $f$, a
separate experiment was conducted in order to evaluate the detection
accuracy of a web application anomaly detector incorporating $\kb$,
the global knowledge base constructed in the previous experiments. In
particular, the goal of this experiment is to demonstrate that an
anomaly detector equipped with a global knowledge base exhibits an
improved detection accuracy in the presence of training data scarcity.
The data used in this experiment was a subset of the full dataset
described above, and was completely disjoint from the one used to
construct the global knowledge base and its indices. It consisted of
220 unique real-world web applications, 8,402 unique resource paths,
7,648 distinct parameters, and 55,290,532 \ac{HTTP}\index{HTTP}
requests.
The intended threat model is that of an attacker attempting to
compromise the confidentiality or integrity of data exposed by a web
application by injecting malicious code in request parameters.
\begin{note}[Threat model]
Although the anomaly detector used in this study is capable of
detecting more complex session-level anomalies, we restrict the
threat model to request parameter manipulation because we do not
address session profile clustering.
\end{note}
To establish a worst-case bound on the detection accuracy of the
system, profiles for each observed request parameter were deliberately
undertrained to artificially induce a scarcity of training data for
\emph{all} parameters. That is, for each value of
$\kappa=1,2,4,8,16,32,64$, the anomaly detector prematurely terminated
profile training after $\kappa$ samples, and then used the
undertrained profiles to query $\kb$. The resulting global profiles
were then substituted for the undertrained profiles and evaluated
against the rest of the dataset. The sensitivity of the system was
varied over the interval $\left[0,1\right]$, and the resulting
\ac{ROC}\index{ROC} curves for each $\kappa$ are plotted in
Figure~\ref{fig:roc}.
As one can clearly see, low values of $\kappa$ result in the selection
of global profiles that do not accurately model the behavior of the
undertrained parameters. As $\kappa$ increases, however, the quality
of the global profiles returned by the querying process increases as
well. In particular, this increase in quality closely follows the
mapping robustness plot presented in Figure~\ref{fig:robustness}. As
predicted, setting $\kappa=32,64$ leads to fairly accurate global
profile selection, with the resulting \ac{ROC}\index{ROC} curves
approaching that of fully-trained profiles. This means that even if
the component of a web application has received only a few requests
(i.e., 64), by leveraging a global knowledge base it is possible to
achieve effective attack detection. As a consequence, our approach
can improve the effectiveness of real-world web application firewalls
and web application anomaly detection systems.
Clearly, the detection accuracy will improve as more training samples
(e.g., 128, 256) become available. However, the goal of this
experiment was to evaluate such an improvement with a very limited
training set, rather than showing the detection maximum accuracy
achievable. From these results, we conclude that for appropriate
values of $\kappa$, the use of a global knowledge base can provide
reasonably accurate detection performance even in the presence of
training data scarcity.
\begin{figure}[t]
\centering
\includegraphics[width=.9\textwidth]{figures/web/longtail/fig_roc}
\caption{Global profile ROC curves for varying $\kappa$. In the presence
of severe undertraining ($\kappa \ll \kappa_{\text{stable}}$), the
system is not able to recognize most attacks and also reports
several false positives. However, as $\kappa$ increases, detection
accuracy improves, and approaches that of the well-trained case
($\kappa=\kappa_{\text{stable}}$).}
\label{fig:roc}
\vspace*{-.5cm}
\end{figure}
One concern regarding the substitution of global profiles for local
request parameters is that a global profile that was trained on
another web application may not detect valid attacks against the
undertrained parameter. Without this technique, however, recall that
a learning-based web application anomaly detector would otherwise have
no effective model whatsoever, and therefore the undertrained
parameter would be unprotected by the detection system (i.e., zero
true positive rate). Furthermore, the \ac{ROC}\index{ROC} curves
demonstrate that while global profiles are in general not as precise
as locally-trained models, they do provide a significant level of
detection accuracy.
\begin{note}
If global profiles were found to be as accurate as local profiles,
this would constitute an argument against anomaly detection itself.
\end{note}