CS434a/541a: Pattern Recognition
Prof. Olga Veksler
Lecture 6
Today
Introduction to nonparametric techniques
Basic Issues in Density Estimation
Two Density Estimation Methods
1. Parzen Windows
2. Nearest Neighbors
Non-Parametric Methods
Neither probability distribution nor
discriminant function is known
Happens quite often
All we have is labeled data
a lot is
known
”easier”
little is
known
“harder”
salmon salmonsalmon
bass
Estimate the probability distribution
from the labeled data
NonParametric Techniques: Introduction
In previous lectures we assumed that either
1. someone gives us the density )p(x
In pattern recognition applications this never happens
2. someone gives us p(x|
θθθθ
)
Does happen sometimes, but
we are likely to suspect whether the given p(x|
θθθθ
)
models the data well
Most parametric densities are unimodal (have a
single local maximum), whereas many practical
problems involve multi-modal densities
NonParametric Techniques: Introduction
Nonparametric procedures can be used with
arbitrary distributions and without any
assumption about the forms of the underlying
densities
There are two types of nonparametric methods:
Parzen windows
Estimate likelihood p c(x| j)
Nearest Neighbors
Bypass likelihood and go directly to posterior estimation
P c x( j| )
NonParametric Techniques: Introduction
Nonparametric techniques attempt to estimate the
underlying density functions from the training data
Idea: the more data in a region, the larger is the density
function
p(x)
salmon length x
[[[[ ]]]]
(((( ))))
=
==
=
dxxfXPr
average of f(x)
over 
NonParametric Techniques: Introduction
How can we approximate and ?
[
[[
[ ]
]]
]
1
XPr
and
[[[[ ] ]]]
20
6
XPr 1 [[[[ ] ]]]
20
6
XPr 2
Should the density curves above   1and 2be
equally high?
No, since is  1smaller than 
2
p(x)
salmon length x
1
2
[
[[
[ ]
]]
]
2
XPr
[[[[ ]]]] (((( ))))
=
==
=
dxxfXPr
[[[[ ]]]]
(((( )))) (((( ))))
[[[[ ]]]]
21 XPrdxxfdxxf
21
=
==
=
=
==
=
To get density, normalize by region size
NonParametric Techniques: Introduction
Assuming ) is basically flat inside f(x 
[
[[
[ ]
]]
] (
((
( )
))
)
====dyyfXPr
samplesof#total
insamplesof#
(((( )))) (((( ))))
Volumexf
Thus, density at a point inside x   can be
approximated
(((( ))))(
((( ))))
≈Volume
1
samplesof#total
insamplesof#
xf
Now let’s derive this formula more formally
Binomial Random Variable
Let us flip a coin times (each one is called “trial”)n
Probability of head
ρρρρ
, probability of tail is 1-
ρρρρ
Binomial random variable counts the number of K
heads in ntrials
Mean is
(((( ))))
ρρρρ
nKE
====
Variance is
(((( )))) (((( ))))
ρρρρρρρρ
====
1var nK
(((( ) )))
!!
!
knk
n
k
n
====
where
(((( ) ))) ( ((( ) )))
kn
k
k
n
kKP
=
==
==
==
= ρρρρρρρρ
1
Density Estimation: Basic Issues
From the definition of a density function, probability
ρ
ρρ
ρ
that a vector will fall in region x is:
[[[[ ]]]]
=
==
=
=
==
='dx)'x(p ρρρρ
Suppose we have samples , ,…, drawn from x1 x2 xn
the distribution ). The probability that points fall p(x k
in 
is then given by binomial distribution:
[[[[ ] ]]] )1( Pr knk
k
n
kK
======== ρρρρρρρρ
Suppose that kpoints fall in  , we can use MLE to
estimate the value of
ρ
ρρ
ρ
. The likelihood function is
(((( ) )))
)1( |,...,
1
knk
nk
n
xxp
=
==
= ρρρρρρρρρρρρ
Density Estimation: Basic Issues
(((( ) )))
)1( |,...,
1
knk
nk
n
xxp
=
==
= ρ
ρρ
ρρ
ρρ
ρρ
ρρ
ρ
This likelihood function is maximized at
ρ
ρ ρ
ρ
=
n
k
Thus the MLE is
ˆn
k
====ρρρρ
Vxpdxxp )(')'(
Assume that is continuous and that the region p(x)
is so small that is approximately constant in p(x)
Recall from the previous slide:
=
==
=')'( dxxp ρρρρ
xis in and is the volume of V
(((( ))))
V
n/k
xp
Thus can be approximated:p(x)
p x( )
x
Our estimate will always be the average of true
density over
Density Estimation: Basic Issues
(((( ))))
V
n/k
xp
x is inside some region 
V = volume of 
n=total number of samples inside 
k = number of samples inside 
x
(
((
( )
))
)
V
n/k
xp ≈V
ρ
ρρ
ρ
ˆ
====V
dxxp
')'(
This is exactly what we had before:
Ideally, ) should be constant inside p(x
Density Estimation: Histogram
(((( ))))
V
n/k
xp
p l( )
010 20 30 40 50


2
 1 3
10
190
6
190
1
190
(((( ))))
10
19/6
xp = ===
(
((
( )
))
)
30
19/3
xp = ===
(((( ))))
10
19/10
xp = ===
If regions i‘s do not overlap, we have a histogram
We have made two approximations
Density Estimation: Accuracy
(((( ))))
V
n/k
xp
How accurate is density approximation ?
ˆn
k
====ρρρρ
1.
Vxpdxxp )(
2.
as n increases, this estimate becomes more accurate
as  grows smaller, the estimate becomes more accurate
As we shrink  we have to make sure
it contains samples, otherwise our
estimated ) = 0 for all in p(x x
Thus in theory, if we have an unlimited number of
samples, to we get convergence as we
simultaneously increase the number of samples , n
and shrink region 

 , but not too much so that

still
contains a lot of samples
In practice, the number of samples is always fixed
Density Estimation: Accuracy
(((( ))))
V
n/k
xp
Thus the only available option to increase the
accuracy is by decreasing the size of  (Vgets
smaller)
If Vis too small, )=0 for most , because most p(x x
regions will have no samples
Thus have to find a compromise for V
not too small so that it has enough samples
but also not too large so that ) is p(x
approximately constant inside V
Density Estimation: Two Approaches
(((( ))))
V
n/k
xp
2. k-Nearest Neighbors
Choose a fixed value for and k
determine the corresponding
volume from the dataV
Under appropriate conditions and as number
of samples goes to infinity, both methods can
be shown to converge to the true p(x)
1. Parzen Windows:
Choose a fixed value for volume V
and determine the corresponding k
from the data
Parzen Windows
Let us assume that the region  is a -dimensional d
hypercube with side length thus it’s volume is h hd
In Parzen-window approach to estimate densities we
fix the size and shape of region 
2 dimensions
h
3 dimensions
h
1 dimension
h
Parzen Windows
To estimate the density at point , simply center the x
region at , count the number of samples in x ,
and substitute everything in our formula
(
((
( )
))
)
V
n/k
xp
x
(
((
( )
))
)
10
6/3
xp
Parzen Windows
We wish to have an analytic expression for our
approximate density
Let us define a window function
=
==
=
====
otherwise0
d , 1,...j u 1
(u) j2
1
ϕ
ϕϕ
ϕ
u
ϕϕϕϕ
(u)
1
1 dimension
1/2
1/2
ϕϕϕϕ
is 1 inside
ϕ
ϕϕ
ϕ
is 0 outside
2 dimensions
u1
u2
Parzen Windows
Recall we have samples , ,…, . Thenx1 x2 xn
=
==
=
====
otherwise0
d , 1,...j x-x 1
)
x-x
(i
i2
h
h
ϕ
ϕϕ
ϕ
0 otherwise
1 if x is inside the hypercube with i
width h and centered at x
h
x
xi
====)
x-x
(i
h
ϕϕϕϕ

Preview text:

CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 6 Today
Introduction to nonparametric techniques
Basic Issues in Density Estimation Two Density Estimation Methods 1. Parzen Windows 2. Nearest Neighbors Non-Parametric Methods
Neither probability distribution nor a lot is discriminant function is known known ”easier” Happens quite often All we have is labeled data salm s o a nlmon salmo b n ass
Estimate the probability distribution from the labeled data little is known “harder”
NonParametric Techniques: Introduction
In previous lectures we assumed that either
1. someone gives us the density p(x)
In pattern recognition applications this never happens 2. someone gives us p(x| θ) Does happen sometimes, but
we are likely to suspect whether the given p(x| θ) models the data well
Most parametric densities are unimodal (have a
single local maximum), whereas many practical
problems involve multi-modal densities
NonParametric Techniques: Introduction
Nonparametric procedures can be used with
arbitrary distributions and without any
assumption about the forms of the underlying densities
There are two types of nonparametric methods: Parzen windows Estimate likelihood p(x|cj) Nearest Neighbors
Bypass likelihood and go directly to posterior estimation P(cj| x)
NonParametric Techniques: Introduction
Nonparametric techniques attempt to estimate the
underlying density functions from the training data
Idea: the more data in a region, the larger is the density function Pr X [f x∈ ∈dx] ℜ = ( ) ℜ average of f(x) over p(x) salmon length x
NonParametric Techniques: Introduction Pr X [f x∈dx] ℜ = ( ) ℜ How can we approximate P r [ X ∈ an ℜ d ] Pr [X ∈ ℜ ] 1 ? 2 6 6 and Pr [1 ] X ≈ ∈ ∈ ℜ ]≈ Pr [2 ] X ≈ ∈ ∈ ℜ ]≈ 20 20
Should the density curves above 1and 2be equally high? No, since is 1smaller than 2 f x dx f x [ dx P ∈ ∈ r 1 ℜ ]X= ( ) ≈ ( ) = [ ∈ ∈ 2 ℜ ] ℜ ℜ 1 2 ℜ
To get density, normalize by region size p(x) ℜ1 ℜ2 salmon length x
NonParametric Techniques: Introduction
Assuming f(x) is basically flat inside # of samples in ℜ Pr X ≈ [f y∈ ∈dy] ℜ = ( ) ≈f (x )∗Volume ( ) ℜ total # of samples ℜ
Thus, density at a point xinside can be approximated f ( ) x (# of samples in ℜ 1 ≈Volume total # of samples ( )) ℜ
Now let’s derive this formula more formally Binomial Random Variable
Let us flip a coin ntimes (each one is called “trial”)
Probability of head ρ, probability of tail is 1- ρ
Binomial random variable Kcounts the number of heads in ntrials P (− ) K = k = ) ( n k = ρ ( ) 1 − ρ)n −k k n n! where k = k !(n)− k )! Mean is E (K )= nρ
Variance is Kvar (n )= ρ (1 − ρ )
Density Estimation: Basic Issues
From the definition of a density function, probability
ρ that a vector xwill fall in region is: p ( x' ) d ρ = x ' [ ∈ ] ℜ = ℜ
Suppose we have samples x1, x2,…, xn drawn from
the distribution p(x). The probability that kpoints fall
in is then given by binomial distribution: Pr [ ] K −= ] n k = − k ρ (1 − ρ)n k k Suppose that kpoints fall in , we can use MLE to
estimate the value of ρ . The likelihood function is p (− ) x ,..., x | ρρ ) n = k ρ (1 − ρ)n−k 1 nk
Density Estimation: Basic Issues p (− ) x ,..., x | ρρ ) n = k ρ (1 − ρ)n−k 1 nk k
This likelihood function is maximized at ρ ρ = n Thus the MLE is k ˆρn=
Assume that p(x)is continuous and that the region
is so small that p(x)is approximately constant in p(x) ( ' ) ' p ( x ) dx p ≅≅ ≅ x V ℜ x xis in and Vis the volume of Recall from the previous sli p de: x ( d' ) xρ =' ℜ Thus p(x)can be approximated: p (≈ ) x ) k / n ≈ V
Density Estimation: Basic Issues
This is exactly what we had before: p (≈ ) x ) k / n ≈≈x is inside some region V k = number of samples inside
n=total number of samples inside x V = volume of
Our estimate will always be the average of true density over p (x ' )dx ' ρˆ p (≈ ) x ) k / n ≈V =V ℜ ≈ V
Ideally, p(x) should be constant inside Density Estimation: Histogram p (≈ ) x ) k / n ≈ V p(l) 10 190 6 190 1 190 010 20 30 40 50 1 2 3 10/ 19 p(=) x 6 / 19 = p(=) x 3 / 19 = p(=) x = 10 30 10
If regions i‘s do not overlap, we have a histogram Density Estimation: Accuracy
How accurate is density approximation p (≈ ) x ) k / n ≈ ?V
We have made two approximations k 1. ˆn ρ =
as n increases, this estimate becomes more accurate p (2. x ) dx p ≅≅ ≅ x V ℜas
grows smaller, the estimate becomes more accurate As we shrink we have to make sure
it contains samples, otherwise our estimated p( ) = 0 for all x xin
Thus in theory, if we have an unlimited number of
samples, to we get convergence as we
simultaneously increase the number of samples n, and shrink region , but not too much so that still contains a lot of samples Density Estimation: Accuracy p (≈ ) x ) k / n ≈ V
In practice, the number of samples is always fixed
Thus the only available option to increase the
accuracy is by decreasing the size of (Vgets smaller)
If Vis too small, p(x)=0 for most x, because most regions will have no samples
Thus have to find a compromise for V
not too small so that it has enough samples
but also not too large so that p(x) is
approximately constant inside V
Density Estimation: Two Approaches p (≈ ) x ) k / n ≈ V 1. Parzen Windows:
Choose a fixed value for volume V
and determine the corresponding k from the data 2. k-Nearest Neighbors Choose a fixed value for kand determine the corresponding volume Vfrom the data
Under appropriate conditions and as number
of samples goes to infinity, both methods can
be shown to converge to the true p(x) Parzen Windows
In Parzen-window approach to estimate densities we
fix the size and shape of region Let us assume that the region is a d-dimensional
hypercube with side length h thus it’s volume is hd h h h 1 dimension 2 dimensions 3 dimensions Parzen Windows
To estimate the density at point x, simply center the region
at x, count the number of samples in ,
and substitute everything in our formula p (≈ ) x ) k / n ≈ V x p (≈ ) x ) 3 / 6 ≈ 10 Parzen Windows
We wish to have an analytic expression for our approximate density
Let us define a window function = 1 1 u ≤ j =1,... ,d ϕ(u) = j2 0 otherwise ϕ is 1 inside 1 dimension u2 ϕ(u) 1 1/2 u1 u ϕ is 0 outside 1/2 2 dimensions Parzen Windows
Recall we have samples x1, x2,…, xn . Then = h x - x 1 x - x ≤ j =1,... ,d ϕ(i i ) 2 = h 0 otherwise x h xi 1
if xiis inside the hypercube with x - x ϕ(i ) = width h and centered at x h 0 otherwise