1
DATA STRUCTURES AND
ALGORITHMS
ET2100E/ET2105E
Chapter VI: Searching Algorithms
Lecturer: PhD. DO Thi Ngoc Diep
SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
3
Outline
1. Searching Problem
2. Algorithms for searching elements using comparison
Sequential searching algorithm
Binary searching algorithm
3. Substring searching
4. Direct searching Algorithms
Hash functions
Collision resolution
4
1. Introduction
Search Problems: 2 basic search problems::
Finding an element in a given sequence of elements:
Searching by comparison
Searching directly from key values
Finding the occurrence of a substring in a given string
Simple search algorithms (also called raw search)
Complex algorithms: Knuth-Morris-Pratt, Boyer-Moore, etc.
5
Finding an element
Basic Problem:
Given a sequence of N numbers: A = (a
0
, a
1
,…, a
N-1
) and a value K (search
key).
find the position of the element with value equal to K.
Search by Comparison
Compare K with each element in the sequence until a result is obtained
(either found or not found
)
Two types of search algorithms
:
Sequential Search
Binary Search
Direct Searching
Using a hash function: helps to directly calculate the address of the
element to be searched from its value
6
2. Search by Comparison
(1) Sequential Searching
Algorithm idea:
To find an element equal to K in a sequence of N numbers: A = (a
0
, a
1
,…, a
N-1
),
compare K with each element in the sequence in turn, until:
Either find the element a
i
= K, then return the position i to be found.
Or compare with all elements in the sequence but still not find it, then return the
result "not found”.
typedef int Sequence[N];
int SeqSearch(Sequence a, int K) {
int i=0;
while (i<N && a[i] != K) i++;
if (i<N) return i; //Found
return -1; //Not Found
}
7
2. Search by Comparison
(1) Sequential Searching
Algorithm Evaluation
Best case: O(1)
Worst case: O(n)
Average case: O(n)
8
2. Search by Comparison
(2) Binary Search
Algorithm Idea: Recursive Algorithm
To find an element equal to K in a sequence of N numbers A = (a
0
, a
1
,…, a
N-1
),
provided that the sequence A is sorted (assuming in ascending order):
Compare K with the element a
m
(m= ).
If K = a
m
found at position m
If K < a
m
Find K in the sequence (a
0
,a
1
,…,a
m-1
)
If K > a
m
Find K in the sequence (a
m+1
,a
m+2
,…,a
N-1
)
Stopping point: when found or when no element remains in the sequence,
return the result "not found".
9
2. Search by Comparison
(2) Binary Search
Algorithm Idea: Recursive Algorithm
int BSearch (Sequence a, int K, int b, int e) {
if (b>e) return -1;
int m = (b+e)/2;
if (K==a[m]) return m;
else
if (K<a[m])
return BSearch(a, K, b, m-1);
else
return BSearch(a, K, m+1,e);
}
int BinarySearch(Sequence a, int K){
return BSearch(a,K,0,a.length-1);
}
10
2. Search by Comparison
(2) Binary Search
Algorithm Evaluation
Best case: O(1)
Worst case: O(log
2
n)
Average case: O(log
2
n)
11
3. Substring searching
Problem:
Given a text V of n characters
(v
0
, v
1
,…, v
n-1
)
and a substring P (called the
pattern) of m characters
(p
0
, p
1
,…, p
m-1
).
The requirement is to find the first occurrence [of all] of P in V.
Example
V: “the rain in spain stays mainly on the plain
P: “ain”
Applications:
In information retrieval
Text editing
Biological data processing (DNA)
Substring search algorithms :
Brute-Force search algorithm
Knuth-Morris-Pratt (KMP) algorithm
12
3. Substring searching
Brute-Force search algorithm (raw search)
Algorithm Idea:
Match the pattern to the text character by character from left to right
When a mismatch is identified, shift the entire pattern one position to
the right and continue matching with the text in the same manner.
Algorithm evaluation: Worst-case scenario: number of comparison
operations: M*N O(n
2
)
int BFSearch(char v[], int N, char p[], int M)
{
if (M>N) return -1;
int i=0,j=0;
while (j<N){
while (i<M && j<N && p[i]==v[i+j])
i++;
if (i==M) return j;
i=0;
j++;
}
return -1;
}
P: p
0
p
1
… p
i
… p
m-1
V: v
0
… v
j
v
j+1
… v
j+i
… … v
n-1
=
=
13
3. Substring searching
Knuth-Morris-Pratt Algorithm
Algorithm Idea:
Improvements from the raw search algorithm
When comparing to the first pair of characters (v
i
,p
j
) where v
i
≠ p
j
(j previous
pairs have matched)
Instead of going back to compare p
0
of the pattern with the character v
i-j+1
continue comparing from a character p
k
in the pattern with the character v
i
that currently does not match in the text
P
V
p
j
p
0
v
i
v
i-j
v
i-j+1
p
k
p
j-k
p
k-1
p
0
p
k-1
p
j-1
continue comparing
No need to go back
14
3. Substring searching
Knuth-Morris-Pratt Algorithm
The position k to be found in sample P satisfies the following conditions:
Maximum possible: k < j, p
k
≠ p
j
;
p
0
p
1
..p
k-1
= v
i-k
…v
i-1
;
Properties of k:
k depends on position j and sample content, not on the text
[p
0
p
1
...p
k-1
] = [p
j-k
p
j-k+1
...p
j-1
]
P
V
p
j
p
0
v
i
v
i-j
v
i-j+1
p
k
p
j-k
p
k-1
p
0
p
k-1
p
j-1
v
i-k
v
i-1
15
3. Substring searching
Knuth-Morris-Pratt Algorithm
k can be calculated before performing the search (even before the
text content is available)
Example
Pattern: 10110111
0<k< j, [p
0
p
1
...p
k-1
] = [p
j-k
p
j-k+1
...p
j-1
]
j 0 1 2 3 4 5 6 7
p
0
..p
j-1
1 10 101 1011---- 10110--- 101101-- 1011011-
k
1
k=1: p
0
≠p
1
1, 2
k=1: p
0
=p
2
k=2: p
0
p
1
≠p
1
p
2
1, 2, 3 1, 2, 3,4 1,2, 3,4,5 1,2,3, 4, 5,6
Matching
sequence
1
1xx-----
xx1-----
1
1xxx----
xxx1----
10
10xxx---
xxx10---
1 or 101
101xxx--
xxx101--
1 or 1011
1011xxx-
xxx1011-
16
3. Substring searching
Comment:
Choose k max
Choose p
k
≠p
j
Pattern 10110111
j 0 1 2 3 4 5 6 7
p
0
..p
j-1
1 10 101 1011---- 10110--- 101101-- 1011011-
k
1 1, 2
1, 2, 3 1, 2, 3,4 1,2, 3,4,5 1,2,3, 4, 5,6k=1: p
0
≠p
1
k=1: p
0
=p
2
k=2: p
0
p
1
≠p
1
p
2
Matching
sequence
1 1 10 1 or 101 1 or 1011
1xx----- 1xxx---- 10xxx--- 101xxx-- 1011xxx-
xx1----- xxx1---- xxx10--- xxx101-- xxx1011-
pj 1 0 1 1 0 1 1 1
pk 0 0 1 0, 1 0, 0
pk≠pj Not satisfied Not satisfied k=1 =>
pk≠pj
pk≠pj
New
matching
seq
1 1 1011
New k 0 0 0 1 0 0 1 4 (max)
17
3. Substring searching
Knuth-Morris-Pratt Algorithm
Finding k algorithm:
Returns array next[]: next[j]=k
j: position in the pattern (0≤ j ≤ M-1)
k: next matching position (0 ≤k<j) (choose the largest possible value of
k)
step 1: k = j-1;
Step 2: Find the largest k such that p[k] ≠ p[j];
Step 3: Compare the two strings
(p[0] p[1] .. p[k-1]) and (p[j-k] p[1] .. p[j-1])
If they are equal then
next[j] = k;
End of algorithm;
Otherwise,
k = k – 1;
Repeat from step 2 of the algorithm;
18
3. Substring searching
Knuth-Morris-Pratt Algorithm
Finding k algorithm:
Setup:
void initNext(char p[], int next[], int M ){
int j,k;
for (j=0; j<M; j++) next[j] = 0;
for (j=2; j<M; j++){
k = j-1;
do {
while (k>0 && p[k]==p[j]) k--;
if (k>0){
int i = 1;
while (i<=k && p[k-i]==p[j-i]) i
++;
if (i>k){
next[j] = k;
break;
}
}
k--;
} while (k>0);
}
}
1: k = j-1;
2: Find the largest k such that p[k] ≠ p[j];
3: Compare the two strings
(p[0] p[1] .. p[k-1]) and (p[j-k] p[1] .. p[j-1])
If they are equal then
next[j] = k;
End of algorithm;
Otherwise,
k = k – 1;
Repeat from step 2 of the algorithm;
19
3. Substring searching
Knuth-Morris-Pratt Algorithm
Setup:
int KMPSearch(char V[], int N, char P[], int M ){
if (N<M) return -1;
int next[M];
initNext(P, next, M);
int i,j; i=j=0;
do {
while (j<M && V[i]==P[j]){i++; j++;}
if (i<=N-M && j<M ) {
if (j==0) i++;
else j = next[j];
}
} while (i<=N-M && j<M) ;
if (j==M) return (i-M);
else return -1;
}
Algorithm evaluation:
In the worst-case scenario, the algorithm only performs M+N character comparison
operations => O(N)
20
4. Direct searching
Problem: Find the element equal to K in the sequence of N
numbers: A = (a
0
, a
1
,…, a
N-1
)
Storage and search principles:
No comparison
Through memory addresses, directly calculated on key values (Search
based on/from key values)
Hash function: a formula to directly calculate the position of the
element to be found through its value
y = h(k): k → y
k D:
the Domain values of key
y P: t
he Domain values of the addresses
Search problem Calculate h(K)

Preview text:

DATA STRUCTURES AND ALGORITHMS ET2100E/ET2105E 1
Chapter VI: Searching Algorithms
Lecturer: PhD. DO Thi Ngoc Diep
SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY Outline 1. Searching Problem
2. Algorithms for searching elements using comparison • Sequential searching algorithm • Binary searching algorithm 3. Substring searching 4. Direct searching Algorithms • Hash functions • Col ision resolution 3 1. Introduction
• Search Problems: 2 basic search problems::
• Finding an element in a given sequence of elements: • Searching by comparison
• Searching directly from key values
• Finding the occurrence of a substring in a given string
• Simple search algorithms (also cal ed raw search)
• Complex algorithms: Knuth-Morris-Pratt, Boyer-Moore, etc. 4 Finding an element • Basic Problem:
• Given a sequence of N numbers: A = (a0, a1,…, aN-1) and a value K (search key).
• find the position of the element with value equal to K. • Search by Comparison
• Compare K with each element in the sequence until a result is obtained (either found or not found)
• Two types of search algorithms : • Sequential Search • Binary Search • Direct Searching
• Using a hash function: helps to directly calculate the address of the
element to be searched from its value 5 2. Search by Comparison (1) Sequential Searching • Algorithm idea:
• To find an element equal to K in a sequence of N numbers: A = (a0, a1,…, aN-1),
compare K with each element in the sequence in turn, until:
• Either find the element ai = K, then return the position i to be found.
• Or compare with al elements in the sequence but stil not find it, then return the result "not found”. typedef int Sequence[N];
int SeqSearch(Sequence a, int K) { int i=0; while (iif (ireturn -1; //Not Found } 6 2. Search by Comparison (1) Sequential Searching • Algorithm Evaluation • Best case: O(1) • Worst case: O(n) • Average case: O(n) 7 2. Search by Comparison (2) Binary Search
• Algorithm Idea: Recursive Algorithm
• To find an element equal to K in a sequence of N numbers A = (a0, a1,…, aN-1),
provided that the sequence A is sorted (assuming in ascending order):
• Compare K with the element am (m= ).
• If K = am  found at position m
• If K < am  Find K in the sequence (a0,a1,…,am-1)
• If K > am  Find K in the sequence (am+1,am+2,…,aN-1)
• Stopping point: when found or when no element remains in the sequence, return the result "not found". 8 2. Search by Comparison (2) Binary Search
• Algorithm Idea: Recursive Algorithm
int BSearch (Sequence a, int K, int b, int e) { if (b>e) return -1; int m = (b+e)/2; if (K==a[m]) return m; else
if (Kreturn BSearch(a, K, b, m-1); else return BSearch(a, K, m+1,e); }
int BinarySearch(Sequence a, int K){
return BSearch(a,K,0,a.length-1); } 9 2. Search by Comparison (2) Binary Search • Algorithm Evaluation • Best case: O(1) • Worst case: O(log2n) • Average case: O(log2n) 10 3. Substring searching • Problem:
• Given a text V of n characters (v0, v1,…, vn-1) and a substring P (cal ed the
pattern) of m characters (p0, p1,…, pm-1).
• The requirement is to find the first occurrence [of al ] of P in V. • Example
• V: “the rain in spain stays mainly on the plain” • P: “ain” • Applications: • In information retrieval • Text editing
• Biological data processing (DNA)
• Substring search algorithms :
• Brute-Force search algorithm
• Knuth-Morris-Pratt (KMP) algorithm 11 3. Substring searching
• Brute-Force search algorithm (raw search) • Algorithm Idea:
• Match the pattern to the text character by character from left to right
• When a mismatch is identified, shift the entire pattern one position to
the right and continue matching with the text in the same manner.
int BFSearch(char v[], int N, char p[], int M) { P: p0 p1 … pi … pm-1 = = ≠ if (M>N) return -1; int i=0,j=0; V: v
while (j0 … vj vj+1 … vj+i … … vn-1 while (ii++; if (i==M) return j; i=0; j++; } return -1; }
• Algorithm evaluation: Worst-case scenario: number of comparison operations: M*N O(n2) 12 3. Substring searching
• Knuth-Morris-Pratt Algorithm • Algorithm Idea:
• Improvements from the raw search algorithm
• When comparing to the first pair of characters (vi,pj) where vi ≠ pj (j previous pairs have matched)
• Instead of going back to compare p0 of the pattern with the character vi-j+1
 continue comparing from a character pk in the pattern with the character vi
that currently does not match in the text vi-j vi-j+1 vi V p0 pk-1 pj-kpj-1 pj P p0 p p k-1 k continue comparing No need to go back 13 3. Substring searching
• Knuth-Morris-Pratt Algorithm
• The position k to be found in sample P satisfies the fol owing conditions:
• Maximum possible: k < j, pk≠ pj; • p0p1..pk-1 = vi-k…vi-1; • Properties of k:
• k depends on position j and sample content, not on the text
• [p0p1...pk-1] = [pj-kpj-k+1...pj-1] vi-j vi-j+1 v vi-1 v i-k i V p0 pk-1 pj-kpj-1 pj P p0 p p k-1 k 14 3. Substring searching
• Knuth-Morris-Pratt Algorithm
• k can be calculated before performing the search (even before the text content is available) • Example • Pattern: 10110111 • 0j 0 1 2 3 4 5 6 7 p 1 10 101 1011---- 10110--- 101101-- 1011011- 0..pj-1 k 1 1, 2 1, 2, 3 1, 2, 3,4 1,2, 3,4,5 1,2,3, 4, 5,6 k=1: p0 ≠p1 k=1: p0=p2 k=2: p0p1≠p1p2 Matching ∅ ∅ ∅ 1 1 10 1 or 101 1 or 1011 sequence 1xx----- 1xxx---- 10xxx--- 101xxx-- 1011xxx- xx1----- xxx1---- xxx10--- xxx101-- xxx1011- 15 3. Substring searching • Comment: • Choose k max • Choose pk≠pj Pattern 10110111 j 0 1 2 3 4 5 6 7 p0..pj-1 1 10 101 1011---- 10110--- 101101-- 1011011- 1 1, 2 k k=1: p0 ≠p1 k=1: p0=p2 1, 2, 3 1, 2, 3,4 1,2, 3,4,5 1,2,3, 4, 5,6 k=2: p0p1≠p1p2 Matching ∅ ∅ ∅ 1 1 10 1 or 101 1 or 1011 1xx----- 1xxx---- 10xxx--- 101xxx-- 1011xxx- sequence xx1----- xxx1---- xxx10--- xxx101-- xxx1011- pj 1 0 1 1 0 1 1 1 pk 0 0 1 0, 1 0, 0 pk≠pj
Not satisfied Not satisfied k=1 => pk≠pj pk≠pj New ∅ ∅ ∅ 1 ∅ ∅ 1 1011 matching seq New k 0 0 0 1 0 0 1 4 (max) 16 3. Substring searching
• Knuth-Morris-Pratt Algorithm • Finding k algorithm:
• Returns array next[]: next[j]=k
• j: position in the pattern (0≤ j ≤ M-1)
• k: next matching position (0 ≤kk) step 1: k = j-1;
Step 2: Find the largest k such that p[k] ≠ p[j];
Step 3: Compare the two strings
(p[0] p[1] .. p[k-1]) and (p[j-k] p[1] .. p[j-1]) If they are equal then next[j] = k; End of algorithm; Otherwise, k = k – 1;
Repeat from step 2 of the algorithm; 17 3. Substring searching
• Knuth-Morris-Pratt Algorithm • Finding k algorithm: • Setup:
void initNext(char p[], int next[], int M ){ int j,k; for (j=0; jfor (j=2; jk = j-1; do {
while (k>0 && p[k]==p[j]) k--; if (k>0){ int i = 1; 1: k = j-1;
2: Find the largest k such that p[k] ≠ p[j];
while (i<=k && p[k-i]==p[j-i]) i++; 3: Compare the two strings if (i>k){
(p[0] p[1] .. p[k-1]) and (p[j-k] p[1] .. p[j-1]) If they are equal then next[j] = k; next[j] = k; break; End of algorithm; Otherwise, } k = k – 1;
Repeat from step 2 of the algorithm; } k--; } while (k>0); } } 18 3. Substring searching
• Knuth-Morris-Pratt Algorithm • Setup:
int KMPSearch(char V[], int N, char P[], int M ){ if (Nint next[M]; initNext(P, next, M); int i,j; i=j=0; do {
while (jif (i<=N-M && jif (j==0) i++; else j = next[j]; }
} while (i<=N-M && jif (j==M) return (i-M); else return -1; } • Algorithm evaluation:
• In the worst-case scenario, the algorithm only performs M+N character comparison operations => O(N) 19 4. Direct searching
• Problem: Find the element equal to K in the sequence of N
numbers: A = (a0, a1,…, aN-1)
• Storage and search principles: • No comparison
• Through memory addresses, directly calculated on key values (Search based on/from key values)
• Hash function: a formula to directly calculate the position of the
element to be found through its value • y = h(k): k → y
• k D: the Domain values of key
• y P: the Domain values of the addresses
• Search problem  Calculate h(K) 20