Suffix Arrays – A simple Tutorial

Suffix Arrays

It is the data structure to look out for when you are going for string Searching and want to reduce the time complexity greatly.  So, Let us examine and start from basics.

What are Suffix Arrays ?

As simple as it can get , suffix arrays represent the rank of all possible suffixes of a given string .For example ,

String  S = “ABDCASD”

Then the suffixes for the string S (with the numbers in front represent are ,

7 $ *

6 D

5 SD

4 ASD
3 CASD

2 DCASD

1 BDCASD

0 ABDCASD

*$, A null character can be called a suffix of every string as it does no addition to the string itself . We will later on see why we chose to add NULL character to our algorithm . For NULL character that we used, we assume have highest lexicographic priority .

Now,

What happens when we sort those suffixes lexicographically ?

We get the order

Rank

Index from which starting

String

1

7

$

2

0

ABDCASD

3

4

ASD

4

1

BDCASD

5

3

CASD

6

6

D

7

2

DCASD

8

5

SD

So, the Suffix Array in our case will be ,

Suffix Array Index Sorted Suffixes (index of suffixes when sorted in lexicographic order )
1 7
2 0
3 4
4 1
5 3
6 6
7 2
8 5

Now let us take a different example and we introduce the concept of Lowest Common Prefix .

S=”abaabba”

Let us construct the Suffix Array for the above string first

Rank

Index

String

LCP

1

8

$

2

7

a

3

3

aabba

4

1

abaabba

5

4

abba

6

6

ba

7

2

baabba

8

5

bba

Now you see a column of LCP (lowest common prefix ) , what is it ?

So , What is Longest Common Prefix ?

 

It is the length of prefix that is common between the two consecutive suffixes in an sorted suffix array. To simplify , see this way , we have created an suffix array and we have ordered suffixes . We need to find out the number of consecutive characters common in the two suffixes (from starting of suffix).

e.g.

LCP of a and aabba is 1 .

LCP of abaabba and abba is 2 .

We put LCP in first column as 0 , always .

So , we get the following table :

Rank

Index

String

LCP

1

8

$

0

2

7

a

0

3

3

aabba

1

4

1

abaabba

1

5

4

abba

2

6

6

ba

0

7

2

baabba

2

8

5

bba

1

Why would I want to use suffix array ?

 

Well , They are extremely useful string searching , where text is limited and substrings to be searched are not limited . This approach is in contrast to KMP algorithm where there was one substring and you were given continuous stream of characters to search from.

Following is one out of the many important usages of the Suffix Arrays :

Finding Total Number of distinct Substrings.

Well how do we do this ?

We see that , any string , let us say

ABC will give following distinct substring ,

A

B

C

AB

BC

ABC

Which is equivalent to choosing an end point and a start point in a string, and what is left is a simple matter of finding the number of ways  you can choose a start and end point . Which is  nC2 (where n is the string length ) .

Problem happens when we have repeated characters . What will you do then ? that is when suffix arrays come into play .

Now if I am given a string , axyz , and let us say , I ask you how many strings can you make with this string such that they start from the first letter. Then the answer will be 4 ( a , ax, axy, axyz ) .

So , I have a string of length 7 , then I have 7 suffixes (excluding NULL , I am not counting it because it’s length is 0 , does not contribute towards any distinct substring ) . So , going by the above logic , I will 7+6+5+4+3+2+1 distinct strings , which is equivalent to  n+1C2  which was also the result of above discussion .

Unfortunately for us , this result doesn’t hold for the strings with repeated characters.

Take this one

S= “aabaab”

Suffix    Possible strings such that they start from first letter of suffix

b             b

ab           a,ab

aab         a,aa,aab

baab      b,ba,baa,baab

abaab    a,ab,aba,abaa,abaab

aabaab a,aa,aab,aaba,aabaa,aabaab

Now as you see strings a,b,aab,etc  are repeated but we also know that if we sort the above suffixes , those strings which are common , there length is given in Lowest Common Prefix .

So , the total number of distinct Substrings from a given string is given by

n+1C2 – Sum of All LCP

And that is the answer, as simple as that.

Problems Based on the above concept :

http://www.spoj.pl/problems/SUBST1/

http://www.spoj.pl/problems/DISUBSTR/

We will see more about the Suffix array and LCP implementation in the next part.

Please feel free to leave suggestions. Question are encouraged. Do not forget to comment  🙂

Adios ,

Ankur Khurana

Advertisements
Suffix Arrays – A simple Tutorial