Saturday, 24 January 2015

Suffix Arrays (learning material)



http://www.quora.com/Given-a-string-how-do-I-find-the-number-of-distinct-substrings-of-the-string
(used)

http://web.stanford.edu/class/cs97si/suffix-array.pdf


https://greasepalm.wordpress.com/2012/07/01/suffix-arrays-a-simple-tutorial/

http://www.quora.com/How-do-I-find-the-total-number-of-different-palindromes-of-length-k-in-a-given-string-using-suffix-array (used)

http://www.quora.com/If-two-strings-have-more-than-one-longest-common-subsequence-considering-length-what-is-the-algorithm-to-find-the-lexicographically-smallest-longest-common-subsequence-among-them

http://www.roman10.net/suffix-array-part-3-longest-common-substring-lcs/ (read imp)

question link

http://www.spoj.com/problems/DISUBSTR/
http://www.spoj.com/problems/SARRAY/ (this is for understanding the complexity of ur designed suffix array implimentation)

suffix array standard code( time complexity o(n log^2 n ) using bucket sort
https://gist.github.com/calmhandtitan/8119030

String hashing

can be used instead of suffix array and all (in some cases)
tutorial by lalit kundu

http://threads-iiith.quora.com/String-Hashing-for-competitive-programming


code of DISUBSTR

as per the instructions of quora (implementation of suffix array) , first link
time complexity( n^2 logn)

#include<stdio.h>
#include<string.h>
#include<algorithm>
#include<iostream>
typedef long long ll;
using namespace std;
int L;

int su[1005];
char arr[1005];

int mpp(int a,int b)
{

    while(arr[a]==arr[b] && a<L && b<L){a++;b++;}
    if(arr[a]==NULL || arr[b]==NULL)
    {
        a--;b--;
    }

        if(arr[a]!=arr[b])
    {
        return arr[a]<arr[b];
    }
    else// the one which will be shorter all be being equal in all respect will come first
    {
        return a>b;
    }
}

int str(int l)
{
    for(int i=0;i<=l;i++)
        su[i]=i;

    sort(su,su+l,mpp);

}

int main()
{
    int t;

       // freopen("kr.in","r",stdin);
    scanf("%d",&t);
    while(t--)
    {
        scanf("%s",arr);

    ll l=strlen(arr);
    L=l;
    str(l);
        ll s=l-su[0];
    for(int i=0;i<l-1;i++)
    {
        ll p=l-su[i+1];
        ll co=0;
        ll a=su[i];
        ll b=su[i+1];
        while(arr[a]==arr[b])
        {
            co++;
            a++;
            b++;
        }
        s+=p-co;
    }
    printf("%lld\n",s);
    }
}


explaination

This is one of the problems in SPOJ (Sphere Online Judge (SPOJ))

The solution consists of constructing the suffix array and then finding the number of distinct substrings based on the Longest Common Prefixes.

One key observation here is that:

If you look through the prefixes of each suffix of a string, you have covered all substrings of that string.


Let us take an example: BANANA

Suffixes are:
0) BANANA
1) ANANA
2) NANA
3) ANA
4) NA
5) A

It would be a lot easier to go through the prefixes if we sort the above set of suffixes, as we can skip the repeated prefixes easily.

Sorted set of suffixes:
5) A
3) ANA
1) ANANA
0) BANANA
4) NA
2) NANA

From now on, 

LCP = Longest Common Prefix of 2 strings.

Initialize

ans = length(first suffix) = length("A") = 1.


Now consider the consecutive pairs of suffixes, i.e, [A, ANA], [ANA, ANANA], [ANANA, BANANA], etc. from the above set of sorted suffixes.

We can see that,
LCP("A", "ANA") = "A".


All characters that are not part of the common prefix contribute to a distinct substring. In the above case, they are 'N' and 'A'. So they should be added toans.

So we have, 
1
2
ans += length("ANA") - LCP("A", "ANA") 
ans = ans + 3 - 1 = ans + 2 = 3


Do the same for the next pair of consecutive suffixes: ["ANA", "ANANA"]

1
2
3
4
LCP("ANA", "ANANA") = "ANA".
ans += length("ANANA") - length(LCP)
=> ans = ans + 5 - 3
=> ans = 3 + 2 = 5.


Similarly, we have:

1
2
LCP("ANANA", "BANANA") = 0
ans = ans + length("BANANA") - 0 = 11


1
2
LCP("BANANA", "NA") = 0
ans = ans + length("NA") - 0 = 13


1
2
LCP("NA", "NANA") = 2
ans = ans + length("NANA") - 2 = 15


Hence the number of distinct substrings for the string "BANANA" = 15.




lcp computation for lcs (longest common string etc)

 for (i = 0; i < len1 + len2 - 1; ++i) {
        if ((ap[i] - cstr >= len1) && (ap[i+1] - cstr >= len1)) {
            //both starts with suffix of second string
            continue;
        } else if ((ap[i] - cstr < len1) && (ap[i+1] - cstr < len1)) {
            //both starts with suffix of first string
            continue;
        } else {
            lcplen = lcp(ap[i], ap[i+1]);
            if (lcplen > lcslen) {
                lcslen = lcplen;
                lcssufpos = i;
            }
        }
we generally merge two strings s1 and s2 and then form the suffix array and do the lcp computation , 
in then lcp computation we needed to ensure that , the consecutive suffixes that we are comparing dont belong to the same string . and then do the needful. 
the above part of code explains how it is to be performed