面試題之10億正整數問題--完整解答

作者: cnyao 來源: 博客園發布時間: 2009-11-26 21:39 閱讀: 2190 次推薦: 0 原文鏈接 [收藏]

關于這個問題，經過這么久的討論，兩篇文章及大家的回復，已經比較很清楚了。這里就來完整的整理一下解答。其實本來已經整理得差不多了，不過很不幸，電腦忽然罷工，怎么也啟動不了，然后又感冒了，所以一直到現在才開始做這個解答。

好了，不說這個了。下面進入正題。

這個題目來源于某公司的面試題，是absolute同學在我的“面試題收集貼”中提出的，之后CMGS同學在回復中提到，騰訊今年的面試題中有類似題目，問題規模擴大了10倍，但是本質相同。下面我們來看一下題目：

10億個正整數，只有其中1個數重復出現過，要在O(n)的時間里面找出這個數，內存要盡可能少（小于100M）。

我們首先不急著來解決這個問題，而是從其來源來慢慢看。這里提到的題目與面試題有一些不同（在面試題中是要求找重復的數，而這里的題目是對其進行排序），所以不要感覺疑惑。

[來源]

這種類型的題目，其來源為《編程珠璣》這本書，這里推薦這本書一下，但是由于這本書比較薄，所以里面涉及到的知識只能簡單提到，但是并不能夠完全覆蓋，所以讀這本書還是要經歷將其讀厚的過程，對涉及到的知識點，通過其它參考書來得到關于其的全部知識。另外，盡管這本書的翻譯還不錯，但是還是推薦下載或得到其英文版進行參考，對于中文版中感覺了解不是很清楚的地方，再來對照英文版看看。

該題的原型是在《編程珠璣》的開頭，題目在“面試題之10億正整樹問題”中我已經詳細介紹過了，而且園子中應該大部分兄弟都有這本書吧，我就不詳細將書中內容重新打一遍了。

大概介紹一下，作者通過和程序員交談，了解到程序員需要在一個大系統中實現一個電話號碼的文本數據庫，讀入電話號碼，輸出排序好的以800開頭的文件。

通過作者的整理，得到精確的問題陳述如下

輸入：

所輸入的是一個文件，至多包含n個正整數，每個正整數都要小于n，這里的n為10^7。如果輸入時某一個整數出現了兩次，就會產生一個致命的錯誤。這些整數與其他任何數據都不關聯。

輸出：

以增序形式輸出經過排序的整數列表。（這里應該補充一下是文件形式嗎？）

約束：

至多（大概）只有1MB的可用主存，但是可用磁盤空間非常充足。運行時間至多只允許幾分鐘，最適宜的時間大概為10秒鐘。

[解答]

從上面的這些描述來看一下，題1為10億正整數的問題，題2為《編程珠璣》中的問題。

題1中問題規模為10億，也就是10^9，內存要求為小于100M，問題要求是找出重復的數；

題2中問題規模為10^7，而內存要求為1M左右，問題要求是對這些數進行排序。

同時題1中要求算法的時間復雜度為O(n)。

這里我們就來看一下《編程珠璣》中是怎樣來解決這個問題的。

《編程珠璣》中使用了三種方法來對其進行解決。一種為Merge Sort，一種為Multi-pass Sort，還有一種為作者起名的Wonder Sort。這里我們會一個個分析一下這些算法，包括時間復雜度和空間復雜度，同時在后面還會有實例的說明。

Merge Sort

Merge Sort大家應該很熟悉，是慣常使用的一種外排序方法，其主要思想是采用了分而治之的思想，該思想被應用于多個算法領域。其中文稱為歸并排序，在大部分的算法書中都會提到，同時也是外排序中經常使用到的排序方法。

歸并排序可以通過迭代和遞歸兩種方式來實現。

這些實現在殷人昆的那本黃書中可以找到（大家應該知道我說的是哪本書吧），但是他的那本書的實現有些不太好，使用了datalist和staticlinklist這兩個與歸并排序本身無關的數據結構，而這里我們用來示例的話只需要int的數組即可。（另外加上文件讀寫操作，針對題目的話），所以這里我們僅僅參考一部分黃書中的實現，來自己實現歸并排序方法。

歸并排序可以使用多種方法來進行實現，這里只談常規方式下的歸并排序，在這種情況下，我們需要和被排序數組同樣大小的一個額外空間來輔助進行排序。

1 #include <iostream>
2 using namespace std;
3
4 void merge(int initlist[], int mergedlist[],int l,int m, int n)
5 {
6     int i=l,j=m+1,k=l;
7     while(i<=m && j<=n)
8     {
9         if(initlist[i]<=initlist[j])
10         {
11             mergedlist[k]=initlist[i];
12             i++;
13             k++;
14         }
15         else
16         {
17             mergedlist[k]=initlist[j];
18             j++;
19             k++;
20         }
21     }
22     if(i<=m)
23     {
24         for(int n1=k,n2=i;n1<=n&&n2<=m;n1++,n2++)
25         {
26             mergedlist[n1]=initlist[n2];
27         }
28     }
29     else
30     {
31         for(int n1=k,n2=j;n1<=n&&n2<=n;n1++,n2++)
32         {
33             mergedlist[n1]=initlist[n2];
34         }
35     }
36 }
37
38 void mergepass(int initlist[], int mergedlist[], const int len,const int listlen)
39 {
40     int i=0;
41     while(i+2*len<=listlen-1)
42     {
43         merge(initlist,mergedlist,i,i+len-1,i+2*len-1);
44         i+=2*len;
45     }
46     if(i+len<=listlen-1)
47     {
48         merge(initlist,mergedlist,i,i+len-1,listlen-1);
49     }
50     else
51     {
52         for(int j=i;j<=listlen-1;j++)
53         {
54             mergedlist[j]=initlist[j];
55         }
56     }
57 }
58
59 void mergesort(int list[], int listlen)
60 {
61     int *templist=new int[listlen];
62     int len=1;
63     while(len<listlen-1)
64     {
65         mergepass(list,templist,len,listlen);
66         len*=2;
67         mergepass(templist,list,len,listlen);
68         len*=2;
69     }
70     delete []templist;
71 }
72
73 int main()
74 {
75     int initlist[]={21,25,49,25,93,62,72,8,37,16,54};
76
77     int i;
78     for(i=0;i<sizeof(initlist)/sizeof(int);i++)
79         cout<<initlist[i]<<" ";
80     cout<<endl;
81
82     mergesort(initlist,sizeof(initlist)/sizeof(int));
83     for(i=0;i<sizeof(initlist)/sizeof(int);i++)
84         cout<<initlist[i]<<" ";
85     cout<<endl;
86
87
88     return 0;
89 }

歸并排序的時間復雜度和空間復雜度:

時間復雜度為O(nlgn)，空間復雜度為O(n)。實例說明見后。

Multi-pass Sort

在中文翻譯中，稱其為多通道排序。但是從其描述來看，似乎是多趟排序更為確切。

補充：eaglet提到所謂多通道排序的原型就是B+樹。對于B+樹沒有具體研究過，不過multipass在算法書中似乎沒有查找到，倒是B+樹會出現，而且看描述似乎的確是一樣的。

偽碼(using tmp file)：

1 while(not the endof file)
2    read one record from input file      (I/O Operation)
3    if(record in the region)
4        put the record into memory
5    else
6        write the record into the file(s).  (I/O Operation)
7    end if
8 end while
9
10  sort the records in the memory using some inner sort method.
11  write the sorted records into the output file.  (I/O Operation)
12 while(there are tmp files left)
13    dump one tmp file (begin from the smallest) to memory
14    sort the records in the memory using some inner sort method.
15    write the sorted records into the output file.  (I/O Operation)
16 end while
17 DONE.

偽碼(not using tmp file)：

1 for(region++)
2     while(not the endof file)
3    read one record from input file      (I/O Operation)
4      if(record in the region)
5    put the record into memory
6    end if
7     end while
8    sort the records in the memory using some inner sort method.
9     write the sorted records into the output file.  (I/O Operation)
10 end for
11 DONE

多趟排序的時間復雜度和空間復雜度

可以看到，通過存儲到硬盤文件，我們可以控制使用空間的大小，但是這樣同樣也就加大了I/O讀寫的次數，而大家都很清楚，I/O操作的效率遠遠低于內存操作，這樣在整個耗時方面，I/O讀寫次數越多的，其實際運行時間就越長。要提高算法的效率，就需要減少I/O操作的次數。

實例說明見后。

Wonder Sort

OK，現在就進入最精彩的部分，也是這道題目最希望的解法。還記得張柏芝在有一部電影中叫wonderful，很喜歡這個名字，看來作者也很喜歡這個名字哈~

在上面讀入文件記錄的時候，我們可以使用string, int等類型來表示我們所讀入的這一條記錄的具體值。而對于32位的機器，其int值為32位，也就是4個byte。而1M空間則有1024*1024=1048576個字節，一共能夠表達的int值為262144。而對于10的7次方也就是1千萬這樣的數，我們除下來得到38.14697265625也就是差不多要做40次。

而Wonder Sort中使用位圖來做，以每一位是0還是1表示數是否存在，這樣1M空間共有8388608個位，也就能表示大概800萬個數。這里我們暫時考慮不存在一個對內存的嚴格限制。所以如果要表示1000萬，我們需要的內存空間大概是1.25M。

偽碼：

1 setup a bit-map which have 1000,0000 bits. And set all bits to zero.
2 while(not end of the input file)
3     read one record from input file
4     if(correspond bit in the bit-map is 0)
5     change the correspond bit in the bit-map to 1. (bit-map[record]=1)
6     else
7    //if we want to sort, just do nothing.
8    //if we want to find the num that came twice, just print it out and break.
9     end if
10 end while
11
12 write the records into the output file.  (I/O Operation)
13

精彩排序的時間復雜度和空間復雜度

空間復雜度肯定是最低的，而時間復雜度為O(n)。實例見后。

【在該問題上面的擴展】

上面三個解法中，當然是Wonder Sort最好。但是作者也提到了，使用Wonder Sort時，我們需要1.25M的空間（加上其他一些操作，還要更多些，不過最大的需求肯定是1.25M的用來進行位圖法的區域），但是如果嚴格要求使用1M空間的話，我們怎么做？

從上面解法的描述來看，我們可以將第二種解法與第三種解法結合起來，這樣就可以解決。

偽碼：

1 setup a bit-map which have 800,0000 bits. And set all bits to zero.
2 while(not the endof file)
3    read one record from input file      (I/O Operation)
4    if(record in the region, which means 0 – 800,0000)
5        if(correspond bit in the bit-map is 0)
6   change the correspond bit in the bit-map to 1. (bit-map[record]=1)
7   else
8            //if we want to sort, just do nothing.
9            //if we want to find the num that came twice, just print it out and break.
10   end if
11   else
12    write the record into the file(s).  (I/O Operation)
13   end if
14 end while
15
16 write the records into the output file.  (I/O Operation)
17
18 //dump the tmp file (800,0000 – 1000,0000) to memory
19 while(not the endof file)
20     if(correspond bit in the bit-map is 0)
21     change the correspond bit in the bit-map to 1. (bit-map[record]=1)
22     else
23    //if we want to sort, just do nothing.
24    //if we want to find the num that came twice, just print it out and break.
25     end if
26 end while
27 write the records into the output file.  (I/O Operation)
28
29 DONE.
30

下面就是實例了，在做實例之前，我們有一個問題，就是如何生成題目中要求的測試數據？

測試數據的生成，必須比較隨機，我們需要生成范圍從0-10000000的數，這里我生成為8000000個數。同時這些數要求不重復。

[回答]

在《編程珠璣》的第12章中給出了回答，這里我直接寫出代碼。我使用這個生成了一個random.txt文件，一共67.8M。

測試文件生成代碼：

1 #include <iostream>
2 #include <fstream>
3 using namespace std;
4
5 #include <ctime>
6 //total use 126 seconds
7
8 ofstream ofs("random.txt");
9
10 void myswap(int &a,int &b)
11 {
12     int tmp=a;
13     a=b;
14     b=tmp;
15 }
16
17 int bigrand()
18 {
19     return RAND_MAX*rand()+rand();
20 }
21
22 int randint(int l,int u)
23 {
24     return l+bigrand()%(u-l+1);
25 }
26
27 void generate(int x[],int k,int n)
28 {
29     int i=0;
30     for(i=0;i<n;i++)
31     {
32         x[i]=i;
33     }
34     for(i=0;i<k;i++)
35     {
36         //myswap(i,randint(i,n-1));
37         int j=randint(i,n-1);
38         int t=x[i];
39         x[i]=x[j];
40         x[j]=t;
41         ofs<<x[i]<<endl;
42     }
43 }
44
45 int main()
46 {
47     clock_t Start, Finish;
48     Start = clock();
49     int *x=new int[10000000];
50     generate(x,8000000,10000000);
51     //generate(x,100000,1000000);
52     delete []x;
53     Finish=clock();
54     int second=double(Finish-Start)/CLOCKS_PER_SEC;
55     cout<<"total use "<<second<<" seconds"<<endl;
56     return 0;
57 }

該代碼生成八百萬的測試數據共用時126秒，機器配置為1G內存，P4處理器。這里，能否更快生成測試數據，當然，要保證隨機性和正確性。

問題：

這里我們得到程序的時間是使用clock函數，那如何得到程序占用的內存呢？

還有一個問題就是，我們如何像linux中那樣得到用戶時間和系統調用時間？

這個還不清楚，這里提出來，大家有知道的可以共享一下。

在測試文件生成好之后，我們就可以來實際的看一下這3種方法所具體使用的時間了。

因為是使用debug版來測試，同時還有其他程序運行，所以可能不是十分精準。但是都是在同一臺機器上面測試，所以數量級上面還是可以參考的。

結果：

使用merge sort來排序800萬的測試數據，我們共用了337.7秒

使用multipass sort來排序800萬的測試數據，我們共用了581.67秒

使用wonder sort時，居然也用了很長時間，共為334.171秒。稍微換了一下，將函數調用去掉一層，最后還是得到total time is 333.328 seconds。

這和我們的預期相差還是比較大的。到底是什么原因？

是否是I/O操作歷時比較長的原因？

單純I/O操作total time is 354.343seconds。

下面列出我使用的代碼

1. merge sort

  1 //total running time is 337.751 seconds
  2
  3 #include <iostream>
  4 #include <fstream>
  5 using namespace std;
  6
  7 #include <ctime>
  8
  9 ifstream randfile("random.txt");
10 ofstream sortedfile("sorted.txt");
11
12 const int NUMS=8000000;
13
14 void merge(int initlist[], int mergedlist[],int l,int m, int n)
15 {
16     int i=l,j=m+1,k=l;
17     while(i<=m && j<=n)
18     {
19         if(initlist[i]<=initlist[j])
20         {
21             mergedlist[k]=initlist[i];
22             i++;
23             k++;
24         }
25         else
26         {
27             mergedlist[k]=initlist[j];
28             j++;
29             k++;
30         }
31     }
32     if(i<=m)
33     {
34         for(int n1=k,n2=i;n1<=n&&n2<=m;n1++,n2++)
35         {
36             mergedlist[n1]=initlist[n2];
37         }
38     }
39     else
40     {
41         for(int n1=k,n2=j;n1<=n&&n2<=n;n1++,n2++)
42         {
43             mergedlist[n1]=initlist[n2];
44         }
45     }
46 }
47
48 void mergepass(int initlist[], int mergedlist[], const int len,const int listlen)
49 {
50     int i=0;
51     while(i+2*len<=listlen-1)
52     {
53         merge(initlist,mergedlist,i,i+len-1,i+2*len-1);
54         i+=2*len;
55     }
56     if(i+len<=listlen-1)
57     {
58         merge(initlist,mergedlist,i,i+len-1,listlen-1);
59     }
60     else
61     {
62         for(int j=i;j<=listlen-1;j++)
63         {
64             mergedlist[j]=initlist[j];
65         }
66     }
67 }
68
69 void mergesort(int list[], int listlen)
70 {
71     int *templist=new int[listlen];
72     int len=1;
73     while(len<listlen-1)
74     {
75         mergepass(list,templist,len,listlen);
76         len*=2;
77         mergepass(templist,list,len,listlen);
78         len*=2;
79     }
80     delete []templist;
81 }
82
83 int main()
84 {
85     clock_t start,finish;
86     start=clock();
87     int *initlist=new int[NUMS];
88
89     int i;
90     for(i=0;i<NUMS;i++)
91         randfile>>initlist[i];
92
93     mergesort(initlist,NUMS);
94     for(i=0;i<NUMS;i++)
95         sortedfile<<initlist[i]<<endl;
96
97     delete []initlist;
98
99     finish=clock();
100     double seconds=(double)(finish-start)/CLOCKS_PER_SEC;
101     cout<<"total running time is "<<seconds<<" seconds"<<endl;
102     return 0;
103 }

2. multipass Sort

1 //total time is 581.67 seconds
2
3 #include <iostream>
4 #include <fstream>
5 using namespace std;
6
7 #include <ctime>
8
9 ifstream randfile("random.txt");
10 ofstream sortedfile("sorted.txt");
11 ifstream helpfileout;
12 ofstream helpfile("tmp.txt");
13
14 int cmp(const void *a, const void *b)
15 {
16     return *(int *)a - *(int *)b;
17 }
18
19 void multipassSort(int x[])
20 {
21     int record;
22     int i=0;
23     int j;
24     while(randfile>>record)
25     {
26         if(record<4000000)
27             x[i++]=record;
28         else
29             helpfile<<record<<endl;
30     }
31     qsort(x,i,sizeof(int),cmp);
32     for(j=0;j<i;j++)
33         sortedfile<<x[j]<<endl;
34     i=0;
35     helpfile.close();
36     helpfileout.open("tmp.txt");
37     while(helpfileout>>record)
38     {
39         x[i++]=record;
40     }
41     qsort(x,i,sizeof(int),cmp);
42     for(j=0;j<i;j++)
43         sortedfile<<x[j]<<endl;
44 }
45
46 int main()
47 {
48     clock_t start,finish;
49     start=clock();
50     int *x=new int[8000000];
51     multipassSort(x);
52     delete []x;
53
54     finish=clock();
55     double secs=(double)(finish-start)/CLOCKS_PER_SEC;
56     cout<<"total time is "<<secs<<" seconds"<<endl;
57     return 0;
58 }
59

3. wonder Sort

1 #include <iostream>
2 #include <fstream>
3 using namespace std;
4
5 #include <ctime>
6
7 ifstream randfile("random.txt");
8 ofstream sortedfile("sorted.txt");
9
10 const int NUMS=8000000;
11
12 const int BITSPERINT=32;
13 const int SHIFT=5;
14 const int MASK=0x1f;
15
16 void seti(int x[],int i)
17 {
18     x[i>>SHIFT] |= (1<<(i & MASK));
19 }
20
21 void clri(int x[],int i)
22 {
23     x[i>>SHIFT] &= ~(1<<(i & MASK));
24 }
25
26 int test(int x[],int i)
27 {
28     return x[i>>SHIFT] & (1<<(i & MASK));
29 }
30
31 void wonderSort(int x[])
32 {
33     int tmp;
34     int i=0;
35     for(i=0;i<NUMS;i++)
36     {
37         randfile>>tmp;
38         if(test(x,tmp)==0)
39         {
40             seti(x,tmp);
41         }
42     }
43     for(i=0;i<10000000;i++)
44     {
45         if(test(x,i)!=0)
46         {
47             sortedfile<<i<<endl;
48         }
49     }
50 }
51
52 int main()
53 {
54     clock_t start,finish;
55     start=clock();
56     int *x=new int[10000000/BITSPERINT];
57     memset(x,0,10000000/BITSPERINT*sizeof(int));
58
59     wonderSort(x);
60     delete[] x;
61     finish=clock();
62     double seconds=(double)(finish-start)/CLOCKS_PER_SEC;
63     cout<<"total use time is "<<seconds<<" seconds"<<endl;
64
65     return 0;
66 }

4. 單純讀寫文件

1 #include <iostream>
2 #include <fstream>
3 using namespace std;
4
5 #include <ctime>
6
7 ifstream randfile("random.txt");
8 ofstream outfile("out.txt");
9
10 #define NUMS 8000000
11
12 int main()
13 {
14     clock_t start,finish;
15     start=clock();
16     int i;
17     int tmp;
18     for(i=0;i<NUMS;i++)
19     {
20         randfile>>tmp;
21         outfile<<tmp;
22     }
23     finish=clock();
24     double secs=(double)(finish-start)/CLOCKS_PER_SEC;
25     cout<<"total time is "<<secs<<"seconds"<<endl;
26
27     return 0;
28 }

如果不使用文件讀寫，就可以看出算法的效率來了，但是如何做到呢？

排序部分結束，然后回到面試題，如果只是來查找是否有重復數的話，是否有其他解法，如何做？

在上次的回復中，winter-cn提到了字典樹和哈希表的解決方案。

字典樹的確是一個好辦法，而且僅僅是對于查找，如果是排序，字典樹就不合適了。

而哈希還沒有想到好的哈希函數，具體也沒有細想，但是應該也是可以的。

【在其基礎上衍生的面試問題】

其實上面的內容都解決了的話，那一開始的面試題也就解決了，因為其盡管問題規模變大了，但是其能夠使用的內存也一樣變大了，其基本方法還是一樣的。

現在很多公司，因為其篩選人員的目的，同時也由于其工作是處理海量數據，所以在面試的時候，會出一些這樣的關于海量數據處理的問題，但是我們很多人，包括我，都一般不會接觸到這樣海量的數據，所以，適量的減小問題的規模，但是同時將其內存限制變得更加嚴格，其實其本質還是一樣的。

【TODO – 海量數據處理的話題】

這里的話題可以歸入“海量數據處理”，關于海量數據處理，現在有很多公司的面試題會提相關的問題，如果沒有對這個話題思考過的話，是不太可能會有很好的答案的，關于這個話題，以后還可以找到其他的問題來進行討論，當問題規模不大的時候，體現不出算法的優勢，但是當問題的規模到達一定數量級的時候，O(n)或者O(lgn)的算法的優勢就能夠體現出來了。

因為內容比較多，整理也花了一些時間，還有編寫代碼，如果其中有錯誤，歡迎大家指出，本來想分為幾篇來寫的，最后還是一下子寫成一篇寫完算了，篇幅就比較長，而且代碼也較多，多謝大家能夠看到最后~