超簡單的java爬蟲 - zjutzz－IT工程師數位筆記本

文章出處

最簡單的爬蟲，不需要設定代理服務器，不需要設定cookie，不需要http連接池，使用httpget方法，只是為了獲取html代碼...

好吧，滿足這個要求的爬蟲應該是最基本的爬蟲了。當然這也是做復雜的爬蟲的基礎。

使用的是httpclient4的相關API。不要跟我講網上好多都是httpclient3的代碼該怎么兼容的問題，它們差不太多，但是我們應該選擇新的能用的接口！

當然，還是有很多細節可以去關注一下，比如編碼問題（我一般都是強制用UTF-8的）

放碼過來：

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

public class Easy {
    
    //輸入流轉為String類型
    public static String inputStream2String(InputStream is)throws IOException{ 
        ByteArrayOutputStream baos=new ByteArrayOutputStream(); 
        int i=-1; 
        while((i=is.read())!=-1){ 
            baos.write(i); 
        } 
        return baos.toString(); 
    }

    //抓取網頁的核心函數
    public static void doGrab() throws Exception {
        //httpclient可以認為是模擬的瀏覽器
        CloseableHttpClient httpclient = HttpClients.createDefault();
        try {
            //要訪問的目標頁面url
            String targetUrl="http://chriszz.sinaapp.com";
            //使用get方式請求頁面。復雜一點也可以換成post方式的
            HttpGet httpGet = new HttpGet(targetUrl);
            CloseableHttpResponse response1 = httpclient.execute(httpGet);

            try {
                String status=response1.getStatusLine().toString();
                //通過狀態碼來判斷訪問是否正常。200表示抓取成功
                if(!status.equals("HTTP/1.1 200 OK")){                    
                    System.out.println("此頁面可以正常獲取！");
                }else{
                    response1 = httpclient.execute(httpGet);
                    System.out.println(status);
                }
                //System.out.println(response1.getStatusLine());
                HttpEntity entity1 = response1.getEntity();
                // do something useful with the response body
                // and ensure it is fully consumed
                InputStream input=entity1.getContent();

                String rawHtml=inputStream2String(input);
                System.out.println(rawHtml);

                //有時候會有中文亂碼問題，這取決于你的eclipse java工程設定的編碼格式、當前java文件的編碼格式，以及抓取的網頁的編碼格式
                //比如，你可以用String的getBytes()轉換編碼
                //String html = new String(rawHtml.getBytes("ISO-8859-1"),"UTF-8");//轉換后的結果

                EntityUtils.consume(entity1);
            } finally {
                response1.close();//記得要關閉
            }
        } finally {
            httpclient.close();//這個也要關閉哦！
        }
    }
    
    /*
     * 最簡單的java爬蟲--抓取百度首頁
     * memo：
     * 0.抓取的是百度的首頁，對應一個html頁面。
     *         (至于為啥我們訪問的是http://www.baidu.com而不是http://www.baidu.com/xxx.html，這個是百度那邊設定的，總之我們會訪問到那個包含html的頁面) 
     * 1.使用http協議的get方法就可以了(以后復雜了可以用post方法，設定cookie，甚至設定http連接池；或者抓取json格式的數據、抓取圖片等，也是類似的)
     * 2.通過httpclient的相關包（httpclient4版本）編寫，需要下載并添加相應的jar包到build path中
     * 3.代碼主要參考了httpclient(http://hc.apache.org/)包里面的tutorial的pdf文件。
     */
    public static void main(String[] args) throws Exception{
        Easy.doGrab();//為了簡答這里把doGrab()方法定義為靜態方法了所以直接Easy.doGrab()就好了
    }

}

文章列表