本博客日IP超过2000,PV 3000 左右,急需赞助商。
极客时间所有课程通过我的二维码购买后返现24元微信红包,请加博主新的微信号:xttblog2,之前的微信号好友位已满,备注:返现
受密码保护的文章请关注“业余草”公众号,回复关键字“0”获得密码
所有面试题(java、前端、数据库、springboot等)一网打尽,请关注文末小程序
腾讯云】1核2G5M轻量应用服务器50元首年,高性价比,助您轻松上云
昨天发生了一件另我非常沮丧的事情。我的个人站点业余草,数据库发生了故障,导致了将近100篇文章的丢失。
本站点主要是一个月备份一次数据库,上个月,也就是9月份的文章目前已全部丢失。
通过我个人对搜索引擎的理解,发现谷歌网页快照中有部分保留,于是我就用https抓取了部分快照,以便能恢复部分文章。下面介绍本文的重点如何使用HttpsClient抓取https网页内容?
一般的jsoup等爬虫框架对https的支持都不够友好。因此我这里借助了HttpsClient工具类来实现。
注意:如果你使用我的案例,在抓取https开头的网页时报错:unable to find valid certification path to requested target或者是peer not authenticated异常,原因是可能是使用jdk1.6,可以1.7试试,如果还是报错那就重新包装抓取用到HttpClient类。
下面我们进入代码实战阶段。
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;
import javax.net.ssl.SSLContext;
import javax.net.ssl.TrustManager;
import javax.net.ssl.X509TrustManager;
import org.apache.http.client.HttpClient;
import org.apache.http.conn.scheme.Scheme;
import org.apache.http.conn.scheme.SchemeRegistry;
import org.apache.http.conn.ssl.SSLSocketFactory;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager;
//业余草:www.xttblog.com
public class HttpsClient {
public static DefaultHttpClient getNewHttpsClient(HttpClient httpClient){
try {
SSLContext ctx = SSLContext.getInstance("TLS");
X509TrustManager tm = new X509TrustManager() {
public X509Certificate[] getAcceptedIssuers() {
return null;
}
public void checkClientTrusted(X509Certificate[] arg0,
String arg1) throws CertificateException {
}
public void checkServerTrusted(X509Certificate[] arg0,
String arg1) throws CertificateException {
}
};
ctx.init(null, new TrustManager[] { tm }, null);
SSLSocketFactory ssf = new SSLSocketFactory(ctx,SSLSocketFactory.ALLOW_ALL_HOSTNAME_VERIFIER);
SchemeRegistry registry = new SchemeRegistry();
registry.register(new Scheme("https", 443, ssf));
ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(registry);
return new DefaultHttpClient(mgr, httpClient.getParams());
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
}
}
在抓取之前重新获取httpClient类(httpClient = HttpsClient.getNewHttpsClient(httpClient);)
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;
public class Test {
//业余草:www.xttblog.com
public static void main(String[] args) {
String url ="https://baidu.com";
String html = getPageHtml(url);
System.out.println(html);
}
/**
* 获取网页html
*/
public static String getPageHtml(String currentUrl) {
HttpClient httpClient=new DefaultHttpClient();
httpClient = HttpsClient.getNewHttpsClient(httpClient);
String html = "";
HttpGet request = new HttpGet(currentUrl);
HttpResponse response = null;
try {
response = httpClient.execute(request);
if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){
HttpEntity mEntity = response.getEntity();
html = EntityUtils.toString(mEntity);
}
}catch(IOException e){
e.printStackTrace();
}
return html.toString();
}
}
使用的jar:
- commons-httpclient-3.1.jar
- commons-logging.jar
- httpclient-4.2.5.jar
- httpcore-4.2.4.jar
以上代码使用jdk1.7测试通过。
源码和jar到这里进行下载,导入eclipse中就能运行。

最后,欢迎关注我的个人微信公众号:业余草(yyucao)!可加作者微信号:xttblog2。备注:“1”,添加博主微信拉你进微信群。备注错误不会同意好友申请。再次感谢您的关注!后续有精彩内容会第一时间发给您!原创文章投稿请发送至532009913@qq.com邮箱。商务合作也可添加作者微信进行联系!
本文原文出处:业余草: » 使用HttpsClient抓取https网页内容