循环爬虫脚本

  • 2020-08-18
  • 0
  • 0

最近需要爬取某站的内容,于是写了一个自动爬取的代码,加入定时任务自动获取(单线程爬取)

使用GuzzleHttp扩展进行请求(CURL请求同理),任务开启直到内容爬取结束退出。

废话不多说,直接上代码:

class SearchController
{
    // 模拟浏览器代理
    protected static $userAgent = [
        1 => 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',
        2 => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
        3 => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36 OPR/38.0.2220.41',
        4 => 'Opera/9.80 (Macintosh; Intel Mac OS X; U; en) Presto/2.2.15 Version/10.00',
        5 => 'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)'
    ];

    public static function startSearch()
    {
        // 资源网址
        $refererUrl = 'http://www.baidu.com';
        while(true){
            // 事先准备好需要爬取的内容,及设置爬取标记
            $searchData = SearchData::query()->where('is_search', 0)->limit(50)->get(['id', 'search_name', 'is_search']);
            // 都爬取完成则终止循环,退出爬取
            if($searchData->isEmpty()){
                break;
            }
            // 开始逐条遍历查询
            foreach ($searchData as $search){
                // try catch 捕获爬取过程中的异常,并记录异常,是程序继续正常运行
                try{
                    if(empty($search->search_name)){
                        $search->is_search = 1;
                        $search->save();
                        continue;
                    }
                    $requestUrl = $refererUrl.'?keyword='.$search->search_name;
                    // 开始请求内容
                    $html = self::getCurl($requestUrl, $refererUrl);
                    // 将请求的内容转换成一行字符串,如果需要去掉空格加上\s,例如去掉连续三个及以上空格/\r|\n|\t/|\s{3,}
                    $htmlOneLine = preg_replace("/\r|\n|\t/","", $html);
                    // 获取HTML表格中的内容,根据自己需要写对应的正则
                    preg_match_all("/<tbody>(.*)<\/tbody>/iU", $htmlOneLine, $tableArr);
                    // 内容为空直接跳过本次获取,并将记录置为已查询
                    if(!isset($tableArr[1][0]) || empty($tbodyHtml = $tableArr[1][0])){
                        $search->is_search = 1;
                        $search->save();
                        continue;
                    }
                    // 获取表格的每一行记录
                    preg_match_all("/<tr>(.*)<\/tr>/iU", $tbodyHtml, $trArr);
                    $tdArr = $trArr[1] ?? [];
                    foreach ($tdArr as $item){
                        // 将记录解析后保存
                        preg_match_all("/<td>(.*)<a/iU", $item, $data);
                        $result = $data[1][0] ?? '';
                        if(!empty($result)){
                            saveData::query()->create([
                                'search_data'   => $result,
                            ]);
                        }
                    }
                    $search->is_search = 1;
                    $search->save();
                }catch (\Exception $e){
                    echo $e->getMessage().'\\';
                    continue;
                }
            }
        }

        return true;
    }

    // get 方式请求内容
    protected static function getCurl($url, $referer = '')
    {
        $option = [
            'stream' => false,
            'timeout' => 30, // 设置超时时间
            'connect_timeout' => 30, // 设置连接超时时间,超时后自动断掉超时连接
            'http_errors' => false,
            'Headers' => [
                'Referer' => $referer,
                'User-Agent'=> self::$userAgent[array_rand(self::$userAgent)], // 随机浏览器代理
                'Cache-Control' => 'no-cache',
                'Accept' => '*/*',
                'Accept-Encoding' => 'gzip, deflate, br',
            ]
        ];
        $client = new Client();
        $response = $client->get($url, $option);

        return $response->getBody()->getContents();
    }

    // post 请求
    protected static function postCurl($url, $data = [], $referer = '')
    {
        $option = [
            'stream' => false,
            'timeout' => 30,
            'connect_timeout' => 30,
            'http_errors' => false,
            'Headers' => [
                'Referer' => $referer,
                'User-Agent'=> self::$userAgent[array_rand(self::$userAgent)],
                'Cache-Control' => 'no-cache',
                'Accept' => '*/*',
                'Accept-Encoding' => 'gzip, deflate, br',
            ],
            // 以form-data格式传参
            'form_params' => $data,
            // 以json格式传参
            'json' => $data,
        ];
        $client = new Client();
        $response = $client->post($url, $option);

        return $response->getBody()->getContents();
    }
}

评论

还没有任何评论,你来说两句吧