Gecco：易用的轻量化网络爬虫介绍

技术背景

在网络数据获取的需求下，爬虫技术应运而生。Gecco 是一款使用 Java 语言开发的易用的轻量化网络爬虫。它集成了 jsoup、httpclient、fastjson、spring、htmlunit、redission 等优秀框架，基于开闭设计原则，具有良好的扩展性，并且遵循非常开放的 MIT 开源协议。

实现步骤

1. 下载 Gecco

可以通过 Maven 进行下载，在 pom.xml 中添加以下依赖：

<dependency>
    <groupId>com.geccocrawler</groupId>
    <artifactId>gecco</artifactId>
    <version>x.x.x</version>
</dependency>

2. 快速开始

定义一个爬虫类，示例如下：

import com.geccocrawler.gecco.annotation.Gecco;
import com.geccocrawler.gecco.annotation.Html;
import com.geccocrawler.gecco.annotation.HtmlField;
import com.geccocrawler.gecco.annotation.RequestParameter;
import com.geccocrawler.gecco.annotation.Text;
import com.geccocrawler.gecco.spider.HtmlBean;

@Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipeline")
public class MyGithub implements HtmlBean {

    private static final long serialVersionUID = -7127412585200687225L;

    @RequestParameter("user")
    private String user;

    @RequestParameter("project")
    private String project;

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(2) .social-count")
    private String star;

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(3) .social-count")
    private String fork;

    @Html
    @HtmlField(cssPath=".entry-content")
    private String readme;

    public String getReadme() {
        return readme;
    }

    public void setReadme(String readme) {
        this.readme = readme;
    }

    public String getUser() {
        return user;
    }

    public void setUser(String user) {
        this.user = user;
    }

    public String getProject() {
        return project;
    }

    public void setProject(String project) {
        this.project = project;
    }

    public String getStar() {
        return star;
    }

    public void setStar(String star) {
        this.star = star;
    }

    public String getFork() {
        return fork;
    }

    public void setFork(String fork) {
        this.fork = fork;
    }

    public static void main(String[] args) {
        GeccoEngine.create()
        .classpath("com.geccocrawler.gecco.demo")
        .start("https://github.com/xtuhcy/gecco")
        .thread(1)
        .interval(2000)
        .loop(true)
        .mobile(false)
        .start();
    }
}

3. 使用 DynamicGecco 进行运行时配置

import com.geccocrawler.gecco.GeccoEngine;
import com.geccocrawler.gecco.dynamic.DynamicGecco;

public class DynamicGeccoDemo {
    public static void main(String[] args) {
        DynamicGecco.html()
        .gecco("https://github.com/{user}/{project}", "consolePipeline")
        .requestField("request").request().build()
        .stringField("user").requestParameter("user").build()
        .stringField("project").requestParameter().build()
        .stringField("star").csspath(".pagehead-actions li:nth-child(2) .social-count").text(false).build()
        .stringField("fork").csspath(".pagehead-actions li:nth-child(3) .social-count").text().build()
        .stringField("contributors").csspath("ul.numbers-summary > li:nth-child(4) > a").href().build()
        .register();

        GeccoEngine.create()
        .classpath("com.geccocrawler.gecco.demo")
        .start("https://github.com/xtuhcy/gecco")
        .run();
    }
}