设计一个网页爬虫_代码007(未授权)

本文介绍: 系统设计面试题

我们的作用域只是处理以下Use Case：

作用域外：

状态假设：

使用更传统的系统进行练习 – 不要使用现存的系统比如 solr 或者 nutch

便利的转换指南

Component Design

我们假设我们有一个初始化list links_to_crawl 基于整体站点流行度排序初始化，如果这不是一个合理的假设的话，我们可以搜索这个爬虫伴随着流行度网站。link到外面的内容，比如 Yahoo, DMOZ等

我们使用一个表 crawled_links 去存储处理过的link和他们的页面签名.
我们可以存储 links_to_crawl 和 crawled_links 在一个 key-value NoSQL Database. 对于排名的link 我们存进 links_to_crawl, 我们可以使用 Redis 伴随着 sorted set去维护一个页面link的排名。我们应该讨论不同 use case之间的最优解。

PageDataStore 是一个抽象伴随着 Crawler Service，并且使用 NoSQL 数据库

class PagesDataStore(object):

    def __init__(self, db);
        self.db = db
        ...

    def add_link_to_crawl(self, url):
        """Add the given link to `links_to_crawl`."""
        ...

    def remove_link_to_crawl(self, url):
        """Remove the given link from `links_to_crawl`."""
        ...

    def reduce_priority_link_to_crawl(self, url)
        """Reduce the priority of a link in `links_to_crawl` to avoid cycles."""
        ...

    def extract_max_priority_page(self):
        """Return the highest priority link in `links_to_crawl`."""
        ...

    def insert_crawled_link(self, url, signature):
        """Add the given link to `crawled_links`."""
        ...

    def crawled_similar(self, signature):
        """Determine if we've already crawled a page matching the given signature"""
        ...

Page 是一个抽象伴随着 crawler service, 用来疯涨一个 Page, 他的内容， child urls,和签名

class Page(object):
	def __init__(self, url, contents, child_urls, signature):
		self.url = url
		self.contents = contents
		self.child_urls = child_urls
		self.signature = signature

Crawler 是Crawler Service中的主类，聚合 Page 和 PagesDataStore

class Crawler(object):
	def __init__(self, data_store, reverse_index_queue, doc_index_queue):
        self.data_store = data_store
        self.reverse_index_queue = reverse_index_queue
        self.doc_index_queue = doc_index_queue

    def create_signature(self, page):
        """Create signature based on url and contents."""
        ...

    def crawl_page(self, page):
        for url in page.child_urls:
            self.data_store.add_link_to_crawl(url)
        page.signature = self.create_signature(page)
        self.data_store.remove_link_to_crawl(page.url)
        self.data_store.insert_crawled_link(page.url, page.signature)

    def crawl(self):
        while True:
            page = self.data_store.extract_max_priority_page()
            if page is None:
                break
            if self.data_store.crawled_similar(page.signature):
                self.data_store.reduce_priority_link_to_crawl(page.url)
            else:
                self.crawl_page(page)

class RemoveDuplicateUrls(MRJob):

    def mapper(self, _, line):
        yield line, 1

    def reducer(self, key, values):
        total = sum(values)
        if total == 1:
            yield key, total

决定什么时候去更新爬虫的结果

$ curl https://search.com/api/v1/search?query=hello+world

{
    "title": "foo's title",
    "snippet": "foo's snippet",
    "link": "https://foo.com",
},
{
    "title": "bar's title",
    "snippet": "bar's snippet",
    "link": "https://bar.com",
},
{
    "title": "baz's title",
    "snippet": "baz's snippet",
    "link": "https://baz.com",
},