洛阳铲的日志

2009年11月6日

crawler collection

Filed under: Python — 标签:, , , , , , , — HackGou @ 18:15

python based crawler:

  1. Atomisator: http://atomisator.ziade.org/ to build custom RSS feeds
  2. Orchid: http://pypi.python.org/pypi/Orchid/1.1
    Orchid is a python crawler I developed for one of my graduate courses. It is a generic multi-threaded web crawler complete with documentation. We used this crawler to locate web pages which contained malicious code. However, the logic of what to do with the crawled pages is implemented in a separate class and therefore Orchid can easily be used for any application which requires crawling the web
  3. Ruya: http://pypi.python.org/pypi/Ruya/1.0
    Ruya is a Python-based breadth-first, level-, delayed, event-based-crawler for crawling English, Japanese websites. It is targeted solely towards developers who want crawling functionality in their projects using API, and crawl control
  4. harvestman:
    HarvestMan (with a capital ‘H’ and a capital ‘M’) is a webcrawler program. HarvestMan belongs to a family of
    programs frequently addressed as webcrawlers, webbots, web-robots, offline browsers etc.
    These programs are used to crawl a distributed network of computers like the Internet and download files locally
    1. http://code.google.com/p/harvestman-crawler/
    2. http://www.harvestmanontheweb.com/
  5. : http://dev.scrapy.org/
    Scrapy a is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
    Even though Scrapy was originally designed for screen scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
    The purpose of this document is to introduce you to the concepts behind Scrapy so you can get an idea of how it works and decide if Scrapy is what you need.
  6. Webstemmer : http://www.unixuser.org/~euske/python/webstemmer/index.html
    Webstemmer is a web crawler and HTML layout analyzer that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up

Other Crawler:

  1. droids http://incubator.apache.org/droids/
  2. Heritrix: http://crawler.archive.org/articles/user_manual/creating.html

Del.icio.us : , , , , , , ,

没有评论 »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress