Some Hints for Nutch

好久没关注Nutch了，看邮件列表，学到了几招关于 Nutch 的小技巧．

如何索引动态 URL 站点？
调整 regex-urlfilter.txt 或是 crawl-urlfilter.txt 文件．参见行"# skip URLs containing certain characters as probable queries,后面的内容．
编译 Nutch 需要用到的 Ant 版本至少要 1.6 以上．

验证regex-urlfilter是否正常(by Michael Nebel)：

If you want to know, if your regex-urlfilter works as expectet, you can 
check it with the command:

	cat FILE-WITH-URLS | nutch net/nutch/net/RegexURLFilter

or by calling "nutch net/nutch/net/RegexURLFilter" and entering the URL 
by hand.

Everyline line beginning with a "+" ist accepted - a line with a "-" is 
accepted. For example:

   $ echo "http://www.nutch.org" | nutch net/nutch/net/RegexURLFilter
   run with heapsize 256
   -Xmx256m
   050202 173520 loadingfile:/home/nutch/nutch-0.7/conf/nutch-default.xml
   050202 173520 loading file:/home/nutch/nutch-0.7/conf/nutch-site.xml
   050202 173520 found resource regex-urlfilter.txt at
   file:/home/nutch/nutch-0.7/conf/regex-urlfilter.txt

Some Hints for Nutch

Categories:

内容分类

搜索

专题页面

关于本文