<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
  <channel>
    <title>中文分词</title>
    <description>Lucene中文分词  信息、技术、交流</description>
    <link>http://analysis.group.javaeye.com</link>
    <language>UTF-8</language>
    <copyright>Copyright 2003-2008, JavaEye.com</copyright>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <generator>JavaEye - 做最棒的软件开发交流社区</generator>
          <item>
        <title>ictclas4j的一个bug</title>
        <author>chencang</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://tinypig.javaeye.com">chencang</a>&nbsp;
                    链接：<a href="http://analysis.group.javaeye.com/group/blog/250926" style="color:red;">http://analysis.group.javaeye.com/group/blog/250926</a>&nbsp;
          发表时间: 2008年10月09日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>不知道用ictclas4j的人多不多，该项目地址是<a href="http://code.google.com/p/ictclas4j/">http://code.google.com/p/ictclas4j/</a>&nbsp;关于ictclas分词系统讨论组地址是<a href="http://groups.google.com/group/ictclas">http://groups.google.com/group/ictclas</a></p>
<p>其中在ictclas4j项目的issues中有人提到一个问题&ldquo;<span style="color: #ff0000;">程序分词最后结果</span><span class="Apple-style-span" style="word-spacing: 0px; font: bold 17px arial; text-transform: none; color: #000000; text-indent: 0px; white-space: normal; letter-spacing: normal; border-collapse: separate; orphans: 2; widows: 2; webkit-border-horizontal-spacing: 0px; webkit-border-vertical-spacing: 0px; webkit-text-decorations-in-effect: none; webkit-text-size-adjust: auto; webkit-text-stroke-width: 0;"><span style="color: #ff0000;">会吃掉一些字</span></span>&rdquo;，问题地址为<a href="http://code.google.com/p/ictclas4j/issues/detail?id=2">http://code.google.com/p/ictclas4j/issues/detail?id=2</a>，但是没有人来回答。</p>
<p>&nbsp;</p>
<p>我也碰到这个问题，只能自己看看了。经过对ictclas4j源程序的理解以及与原始c++版本（FreeICTCLAS）的源程序的比对，终于发现了错误所在：PosTagger.java文件中人名识别部分personRecognize方法里面出错了</p>
<p>ictclas4j的代码是</p>
<pre name="code" class="java">if (sn.getPos() &lt; 4 &amp;&amp; unknownDict.getFreq(sn.getWord(), sn.getPos()) &lt; Utility.LITTLE_FREQUENCY)
	personName += sn.getWord();</pre>
<p>&nbsp;而原始C++版本里面代码为：</p>
<pre name="code" class="cpp">				if(m_nBestTag[nPos]&lt;4 &amp;&amp; personDict.GetFrequency(m_sWords[nPos],m_nBestTag[nPos])&lt;LITTLE_FREQUENCY)
					nLittleFreqCount++;//The counter increase
				strcat(sPersonName,m_sWords[nPos]);</pre>
<p>&nbsp;这两段代码里面personName和sPersonName含义是一样的，这样我们就看到错误在什么地方了。</p>
<p>估计sinboy在写程序的时候没看清。漏掉的nLittleFreqCount变量在ictclas4j里面添不添加都无所谓，对它暂时没什么影响（以后的版本有没有影响就不知道了），所以我们就直接将该if判断句注释掉。</p>
<p>另外，看到它的sn.getWord()方法我还有点不放心，毕竟sn.getSrcWrod()取得的才是原始字词（参加SegNode类的注释），所以这个方法也改掉比较好。</p>
<p>&nbsp;</p>
<p>最终，我的修改方法是将ictclas4j中PosTagger类的personRecognize方法里面的上段代码直接改为：</p>
<pre name="code" class="java">personName += sn.getSrcWord();</pre>
<p>&nbsp;如此再进行测试，就不会发生分词结果&ldquo;漏词&rdquo;、&ldquo;吃掉词&rdquo;的现象了。</p>
<p>&nbsp;</p>
<p>另外好像据sinboy自己曾说过要将ictclas4j升级到1.0的版本，现在是0.9.1版本的，期待中吧</p>
<p>&nbsp;</p>
          <br/>
          <span style="color:red;">
            <a href="http://analysis.group.javaeye.com/group/blog/250926#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">Windows7在微软WinHEC 2008上揭开神秘面纱</span></a></li><li><a href='/adverts/138' target='_blank'><span style="color:red;font-weight:bold;">加入阿里巴巴，发展潜力无限</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Thu, 09 Oct 2008 21:27:21 +0800</pubDate>
        <link>http://analysis.group.javaeye.com/group/blog/250926</link>
        <guid>http://analysis.group.javaeye.com/group/blog/250926</guid>
      </item>
          <item>
        <title>IKAnalyzer和庖丁分词性能对比</title>
        <author>keller</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://keller.javaeye.com">keller</a>&nbsp;
                    链接：<a href="http://analysis.group.javaeye.com/group/blog/203929" style="color:red;">http://analysis.group.javaeye.com/group/blog/203929</a>&nbsp;
          发表时间: 2008年02月18日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>&nbsp; paoding和IK分词效果差不多，IK分词多些但速度差些。</p>
<p>原文：</p>
<!-- Start: CommunityServer.Blogs.Controls.CategoryTagControl --><!-- Skin Path: ~/Themes/default/Skins/Skins/Skin-CategoryTagControl.ascx -->
<p>&nbsp;<a href="http://www.zgkw.cn/FORUMS/blogs/dyx/archive/2008/02/18/59776.aspx">http://www.zgkw.cn/FORUMS/blogs/dyx/archive/2008/02/18/59776.aspx</a></p>
          <br/>
          <span style="color:red;">
            <a href="http://analysis.group.javaeye.com/group/blog/203929#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">Windows7在微软WinHEC 2008上揭开神秘面纱</span></a></li><li><a href='/adverts/138' target='_blank'><span style="color:red;font-weight:bold;">加入阿里巴巴，发展潜力无限</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Mon, 18 Feb 2008 12:35:00 +0800</pubDate>
        <link>http://analysis.group.javaeye.com/group/blog/203929</link>
        <guid>http://analysis.group.javaeye.com/group/blog/203929</guid>
      </item>
          <item>
        <title>je分词的问题</title>
        <author>zzxplayful</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://zzxplayful.javaeye.com">zzxplayful</a>&nbsp;
                    链接：<a href="http://analysis.group.javaeye.com/group/blog/143538" style="color:red;">http://analysis.group.javaeye.com/group/blog/143538</a>&nbsp;
          发表时间: 2007年11月26日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p><font face="Arial">我现在用的是<font face="Arial">je-analysis-1.5.2.jar的分词，当我建立索引大约有几百条的时候，就出现一下异常，看看，是什么原因？谢谢了</font></font></p>
<p><font face="Arial">java.lang.ArrayIndexOutOfBoundsException: 1056<br />
&nbsp;at jeasy.analysis.lIIllIlIlIIIllll._$3(Unknown Source:264)<br />
&nbsp;at jeasy.analysis.lIIllIlIlIIIllll._$2(Unknown Source:143)<br />
&nbsp;at jeasy.analysis.lIIllIlIlIIIllll._$1(Unknown Source:58)<br />
&nbsp;at jeasy.analysis.lIIllIlIlIIIllll.next(Unknown Source:38)<br />
&nbsp;at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:107)<br />
&nbsp;at org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:219)<br />
&nbsp;at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:95)<br />
&nbsp;at org.apache.lucene.index.IndexWriter.buildSingleDocSegment(IndexWriter.java:1013)<br />
&nbsp;at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1001)<br />
&nbsp;at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:983)<br />
&nbsp;at com.hotct.search.core.IndexProcesser.createIndex(IndexProcesser.java:125)<br />
&nbsp;at com.hotct.search.app.cms.index.ArticleIndexProcesser.createArticleIndex(ArticleIndexProcesser.java:49)<br />
&nbsp;at com.hotct.search.app.cms.index.ArticleIndexProcesser.getPageAritcle(ArticleIndexProcesser.java:74)<br />
&nbsp;at com.hotct.search.app.cms.index.ArticleIndexProcesser.main(ArticleIndexProcesser.java:82)</font></p>
          <br/>
          <span style="color:red;">
            <a href="http://analysis.group.javaeye.com/group/blog/143538#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/138' target='_blank'><span style="color:red;font-weight:bold;">加入阿里巴巴，发展潜力无限</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">Windows7在微软WinHEC 2008上揭开神秘面纱</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Mon, 26 Nov 2007 11:00:58 +0800</pubDate>
        <link>http://analysis.group.javaeye.com/group/blog/143538</link>
        <guid>http://analysis.group.javaeye.com/group/blog/143538</guid>
      </item>
          <item>
        <title>庖丁分词(2.0.4-alpha)的分词策略</title>
        <author>Qieqie</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://qieqie.javaeye.com">Qieqie</a>&nbsp;
                    链接：<a href="http://analysis.group.javaeye.com/group/blog/126944" style="color:red;">http://analysis.group.javaeye.com/group/blog/126944</a>&nbsp;
          发表时间: 2007年09月25日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <pre name="code" class="java">public class AnalyzerTest extends TestCase {

 

    protected PaodingAnalyzer analyzer = new PaodingAnalyzer();

 

    protected StringBuilder sb = new StringBuilder();

 

    protected String dissect(String input) {

       try {

           TokenStream ts = analyzer.tokenStream("", new StringReader(input));

           Token token;

           sb.setLength(0);

           while ((token = ts.next()) != null) {

              sb.append(token.termText()).append('/');

           }

           if (sb.length() > 0) {

              sb.setLength(sb.length() - 1);

           }

           return sb.toString();

       } catch (Exception e) {

           e.printStackTrace();

           return "error";

       }

    }

 

    // --------------------------------------------------------------

    // 仅包含词语的句子分词策略

    // --------------------------------------------------------------

 

    /**

     * 句子全由词典词语组成，但词语之间没有包含、交叉关系

     */

    public void test100() {

       String result = dissect("台北中文国际");

       assertEquals("台北/中文/国际", result);

    }

 

    /**

     * 句子全由词典词语组成，但词语之间有包含关系

     */

    public void test101() {

       String result = dissect("北京首都机场");

       assertEquals("北京/首都/首都机场/机场", result);

    }

 

    /**

     * 句子全由词典词语组成，但词语之间有交叉关系

     */

    public void test102() {

       String result = dissect("东西已经拍卖了");

       assertEquals("东西/已经/拍卖/卖了", result);

    }

 

    /**

     * 句子全由词典词语组成，但词语之间有包含、交叉等复杂关系

     */

    public void test103() {

       String result = dissect("羽毛球拍");

       assertEquals("羽毛/羽毛球/羽毛球拍/球拍", result);

    }

 

    // --------------------------------------------------------------

    // noise词汇和单字的分词策略

    // --------------------------------------------------------------

 

    /**

     * 词语之间有一个noise字(的)

     */

    public void test200() {

       String result = dissect("足球的魅力");

       assertEquals("足球/魅力", result);

    }

 

    /**

     * 词语之间有一个noise词语(因之)

     */

    public void test201() {

       String result = dissect("主人因之生气");

       assertEquals("主人/生气", result);

    }

 

    /**

     * 词语前后分别有单字和双字的noise词语(与,有关)

     */

    public void test202() {

       String result = dissect("与谋杀有关");

       assertEquals("谋杀", result);

    }

 

    /**

     * 前有noise词语(哪怕)，后面跟随了连续的noise单字(了,你)

     */

    public void test203() {

       String result = dissect("哪怕朋友背叛了你");

       assertEquals("朋友/背叛", result);

    }

 

    /**

     * 前后连续的noise词汇(虽然,某些)，词语中有noise单字(很)

     */

    public void test204() {

       String result = dissect("虽然某些动物很凶恶");

       assertEquals("动物/凶恶", result);

    }

 

    // --------------------------------------------------------------

    // 词典没有收录的字符串的分词策略

    // --------------------------------------------------------------

 

    

    /**

     * 仅1个字的非词汇串(东,西,南,北)

     */

    public void test300() {

       String result = dissect("东&amp;&amp;西&amp;&amp;南&amp;&amp;北");

       assertEquals("东/西/南/北", result);

    }

 

    

    /**

     * 仅两个字的非词汇串(古哥,谷歌,收狗,搜狗)

     */

    public void test302() {

       String result = dissect("古哥&amp;&amp;谷歌&amp;&amp;收狗&amp;&amp;搜狗");

       assertEquals("古哥/谷歌/收狗/搜狗", result);

    }

    

    /**

     * 多个字的非词汇串

     */

    public void test303() {

       String result = dissect("这是鸟语：玉鱼遇欲雨");

       assertEquals("这是/鸟语/玉鱼/鱼遇/遇欲/欲雨", result);

    }

    

    /**

     * 两个词语之间有一个非词汇的字(真)

     */

    public void test304() {

       String result = dissect("朋友真背叛了你了!");

       assertEquals("朋友/真/背叛", result);

    }

    

    /**

     * 两个词语之间有一个非词汇的字符串(盒蟹)

     */

    public void test305() {

       String result = dissect("建设盒蟹社会");

       assertEquals("建设/盒蟹/社会", result);

    }

    

    /**

     * 两个词语之间有多个非词汇的字符串(盒少蟹)

     */

    public void test306() {

       String result = dissect("建设盒少蟹社会");

       assertEquals("建设/盒少/少蟹/社会", result);

    }

 

    // --------------------------------------------------------------

    // 不包含小数点的汉字数字

    // --------------------------------------------------------------

 

 

    /**

     * 单个汉字数字

     */

    public void test400() {

       String result = dissect("二");

       assertEquals("2", result);

    }

 

    /**

     * 两个汉字数字

     */

    public void test61() {

       String result = dissect("五六");

       assertEquals("56", result);

    }

 

    /**

     * 多个汉字数字

     */

    public void test62() {

       String result = dissect("三四五六");

       assertEquals("3456", result);

    }

 

    /**

     * 十三

     */

    public void test63() {

       String result = dissect("十三");

       assertEquals("13", result);

    }

 

    /**

     * 二千

     */

    public void test65() {

       String result = dissect("二千");

       assertEquals("2000", result);

    }

 

    /**

     * 两千

     */

    public void test651() {

       String result = dissect("两千");

       assertEquals("2000", result);

    }

 

    /**

     * 2千

     */

    public void test652() {

       String result = dissect("2千");

       assertEquals("2000", result);

    }

    

    /**

     * 

     */

    public void test653() {

       String result = dissect("3千万");

       assertEquals("30000000", result);

    }

    

    /**

     * 

     */

    public void test654() {

       String result = dissect("3千万个案例");

       assertEquals("30000000/30000000个/案例", result);

    }

 

 

    /**

     * 

     */

    public void test64() {

       String result = dissect("千万");

       assertEquals("千万", result);

    }

 

    public void test66() {

       String result = dissect("两两");

       assertEquals("两两", result);

    }

 

    public void test67() {

       String result = dissect("二二");

       assertEquals("22", result);

    }

 

    public void test68() {

       String result = dissect("2.2两");

       assertEquals("2.2/2.2两", result);

    }

 

    public void test69() {

       String result = dissect("二两");

       assertEquals("2/2两", result);

    }

 

    public void test7() {

       String result = dissect("哪怕二");

       assertEquals("2", result);

    }

 

}
 </pre>
          <br/>
          <span style="color:red;">
            <a href="http://analysis.group.javaeye.com/group/blog/126944#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">Windows7在微软WinHEC 2008上揭开神秘面纱</span></a></li><li><a href='/adverts/138' target='_blank'><span style="color:red;font-weight:bold;">加入阿里巴巴，发展潜力无限</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Tue, 25 Sep 2007 16:13:15 +0800</pubDate>
        <link>http://analysis.group.javaeye.com/group/blog/126944</link>
        <guid>http://analysis.group.javaeye.com/group/blog/126944</guid>
      </item>
          <item>
        <title>使用 庖丁分词(2.0.4-alpha)</title>
        <author>Qieqie</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://qieqie.javaeye.com">Qieqie</a>&nbsp;
                    链接：<a href="http://analysis.group.javaeye.com/group/blog/126943" style="color:red;">http://analysis.group.javaeye.com/group/blog/126943</a>&nbsp;
          发表时间: 2007年09月25日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <h2><span style="FONT-SIZE: 15pt; LINE-HEIGHT: 173%; FONT-FAMILY: 黑体">开始使用</span></h2>
<p class="MsoNormal" style="MARGIN-BOTTOM: 7.8pt"><span style="FONT-FAMILY: 宋体">庖丁中文分词需要一套词典，这些词典需要统一存储在某个目录下，这个目录称为词典安装目录。词典安装目录可以是文件系统的任何目录，它不依赖于应用程序的运行目录。将词典拷贝到词典安装目录的过程称为安装词典。增加、删除、修改词典目录下的词典的过程称为自定制词典。</span></p>
<p class="MsoNormal" style="MARGIN: 0cm 0cm 7.8pt 21pt"><span style="FONT-SIZE: 9pt; FONT-FAMILY: 华文细黑">在<span lang="EN-US">linux</span>下，我们可以考虑将词典安装在一个专门存储数据的分区下某目录，以笔者为例，笔者将<span lang="EN-US">/data</span>作为系统的一个独立分区，笔者便是将词典保存在<span lang="EN-US">/data/paoding/dic</span>下。</span></p>
<p class="MsoNormal" style="MARGIN: 0cm 0cm 7.8pt 21pt"><span style="FONT-SIZE: 9pt; FONT-FAMILY: 华文细黑">在<span lang="EN-US">windows</span>下，我们可以考虑将词典安装在非系统盘的另外分区下的某个目录，以笔者为例，笔者可能将词典保存在<span lang="EN-US">E:/data/paoding/dic</span>下。</span></p>
<p class="MsoNormal" style="MARGIN-BOTTOM: 7.8pt"><span style="FONT-FAMILY: 宋体">使用者安装辞典后，应该设置系统环境变量</span><span lang="EN-US" style="FONT-FAMILY: 'Courier New'">PAODING_DIC_HOME</span><span style="FONT-FAMILY: 宋体">指向词典安装目录。</span></p>
<p class="MsoNormal" style="MARGIN: 0cm 0cm 7.8pt 21pt"><span style="FONT-SIZE: 9pt; FONT-FAMILY: 华文细黑">在<span lang="EN-US">linux</span>下，通过修改<span lang="EN-US">/etc/profile</span>，在文件末尾加上以下<span lang="EN-US">2</span>行，然后保存该文件并退出即可。</span></p>
<p class="MsoNormal" style="MARGIN-LEFT: 42pt"><span lang="EN-US" style="FONT-SIZE: 9pt; FONT-FAMILY: 'Courier New'">PAODING_DIC_HOME=/data/paoding/dic</span></p>
<p class="MsoNormal" style="MARGIN: 0cm 0cm 7.8pt 42pt"><span lang="EN-US" style="FONT-SIZE: 9pt; FONT-FAMILY: 'Courier New'">export PAODING_DIC_HOME</span></p>
<p class="MsoNormal" style="MARGIN: 0cm 0cm 7.8pt 21pt"><span style="FONT-SIZE: 9pt; FONT-FAMILY: 华文细黑">在<span lang="EN-US">windows</span>下，通过&ldquo;我的电脑&rdquo;属性之&ldquo;高级&rdquo;选项卡，然后在进入&ldquo;环境变量&rdquo;编辑区，新建环境变量，设置&ldquo;变量名&rdquo;为<span lang="EN-US">PAODING_DIC_HOME</span>；&ldquo;变量值&rdquo;为<span lang="EN-US">E:/data/paoding/dic</span></span></p>
<p class="MsoNormal" style="MARGIN-BOTTOM: 7.8pt"><span style="FONT-FAMILY: 宋体">第</span><span lang="EN-US" style="FONT-FAMILY: 'Courier New'">3</span><span style="FONT-FAMILY: 宋体">步，把</span><span lang="EN-US" style="FONT-FAMILY: 'Courier New'">paoding-analysis.jar</span><span style="FONT-FAMILY: 宋体">拷贝到应用运行时的类路径</span><span lang="EN-US" style="FONT-FAMILY: 'Courier New'">(classpath)</span><span style="FONT-FAMILY: 宋体">下。使用集成开发环境</span><span lang="EN-US" style="FONT-FAMILY: 'Courier New'">(IDE)</span><span style="FONT-FAMILY: 宋体">开发应用的使用者，需要把</span><span lang="EN-US" style="FONT-FAMILY: 'Courier New'">paoding-analysis.jar</span><span style="FONT-FAMILY: 宋体">拷贝到工程中，然后使用</span><span lang="EN-US" style="FONT-FAMILY: 'Courier New'">IDE</span><span style="FONT-FAMILY: 宋体">向导引入该</span><span lang="EN-US" style="FONT-FAMILY: 'Courier New'">Jar</span><span style="FONT-FAMILY: 宋体">包，以便开发应用时</span><span lang="EN-US" style="FONT-FAMILY: 'Courier New'">IDE</span><span style="FONT-FAMILY: 宋体">能够认识它。</span></p>
<p class="MsoNormal" style="MARGIN-BOTTOM: 7.8pt"><span style="FONT-FAMILY: 宋体">至此，便可以在应用代码中使用庖丁提供的中文分析器了。</span></p>
<p class="MsoNormal" style="MARGIN: 0cm 0cm 7.8pt 21pt"><em><span style="FONT-SIZE: 9pt; FONT-FAMILY: 宋体">提醒：以下示例代码中的</span></em><em><span lang="EN-US" style="FONT-SIZE: 9pt; FONT-FAMILY: 'Courier New'">IDNEX_PATH</span></em><em><span style="FONT-SIZE: 9pt; FONT-FAMILY: 宋体">表示索引库地址，读者运行以下代码前，应该赋与一个不重要的地址，比如</span></em><em><span lang="EN-US" style="FONT-SIZE: 9pt; FONT-FAMILY: 'Courier New'">/data/paoding/test_index </span></em><em><span style="FONT-SIZE: 9pt; FONT-FAMILY: 宋体">或</span></em><em><span lang="EN-US" style="FONT-SIZE: 9pt; FONT-FAMILY: 'Courier New'">E:/paoding_test_index</span></em><em><span style="FONT-SIZE: 9pt; FONT-FAMILY: 宋体">，以免一时疏忽将重要数据丢失。</span></em></p>
<table class="MsoTableGrid" cellspacing="0" border="1" width="568" cellpadding="0" style="BORDER-RIGHT: medium none; BORDER-TOP: medium none; MARGIN-LEFT: 5.4pt; BORDER-LEFT: medium none; WIDTH: 426.1pt; BORDER-BOTTOM: medium none; BORDER-COLLAPSE: collapse">
    <tbody>
        <tr style="HEIGHT: 15.1pt">
            <td valign="top" width="568" style="BORDER-RIGHT: windowtext 1pt solid; PADDING-RIGHT: 5.4pt; BORDER-TOP: windowtext 1pt solid; PADDING-LEFT: 5.4pt; BACKGROUND: #cccccc; PADDING-BOTTOM: 0cm; BORDER-LEFT: windowtext 1pt solid; WIDTH: 426.1pt; PADDING-TOP: 0cm; BORDER-BOTTOM: windowtext 1pt solid; HEIGHT: 15.1pt">
            <p class="MsoNormal"><span style="FONT-FAMILY: 宋体">示例代码：建立索引库，并依此查询</span></p>
            </td>
        </tr>
        <tr style="HEIGHT: 19.4pt">
            <td valign="top" width="568" style="BORDER-RIGHT: windowtext 1pt solid; PADDING-RIGHT: 5.4pt; BORDER-TOP: medium none; PADDING-LEFT: 5.4pt; PADDING-BOTTOM: 0cm; BORDER-LEFT: windowtext 1pt solid; WIDTH: 426.1pt; PADDING-TOP: 0cm; BORDER-BOTTOM: windowtext 1pt solid; HEIGHT: 19.4pt">
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">String IDNEX_PATH = &quot;E:/paoding_test_index&quot;;</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">//</span><strong><span style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 宋体">获取</span></strong><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Paoding</span></strong><strong><span style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 宋体">中文分词器</span></strong></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Analyzer analyzer = </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">new</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> PaodingAnalyzer();</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">//</span><strong><span style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 宋体">建立索引</span></strong></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">IndexWriter writer = </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">new</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> IndexWriter(</span><em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #0000c0; FONT-FAMILY: 'Courier New'">IDNEX_PATH</span></em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">, analyzer, </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">true</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Document doc = </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">new</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> Document();</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Field field = </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">new</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> Field(</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;content&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">, </span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;</span><span style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 宋体">你好，世界</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">!&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">, Field.Store.</span><em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #0000c0; FONT-FAMILY: 'Courier New'">YES</span></em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">,</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp; Field.Index.</span><em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #0000c0; FONT-FAMILY: 'Courier New'">TOKENIZED</span></em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">, Field.TermVector.</span><em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #0000c0; FONT-FAMILY: 'Courier New'">WITH_POSITIONS_OFFSETS</span></em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">doc.add(field);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">writer.addDocument(doc);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">writer.close();</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">System.</span><em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #0000c0; FONT-FAMILY: 'Courier New'">out</span></em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">.println(</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;Indexed success!&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; FONT-FAMILY: 'Courier New'">&nbsp;</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; FONT-FAMILY: 'Courier New'">//</span><strong><span style="FONT-SIZE: 9pt; FONT-FAMILY: 宋体">检索</span></strong></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">IndexReader reader = IndexReader.<em>open</em>(</span><em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #0000c0; FONT-FAMILY: 'Courier New'">IDNEX_PATH</span></em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">QueryParser parser = </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">new</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> QueryParser(</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;content&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">, analyzer);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Query query = parser.parse(</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;</span><span style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 宋体">你好</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Searcher searcher = </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">new</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> IndexSearcher(reader);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Hits hits = searcher.search(query);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">if</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> (hits.length() == 0) {</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp; System.</span><em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #0000c0; FONT-FAMILY: 'Courier New'">out</span></em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">.println(</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;hits.length=0&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">}</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Document doc2 = hits.doc(0);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">//</span><strong><span style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 宋体">高亮处理</span></strong></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">String text = doc2.get(</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;content&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">TermPositionVector tpv = (TermPositionVector) reader.getTermFreqVector(</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0, </span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;content&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">TokenStream ts = TokenSources.<em>getTokenStream</em>(tpv);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Formatter formatter = </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">new</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> Formatter() {</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp; </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">public</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> String highlightTerm(String srcText, TokenGroup g) {</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">if</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> (g.getTotalScore() &lt;= 0) {</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">return</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> srcText;</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">return</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> </span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;&lt;b&gt;&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> + srcText + </span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;&lt;/b&gt;&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">;</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp; }</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">};</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">Highlighter highlighter = </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">new</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> Highlighter(formatter, </span><strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #7f0055; FONT-FAMILY: 'Courier New'">new</span></strong><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> QueryScorer(</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; query));</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">String result = highlighter.getBestFragments(ts, text, 5, </span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;&hellip;&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">);</span></p>
            <p class="MsoNormal" align="left" style="TEXT-ALIGN: left"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">System.</span><em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #0000c0; FONT-FAMILY: 'Courier New'">out</span></em><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">.println(</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: #2a00ff; FONT-FAMILY: 'Courier New'">&quot;result:\n\t&quot;</span><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'"> + result);</span></p>
            <p class="MsoNormal" style="MARGIN-BOTTOM: 7.8pt"><span lang="EN-US" style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Courier New'">reader.close();</span></p>
            </td>
        </tr>
    </tbody>
</table>
          <br/>
          <span style="color:red;">
            <a href="http://analysis.group.javaeye.com/group/blog/126943#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/138' target='_blank'><span style="color:red;font-weight:bold;">加入阿里巴巴，发展潜力无限</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">Windows7在微软WinHEC 2008上揭开神秘面纱</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Tue, 25 Sep 2007 16:11:47 +0800</pubDate>
        <link>http://analysis.group.javaeye.com/group/blog/126943</link>
        <guid>http://analysis.group.javaeye.com/group/blog/126943</guid>
      </item>
          <item>
        <title>Paoding 2.0.2记录</title>
        <author>Qieqie</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://qieqie.javaeye.com">Qieqie</a>&nbsp;
                    链接：<a href="http://analysis.group.javaeye.com/group/blog/117592" style="color:red;">http://analysis.group.javaeye.com/group/blog/117592</a>&nbsp;
          发表时间: 2007年08月28日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          Paoding 2.0.2记录<br /><br />paoding 现在在svn上的代码能够支持 自动动态装载词典，并检测词典是否发生了更新、删除。<br />也支持关闭自动监测(paoding.stopAutoDetecting)，而提供一个方法paoding.forceDetecting手动执行一次检测。<br /><br />现在这个版本为2.0.2，但是现在不打算打成jar包和zip包。<br /><u>待之后2.0.3支持简繁体、提供GBK->UTF-8;Big5->utf-8转化功能后再发包。</u><br /><br />-------------------------------<br />2007-9-19：<br />计划变更：简体繁体从2.0去除，推迟到2.1版；2.0.3版本号留空。下一个发布版本是2.0.4-alpha.<br />错误观点修正：因为lucene输入的是Reader，此时已经没有编码的问题了，全部都是符合unicode规范的字符了。不管是GBK还是BIG5存储的文件转化为Reader后，就没有编码的概念了。所以庖丁不存在GBK->UTF-8的变更。<br />-------------------------------<br /><br />2.0.3之后没有特殊原因，不会再增加新的特性或功能了。<br />之后便是完整测试，并持续发布2.0.4-alpha;-->2.0.4-beta;--><br />被**证明**稳定后最终发布2.0.5。<br /><br />之后除非有严重妨碍使用的bug，否则不再发布新版本。<br /><br />2.0.5之后的版本将直接跳到2.1.0开始(如果有新特性需要加入才会生版本)。<br />-------------------------------<br />2007-9-19：<br />计划调整：简繁体计划从2.1开始开发<br />-------------------------------<br /><br /><br /><br />一个使用手动检测词典变化的例子：<br /><pre name="code" class="java">	public static void main(String[] args) throws Exception {
		Paoding paoding = PaodingMaker.make();
		paoding.stopAutoDetecting();//关闭自动词典监测，使用手动检测
		PaodingAnalyzer analyzer = PaodingAnalyzer.defaultMode(paoding);
		int count = 1;
		while (true) {
			paoding.forceDetecting();//分词之前手动强制检测一次
			TokenStream ts = analyzer.tokenStream(
					"", new StringReader("庖丁解牛词典检测"));
			Token token;
			while ((token = ts.next()) != null) {
				System.out.println(token);
			}
			System.out.println("--" + (count ++) + "--");
			Thread.sleep(1000 * 5);
		}
	}</pre><br /><br />如果要使用自动监测，应该保证有其他线程在运行，否则自动监测没办法进行<br />(其他线程如果不存在了，那么Paoding自动退出检测，所以一般只能在Web应用中测试Paoding的自动监测)<br />如果检测到词典变话，可以从日志/控制台中得到消息提示。
          <br/>
          <span style="color:red;">
            <a href="http://analysis.group.javaeye.com/group/blog/117592#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/138' target='_blank'><span style="color:red;font-weight:bold;">加入阿里巴巴，发展潜力无限</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">Windows7在微软WinHEC 2008上揭开神秘面纱</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Tue, 28 Aug 2007 17:20:19 +0800</pubDate>
        <link>http://analysis.group.javaeye.com/group/blog/117592</link>
        <guid>http://analysis.group.javaeye.com/group/blog/117592</guid>
      </item>
          <item>
        <title>中文分词 庖丁解牛 版本号 2.0.1</title>
        <author>Qieqie</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://qieqie.javaeye.com">Qieqie</a>&nbsp;
                    链接：<a href="http://analysis.group.javaeye.com/group/blog/112164" style="color:red;">http://analysis.group.javaeye.com/group/blog/112164</a>&nbsp;
          发表时间: 2007年08月14日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          庖丁解牛 中文分词 版本号 2.0.1<br /><br />---------------------------------------------------<br />相对2.0.0版本变更如下：<br /><br /><strong>重构(!)</strong>：<br />svn上的代码和字典从原来的GBK编码转化为UTF-8编码(使用Eclipse下载代码的同学需要改变工程的encoding)<br />->从统计上，更多人使用UTF-8而非GBK，故改之；望谅。<br /><br /><strong>重构(!)</strong>：<br />重构PaodingMaker使调用make获取Paoding对象，一个配置文件默认只会产生一个Paoding(通过记录文件的绝对路径并记录之实现)<br />->2.0.0如果多次调用PaodingMaker.make会多次载入词典，虽然这是有意的。2.0.1不必有如此担心了，同一个配置文件的Paoding不会多次创建。<br /><br /><strong>重构(!)</strong>：<br />重构PaodingMaker使可以调用多次make方法根据不同配置文件(类路径或普通文件路径)产生不同的Paoding<br />->这个特性目的是为了支持根据不同的应用场合扩招Paoding的分词针对性(庖丁能够根据配置不同的Knife而具有完全不同的分词效果)<br />->2.0.0不能同时根据不同的配置文件产生Paoding对象<br /><br /><strong>重构</strong>：<br />删除几乎无用的net.paoding.dictionary.support.Util类(其中有一个函数被move到其他位置)<br /><br /><strong>重构</strong>：<br />增加Constants接口记录配置文件中配置项的name<br /><br /><strong>增强</strong>：<br />当指定的词典安装目录或其子目录下没有任何词典文件时，抛出PaodingAnalysisException,并提示:Not found any dictionary files, have you set the 'paoding.dic.home' right?<br /><br /><strong>增强</strong>：<br />可以在配置文件中指定字符集读取字典文件，如果没有配置则使用UTF-8。配置项名为paoding.dic.charset<br /><br /><strong>增强</strong>：增加build.xml文件<br /><br /><strong>错误</strong>:<br />当没有noiseWord、noiseCharactor、unit、confucianFamilyName等特定词典时无法使用，应为忽略之而正常使用<br /><br /><strong>错误</strong>:<br />非词典直接目录下的设置词典忽略前缀无效<br /><br /><strong>错误</strong>:<br />将错误的命名paoding-analy[s]is.jar纠正为paoding-analysis.jar<br />之前jar命名少了中括号标注的字母<br />---------------------------------------------------<br />任务表(还未实现的任务)<br />1、繁简体的支持[优先级：中]<br />2、动态转载变更的词典[优先级：高]<br />3、针对高级使用者的文档[优先级：低]<br /><br /><br />---------------------------------------------------<br />示例：<br />请参考：<a href="http://groups.google.com/group/paoding/browse_thread/thread/9771c8d495786fee" target="_blank">庖丁解牛 2.0.0版本发布 </a>之《"庖丁解牛" 使用指南》<br /><br /><br />---------------------------------------------------<br />相关地址<br /><br />svn地址：<a href="http://paoding.googlecode.com/svn/trunk/paoding-analysis" target="_blank">http://paoding.googlecode.com/svn/trunk/paoding-analysis</a><br /><br />zip下载：<a href="http://code.google.com/p/paoding/downloads/list" target="_blank">http://code.google.com/p/paoding/downloads/list</a> <br /><br />论&nbsp;&nbsp; 坛：<a href="http://groups.google.com/group/paoding" target="_blank">http://groups.google.com/group/paoding</a><br /><br />JavaEye：<a href="http://analysis.group.javaeye.com/" target="_blank">http://analysis.group.javaeye.com/</a>
          <br/>
          <span style="color:red;">
            <a href="http://analysis.group.javaeye.com/group/blog/112164#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">Windows7在微软WinHEC 2008上揭开神秘面纱</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/138' target='_blank'><span style="color:red;font-weight:bold;">加入阿里巴巴，发展潜力无限</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Tue, 14 Aug 2007 17:42:00 +0800</pubDate>
        <link>http://analysis.group.javaeye.com/group/blog/112164</link>
        <guid>http://analysis.group.javaeye.com/group/blog/112164</guid>
      </item>
          <item>
        <title>中文分词 庖丁解牛 2.0.0版本发布</title>
        <author>Qieqie</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://qieqie.javaeye.com">Qieqie</a>&nbsp;
                    链接：<a href="http://analysis.group.javaeye.com/group/blog/110148" style="color:red;">http://analysis.group.javaeye.com/group/blog/110148</a>&nbsp;
          发表时间: 2007年08月08日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <strong>庖丁解牛 最新版本2.0.0 </strong><br /><br />主要变更： <br /><br />1)调整了package命名 改为net.paoding.analysis开头；调整了一些类的命名，主要是XAnalyzer改为 <br />PaodingAnalyzer之类的。 <br /><br />2)并调整了部分代码的相对位置：代码集中在三个包中： <br />net.paoding.analysis.dictionary 字典抽象--这是核心代码之一 <br />net.paoding.analysis.knife "刀"抽象-分词算法-这是核心代码之二 <br />net.paoding.analysis.analyzer 封装adapter到lucene接口 <br />关键代码没有任何改变，特别是CJKKnife没有发现错误。 <br /><br />3)同时，将字典改为英文命名，避免操作系统中文命名带来不必要影响 <br /><br />4)增加了配置文件；使knife可以在配置文件配置增减，同时字典的安装路径可以随意指定。 <br /><br />5)BUGFIX : highlight位置错误<br /><br />下载地址：<a href="http://code.google.com/p/paoding/downloads/list " target="_blank">http://code.google.com/p/paoding/downloads/list </a><br />SVN地址：<a href="http://paoding.googlecode.com/svn/trunk/paoding-analysis/" target="_blank">http://paoding.googlecode.com/svn/trunk/paoding-analysis/</a> <br /><br />------------------------------------------------------------------- <br /><strong>选择"庖丁解牛"作为Lucene中文分词可能有以下理由： </strong><br /><br />@设计优美-使用庖丁隐喻，容易理解代码设计 <br /><br />@效率极高-极高效率的字典查找算法；尽量避免无谓试探查找 <br /><br />@算法简练-简单易理解的算法，但效率却是非常高效的 <br /><br />@轻松支持最大/最小切词 <br /><br />@字典灵活- <br />字典文件个数不限； <br />名称不限，只要符合以dic作为扩展名的均视为字典 <br />目录层级不限(所以可以任意加减字典目录以及目录下的字典) <br />字典格式简单：不需要特别排序，人工可编辑 <br /><br />@源代码是开放的，遵守<a href="http://www.apache.org/licenses/LICENSE-2.0" target="_blank">http://www.apache.org/licenses/LICENSE-2.0</a>协议 <br /><br />@作者能力：Java基础知识、设计能力扎实、持续关注改进 <br /><br />------------------------------------------------------------------- <br /><strong>"庖丁解牛" 使用指南 </strong><br /><br />1、准备 <br />1)将二进制包paoding-analyis.jar放到自己的classpath下 <br /><br />2)将字典文件安装(也就是拷贝)到某个目录下，比如/data/paoding/dic下 <br /><br />3)把配置文件paoding-analysis.properties放到自己的classpath下 <br /><br />4)打开paoding-analysis.properties，把paoding.dic.home属性设置为字 <br />典的安装目录，比如paoding.dic.home=/data/paoding/dic，特别的，如 <br />果字典是安装在classpath下的dic目录下，则可以简单这样配置该属性： <br />paoding.dic.home=classpath:dic <br /><br />2、建立索引 <br />1)将庖丁封装成符合Lucene要求的Analyzer规范,获取writer mode的lucene <br />分析器，writer mode意味要同时支持最大和最小切词。 <br />Paoding paoding = PaodingMaker.make(); <br />Analyzer writerAnalyzer = PaodingAnalyzer.writerMode(paoding); <br /><br />Paoding应保存为一个系统单例为好，以重复利用，它是线程安全的. <br /><br />2)使用Lucene标准API对文件建立索引。 <br />IndexWriter writer = new IndexWriter(dirctory, writerAnalyzer); <br />... <br /><br />3、检索查找 <br />1)使用Lucene标准API对文件进行检索，使用和建立索引时相同种的lucene分析器。 <br />QueryParser parser = new QueryParser("content", writerAnalyzer ); <br />... <br /><br />更具体的使用方式参见 <br />examples/net.paoding.analysis.examples.gettingstarted中的示例代码 <br /><br />------------------------------------------------------------------ <br />"庖丁解牛"google 论坛： <br /><a href="http://groups.google.com/group/paoding " target="_blank">http://groups.google.com/group/paoding </a><br /><br />"中文分词" Javaeye 论坛： <br /><a href="http://analysis.group.javaeye.com/" target="_blank">http://analysis.group.javaeye.com/</a><br /><br />svn地址: <br /><a href="http://paoding.googlecode.com/svn/trunk/paoding-analysis/ " target="_blank">http://paoding.googlecode.com/svn/trunk/paoding-analysis/ </a><br /><br />旧版本地址： <br />http://paoding.googlecode.com/svn/trunk/paoding-analysis-1/ <br />不建议下载旧版本
          <br/>
          <span style="color:red;">
            <a href="http://analysis.group.javaeye.com/group/blog/110148#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">Windows7在微软WinHEC 2008上揭开神秘面纱</span></a></li><li><a href='/adverts/138' target='_blank'><span style="color:red;font-weight:bold;">加入阿里巴巴，发展潜力无限</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Wed, 08 Aug 2007 14:31:22 +0800</pubDate>
        <link>http://analysis.group.javaeye.com/group/blog/110148</link>
        <guid>http://analysis.group.javaeye.com/group/blog/110148</guid>
      </item>
      </channel>
</rss>