lucene 分词源码分析

lucene 分词源码分析

流程分析

借助于 ES 分词DSL 看一下 analyzer的组成:

# 可以看出一个 analyzer 由 char_filter(任意个) + tokenier(固定一个) + filter(任意个)
# char_filter :对原文中的char 进行中转换
# tokenier : 将接收到的 text 切分出一系列的 token
# filter :对tokenier 切分出来的结果的增加、删除、更改 操作。 
POST _analyze
{
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "٠ => 0",
        "١ => 1",
        "٢ => 2",
        "٣ => 3",
        "٤ => 4",
        "٥ => 5"
      ]
    },
    {
      "type": "pattern_replace",
      "pattern": """(\d+)-(?=\d)""",
      "replacement": "$1_"
    }
  ],
  "tokenizer": {
    "type": "whitespace"
  },
  "filter": [
    {
      "type": "lowercase"
    },
    {
      "type": "stop"
    }
  ],
  "text": "My license plate is ٢٥-٠١٥"
} 
# response
{
  "tokens" : [
    {
      "token" : "my",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "license",
      "start_offset" : 3,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "plate",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "25_015",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "word",
      "position" : 4
    }
  ]
}

代码分析

CharFilter

请添加图片描述

从类图继承来看CharFilter 本身为java.io.Reader 子类,同时其中的 属性 input 是 java.io.Reader,可以看出为 CharFilter 就是 java.io.Reader 代理类。其中需要代理的方法就是java.io.Reader#read() 方法。 其具体子类在实现种按照自己逻辑,对原始的文本进行 char 变换。

public abstract class CharFilter extends Reader {
  // 用于承接输入的 text 
  protected final Reader input;

  // 构造函数
  public CharFilter(Reader input) {
    super(input);
    this.input = input;
  }

  // 在子类中需要改变原始 text 的内容,但是为了避免 offset 引起变化,要做出修正
  protected abstract int correct(int currentOff);

  // 得到修正后的 offset, 并且是向上链式修正
  public final int correctOffset(int currentOff) {
    final int corrected = correct(currentOff);
    return (input instanceof CharFilter)
        ? ((CharFilter) input).correctOffset(corrected)
        : corrected;
  }
}

Tokenizer && TokenFilter

请添加图片描述

Tokenizer: 继承自 TokenStream,字段 input 用于接收 CharFilter或 Reader的输入。输出为切词之后的 token。

TokenFilter: 同样继承自 TokenStream, 但是数据源input对应的是 TokenStream(也就说上游为 Tokenizer 或 TokenFilter ) 。作用是对 Tokenizer或TokenFilter处理后的 切词结果 token 做各种处理。

// 这个类里面额外设置了 其他字段,用于规范用户使用 setReader -> reset -> close 的使用流程。避免数据混入或 内存泄露
public abstract class Tokenizer extends TokenStream {
  /** The text source for this Tokenizer. */
  protected Reader input = ILLEGAL_STATE_READER;

  /** Pending reader: not actually assigned to input until reset() */
  private Reader inputPending = ILLEGAL_STATE_READER;

  /**
   * Expert: Set a new reader on the Tokenizer. Typically, an analyzer (in its tokenStream method)
   * will use this to re-use a previously created tokenizer.
   */
  public final void setReader(Reader input) {
    if (input == null) {
      throw new NullPointerException("input must not be null");
    } else if (this.input != ILLEGAL_STATE_READER) {
      throw new IllegalStateException("TokenStream contract violation: close() call missing");
    }
    this.inputPending = input;
    setReaderTestPoint();
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    input = inputPending;
    inputPending = ILLEGAL_STATE_READER;
  }

  @Override
  public void close() throws IOException {
    input.close();
    // LUCENE-2387: don't hold onto Reader after close, so
    // GC can reclaim
    inputPending = input = ILLEGAL_STATE_READER;
  }

  private static final Reader ILLEGAL_STATE_READER =
      new Reader() {
        @Override
        public int read(char[] cbuf, int off, int len) {
          throw new IllegalStateException(
              "TokenStream contract violation: reset()/close() call missing, "
                  + "reset() called multiple times, or subclass does not call super.reset(). "
                  + "Please see Javadocs of TokenStream class for more information about the correct consuming workflow.");
        }

        @Override
        public void close() {}
      };
}

TokenFilter 简单看一下,没啥可说的

public abstract class TokenFilter extends TokenStream {
  /** The source of tokens for this filter. */
  protected final TokenStream input;

  /** Construct a token stream filtering the given input. */
  protected TokenFilter(TokenStream input) {
    super(input);
    this.input = input;
  }

  @Override
  public void end() throws IOException {
    input.end();
  }

  @Override
  public void close() throws IOException {
    input.close();
  }

  @Override
  public void reset() throws IOException {
    input.reset();
  }
}

XXXAttribute

请添加图片描述

Attribute 的种类(这些Attribute 的子类)

请添加图片描述

各种 Attribute 作为 信息的传递媒介,将 token 在 tokenier 和 任意个 tokenFilter 之间传递。

// TokenStream为 AttributeSource 的子类
public abstract class TokenStream extends AttributeSource implements Closeable {

   /** A TokenStream that uses the same attributes as the supplied one. */
  protected TokenStream(AttributeSource input) {
    super(input);
    assert assertFinal();
  }
}

// 看 AttributeSource 的实现,就是 attributes 保存映射关系
public class AttributeSource {

  private final Map<Class<? extends Attribute>, AttributeImpl> attributes;
  private final Map<Class<? extends AttributeImpl>, AttributeImpl> attributeImpls;
  private final State[] currentState;

  private final AttributeFactory factory;

  // 将上游的 attributes 引用:this.attributes = input.attributes;
  public AttributeSource(AttributeSource input) {
    Objects.requireNonNull(input, "input AttributeSource must not be null");
    this.attributes = input.attributes;
    this.attributeImpls = input.attributeImpls;
    this.currentState = input.currentState;
    this.factory = input.factory;
  }


  public final <T extends Attribute> T addAttribute(Class<T> attClass) {
    AttributeImpl attImpl = attributes.get(attClass);
    if (attImpl == null) {
      if (!(attClass.isInterface() && Attribute.class.isAssignableFrom(attClass))) {
        throw new IllegalArgumentException(
            "addAttribute() only accepts an interface that extends Attribute, but "
                + attClass.getName()
                + " does not fulfil this contract.");
      }
      addAttributeImpl(attImpl = this.factory.createAttributeInstance(attClass));
    }
    return attClass.cast(attImpl);
  }
}

所以其应用就是:

Tokenizer 主动向 Attribute 中添加值

public final class StandardTokenizer extends Tokenizer {

  // this tokenizer generates three attributes:
  // term offset, positionIncrement and type
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
  private final PositionIncrementAttribute posIncrAtt =
      addAttribute(PositionIncrementAttribute.class);
  private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class);

  // Tokenizer 在调用 incrementToken() 时,添加值
  @Override
  public final boolean incrementToken() throws IOException {
    clearAttributes();
    skippedPositions = 0;

    while (true) {
      int tokenType = scanner.getNextToken();

      if (tokenType == StandardTokenizerImpl.YYEOF) {
        return false;
      }

      if (scanner.yylength() <= maxTokenLength) {
        posIncrAtt.setPositionIncrement(skippedPositions + 1);
        scanner.getText(termAtt);
        final int start = scanner.yychar();
        offsetAtt.setOffset(correctOffset(start), correctOffset(start + termAtt.length()));
        typeAtt.setType(StandardTokenizer.TOKEN_TYPES[tokenType]);
        return true;
      } else
        // When we skip a too-long term, we still increment the
        // position increment
        skippedPositions++;
    }
  }
    
}

tokenFilter 改变其中的数据

public class LowerCaseFilter extends TokenFilter {
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

  public LowerCaseFilter(TokenStream in) {
    super(in);
  }

  @Override
  public final boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      CharacterUtils.toLowerCase(termAtt.buffer(), 0, termAtt.length());
      return true;
    } else {
      return false;
    }
  }
}

Analyzer

analyzer 作用是将 各种 CharFilter 、Tokenier 、TokenFilter 组合起来使用,构成一个复合体。

// StandardAnalyzer 中的 createComponents() 将各种组件组合在一起
public final class StandardAnalyzer extends StopwordAnalyzerBase {

  @Override
  protected TokenStreamComponents createComponents(final String fieldName) {
    final StandardTokenizer src = new StandardTokenizer();
    src.setMaxTokenLength(maxTokenLength);
    TokenStream tok = new LowerCaseFilter(src);
    tok = new StopFilter(tok, stopwords);
    return new TokenStreamComponents(
        r -> {
          src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
          src.setReader(r);
        },
        tok);
  }
}

Analyzer 的线程安全性

public abstract class Analyzer implements Closeable {

  private final ReuseStrategy reuseStrategy;
  // 线程安全的保障
  CloseableThreadLocal<Object> storedValue = new CloseableThreadLocal<>();
  // 默认使用全局策略,适用于只有一个字段类型
  public Analyzer() {
    this(GLOBAL_REUSE_STRATEGY);
  }

  public final TokenStream tokenStream(final String fieldName,
                                       final Reader reader) {
    TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
    final Reader r = initReader(fieldName, reader);
    if (components == null) {
      components = createComponents(fieldName);
      reuseStrategy.setReusableComponents(this, fieldName, components);
    }
    components.setReader(r);
    return components.getTokenStream();
  }
  

  @Override
  public void close() {
    if (storedValue != null) {
      storedValue.close();
      storedValue = null;
    }
  }


  public static abstract class ReuseStrategy {

    /** Sole constructor. (For invocation by subclass constructors, typically implicit.) */
    public ReuseStrategy() {}


    public abstract TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName);

    public abstract void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components);

    protected final Object getStoredValue(Analyzer analyzer) {
      if (analyzer.storedValue == null) {
        throw new AlreadyClosedException("this Analyzer is closed");
      }
      return analyzer.storedValue.get();
    }

    protected final void setStoredValue(Analyzer analyzer, Object storedValue) {
      if (analyzer.storedValue == null) {
        throw new AlreadyClosedException("this Analyzer is closed");
      }
      analyzer.storedValue.set(storedValue);
    }

  }
  // 全局使用一个 组件,适用于一个字段类型
  public static final ReuseStrategy GLOBAL_REUSE_STRATEGY = new ReuseStrategy() {

    @Override
    public TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
      return (TokenStreamComponents) getStoredValue(analyzer);
    }

    @Override
    public void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components) {
      setStoredValue(analyzer, components);
    }
  };

  // 每个字段使用一个组件,适用于多个字段类型
  public static final ReuseStrategy PER_FIELD_REUSE_STRATEGY = new ReuseStrategy() {

    @SuppressWarnings("unchecked")
    @Override
    public TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
      Map<String, TokenStreamComponents> componentsPerField = (Map<String, TokenStreamComponents>) getStoredValue(analyzer);
      return componentsPerField != null ? componentsPerField.get(fieldName) : null;
    }

    @SuppressWarnings("unchecked")
    @Override
    public void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components) {
      Map<String, TokenStreamComponents> componentsPerField = (Map<String, TokenStreamComponents>) getStoredValue(analyzer);
      if (componentsPerField == null) {
        componentsPerField = new HashMap<>();
        setStoredValue(analyzer, componentsPerField);
      }
      componentsPerField.put(fieldName, components);
    }
  };

}

使用样例:

public class FCPAnalyzerDemo {

    public static void main(String[] args) {
//        Analyzer ikAnalyzer = new IKAnalyzer(false);
        DefaultConfig.getInstance();
        Dictionary.initial(DefaultConfig.getInstance());

        Analyzer analyzer = null;
        analyzer = new StandardAnalyzer();
        analyzer = new FCPAnalyzer();
//        analyzer = new WhitespaceAnalyzer();
        //获取Lucene的TokenStream对象
        TokenStream ts = null;
        try {

            String txt = "人 民*?解放軍" + System.getProperty("line.separator");
            ts = analyzer.tokenStream("myfield", new StringReader(txt));
            //获取词元位置属性
            OffsetAttribute offset = ts.addAttribute(OffsetAttribute.class);
            //获取词元文本属性
            CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
            //获取词元文本属性
            TypeAttribute type = ts.addAttribute(TypeAttribute.class);
            // position
            PositionIncrementAttribute posIncrAtt = ts.addAttribute(PositionIncrementAttribute.class);
            //重置TokenStream(重置StringReader)
            ts.reset();
            //迭代获取分词结果
            int position = -1;
            while (ts.incrementToken()) {
                position += posIncrAtt.getPositionIncrement();
                System.out.println(offset.startOffset() + " - " + offset.endOffset() + ", position : " + position+ " , " + term.toString() + " | " + type.type());
            }
            // tokenizer重置各种Attribute中的属性
            ts.end();
            // tokenizer close input,然后删除input引用, 使其可以被GC
            ts.close();

            System.out.println("------------------------------------");

            ts = analyzer.tokenStream("myfield", new StringReader(txt));
            //重置TokenStream(重置StringReader)
            ts.reset();
            //迭代获取分词结果
            position = -1;
            while (ts.incrementToken()) {
                position += posIncrAtt.getPositionIncrement();
                System.out.println(offset.startOffset() + " - " + offset.endOffset() + ", position : " + position+ " , " + term.toString() + " | " + type.type());
            }
            // tokenizer重置各种Attribute中的属性
            ts.end();
            // tokenizer close input,然后删除input引用, 使其可以被GC
            ts.close();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            // 释放当前线程使用的资源
            if(analyzer != null){
                analyzer.close();
            }
        }
    }
}