lucene 分词源码分析
lucene 分词源码分析
流程分析
借助于 ES 分词DSL 看一下 analyzer的组成:
# 可以看出一个 analyzer 由 char_filter(任意个) + tokenier(固定一个) + filter(任意个)
# char_filter :对原文中的char 进行中转换
# tokenier : 将接收到的 text 切分出一系列的 token
# filter :对tokenier 切分出来的结果的增加、删除、更改 操作。
POST _analyze
{
"char_filter": [
{
"type": "mapping",
"mappings": [
"٠ => 0",
"١ => 1",
"٢ => 2",
"٣ => 3",
"٤ => 4",
"٥ => 5"
]
},
{
"type": "pattern_replace",
"pattern": """(\d+)-(?=\d)""",
"replacement": "$1_"
}
],
"tokenizer": {
"type": "whitespace"
},
"filter": [
{
"type": "lowercase"
},
{
"type": "stop"
}
],
"text": "My license plate is ٢٥-٠١٥"
}
# response
{
"tokens" : [
{
"token" : "my",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "license",
"start_offset" : 3,
"end_offset" : 10,
"type" : "word",
"position" : 1
},
{
"token" : "plate",
"start_offset" : 11,
"end_offset" : 16,
"type" : "word",
"position" : 2
},
{
"token" : "25_015",
"start_offset" : 20,
"end_offset" : 26,
"type" : "word",
"position" : 4
}
]
}
代码分析
CharFilter

从类图继承来看CharFilter 本身为java.io.Reader 子类,同时其中的 属性 input 是 java.io.Reader,可以看出为 CharFilter 就是 java.io.Reader 代理类。其中需要代理的方法就是java.io.Reader#read() 方法。 其具体子类在实现种按照自己逻辑,对原始的文本进行 char 变换。
public abstract class CharFilter extends Reader {
// 用于承接输入的 text
protected final Reader input;
// 构造函数
public CharFilter(Reader input) {
super(input);
this.input = input;
}
// 在子类中需要改变原始 text 的内容,但是为了避免 offset 引起变化,要做出修正
protected abstract int correct(int currentOff);
// 得到修正后的 offset, 并且是向上链式修正
public final int correctOffset(int currentOff) {
final int corrected = correct(currentOff);
return (input instanceof CharFilter)
? ((CharFilter) input).correctOffset(corrected)
: corrected;
}
}
Tokenizer && TokenFilter

Tokenizer: 继承自 TokenStream,字段 input 用于接收 CharFilter或 Reader的输入。输出为切词之后的 token。
TokenFilter: 同样继承自 TokenStream, 但是数据源input对应的是 TokenStream(也就说上游为 Tokenizer 或 TokenFilter ) 。作用是对 Tokenizer或TokenFilter处理后的 切词结果 token 做各种处理。
// 这个类里面额外设置了 其他字段,用于规范用户使用 setReader -> reset -> close 的使用流程。避免数据混入或 内存泄露
public abstract class Tokenizer extends TokenStream {
/** The text source for this Tokenizer. */
protected Reader input = ILLEGAL_STATE_READER;
/** Pending reader: not actually assigned to input until reset() */
private Reader inputPending = ILLEGAL_STATE_READER;
/**
* Expert: Set a new reader on the Tokenizer. Typically, an analyzer (in its tokenStream method)
* will use this to re-use a previously created tokenizer.
*/
public final void setReader(Reader input) {
if (input == null) {
throw new NullPointerException("input must not be null");
} else if (this.input != ILLEGAL_STATE_READER) {
throw new IllegalStateException("TokenStream contract violation: close() call missing");
}
this.inputPending = input;
setReaderTestPoint();
}
@Override
public void reset() throws IOException {
super.reset();
input = inputPending;
inputPending = ILLEGAL_STATE_READER;
}
@Override
public void close() throws IOException {
input.close();
// LUCENE-2387: don't hold onto Reader after close, so
// GC can reclaim
inputPending = input = ILLEGAL_STATE_READER;
}
private static final Reader ILLEGAL_STATE_READER =
new Reader() {
@Override
public int read(char[] cbuf, int off, int len) {
throw new IllegalStateException(
"TokenStream contract violation: reset()/close() call missing, "
+ "reset() called multiple times, or subclass does not call super.reset(). "
+ "Please see Javadocs of TokenStream class for more information about the correct consuming workflow.");
}
@Override
public void close() {}
};
}
TokenFilter 简单看一下,没啥可说的
public abstract class TokenFilter extends TokenStream {
/** The source of tokens for this filter. */
protected final TokenStream input;
/** Construct a token stream filtering the given input. */
protected TokenFilter(TokenStream input) {
super(input);
this.input = input;
}
@Override
public void end() throws IOException {
input.end();
}
@Override
public void close() throws IOException {
input.close();
}
@Override
public void reset() throws IOException {
input.reset();
}
}
XXXAttribute

Attribute 的种类(这些Attribute 的子类)

各种 Attribute 作为 信息的传递媒介,将 token 在 tokenier 和 任意个 tokenFilter 之间传递。
// TokenStream为 AttributeSource 的子类
public abstract class TokenStream extends AttributeSource implements Closeable {
/** A TokenStream that uses the same attributes as the supplied one. */
protected TokenStream(AttributeSource input) {
super(input);
assert assertFinal();
}
}
// 看 AttributeSource 的实现,就是 attributes 保存映射关系
public class AttributeSource {
private final Map<Class<? extends Attribute>, AttributeImpl> attributes;
private final Map<Class<? extends AttributeImpl>, AttributeImpl> attributeImpls;
private final State[] currentState;
private final AttributeFactory factory;
// 将上游的 attributes 引用:this.attributes = input.attributes;
public AttributeSource(AttributeSource input) {
Objects.requireNonNull(input, "input AttributeSource must not be null");
this.attributes = input.attributes;
this.attributeImpls = input.attributeImpls;
this.currentState = input.currentState;
this.factory = input.factory;
}
public final <T extends Attribute> T addAttribute(Class<T> attClass) {
AttributeImpl attImpl = attributes.get(attClass);
if (attImpl == null) {
if (!(attClass.isInterface() && Attribute.class.isAssignableFrom(attClass))) {
throw new IllegalArgumentException(
"addAttribute() only accepts an interface that extends Attribute, but "
+ attClass.getName()
+ " does not fulfil this contract.");
}
addAttributeImpl(attImpl = this.factory.createAttributeInstance(attClass));
}
return attClass.cast(attImpl);
}
}
所以其应用就是:
Tokenizer 主动向 Attribute 中添加值
public final class StandardTokenizer extends Tokenizer {
// this tokenizer generates three attributes:
// term offset, positionIncrement and type
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
private final PositionIncrementAttribute posIncrAtt =
addAttribute(PositionIncrementAttribute.class);
private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class);
// Tokenizer 在调用 incrementToken() 时,添加值
@Override
public final boolean incrementToken() throws IOException {
clearAttributes();
skippedPositions = 0;
while (true) {
int tokenType = scanner.getNextToken();
if (tokenType == StandardTokenizerImpl.YYEOF) {
return false;
}
if (scanner.yylength() <= maxTokenLength) {
posIncrAtt.setPositionIncrement(skippedPositions + 1);
scanner.getText(termAtt);
final int start = scanner.yychar();
offsetAtt.setOffset(correctOffset(start), correctOffset(start + termAtt.length()));
typeAtt.setType(StandardTokenizer.TOKEN_TYPES[tokenType]);
return true;
} else
// When we skip a too-long term, we still increment the
// position increment
skippedPositions++;
}
}
}
tokenFilter 改变其中的数据
public class LowerCaseFilter extends TokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public LowerCaseFilter(TokenStream in) {
super(in);
}
@Override
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
CharacterUtils.toLowerCase(termAtt.buffer(), 0, termAtt.length());
return true;
} else {
return false;
}
}
}
Analyzer
analyzer 作用是将 各种 CharFilter 、Tokenier 、TokenFilter 组合起来使用,构成一个复合体。
// StandardAnalyzer 中的 createComponents() 将各种组件组合在一起
public final class StandardAnalyzer extends StopwordAnalyzerBase {
@Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new LowerCaseFilter(src);
tok = new StopFilter(tok, stopwords);
return new TokenStreamComponents(
r -> {
src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
src.setReader(r);
},
tok);
}
}
Analyzer 的线程安全性
public abstract class Analyzer implements Closeable {
private final ReuseStrategy reuseStrategy;
// 线程安全的保障
CloseableThreadLocal<Object> storedValue = new CloseableThreadLocal<>();
// 默认使用全局策略,适用于只有一个字段类型
public Analyzer() {
this(GLOBAL_REUSE_STRATEGY);
}
public final TokenStream tokenStream(final String fieldName,
final Reader reader) {
TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
final Reader r = initReader(fieldName, reader);
if (components == null) {
components = createComponents(fieldName);
reuseStrategy.setReusableComponents(this, fieldName, components);
}
components.setReader(r);
return components.getTokenStream();
}
@Override
public void close() {
if (storedValue != null) {
storedValue.close();
storedValue = null;
}
}
public static abstract class ReuseStrategy {
/** Sole constructor. (For invocation by subclass constructors, typically implicit.) */
public ReuseStrategy() {}
public abstract TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName);
public abstract void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components);
protected final Object getStoredValue(Analyzer analyzer) {
if (analyzer.storedValue == null) {
throw new AlreadyClosedException("this Analyzer is closed");
}
return analyzer.storedValue.get();
}
protected final void setStoredValue(Analyzer analyzer, Object storedValue) {
if (analyzer.storedValue == null) {
throw new AlreadyClosedException("this Analyzer is closed");
}
analyzer.storedValue.set(storedValue);
}
}
// 全局使用一个 组件,适用于一个字段类型
public static final ReuseStrategy GLOBAL_REUSE_STRATEGY = new ReuseStrategy() {
@Override
public TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
return (TokenStreamComponents) getStoredValue(analyzer);
}
@Override
public void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components) {
setStoredValue(analyzer, components);
}
};
// 每个字段使用一个组件,适用于多个字段类型
public static final ReuseStrategy PER_FIELD_REUSE_STRATEGY = new ReuseStrategy() {
@SuppressWarnings("unchecked")
@Override
public TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
Map<String, TokenStreamComponents> componentsPerField = (Map<String, TokenStreamComponents>) getStoredValue(analyzer);
return componentsPerField != null ? componentsPerField.get(fieldName) : null;
}
@SuppressWarnings("unchecked")
@Override
public void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components) {
Map<String, TokenStreamComponents> componentsPerField = (Map<String, TokenStreamComponents>) getStoredValue(analyzer);
if (componentsPerField == null) {
componentsPerField = new HashMap<>();
setStoredValue(analyzer, componentsPerField);
}
componentsPerField.put(fieldName, components);
}
};
}
使用样例:
public class FCPAnalyzerDemo {
public static void main(String[] args) {
// Analyzer ikAnalyzer = new IKAnalyzer(false);
DefaultConfig.getInstance();
Dictionary.initial(DefaultConfig.getInstance());
Analyzer analyzer = null;
analyzer = new StandardAnalyzer();
analyzer = new FCPAnalyzer();
// analyzer = new WhitespaceAnalyzer();
//获取Lucene的TokenStream对象
TokenStream ts = null;
try {
String txt = "人 民*?解放軍" + System.getProperty("line.separator");
ts = analyzer.tokenStream("myfield", new StringReader(txt));
//获取词元位置属性
OffsetAttribute offset = ts.addAttribute(OffsetAttribute.class);
//获取词元文本属性
CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
//获取词元文本属性
TypeAttribute type = ts.addAttribute(TypeAttribute.class);
// position
PositionIncrementAttribute posIncrAtt = ts.addAttribute(PositionIncrementAttribute.class);
//重置TokenStream(重置StringReader)
ts.reset();
//迭代获取分词结果
int position = -1;
while (ts.incrementToken()) {
position += posIncrAtt.getPositionIncrement();
System.out.println(offset.startOffset() + " - " + offset.endOffset() + ", position : " + position+ " , " + term.toString() + " | " + type.type());
}
// tokenizer重置各种Attribute中的属性
ts.end();
// tokenizer close input,然后删除input引用, 使其可以被GC
ts.close();
System.out.println("------------------------------------");
ts = analyzer.tokenStream("myfield", new StringReader(txt));
//重置TokenStream(重置StringReader)
ts.reset();
//迭代获取分词结果
position = -1;
while (ts.incrementToken()) {
position += posIncrAtt.getPositionIncrement();
System.out.println(offset.startOffset() + " - " + offset.endOffset() + ", position : " + position+ " , " + term.toString() + " | " + type.type());
}
// tokenizer重置各种Attribute中的属性
ts.end();
// tokenizer close input,然后删除input引用, 使其可以被GC
ts.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
// 释放当前线程使用的资源
if(analyzer != null){
analyzer.close();
}
}
}
}