💎一站式轻松地调用各大LLM模型接口,支持GPT4、智谱、星火、月之暗面及文生图 广告
# HTML Strip Character Filter 原文链接 : [https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-htmlstrip-charfilter.html](https://www.elastic.co/guide/en/elasticsearch/reference/5.0/analysis-htmlstrip-charfilter.html) 译文链接 :[http://www.apache.wiki/display/Elasticsearch/HTML+Strip+Character+Filter](http://www.apache.wiki/display/Elasticsearch/HTML+Strip+Character+Filter) 贡献者 : [谢雄](/display/~xiexiong),[ApacheCN](/display/~apachecn),[Apache中文网](/display/~apachechina) **HTML Strip Character Filter ** 会删除文本中**的HTML**元素,并且将**HTML**实体替换成对应的解码值(例如用**&**替换**&amp;**)。 **案例** ``` POST _analyze { "tokenizer": "keyword", "char_filter": [ "html_strip" ], "text": "<p>I&apos;m so <b>happy</b>!</p>" } ```  keyword分词器只会返回一个词元(**term**)。 上面的案例将会返回如下的词元(**term**): ``` [ \nI'm so happy!\n ] ``` 同样的案例,如果使用标准分词器(**standar tokenizer**)将会返回的词元(**term**)如下: ``` [ I'm, so, happy ] ``` ## 配置 **HTML Strip Character Filter** 接收如下参数: | 参数名称 | 说明 | | --- | --- | | `escaped_tags` | 会原始文本中保留的一系列标签 | 配置案例: ``` PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": ["my_char_filter"] } }, "char_filter": { "my_char_filter": { "type": "html_strip", "escaped_tags": ["b"] } } } } } POST my_index/_analyze { "analyzer": "my_analyzer", "text": "<p>I&apos;m so <b>happy</b>!</p>" } ``` 上面案例将返回如下结果: ``` [ \nI'm so <b>happy</b>!\n ] ```