java 如何确定 SOLR 索引的字段类型?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2118634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 19:32:45  来源:igfitidea点击:

How to determine field-type for SOLR indexing?

javaphpsqlmysqlsolr

提问by memnoch_proxy

I have two table fields in a MySQL table. One is VARCHAR and is a "headline" for a classified (classifieds website). The other is TEXT field which contains the "text" for the classified.

我在 MySQL 表中有两个表字段。一个是 VARCHAR 并且是分类(分类网站)的“标题”。另一个是 TEXT 字段,其中包含分类的“文本”。

Two Questions:
How should I determine how to index these two fields?(what field-type, what classes to use etc)

两个问题:
我应该如何确定如何索引这两个字段?(什么字段类型,使用什么类等)

Currently I have an "ad_id" as a unique identifier for each ad, example "bmw_m3_82398292".
How can I make SOLR return this identifier whenever a 'query match' is found by SOLR?(The first part of the identifier is actually the headline fields content, the second part is a random number chosen)

目前我有一个“ad_id”作为每个广告的唯一标识符,例如“bmw_m3_82398292”。
每当 SOLR 找到“查询匹配”时,如何让 SOLR 返回此标识符?(标识符的第一部分实际上是标题字段内容,第二部分是随机选择的数字)

Thanks

谢谢

回答by memnoch_proxy

1. Schema

1. 架构

Your Solr schema is very much determined by your intended search behavior. In your schema.xml file, you'll see a bunch of choices like "text" and "string". They behave differently.

您的 Solr 架构很大程度上取决于您预期的搜索行为。在您的 schema.xml 文件中,您将看到一堆选项,例如“文本”和“字符串”。他们的行为不同。

<fieldtype name="string" class="solr.StrField" sortMissingLast="true"     omitNorms="true"/>

The string field type is a literal string match. It would operate like ==in a SQL statement.

字符串字段类型是文字字符串匹配。它会像==在 SQL 语句中一样操作。

<fieldtype name="text_ws"   class="solr.TextField"          positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldtype>

The text_ws field type does tokenization. However, a big difference in the textfield is the filters for stop-words and delimiters and lower-casing. Notice how these filters are designated for both the Lucene index and the Solr query. So when searching a text field, it will adapt the query terms using these filters to help find a match.

text_ws 字段类型进行标记化。但是,该text领域的一个很大差异是用于停用词和分隔符以及小写的过滤器。请注意如何为 Lucene 索引和 Solr 查询指定这些过滤器。因此,在搜索文本字段时,它将使用这些过滤器调整查询词以帮助找到匹配项。

<fieldtype name="text"      class="solr.TextField"  positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter ..... />
    <filter ..... />
    <filter ..... />
  </analyzer>
</fieldtype>

When indexing things like news stories, for example, you probably want to search for company names and headlines differently.

例如,在为新闻报道等内容编制索引时,您可能希望以不同的方式搜索公司名称和标题。

<field name="headline" type="text" />
<field name="coname" type="string" indexed="true" multiValued="false" omitNorms="true" />

The above example would allow you to do a search like &coname:Intel&headline:processor+specificationsand retrieve matches hitting exactly Intel stories.

上面的示例将允许您进行搜索,&coname:Intel&headline:processor+specifications并检索与英特尔故事完全匹配的匹配项。

If you wanted to search a range

如果你想搜索一个范围

2. Result Fields

2. 结果字段

You can defined a standard set of return fields in your RequestHandler

您可以在RequestHandler 中定义一组标准的返回字段

<requestHandler name="mumble" class="solr.DisMaxRequestHandler" >
    <str name="fl">
        category,coname,headline
    </str>
</requestHandler>

You may also define the desired fields in your query string, using the flparameter.:

您还可以使用fl参数在查询字符串中定义所需的字段。:

/select?indent=on&version=2.2&q=coname%3AIn*&start=0&rows=10&fl=coname%2Cid&qt=standard

You can also select rangesin your query terms using the field:[x TO *]syntax. If you wanted to select certain ads by their date , you might build a query with

您还可以使用语法在查询词中选择范围field:[x TO *]。如果您想按日期选择某些广告,您可以使用

ad_date:[20100101 TO 20100201]

in your query terms. (There are many ways to search ranges, I'm presenting a method that uses integers instead of Date class.)

在您的查询条件中。(搜索范围的方法有很多种,我提出了一种使用整数而不是 Date 类的方法。)