Oracle:带条件的全文搜索

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7358137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 00:18:26  来源:igfitidea点击:

Oracle: Full text search with condition

sqloracleindexingfull-text-indexingoracle-text

提问by Clinton

I've created an Oracle Text index like the following:

我创建了一个 Oracle Text 索引,如下所示:

create index my_idx on my_table (text) indextype is ctxsys.context; 

And I can then do the following:

然后我可以执行以下操作:

select * from my_table where contains(text, '%blah%') > 0;

But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:

但是假设我们在此表中有另一列,例如group_id,我想改为执行以下查询:

select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;

With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.

有了上面的索引,Oracle 将不得不搜索所有包含 的项目,'blah'然后检查它们group_id的所有s。

Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:

理想情况下,我更喜欢只搜索带有 的项目group_id = 43,所以我想要一个这样的索引:

create index my_idx on my_table (group_id, text) indextype is ctxsys.context; 

Kind of like a normal index, so a separate text search can be done for each group_id.

有点像普通索引,因此可以为每个group_id.

Is there a way to do something like this in Oracle (I'm using 10g if that is important)?

有没有办法在 Oracle 中做这样的事情(如果这很重要,我正在使用 10g)?

Edit (clarification)

编辑(澄清)

Consider a table with one million rows and the following two columns among others, Aand B, both numeric. Lets say there are 500 different values of Aand 2000 different values of B, and each row is unique.

考虑一个包含 100 万行和以下两列的表,A以及 和B,都是数字。假设 有 500 个不同的值A和 2000 个不同的值B,并且每一行都是唯一的。

Now lets consider select ... where A = x and B = y

现在让我们考虑 select ... where A = x and B = y

An index on Aand Bseparately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.

据我所知AB单独的索引在 上进行索引搜索B,这将返回 500 个不同的行,然后对这些行进行连接/扫描。在任何情况下,都必须查看至少 500 行(除了数据库很幸运并及早找到所需的行。

Whereas an index on (A,B)is much more effective, it finds the one row in one index search.

虽然索引(A,B)更有效,但它会在一次索引搜索中找到一行。

Putting separate indexes on group_idand the text I feel only leaves the query generator with two options.

放置单独的索引group_id和我觉得的文本只会让查询生成器有两个选项。

(1) Use the group_idindex, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.

(1) 使用group_id索引,并扫描文本的所有结果行。
(2) 使用文本索引,并扫描所有结果行以获取group_id.
(3) 使用两个索引,并进行连接。

Whereas I want:

而我想要:

(4) Use the (group_id, "text")index to find the text index under the particular group_idand scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).

(4) 使用(group_id, "text")索引查找特定下的文本索引group_id并扫描我需要的特定行/行的文本索引。不需要扫描和检查或加入,就像在(A,B).

采纳答案by Jon Heller

Oracle Text

甲骨文文本

1 - You can improve performance by creating the CONTEXT index with FILTER BY:

1 - 您可以通过使用FILTER BY创建 CONTEXT 索引来提高性能:

create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;

In my tests the filter bydefinitely improved the performance, but it was still slightly faster to just use a btree index on group_id.

在我的测试中,这filter by确实提高了性能,但仅在 group_id 上使用 btree 索引仍然稍微快一些。

2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:

2 - CTXCAT 索引使用“子索引”,似乎与多列索引类似。这似乎是您正在寻找的选项 (4):

begin
  ctx_ddl.create_index_set('my_table_index_set');
  ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/

create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
    parameters('index set my_table_index_set');

select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0

This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.

这可能是最快的方法。对类似于 A 和 B 场景的 120MB 随机文本使用上述查询只需要 18 个一致的获取。但不利的一面是,创建 CTXCAT 索引花费了将近 11 分钟并使用了 1.8GB 的​​空间。

(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like @NullUserException said.)

(注意:Oracle Text 似乎在这里工作正常,但我不熟悉 Text,我不能保证这不是像@NullUserException 所说的那样不恰当地使用这些索引。)

Multi-column indexes vs. index joins

多列索引与索引连接

For the situation you describe in your edit, normallythere would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.

对于您在编辑中描述的情况,在 (A,B) 上使用索引和在 A 和 B 上连接单独的索引之间通常不会有显着差异。我使用与您描述的数据和索引类似的数据构建了一些测试对于多列索引,join 只需要 7 个一致的获取,而 2 个一致的获取。

The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.

这样做的原因是因为 Oracle 以块为单位检索数据。一个块通常是 8K,并且一个索引块已经排序,因此您可能可以将 500 到 2000 个值放入几个块中。如果您担心性能,通常读取和写入块的 IO 是唯一重要的事情。Oracle 是否必须将几千行连接在一起是一个无关紧要的 CPU 时间。

However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.

但是,这不适用于 Oracle Text 索引。您可以将 CONTEXT 索引与 btree 索引(“位图和”?)连接起来,但性能很差。

回答by eaolson

I'd put an index on group_idand see if that's good enough. You don't say how many rows we're talking about or what performance you need.

我会放一个索引group_id,看看这是否足够好。你没有说我们在谈论多少行或你需要什么性能。

Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.

请记住,处理谓词的顺序不一定是您在查询中编写它们的顺序。除非您有真正的理由,否则不要试图超越优化器。

回答by NullUserException

Short version:There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:

简短版本:没有必要这样做。查询优化器足够聪明,可以决定选择数据的最佳方式。只需在 上创建一个 btree 索引group_id,即:

CREATE INDEX my_group_idx ON my_table (group_id); 


Long version:I created a script (testperf.sql) that inserts 136 rows of dummy data.

长版:我创建了一个testperf.sql插入 136 行虚拟数据的脚本 ( )。

DESC my_table;

Name     Null     Type      
-------- -------- --------- 
ID       NOT NULL NUMBER(4) 
GROUP_ID          NUMBER(4) 
TEXT              CLOB      

There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:

上有一个 btree 索引group_id。为确保实际使用索引,请以 dba 用户身份运行:

EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);

Here's how many rows each group_idhas and the corresponding percentage:

以下是每个行group_id有多少行以及相应的百分比:

GROUP_ID               COUNT                  PCT                    
---------------------- ---------------------- ---------------------- 
1                      1                      1                      
2                      2                      1                      
3                      4                      3                      
4                      8                      6                      
5                      16                     12                     
6                      32                     24                     
7                      64                     47                     
8                      9                      7         


Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:

请注意,仅当查询优化器认为它是一个好主意时才会使用索引 - 也就是说,您正在检索特定百分比的行。因此,如果您要求它提供以下查询计划:

SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;

You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.

您将看到,对于第一个查询,它将使用索引,而对于第二个查询,它将执行全表扫描,因为在 时索引的行太多而无法生效group_id = 7

Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%'(since I am not very familiar with ctxsys.context).

现在,考虑一个不同的条件 - WHERE group_id = Y AND text LIKE '%blah%'(因为我不是很熟悉ctxsys.context)。

SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';

Looking at the query plan, you will see that it willuse the index on group_id. Note that the order of your conditions is not important:

查看查询计划,您将看到它将使用 上的索引group_id。请注意,条件的顺序并不重要:

SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;

Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:

生成相同的查询计划。如果您尝试在 上运行相同的查询group_id = 7,您将看到它返回到全表扫描:

SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';

Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.

请注意,Oracle 每天都会自动收集统计信息(计划在每晚和周末运行),以不断提高查询优化器的效率。简而言之,Oracle 已尽力优化优化器,因此您不必这样做。

回答by Tim

I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains()is involved?

我手头没有要测试的 Oracle 实例,也没有在 Oracle 中使用全文索引,但是我在使用内联视图时通常具有良好的性能,这可能是您想到的索引类型的替代方案。当涉及contains()时,以下语法是否合法?

This inline view gets you the PK values of the rows in group 43:

此内联视图为您提供第 43 组中行的 PK 值:

             (
             select T.pkcol
             from T
             where group = 43
             )

If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:

如果 group 有一个正常的索引,并且没有低基数,那么获取这个 set 应该很快。然后,您将再次使用 T 内连接该集合:

           select * from T
           inner join
            (
             select T.pkcol
             from T
             where group = 43
             ) as MyGroup

           on T.pkcol = MyGroup.pkcol
           where contains(text, '%blah%') > 0

Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the containspredicate only to the group 43 rows.

希望优化器能够使用 PK 索引来优化连接,然后将contains谓词仅应用于组 43 行。