Java 在另一个字节数组中查找一个字节数组的 indexOf

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21341027/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 08:09:08  来源:igfitidea点击:

Find indexOf a byte array within another byte array

javasearchbytearray

提问by CL22

Given a byte array, how can I find within it, the position of a (smaller) byte array?

给定一个字节数组,如何在其中找到(较小的)字节数组的位置?

This documentationlooked promising, using ArrayUtils, but if I'm correct it would only let me find an individual byte within the array to be searched.

这个文档看起来很有希望,使用ArrayUtils,但如果我是正确的,它只会让我在要搜索的数组中找到一个单独的字节。

(I can't see it mattering, but just in case: sometimes the search byte array will be regular ASCII characters, other times it will be control characters or extended ASCII characters. So using String operations would not always be appropriate)

(我认为这不重要,但以防万一:有时搜索字节数组将是常规 ASCII 字符,其他时候它将是控制字符或扩展 ASCII 字符。因此使用字符串操作并不总是合适的)

The large array could be between 10 and about 10000 bytes, and the smaller array around 10. In some cases I will have several smaller arrays that I want found within the larger array in a single search. And I will at times want to find the last index of an instance rather than the first.

大数组可能在 10 到大约 10000 字节之间,较小的数组可能在 10 左右。在某些情况下,我会在一次搜索中在较大的数组中找到几个较小的数组。有时我会想找到实例的最后一个索引而不是第一个。

采纳答案by dasblinkenlight

Java strings are composed of 16-bit chars, not of 8-bit bytes. A charcan hold a byte, so you can always make your byte arrays into strings, and use indexOf: ASCII characters, control characters, and even zero characters will work fine.

Java 字符串由 16 位组成char,而不是 8 位组成byte。Achar可以容纳 a byte,因此您始终可以将字节数组转换为字符串,并使用indexOf: ASCII 字符、控制字符,甚至零字符都可以正常工作。

Here is a demo:

这是一个演示:

byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));

This prints 7.

这打印7.

However, considering that your large array could be up to 10,000 bytes, and the small array is only ten bytes, this solution may not be the most efficient, for two reasons:

但是,考虑到您的大数组可能多达 10,000 个字节,而小数组只有 10 个字节,此解决方案可能不是最有效的,原因有两个:

  • It requires copying your big array into an array that is twice as large (same capacity, but with charinstead of byte). This triples your memory requirements.
  • String search algorithm of Java is not the fastest one available. You may get sufficiently faster if you implement one of the advanced algorithms, for example, the Knuth–Morris–Prattone. This could potentially bring the execution speed down by a factor of up to ten (the length of the small string), and will require additional memory that is proportional to the length of the small string, not the big string.
  • 它需要将你的大数组复制到一个两倍大的数组中(相同的容量,但用char代替byte)。这会使您的内存需求增加三倍。
  • Java 的字符串搜索算法不是最快的一种。如果您实现其中一种高级算法,例如Knuth-Morris-Pratt算法,您可能会变得足够快。这可能会使执行速度降低多达 10 倍(小字符串的长度),并且需要与小字符串而不是大字符串的长度成正比的额外内存。

回答by morpheus05

The simpelst way would be to compare each element:

最简单的方法是比较每个元素:

public int indexOf(byte[] outerArray, byte[] smallerArray) {
    for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
        boolean found = true;
        for(int j = 0; j < smallerArray.length; ++j) {
           if (outerArray[i+j] != smallerArray[j]) {
               found = false;
               break;
           }
        }
        if (found) return i;
     }
   return -1;  
}  

Some tests:

一些测试:

@Test
public void testIndexOf() {
  byte[] outer = {1, 2, 3, 4};
  assertEquals(0, indexOf(outer, new byte[]{1, 2}));
  assertEquals(1, indexOf(outer, new byte[]{2, 3}));
  assertEquals(2, indexOf(outer, new byte[]{3, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}

As you updated your question: Java Strings are UTF-16 Strings, they do not care about the extended ASCII set, so you could use string.indexOf()

当您更新问题时:Java 字符串是 UTF-16 字符串,它们不关心扩展的 ASCII 集,因此您可以使用 string.indexOf()

回答by MonoThreaded

Is this what you are looking for?

这是你想要的?

public class KPM {
    /**
     * Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
     * @param data
     * @param start First index in data
     * @param stop Last index in data so that stop-start = length
     * @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
     * @return
     */
    public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
        if( data == null || pattern == null) return -1;

        int[] failure = computeFailure(pattern);

        int j = 0;

        for( int i = start; i < stop; i++) {
            while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
                j = failure[j - 1];
            }
            if (pattern[j] == '*' || pattern[j] == data[i]) {
                j++;
            }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j>0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }

        return failure;
    }
}

回答by 18446744073709551615

To save your time in testing:

为了节省您的测试时间:

http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence

http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence

gives you code that works if you make computeFailure()static:

如果您将computeFailure() 设为静态,则为您提供有效的代码:

public class KPM {
    /**
     * Search the data byte array for the first occurrence 
     * of the byte array pattern.
     */
    public static int indexOf(byte[] data, byte[] pattern) {
    int[] failure = computeFailure(pattern);

    int j = 0;

    for (int i = 0; i < data.length; i++) {
        while (j > 0 && pattern[j] != data[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == data[i]) { 
            j++; 
        }
        if (j == pattern.length) {
            return i - pattern.length + 1;
        }
    }
    return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
    int[] failure = new int[pattern.length];

    int j = 0;
    for (int i = 1; i < pattern.length; i++) {
        while (j>0 && pattern[j] != pattern[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == pattern[i]) {
            j++;
        }
        failure[i] = j;
    }

    return failure;
    }
}

Since it is always wise to test the code that you borrow, you may start with:

由于测试你借用的代码总是明智的,你可以从:

public class Test {
    public static void main(String[] args) {
        do_test1();
    }
    static void do_test1() {
      String[] ss = { "",
                    "\r\n\r\n",
                    "\n\n",
                    "\r\n\r\nthis is a test",
                    "this is a test\r\n\r\n",
                    "this is a test\r\n\r\nthis si a test",
                    "this is a test\r\n\r\nthis si a test\r\n\r\n",
                    "this is a test\n\r\nthis si a test",
                    "this is a test\r\nthis si a test\r\n\r\n",
                    "this is a test"
                };
      for (String s: ss) {
        System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
      }

    }
}

回答by ant-depalma

Google's Guava provides a Bytes.indexOf(byte[] array, byte[] target).

Google 的 Guava 提供了一个 Bytes.indexOf(byte[] array, byte[] target)。

回答by riversun

package org.example;

import java.util.List;

import org.riversun.finbin.BinarySearcher;

public class Sample2 {

    public static void main(String[] args) throws Exception {

        BinarySearcher bs = new BinarySearcher();

        // UTF-8 without BOM
        byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8");

        byte[] searchBytes = "world".getBytes("utf-8");

        List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);

        System.out.println("indexList=" + indexList);
    }
 }

so it results in

所以它导致

indexList=[6, 25]

So,u can find the index of byte[] in byte[]

所以,你可以在 byte[] 中找到 byte[] 的索引

Example here on Github at: https://github.com/riversun/finbin

Github 上的示例:https: //github.com/riversun/finbin

回答by Gustavo Mendoza

Copied almost identical from java.lang.String.

从 java.lang.String 复制的几乎相同。

indexOf(char[],int,int,char[]int,int,int)

indexOf(char[],int,int,char[]int,int,int)

static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
    if (fromIndex >= sourceCount) {
        return (targetCount == 0 ? sourceCount : -1);
    }
    if (fromIndex < 0) {
        fromIndex = 0;
    }
    if (targetCount == 0) {
        return fromIndex;
    }

    byte first = target[targetOffset];
    int max = sourceOffset + (sourceCount - targetCount);

    for (int i = sourceOffset + fromIndex; i <= max; i++) {
        /* Look for first character. */
        if (source[i] != first) {
            while (++i <= max && source[i] != first)
                ;
        }

        /* Found first character, now look at the rest of v2 */
        if (i <= max) {
            int j = i + 1;
            int end = j + targetCount - 1;
            for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
                ;

            if (j == end) {
                /* Found whole string. */
                return i - sourceOffset;
            }
        }
    }
    return -1;
}

回答by BullyWiiPlaza

Using the Knuth–Morris–Pratt algorithmis the most efficient way.

使用Knuth–Morris–Pratt algorithm是最有效的方法。

StreamSearcher.javais an implementation of it and is part of Twitter's elephant-birdproject.

StreamSearcher.java是它的一个实现,是Twitter's elephant-birdproject 的一部分。

It is not recommended to include this library since it is rather sizable for using just a single class.

不建议包含这个库,因为它对于只使用一个类来说相当大。

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher
{
    private byte[] pattern_;
    private int[] borders_;

    // An upper bound on pattern length for searching. Results are undefined for longer patterns.
    @SuppressWarnings("unused")
    public static final int MAX_PATTERN_LENGTH = 1024;

    StreamSearcher(byte[] pattern)
    {
        setPattern(pattern);
    }

    /**
     * Sets a new pattern for this StreamSearcher to use.
     *
     * @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
     */
    public void setPattern(byte[] pattern)
    {
        pattern_ = Arrays.copyOf(pattern, pattern.length);
        borders_ = new int[pattern_.length + 1];
        preProcess();
    }

    /**
     * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
     * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
     * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
     * another reasonable default, i.e. leave the stream unchanged.
     *
     * @return bytes consumed if found, -1 otherwise.
     */
    long search(InputStream stream) throws IOException
    {
        long bytesRead = 0;

        int b;
        int j = 0;

        while ((b = stream.read()) != -1)
        {
            bytesRead++;

            while (j >= 0 && (byte) b != pattern_[j])
            {
                j = borders_[j];
            }
            // Move to the next character in the pattern.
            ++j;

            // If we've matched up to the full pattern length, we found it.  Return,
            // which will automatically save our position in the InputStream at the point immediately
            // following the pattern match.
            if (j == pattern_.length)
            {
                return bytesRead;
            }
        }

        // No dice, Note that the stream is now completely consumed.
        return -1;
    }

    /**
     * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
     * and aids in implementation of the Knuth-Moore-Pratt string search.
     * <p>
     * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
     */
    private void preProcess()
    {
        int i = 0;
        int j = -1;
        borders_[i] = j;
        while (i < pattern_.length)
        {
            while (j >= 0 && pattern_[i] != pattern_[j])
            {
                j = borders_[j];
            }
            borders_[++i] = ++j;
        }
    }
}

回答by Benjamin Gillhofer

For a little HTTP server I am currently working on, I came up with the following code to find boundaries in a multipart/form-data request. Hoped to find a better solution here, but i guess I'll stick with it. I think it is as efficent as it can get (quite fast and uses not much ram). It uses the input bytes as ring buffer, reads the next byte as soon as it does not match the boundary and writes the data after the first full cycle into the output stream. Of course can it be changed for byte arrays instead of streams, as asked in the question.

对于我目前正在处理的一个小型 HTTP 服务器,我想出了以下代码来查找多部分/表单数据请求中的边界。希望在这里找到更好的解决方案,但我想我会坚持下去。我认为它尽可能高效(相当快并且使用的内存不多)。它使用输入字节作为环形缓冲区,在与边界不匹配时立即读取下一个字节,并在第一个完整周期后将数据写入输出流。当然,如问题中所问,它可以针对字节数组而不是流进行更改。

    private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
    {
        try
        {
            String n = "--"+boundary;
            byte[] bc = n.getBytes("UTF-8");
            int s = bc.length;
            byte[] b = new byte[s];
            int p = 0;
            long l = 0;
            int c;
            boolean r;
            while ((c = is.read()) != -1)
            {
                b[p] = (byte) c;
                l += 1;
                p = (int) (l % s);
                if (l>p)
                {
                    r = true;
                    for (int i = 0; i < s; i++)
                    {
                        if (b[(p + i) % s] != bc[i])
                        {
                            r = false;
                            break;
                        }
                    }
                    if (r)
                        break;
                    os.write(b[p]);
                }
            }
            os.flush();
            return true;
        } catch(IOException e) {e.printStackTrace();}
        return false;
    }