bash 给定比例将文件随机分配到训练/测试中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39210765/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 15:06:25  来源:igfitidea点击:

Randomly distribute files into train/test given a ratio

pythonbashtext-filesfile-handlingtrain-test-split

提问by M?nster

I am at the moment trying make a setup script, capable of setting up a workspace up for me, such that I don't need to do it manually. I started doing this in bash, but quickly realized that would not work that well.

我目前正在尝试制作一个设置脚本,能够为我设置一个工作区,这样我就不需要手动完成了。我开始在 bash 中执行此操作,但很快意识到这行不通。

My next idea was to do it using python, but can't seem to do it a proper way.. My idea was to make a list (a list being a .txt files with the paths for all the datafiles), shuffle this list, and then move each file to either my train dir or test dir, given the ratio....

我的下一个想法是使用 python 来做,但似乎无法以正确的方式做到这一点。 ,然后将每个文件移动到我的火车目录或测试目录,给定比率....

But this is python, isn't there a more simpler way to do it, it seems like I am doing an unessesary workaround just to split the files.

但这是python,没有更简单的方法可以做到这一点,似乎我正在做一个不必要的解决方法只是为了拆分文件。

Bash Code:

重击代码:

# Partition data randomly into train and test. 
cd ${PATH_TO_DATASET}
SPLIT=0.5 #train/test split
NUMBER_OF_FILES=$(ls ${PATH_TO_DATASET} |  wc -l) ## number of directories in the dataset
even=1
echo ${NUMBER_OF_FILES}

if [ `echo "${NUMBER_OF_FILES} % 2" | bc` -eq 0 ]
then    
        even=1
        echo "Even is true"
else
        even=0
        echo "Even is false"
fi

echo -e "${BLUE}Seperating files in to train and test set!${NC}"

for ((i=1; i<=${NUMBER_OF_FILES}; i++))
do
    ran=$(python -c "import random;print(random.uniform(0.0, 1.0))")    
    if [[ ${ran} < ${SPLIT} ]]
    then 
        ##echo "test ${ran}"
        cp -R  $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/test/
    else
        ##echo "train ${ran}"       
        cp -R  $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/train/
    fi

    ##echo $(ls -d */|sed "${i}q;d")
done    

cd ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data
NUMBER_TRAIN_FILES=$(ls train/ |  wc -l)
NUMBER_TEST_FILES=$(ls test/ |  wc -l)

echo "${NUMBER_TRAIN_FILES} and ${NUMBER_TEST_FILES}..."
echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})

if [[ ${even} = 1  ]] && [[ ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES} != ${SPLIT} ]]
    then 
    echo "Something need to be fixed!"
    if [[  $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES}) > ${SPLIT} ]]
    then
        echo "Too many files in the TRAIN set move some to TEST"
        cd train
        echo $(pwd)
        while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
        do
            mv $(ls -d */|sed "1q;d") ../test/
            echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
        done
    else
        echo "Too many files in the TEST set move some to TRAIN"
        cd test
        while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
        do
            mv $(ls -d */|sed "1q;d") ../train/
            echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
        done
    fi

fi   

My problem were the last part. Since i picking the numbers by random, I would not be sure that the data would be partitioned as hoped, which my last if statement were to check whether the partition was done right, and if not then fix it.. This was not possible since i am checking floating points, and the solution in general became more like a quick fix.

我的问题是最后一部分。因为我是随机选择数字的,所以我不确定数据是否会按希望进行分区,我最后的 if 语句是检查分区是否正确,如果没有则修复它..这是不可能的,因为我正在检查浮点数,一般的解决方案变得更像是一个快速修复。

回答by alvas

scikit-learncomes to the rescue =)

scikit-learn来救援=)

>>> import numpy as np
>>> from sklearn.cross_validation import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> y
[0, 1, 2, 3, 4]


# If i want 1/4 of the data for testing 
# and i set a random seed of 42.
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_train
[2, 0, 3]
>>> y_test
[1, 4]

See http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html



To demonstrate:

展示:

alvas@ubi:~$ mkdir splitfileproblem
alvas@ubi:~$ cd splitfileproblem/
alvas@ubi:~/splitfileproblem$ mkdir original
alvas@ubi:~/splitfileproblem$ mkdir train
alvas@ubi:~/splitfileproblem$ mkdir test
alvas@ubi:~/splitfileproblem$ ls
original  train  test
alvas@ubi:~/splitfileproblem$ cd original/
alvas@ubi:~/splitfileproblem/original$ ls
alvas@ubi:~/splitfileproblem/original$ echo 'abc' > a.txt
alvas@ubi:~/splitfileproblem/original$ echo 'def\nghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat a.txt 
abc
alvas@ubi:~/splitfileproblem/original$ echo -e 'def\nghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat b.txt 
def
ghi
alvas@ubi:~/splitfileproblem/original$ echo -e 'jkl' > c.txt
alvas@ubi:~/splitfileproblem/original$ echo -e 'mno' > d.txt
alvas@ubi:~/splitfileproblem/original$ ls
a.txt  b.txt  c.txt  d.txt

In Python:

在 Python 中:

alvas@ubi:~/splitfileproblem$ ls
original  test  train
alvas@ubi:~/splitfileproblem$ python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from sklearn.cross_validation import train_test_split
>>> os.listdir('original')
['b.txt', 'd.txt', 'c.txt', 'a.txt']
>>> X = y= os.listdir('original')
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
>>> X_train
['a.txt', 'd.txt', 'b.txt']
>>> X_test
['c.txt']

Now move the files:

现在移动文件:

>>> for x in X_train:
...     os.rename('original/'+x , 'train/'+x)
... 
>>> for x in X_test:
...     os.rename('original/'+x , 'test/'+x)
... 
>>> os.listdir('test')
['c.txt']
>>> os.listdir('train')
['b.txt', 'd.txt', 'a.txt']
>>> os.listdir('original')
[]

See also: How to move a file in Python

另请参阅:如何在 Python 中移动文件

回答by thodnev

Here's first dry-cut solution, pure Python:

这是第一个干切解决方案,纯 Python:

import sys, random, os

def splitdirs(files, dir1, dir2, ratio):
    shuffled = files[:]
    random.shuffle(shuffled)
    num = round(len(shuffled) * ratio)
    to_dir1, to_dir2 = shuffled[:num], shuffled[num:]
    for d in dir1, dir2:
        if not os.path.exists(d):
            os.mkdir(d)
    for file in to_dir1:
        os.symlink(file, os.path.join(dir1, os.path.basename(file)))
    for file in to_dir2:
        os.symlink(file, os.path.join(dir2, os.path.basename(file)))

if __name__ == '__main__':
    if len(sys.argv) != 5:
        sys.exit('Usage: {} files.txt dir1 dir2 ratio'.format(sys.argv[0]))
    else:
        files, dir1, dir2, ratio = sys.argv[1:]
        ratio = float(ratio)
        files = open(files).read().splitlines()
        splitdirs(files, dir1, dir2, ratio)

[thd@aspire ~]$ python ./test.py ./files.txt dev tst 0.4Here 40% of listed in files.txt goes to dev dir, and 60% -- to tst

[thd@aspire ~]$ python ./test.py ./files.txt dev tst 0.4在这里,files.txt 中列出的 40% 到 dev 目录,60% -- 到 tst

It makes symliks instead of copy, if you need true files, change os.symlinkto shutil.copy2

它制作符号而不是复制,如果您需要真实文件,请更改os.symlinkshutil.copy2

回答by ghoti

Here's a simple example that uses bash's $RANDOMto move things to one of two target directories.

这是一个使用 bash$RANDOM将内容移动到两个目标目录之一的简单示例。

$ touch {1..10}
$ mkdir red blue
$ a=(*/)
$ RANDOM=$$
$ for f in [0-9]*; do mv -v "$f" "${a[$((RANDOM/(32768/${#a[@]})))]}"; done
1 -> red/1
10 -> red/10
2 -> blue/2
3 -> red/3
4 -> red/4
5 -> red/5
6 -> red/6
7 -> blue/7
8 -> blue/8
9 -> blue/9

This example starts with the creation of 10 files and two target directories. It sets an array to */which expands to "all the directories within the current directory". It then runs a for loop with what looks like line noise in it. I'll break it apart for ya.

此示例从创建 10 个文件和两个目标目录开始。它设置一个数组,*/该数组扩展为“当前目录中的所有目录”。然后它运行一个 for 循环,其中包含看起来像线路噪声的内容。我会为你拆开它。

"${a[$((RANDOM/(32768/${#a[@]})+1))]}"is:

"${a[$((RANDOM/(32768/${#a[@]})+1))]}"是:

  • ${a[... the array "a",
  • $((...))... whose subscript is an integer math function.
  • $RANDOMis a bash variable that generates a ramdom(ish) number from 0 to 32767, and our formula divides the denominator of that ratio by:
  • ${#a[@]}, effectively multiplying RANDOM/32768by the number of elements in the array "a".
  • ${a[...数组“a”,
  • $((...))...其下标是整数数学函数。
  • $RANDOM是一个 bash 变量,它生成一个从 0 到 32767 的 ramdom(ish) 数,我们的公式将该比率的分母除以:
  • ${#a[@]},有效地乘以RANDOM/32768数组“a”中的元素数。

The result of all this is that we pick a random array element, a.k.a. a random directory.

所有这一切的结果是我们选择了一个随机数组元素,也就是一个随机目录。

If you really want to work from your "list of files", and assuming you leave your list of potential targets in the array "a", you could replace the for loop with a while loop:

如果您真的想使用“文件列表”,并假设您将潜在目标列表保留在数组“a”中,则可以用 while 循环替换 for 循环:

while read f; do
  mv -v "$f" "${a[$((RANDOM/(32768/${#a[@]})))]}"
done < /dir/file.txt

Now ... these solutions split results "evenly". That's what happens when you multiply the denominator. And because they're random, there's no way to insure that your random numbers won't put all your files into a single directory. So to get a split, you need to be more creative.

现在......这些解决方案“均匀地”分割结果。这就是当你乘以分母时会发生的情况。并且因为它们是随机的,所以无法确保您的随机数不会将您的所有文件放入一个目录中。因此,要进行拆分,您需要更有创意。

Let's assume we're dealing with only two targets (since I think that's what you're doing). If you're looking for a 25/75 split, slice up the random number range accordingly.

假设我们只处理两个目标(因为我认为这就是你正在做的)。如果您正在寻找 25/75 的分割,请相应地分割随机数范围。

$ declare -a b=([0]="red/" [8192]="blue/")
$ for f in {1..10}; do n=$RANDOM; for i in "${!b[@]}"; do [ $i -gt $n ] && break; o="${b[i]}"; done; mv -v "$f" "$o"; done

Broken out for easier reading, here's what we've got, with comments:

为了便于阅读,我们将其拆分为以下内容,并附有评论:

declare -a b=([0]="red/" [8192]="blue/")

for f in {1..10}; do         # Step through our files...
  n=$RANDOM                  # Pick a random number, 0-32767
  for i in "${!b[@]}"; do    # Step through the indices of the array of targets
    [ $i -gt $n ] && break   # If the current index is > than the random number, stop.
    o="${b[i]}"              # If we haven't stopped, name this as our target,
  done
  mv -v "$f" "$o"            # and move the file there.
done

We define our split using the index of an array. 8192 is 25% of 32767, the max value of $RANDOM. You could split things however you like within this range, including amongst more than 2.

我们使用数组的索引定义拆分。8192 是 $RANDOM 的最大值 32767 的 25%。您可以在此范围内随意拆分事物,包括 2 个以上。

If you want to test the results of this method, counting results in an array is a way to do it. Let's build a shell function to help with testing.

如果你想测试这个方法的结果,在数组中统计结果是一种方法。让我们构建一个 shell 函数来帮助测试。

$ tester() { declare -A c=(); for f in {1..10000}; do n=$RANDOM; for i in "${!b[@]}"; do [ $i -gt $n ] && break; o="${b[i]}"; done; ((c[$o]++)); done; declare -p c; }
$ declare -a b='([0]="red/" [8192]="blue/")'
$ tester
declare -A c='([blue/]="7540" [red/]="2460" )'
$ b=([0]="red/" [10992]="blue/")
$ tester
declare -A c='([blue/]="6633" [red/]="3367" )'

On the first line, we define our function. Second line sets the "b" array with a 25/75 split, then we run the function, whose output is the the counter array. Then we redefine the "b" array with a 33/67 split (or so), and run the function again to demonstrate results.

在第一行,我们定义了我们的函数。第二行将“b”数组设置为 25/75 分割,然后我们运行该函数,其输出是计数器数组。然后我们用 33/67 分割(左右)重新定义“b”数组,并再次运行该函数以演示结果。

So... While you certainly coulduse python for this, you can almost certainly achieve what you need with bash and a little math.

所以......虽然你当然可以使用python来做这件事,但你几乎肯定可以用bash和一点数学来实现你所需要的。