bash 使用 for 循环在多个文件上运行 zcat

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26351242/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 11:33:18  来源:igfitidea点击:

Running zcat on multiple files using a for loop

bashfor-loopterminalfilenames

提问by stewart6

I'm very new to terminal/bash, and perhaps this has been asked before but I wasn't able to find what I'm looking for perhaps because I'm not sure exactly what to search for to answer my question.

我对终端/bash 很陌生,也许以前有人问过这个问题,但我找不到我要找的东西,也许是因为我不确定要搜索什么来回答我的问题。

I'm trying to format some files for genetic analysis and while I could write out the following command for every sample file, I know there is a better way:

我正在尝试格式化一些文件以进行遗传分析,虽然我可以为每个样本文件写出以下命令,但我知道有更好的方法:

zcat myfile.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print 
-bash-3.2$ ls
BB001.fastq BB013.fastq.gz  IN014.fastq.gz  RV006.fastq.gz  SL083.fastq.gz
BB001.fastq.gz  BB014.fastq.gz  INA01.fastq.gz  RV007.fastq.gz  SL192.fastq.gz
BB003.fastq.gz  BB015.fastq.gz  INA02.fastq.gz  RV008.fastq.gz  SL218.fastq.gz
BB004.fastq.gz  IN001.fastq.gz  INA03.fastq.gz  RV009.fastq.gz  SL276.fastq.gz
BB006.fastq.gz  IN002.fastq.gz  INA04.fastq.gz  RV010.fastq.gz  SL277.fastq.gz
BB008.fastq.gz  IN007.fastq.gz  INA05.fastq.gz  RV011.fastq.gz  SL326.fastq.gz
BB009.fastq.gz  IN010.fastq.gz  INA1M.fastq.gz  RV012.fastq.gz  SL392.fastq.gz
BB010.fastq.gz  IN011.fastq.gz  RV003.fastq.gz  SL075.fastq.gz  SL393.fastq.gz
BB011.fastq.gz  IN012.fastq.gz  RV004.fastq.gz  SL080.fastq.gz  SL395.fastq.gz
BB012.fastq.gz  IN013.fastq.gz  RV005.fastq.gz  SL081.fastq.gz
}' | gzip > myfile.2.fastq.gz zcat myfile.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print
for FILENAME.fastq.gz in all files in cd

zcat FILENAME.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print 
for fname in *.fastq.gz
do
    gzcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print 
for fname in *.1.fastq.gz
do
cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done
}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.2.fastq.gz" gzcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print
cat: ./CleanedSeparate/XhoI/*.1.fastq.gz: No such file or directory
cat: ./CleanedSeparate/MseI/*.2.fastq.gz: No such file or directory
}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.1.fastq.gz" done
}' | gzip > FILENAME.2.fastq.gz zcat FILENAME.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print
for fname in *.fastq.gz
do
    zcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print 
for fname in /path/to/*.fastq.gz
}' | gzip >"${fname%.fastq.gz}.2.fastq.gz" zcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print
for fname in /path/to/*.fastq.gz
}' | gzip >"${fname%.fastq.gz}.1.fastq.gz" done
}' | gzip > FILENAME.1.fastq.gz
}' | gzip > myfile.1.fastq.gz

I have the following files:

我有以下文件:

for fname in *.1.fastq.gz
do
    cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done

and I would like to apply the two zcat functions to each file, creating two new files from each one without writing it out 50 times. I've used for loops in R quite a bit but don't know where to start in bash. I can say in words what I want and hopefully someone can give me a hand coding it!:

我想将两个 zcat 函数应用于每个文件,从每个文件创建两个新文件,而无需将其写出 50 次。我在 R 中经常使用 for 循环,但不知道在 bash 中从哪里开始。我可以用文字说出我想要的东西,希望有人可以帮我编写代码!:

dir1=./CleanedSeparate/XhoI
for fname in "$dir1"/*.1.fastq.gz
do
    base=${fname#$dir1/}
    base=${base%.1.fastq.gz}
    echo "base=$base"
    cat "$fname" "./CleanedSeparate/MseI/${base}.2.fastq.gz" >"./FinalCleaned/${base}.fastq.gz"
done

Thanks a ton in advance for your help!

非常感谢您的帮助!

*****EDIT*****

*****编辑*****

My notation was a bit off, here's the final, correct for loop:

我的符号有点不对,这是最后一个正确的 for 循环:

for fspec in *.fastq.gz ; do
    echo "${fspec}"
done

*****FOLLOWUP QUESTION*****

***** 后续问题 *****

When I run the following:

当我运行以下命令时:

for fspec in *.fastq.gz ; do
    froot=${fspec%%.fastq.gz}
    echo "Transform ${froot}.fastq.gz into ${froot}.1.fastq.gz"
done

I get this error:

我收到此错误:

sed -n 'p;n;p;n;p;n;p;n;n;n;n'

Obviously I'm not using * correctly. Any tips on where I'm going wrong?

显然我没有正确使用 * 。关于我哪里出错的任何提示?

回答by John1024

sed -n 'n;n;n;n;p;n;p;n;p;n;p'

Key points:

关键点:

  • for fname in *.fastq.gz

    This loops over every file in the current directory ending in .fastq.gz. If the files are in a different directory, then use:

    for fsrc in *.fastq.gz ; do
        fdst1="${fspec%%.fastq.gz}.1.fastq.gz"
        fdst2="${fspec%%.fastq.gz}.2.fastq.gz"
        echo "Processing ${fsrc}"
    
        # For each group of 8 lines, fdst1 gets 1-4, fdst2 gets 5-8.
        zcat ${fsrc} | sed -n 'p;n;p;n;p;n;p;n;n;n;n' | gzip >${fdst1}
        zcat ${fsrc} | sed -n 'n;n;n;n;p;n;p;n;p;n;p' | gzip >${fdst2}
    done
    

    where /path/to/is whatever the path should be to get to those files.

  • zcat "$fname"

    This part is straightforward. It substitutes in the file name as the argument for zcat.

  • "${fname%.fastq.gz}.1.fastq.gz"

    This is a little bit trickier. To get the desired output file name, we need to insert the .1into the original filename. The easiest way to do this in bashis to remove the .fastq.gzsuffix from the file name with ${fname%.fastq.gz}where the %is bash-speak meaning remove what follows from the end. Then, we add on the new suffix .1.fastq.gzand we have the correct file name.

  • for fname in *.fastq.gz

    这将遍历当前目录中以.fastq.gz. 如果文件在不同的目录中,则使用:

    ##代码##

    /path/to/访问这些文件的路径应该在哪里。

  • zcat "$fname"

    这部分很简单。它将文件名替换为zcat.

  • "${fname%.fastq.gz}.1.fastq.gz"

    这有点棘手。要获得所需的输出文件名,我们需要将 插入.1到原始文件名中。最简单的方法bash.fastq.gz从文件名中删除后缀,${fname%.fastq.gz}其中%is bash-speak 意思是从末尾删除后缀。然后,我们添加新的后缀.1.fastq.gz,我们就有了正确的文件名。

Creating the new files in a different directory

在不同的目录中创建新文件

As per the follow-up question, this does not work:

根据后续问题,这不起作用:

##代码##

The problem is that, in the forstatement, the shell is looking for the *.1.fastq.gzin the current directory. But, they aren't there. They are in the ./CleanedSeparate/XhoI/. Instead, run:

问题在于,在for语句中,shell 正在*.1.fastq.gz当前目录中查找。但是,他们不在。他们在./CleanedSeparate/XhoI/. 相反,运行:

##代码##

Notice here that the forstatement is given the correct directory in which to find the files.

请注意,此处为for语句提供了可在其中查找文件的正确目录。

回答by paxdiablo

You can use something like:

你可以使用类似的东西:

##代码##

That will simply echo the file being processed but you can do anything you want to ${fspec}, including using it for a couple of zcatcommands.

这将简单地回显正在处理的文件,但您可以做任何您想做的事情${fspec},包括将它用于几个zcat命令。



In order to get the rootof the file name (for creating the other files), you can use the pattern deletion feature of bashto remove the trailing bit:

为了获得文件名的(用于创建其他文件),您可以使用 的模式删除功能bash来删除尾随位:

##代码##

In addition, for your specificneed, it appears you want to send the first four lines of an eight-line group to one file and the other four lines to a second file.

此外,根据您的特定需要,您似乎希望将八行组的前四行发送到一个文件,将其他四行发送到第二个文件。

I tend to just use sedfor simple tasks like that since it's likely to be faster. You can get the first line group (first four lines of the eight) with:

我倾向于只sed用于这样的简单任务,因为它可能会更快。您可以通过以下方式获得第一行组(八行中的前四行):

##代码##

and the second (second four lines of the eight) with:

和第二个(八个的第二个四行):

##代码##

using the pprint-current and nget-next commands.

使用pprint-current 和nget-next 命令。

Hence the code then becomes something like:

因此代码就变成了这样:

##代码##