在没有空字段折叠的以制表符分隔的文件中读取 bash

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4622355/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 23:12:45  来源:igfitidea点击:

read in bash on tab-delimited file without empty fields collapsing

bash

提问by Charles Duffy

I'm trying to read a multi-line tab-separated file in bash. The format is such that empty fields are expected. Unfortunately, the shell is collapsing together field separators which are next to each other, as so:

我正在尝试在 bash 中读取多行制表符分隔的文件。格式是这样的,需要空字段。不幸的是,外壳正在将彼此相邻的字段分隔符折叠在一起,如下所示:

# IFS=$'\t'
# read one two three <<<$'one\t\tthree'
# printf '<%s> ' "$one" "$two" "$three"; printf '\n'
<one> <three> <>

...as opposed to the desired output of <one> <> <three>.

...与<one> <> <three>.

Can this be resolved without resorting to a separate language (such as awk)?

这可以在不诉诸单独语言(例如 awk)的情况下解决吗?

采纳答案by Alex North-Keys

Here's an approach with some niceties:

这是一种具有一些优点的方法:

  • input data from wherever becomes a pseudo-2D array in the main code (avoiding a common problem where the data is only available within one stage of a pipeline).
  • no use of awk, tr, or other external progs
  • a get/put accessor pair to hide the hairier syntax
  • works on tab-delimited lines by using param matching instead of IFS=
  • 来自任何地方的输入数据在主代码中变成伪二维数组(避免数据仅在管道的一个阶段可用的常见问题)。
  • 不使用 awk、tr 或其他外部程序
  • 一个 get/put 访问器对来隐藏更复杂的语法
  • 通过使用参数匹配而不是 IFS= 在制表符分隔的行上工作

The code. file_dataand file_inputare just for generating input as though from a external command called from the script. dataand colscould be parameterized for the getand putcalls, etc, but this script doesn't go that far.

编码。 file_data并且file_input仅用于生成输入,就像从脚本调用的外部命令一样。 dataandcols可以为getandput调用等参数化,但是这个脚本并没有那么远。

#!/bin/bash

file_data=( $'\t\t'       $'\t\tbC'     $'\tcB\t'     $'\tdB\tdC'   \
            $'eA\t\t'     $'fA\t\tfC'   $'gA\tgB\t'   $'hA\thB\thC' )
file_input () { printf '%s\n' "${file_data[@]}" ; }  # simulated input file
delim=$'\t'

# the IFS=$'\n' has a side-effect of skipping blank lines; acceptable:
OIFS="$IFS" ; IFS=$'\n' ; oset="$-" ; set -f
lines=($(file_input))                    # read the "file"
set -"$oset" ; IFS="$OIFS" ; unset oset  # cleanup the environment mods.

# the read-in data has (rows * cols) fields, with cols as the stride:
data=()
cols=0
get () { local r= c= i ; (( i = cols * r + c )) ; echo "${data[$i]}" ; }
put () { local r= c= i ; (( i = cols * r + c )) ; data[$i]="" ; }

# convert the lines from input into the pseudo-2D data array:
i=0 ; row=0 ; col=0
for line in "${lines[@]}" ; do
    line="$line$delim"
    while [ -n "$line" ] ; do
        case "$line" in
            *${delim}*) data[$i]="${line%%${delim}*}" ; line="${line#*${delim}}" ;;
            *)          data[$i]="${line}"            ; line=                     ;;
        esac
        (( ++i ))
    done
    [ 0 = "$cols" ] && (( cols = i )) 
done
rows=${#lines[@]}

# output the data array as a matrix, using the get accessor
for    (( row=0 ; row < rows ; ++row )) ; do
   printf 'row %2d: ' $row
   for (( col=0 ; col < cols ; ++col )) ; do
       printf '%5s ' "$(get $row $col)"
   done
   printf '\n'
done

Output:

输出:

$ ./tabtest 
row  0:                   
row  1:                bC 
row  2:          cB       
row  3:          dB    dC 
row  4:    eA             
row  5:    fA          fC 
row  6:    gA    gB       
row  7:    hA    hB    hC 

回答by DigitalRoss

Sure

当然



IFS=,
echo $'one\t\tthree' | tr \11 , | (
  read one two three
  printf '<%s> ' "$one" "$two" "$three"; printf '\n'
)

I've rearranged the example just a bit, but only to make it work in any Posix shell.

我只是稍微重新排列了这个例子,但只是为了让它在任何 Posix shell 中工作。

Update: Yeah, it seems that white space is special, at least if it's in IFS. See the second half of this paragraph from bash(1):

更新:是的,似乎空白很特别,至少如果它在 IFS 中。请参阅 bash(1) 的本段后半部分:

   The shell treats each character of IFS as a delimiter, and  splits  the
   results of the other expansions into words on these characters.  If IFS
   is unset, or its value is exactly <space><tab><newline>,  the  default,
   then  any  sequence  of IFS characters serves to delimit words.  If IFS
   has a value other than the default, then sequences  of  the  whitespace
   characters  space  and  tab are ignored at the beginning and end of the
   word, as long as the whitespace character is in the value  of  IFS  (an
   IFS whitespace character).  Any character in IFS that is not IFS white-
   space, along with any adjacent IFS whitespace  characters,  delimits  a
   field.   A  sequence  of IFS whitespace characters is also treated as a
   delimiter.  If the value of IFS is null, no word splitting occurs.

回答by Paused until further notice.

It's not necessary to use tr, but it is necessary that IFSis a non-whitespace character (otherwise multiples get collapsed to singles as you've seen).

没有必要使用tr,但必须IFS是非空白字符(否则倍数会折叠为单数,如您所见)。

$ IFS=, read -r one two three <<<'one,,three'
$ printf '<%s> ' "$one" "$two" "$three"; printf '\n'
<one> <> <three>

$ var=$'one\t\tthree'
$ var=${var//$'\t'/,}
$ IFS=, read -r one two three <<< "$var"
$ printf '<%s> ' "$one" "$two" "$three"; printf '\n'
<one> <> <three>

$ idel=$'\t' odel=','
$ var=$'one\t\tthree'
$ var=${var//$idel/$odel}
$ IFS=$odel read -r one two three <<< "$var"
$ printf '<%s> ' "$one" "$two" "$three"; printf '\n'
<one> <> <three>

回答by Stefan Kriwanek

Here's a fast and simple function I use that avoids calling external programs or restricting the range of input characters. It works in bash only (I guess).

这是我使用的一个快速而简单的函数,它可以避免调用外部程序或限制输入字符的范围。它仅适用于 bash(我猜)。

If it is to allow for more variables than fields, though, it needs to be modified along Charles Duffy's answer.

但是,如果要允许比字段更多的变量,则需要根据 Charles Duffy 的回答对其进行修改。

# Substitute for `read -r' that doesn't merge adjacent delimiters.
myread() {
        local input
        IFS= read -r input || return $?
        while [[ "$#" -gt 1 ]]; do
                IFS= read -r "" <<< "${input%%[$IFS]*}"
                input="${input#*[$IFS]}"
                shift
        done
        IFS= read -r "" <<< "$input"
}

回答by Charles Duffy

I've written a function which works around this issue. This particular implementation is particular about tab-separated columns and newline-separated rows, but that limitation could be removed as a straightforward exercise:

我写了一个函数来解决这个问题。这个特定的实现特别关注制表符分隔的列和换行符分隔的行,但可以作为一个简单的练习删除该限制:

read_tdf_line() {
    local default_ifs=$' \t\n'
    local n line element at_end old_ifs
    old_ifs="${IFS:-${default_ifs}}"
    IFS=$'\n'

    if ! read -r line ; then
        return 1
    fi
    at_end=0
    while read -r element; do
        if (( $# > 1 )); then
            printf -v "" '%s' "$element"
            shift
        else
            if (( at_end )) ; then
                # replicate read behavior of assigning all excess content
                # to the last variable given on the command line
                printf -v "" '%s\t%s' "${!1}" "$element"
            else
                printf -v "" '%s' "$element"
                at_end=1
            fi
        fi
    done < <(tr '\t' '\n' <<<"$line")

    # if other arguments exist on the end of the line after all
    # input has been eaten, they need to be blanked
    if ! (( at_end )) ; then
        while (( $# )) ; do
            printf -v "" '%s' ''
            shift
        done
    fi

    # reset IFS to its original value (or the default, if it was
    # formerly unset)
    IFS="$old_ifs"
}

Usage as follows:

用法如下:

# read_tdf_line one two three rest <<<$'one\t\tthree\tfour\tfive'
# printf '<%s> ' "$one" "$two" "$three" "$rest"; printf '\n'
<one> <> <three> <four       five>