php 如何找出csv文件字段是制表符分隔还是逗号分隔

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3395267/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 09:35:38  来源:igfitidea点击:

how to find out if csv file fields are tab delimited or comma delimited

php

提问by SowmyAnil

how to find out if csv file fields are tab delimited or comma delimited. I need php validation for this. Can anyone plz help. Thanks in advance.

如何找出 csv 文件字段是制表符分隔还是逗号分隔。我需要为此进行 php 验证。任何人都可以帮忙。提前致谢。

回答by Jay Bhatt

It's too late to answer this question but hope it will help someone.

回答这个问题为时已晚,但希望它会帮助某人。

Here's a simple function that will return a delimiter of a file.

这是一个简单的函数,它将返回文件的分隔符。

function getFileDelimiter($file, $checkLines = 2){
        $file = new SplFileObject($file);
        $delimiters = array(
          ',',
          '\t',
          ';',
          '|',
          ':'
        );
        $results = array();
        $i = 0;
         while($file->valid() && $i <= $checkLines){
            $line = $file->fgets();
            foreach ($delimiters as $delimiter){
                $regExp = '/['.$delimiter.']/';
                $fields = preg_split($regExp, $line);
                if(count($fields) > 1){
                    if(!empty($results[$delimiter])){
                        $results[$delimiter]++;
                    } else {
                        $results[$delimiter] = 1;
                    }   
                }
            }
           $i++;
        }
        $results = array_keys($results, max($results));
        return $results[0];
    }

Use this function as shown below:

使用该函数如下图所示:

$delimiter = getFileDelimiter('abc.csv'); //Check 2 lines to determine the delimiter
$delimiter = getFileDelimiter('abc.csv', 5); //Check 5 lines to determine the delimiter

P.S I have used preg_split() instead of explode() because explode('\t', $value) won't give proper results.

PS 我使用了 preg_split() 而不是 expand() 因为 expand('\t', $value) 不会给出正确的结果。

UPDATE: Thanks for @RichardEB pointing out a bug in the code. I have updated this now.

更新:感谢@RichardEB 指出代码中的错误。我现在已经更新了。

回答by Dream Ideation

Here's what I do.

这就是我所做的。

  1. Parse the first 5 lines of a CSV file
  2. Count the number of delimiters [commas, tabs, semicolons and colons] in each line
  3. Compare the number of delimiters in each line. If you have a properly formatted CSV, then one of the delimiter counts will match in each row.
  1. 解析 CSV 文件的前 5 行
  2. 计算每行中分隔符[逗号、制表符、分号和冒号]的数量
  3. 比较每行中的分隔符数量。如果您有一个格式正确的 CSV,那么每行中的一个分隔符计数将匹配。

This will not work 100% of the time, but it is a decent starting point. At minimum, it will reduce the number of possible delimiters (making it easier for your users to select the correct delimiter).

这不会在 100% 的情况下起作用,但它是一个不错的起点。至少,它将减少可能的分隔符的数量(使您的用户更容易选择正确的分隔符)。

/* Rearrange this array to change the search priority of delimiters */
$delimiters = array('tab'       => "\t",
                'comma'     => ",",
                'semicolon' => ";"
                );

$handle = file( $file );    # Grabs the CSV file, loads into array

$line = array();            # Stores the count of delimiters in each row

$valid_delimiter = array(); # Stores Valid Delimiters

# Count the number of Delimiters in Each Row
for ( $i = 1; $i < 6; $i++ ){
foreach ( $delimiters as $key => $value ){
    $line[$key][$i] = count( explode( $value, $handle[$i] ) ) - 1;
}
}


# Compare the Count of Delimiters in Each line
foreach ( $line as $delimiter => $count ){

# Check that the first two values are not 0
if ( $count[1] > 0 and $count[2] > 0 ){
    $match = true;

    $prev_value = '';
    foreach ( $count as $value ){

        if ( $prev_value != '' )
            $match = ( $prev_value == $value and $match == true ) ? true : false;

        $prev_value = $value;
    }

} else { 
    $match = false;
}

if ( $match == true )    $valid_delimiter[] = $delimiter;

}//foreach

# Set Default delimiter to comma
$delimiter = ( $valid_delimiter[0] != '' ) ? $valid_delimiter[0] : "comma";


/*  !!!! This is good enough for my needs since I have the priority set to "tab"
!!!! but you will want to have to user select from the delimiters in $valid_delimiter
!!!! if multiple dilimiter counts match
*/

# The Delimiter for the CSV
echo $delimiters[$delimiter]; 

回答by relet

There is no 100% reliable way to detemine this. What you can do is

没有 100% 可靠的方法来确定这一点。你能做的是

  • If you have a method to validate the fields you read, try to read a few fields using either separator and validate against your method. If it breaks, use another one.
  • Count the occurrence of tabs or commas in the file. Usually one is significantly higher than the other
  • Last but not least: Ask the user, and allow him to override your guesses.
  • 如果您有一种方法来验证您读取的字段,请尝试使用任一分隔符读取一些字段并根据您的方法进行验证。如果它坏了,请使用另一个。
  • 计算文件中制表符或逗号的出现次数。通常一个明显高于另一个
  • 最后但并非最不重要的一点:询问用户,并让他推翻您的猜测。

回答by Thomas Lang

I'm just counting the occurrences of the different delimiters in the CSV file, the one with the most should probably be the correct delimiter:

我只是计算 CSV 文件中不同分隔符的出现次数,最多的应该是正确的分隔符:

//The delimiters array to look through
$delimiters = array(
    'semicolon' => ";",
    'tab'       => "\t",
    'comma'     => ",",
);

//Load the csv file into a string
$csv = file_get_contents($file);
foreach ($delimiters as $key => $delim) {
    $res[$key] = substr_count($csv, $delim);
}

//reverse sort the values, so the [0] element has the most occured delimiter
arsort($res);

reset($res);
$first_key = key($res);

return $delimiters[$first_key]; 

回答by keir

In my situation users supply csv files which are then entered into an SQL database. They may save an Excel Spreadsheet as comma or tab delimited files. A program converting the spreadsheet to SQL needs to automatically identify whether fields are tab separated or comma

在我的情况下,用户提供 csv 文件,然后将其输入 SQL 数据库。他们可以将 Excel 电子表格保存为逗号或制表符分隔的文件。将电子表格转换为 SQL 的程序需要自动识别字段是制表符分隔还是逗号分隔

Many Excel csv export have field headings as the first line. The heading test is unlikely to contain commas except as a delimiter. For my situation I counted the commas and tabs of the first line and use that with the greater number to determine if it is csv or tab

许多 Excel csv 导出将字段标题作为第一行。除了作为分隔符之外,标题测试不太可能包含逗号。对于我的情况,我计算了第一行的逗号和制表符,并使用更大的数字来确定它是 csv 还是制表符

回答by Aaron Marton

I used @Jay Bhatt's solution for finding out a csv file's delimiter, but it didn't work for me, so I applied a few fixes and comments for the process to be more understandable.

我使用@Jay Bhatt 的解决方案来找出 csv 文件的分隔符,但它对我不起作用,所以我应用了一些修复和注释,使该过程更易于理解。

See my version of @Jay Bhatt's function:

查看我的@Jay Bhatt 函数版本:

function decide_csv_delimiter($file, $checkLines = 10) {

    // use php's built in file parser class for validating the csv or txt file
    $file = new SplFileObject($file);

    // array of predefined delimiters. Add any more delimiters if you wish
    $delimiters = array(',', '\t', ';', '|', ':');

    // store all the occurences of each delimiter in an associative array
    $number_of_delimiter_occurences = array();

    $results = array();

    $i = 0; // using 'i' for counting the number of actual row parsed
    while ($file->valid() && $i <= $checkLines) {

        $line = $file->fgets();

        foreach ($delimiters as $idx => $delimiter){

            $regExp = '/['.$delimiter.']/';
            $fields = preg_split($regExp, $line);

            // construct the array with all the keys as the delimiters
            // and the values as the number of delimiter occurences
            $number_of_delimiter_occurences[$delimiter] = count($fields);

        }

       $i++;
    }

    // get key of the largest value from the array (comapring only the array values)
    // in our case, the array keys are the delimiters
    $results = array_keys($number_of_delimiter_occurences, max($number_of_delimiter_occurences));


    // in case the delimiter happens to be a 'tab' character ('\t'), return it in double quotes
    // otherwise when using as delimiter it will give an error,
    // because it is not recognised as a special character for 'tab' key,
    // it shows up like a simple string composed of '\' and 't' characters, which is not accepted when parsing csv files
    return $results[0] == '\t' ? "\t" : $results[0];
}

I personally use this function for helping automatically parse a file with PHPExcel, and it works beautifully and fast.

我个人使用这个函数来帮助自动解析文件PHPExcel,它工作得很好,速度也很快。

I recommend parsing at least 10 lines, for the results to be more accurate. I personally use it with 100 lines, and it is working fast, no delays or lags. The more lines you parse, the more accurate the result gets.

我建议至少解析 10 行,以便结果更准确。我个人将它与 100 行一起使用,它运行速度快,没​​有延迟或滞后。解析的行数越多,结果就越准确。

NOTE: This is just a modifed version of @Jay Bhatt's solution to the question. All credits goes to @Jay Bhatt.

注意:这只是@Jay Bhatt 对该问题的解决方案的修改版本。所有功劳归于@Jay Bhatt。

回答by Arturo Rossodivita

you can simply use the fgetcsv(); PHP native function in this way:

你可以简单地使用 fgetcsv(); PHP 原生函数是这样的:

function getCsvDelimeter($file)
{
    if (($handle = fopen($file, "r")) !== FALSE) {
        $delimiters = array(',', ';', '|', ':'); //Put all that need check

        foreach ($delimiters AS $item) {
            //fgetcsv() return array with unique index if not found the delimiter
            if (count(fgetcsv($handle, 0, $item, '"')) > 1) {
                $delimiter = $item;

                break;
            }
        }
    }

    return (isset($delimiter) ? $delimiter : null);
}

回答by Cyril N.

Thanks for all your inputs, I made mine using your tricks : preg_split, fgetcsv, loop, etc.

感谢您的所有输入,我使用了您的技巧:preg_split、fgetcsv、loop 等。

But I implemented something that was surprisingly not here, the use of fgets instead of reading the whole file, way better if the file is heavy!

但是我在这里实现了一些令人惊讶的东西,使用 fgets 而不是读取整个文件,如果文件很重,那就更好了!

Here's the code :

这是代码:

ini_set("auto_detect_line_endings", true);
function guessCsvDelimiter($filePath, $limitLines = 5) {
    if (!is_readable($filePath) || !is_file($filePath)) {
        return false;
    }

    $delimiters = array(
        'tab'       => "\t",
        'comma'     => ",",
        'semicolon' => ";"
    );

    $fp = fopen($filePath, 'r', false);
    $lineResults = array(
        'tab'       => array(),
        'comma'     => array(),
        'semicolon' => array()
    );

    $lineIndex = 0;
    while (!feof($fp)) {
        $line = fgets($fp);

        foreach ($delimiters as $key=>$delimiter) {
            $lineResults[$key][$lineIndex] = count (fgetcsv($fp, 1024, $delimiter)) - 1;
        }

        $lineIndex++;
        if ($lineIndex > $limitLines) break;
    }
    fclose($fp);

    // Calculating average
    foreach ($lineResults as $key=>$entry) {
        $lineResults[$key] = array_sum($entry)/count($entry);
    }

    arsort($lineResults);
    reset($lineResults);
    return ($lineResults[0] !== $lineResults[1]) ? $delimiters[key($lineResults)] : $delimiters['comma'];
}

回答by Douglas Leeder

Aside from the trivial answer that csv files are always comma-separated - it's in the name, I don't think you can come up with any hard rules. Both TSV and CSV files are sufficiently loosely specified that you can come up with files that would be acceptable as either.

除了csv 文件总是以逗号分隔的简单答案- 它在名称中,我认为您无法提出任何硬性规则。TSV 和 CSV 文件都足够松散地指定,您可以想出可以接受的文件。

A\tB,C
1,2\t3

(Assuming \t == TAB)

(假设 \t == TAB)

How would you decide whether this is TSV or CSV?

您将如何决定这是 TSV 还是 CSV?

回答by SimonDowdles

When I output a TSV file I author the tabs using \t the same method one would author a line break like \n so that being said I guess a method could be as follows:

当我输出一个 TSV 文件时,我使用 \t 创作标签的方法相同,我会创作一个像 \n 这样的换行符,所以据说我猜一种方法可能如下:

<?php
$mysource = YOUR SOURCE HERE, file_get_contents() OR HOWEVER YOU WISH TO GET THE SOURCE;
 if(strpos($mysource, "\t") > 0){
   //We have a tab separator
 }else{
   // it might be CSV
 }
?>

I Guess this may not be the right manner, because you could have tabs and commas in the actual content as well. It's just an idea. Using regular expressions may be better, although I am not too clued up on that.

我猜这可能不是正确的方式,因为您也可以在实际内容中使用制表符和逗号。这只是一个想法。使用正则表达式可能会更好,尽管我对此不太了解。