Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot properly parse tsv #35

Open
ondrejhlavacek opened this issue Jul 17, 2018 · 14 comments
Open

cannot properly parse tsv #35

ondrejhlavacek opened this issue Jul 17, 2018 · 14 comments
Labels

Comments

@ondrejhlavacek
Copy link
Member

php-csv says this file has only one line

image

demonstrated in #34

@ondrejhlavacek
Copy link
Member Author

vrtám se v tom dál, vypadá to, že za to může enclosure chr(0):

$enclosure = !$this->getEnclosure() ? chr(0) : $this->getEnclosure();

image

null char enclosure

        $fh = fopen($filenameFrom, "r");
        while ($parsed = fgetcsv($fh, null, "\t", chr(0))) {
            var_dump($parsed);
        }
array(6) {
  [0]=>
  string(3) "218"
  [1]=>
  string(1) "0"
  [2]=>
  string(1) " "
  [3]=>
  string(1) " "
  [4]=>
  string(1) " "
  [5]=>
  string(18) "?
219	0	 	 	 	 
"
}

no enclosure

        $fh = fopen($filenameFrom, "r");
        while ($parsed = fgetcsv($fh, null, "\t")) {
            var_dump($parsed);
        }
array(6) {
  [0]=>
  string(3) "218"
  [1]=>
  string(1) "0"
  [2]=>
  string(1) " "
  [3]=>
  string(1) " "
  [4]=>
  string(1) " "
  [5]=>
  string(1) " "
}
array(6) {
  [0]=>
  string(3) "219"
  [1]=>
  string(1) "0"
  [2]=>
  string(1) " "
  [3]=>
  string(1) " "
  [4]=>
  string(1) " "
  [5]=>
  string(1) " "
}

@ondrejhlavacek
Copy link
Member Author

když skript rozdělím na fgets a str_getcsv, tak to produkuje podobně divnou chybu

        $fh = fopen($filenameFrom, "r");
        while ($line = fgets($fh)) {
            var_dump($line);
            var_dump(str_getcsv($line, "\t", chr(0)));
        }
        fclose($fh);
array(6) {
  [0]=>
  string(3) "218"
  [1]=>
  string(1) "0"
  [2]=>
  string(1) " "
  [3]=>
  string(1) " "
  [4]=>
  string(1) " "
  [5]=>
  string(3) "?
"
}
string(15) "219	0	 	 	 	 
"
array(6) {
  [0]=>
  string(3) "219"
  [1]=>
  string(1) "0"
  [2]=>
  string(1) " "
  [3]=>
  string(1) " "
  [4]=>
  string(1) " "
  [5]=>
  string(3) "
"
}

jsou tam divný ty newliny a otazníky v posledních prvích polí řádků

@ondrejhlavacek
Copy link
Member Author

str_getcsv umí v argumentech použít prázdný stringy, tj. nemusel by se používat chr(0) hack

        $fh = fopen($filenameFrom, "r");
        while ($line = fgets($fh)) {
            var_dump($line);
            var_dump(str_getcsv($line, "\t", "", ""));
        }
        fclose($fh);
string(15) "218	0	 	 	 	 
"
array(6) {
  [0]=>
  string(3) "218"
  [1]=>
  string(1) "0"
  [2]=>
  string(1) " "
  [3]=>
  string(1) " "
  [4]=>
  string(1) " "
  [5]=>
  string(1) " "
}
string(15) "219	0	 	 	 	 
"
array(6) {
  [0]=>
  string(3) "219"
  [1]=>
  string(1) "0"
  [2]=>
  string(1) " "
  [3]=>
  string(1) " "
  [4]=>
  string(1) " "
  [5]=>
  string(1) " "
}

@ondrejhlavacek
Copy link
Member Author

možná by to celý stálo za report do php bugtrackeru?

@Halama
Copy link
Member

Halama commented Jul 17, 2018

jenže fgets použít nejde protože čte po řádcích a v csv můžeš mít newlines libovolně kombinované (cr, cr lf, lf) uvnitř dat. A fgetcsv nebere prázdný string jako enclosure.

jinak tech bůgu je tam hlášených několik, např. https://bugs.php.net/bug.php?id=51496

@Halama
Copy link
Member

Halama commented Jul 17, 2018

https://csv.thephpleague.com/9.0/interoperability/rfc4180-field/ tady je to taky linkované a nějak to zkouší fixovat. Ale naše testovací CSV to moc nedalo https://keboola.slack.com/archives/C02C3GZUS/p1518859662000040

@Halama
Copy link
Member

Halama commented Jul 17, 2018

V tom testu ale testuješ csv který nemá enclosure ani escape. To nikdy nemůže spolehlivě fungovat.

@Halama
Copy link
Member

Halama commented Jul 17, 2018

Pokud tomu nastavím enclosure na " jede to v pohodě.

@Halama
Copy link
Member

Halama commented Jul 17, 2018

Chtělo by to větší sample kde to způsobovalo problém.

@ondrejhlavacek
Copy link
Member Author

@ondrejhlavacek
Copy link
Member Author

Zkusím do té konfigurace narvat " jako enclosure, jestli to projde

@ondrejhlavacek
Copy link
Member Author

Jo, prošlo! OMG, se s tím seru celej den :-(

@Halama
Copy link
Member

Halama commented Jul 18, 2018

Zaleží ale co pak je v tom CSV dál. Jestli nemá enclousure a najednou se někde v datech " objeví tak se to rozbije. Pokud ale vědí že jsou to např. jenom čísla tak je to cajk.

@ondrejhlavacek
Copy link
Member Author

V současanejch datech to nebylo, takže snad cajk!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants