Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: add a way to limit bytes read per field/record #20169

Open
nussjustin opened this issue Apr 28, 2017 · 23 comments
Open

encoding/csv: add a way to limit bytes read per field/record #20169

nussjustin opened this issue Apr 28, 2017 · 23 comments

Comments

@nussjustin
Copy link
Contributor

nussjustin commented Apr 28, 2017

The csv.Reader has no way to limit the size of the fields read. In combination with LazyQuotes, this can lead to cases where a simple syntax error can make the Reader read everything until io.EOF into a single field.

Probably the simplest case (with LazyQuotes == true) is a quoted field, where there is some other rune between the closing quote and the comma (e.g. a,"b" ,c). In this case the second field will contain all bytes until either a single quote folowed by a comma or EOF is found. (See my comment in #8059)

This behaviour can lead to excessive memory usage or even OOM when reading broken CSV files.

To avoid this I propose to add a new, optional field to Reader to allow limiting the size of each field. When set the Reader would return an error as soon as it hits the limit.

Alternatively the limit could be per record instead. This could help especially with situations where FieldsPerRecord == -1, because there can be a different, basically unlimitted number of fields per record so a limit of 100 bytes per field doesn't help with e.g. missing new lines leading to records with many fields.

The new field would be specified as (as proposed by @bradfitz)

type Reader struct {
...
      // BytesPerField optionally specifies the maximum allowed number of bytes
      // per field. Zero means no limit.
      BytesPerField int
@nussjustin
Copy link
Contributor Author

/cc @bradfitz @josharian

@josharian
Copy link
Contributor

How about just hard-coding a limit, instead of adding API? I seriously doubt any reasonable csv file will contain a single field bigger than, say, 64k. We do something similar for, say, bufio.Scanner, with a note saying that those who find it problematic should switch to lower level reading.

@nussjustin
Copy link
Contributor Author

That also works, but could break existing users (although I agree that there probably won't be any breaking code). In this case a per record limit would make even less sense (IMO) as CSV files can often have many fields (I have worked with files having hundreds or even thousands of fields per record) and then even small fields could trigger the limit.

In this case it's important to think about the limit. Too high and memory usage can still become "excessive", too low and existing programs could break. I would guess that in most cases when reading CSV the memory usage isn't really important or restricted and a limit of 32k or 64k would not be too high and still prevent most (if not all) problems related to the problem.

A different variant of this could make the Reader only limit fields when LazyQuotes is set and a lazy quote was found. This would still solve the problem for the "common" case and not break users that either don't set LazyQuotes to true or don't have CSV files that trigger this behaviour

@josharian
Copy link
Contributor

A different variant of this could make the Reader only limit fields when LazyQuotes is set and a lazy quote was found.

This makes sense.

Related: #19019

@nussjustin
Copy link
Contributor Author

I have a small patch ready that adds a limit for the LazyQuotes == true case when there is at least one lazy quote in a field, but although I like the idea I'm still not 100% sure this is the best approach.

I'd like to wait for a 3rd opinion on this.

@SamWhited SamWhited added this to the Proposal milestone May 10, 2017
@SamWhited SamWhited added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label May 10, 2017
@rsc
Copy link
Contributor

rsc commented May 22, 2017

It seems OK to have a limit of some kind, but someone needs to propose a concrete API.

@rsc rsc added Proposal-Hold and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Jun 5, 2017
@rsc
Copy link
Contributor

rsc commented Jun 5, 2017

On hold for concrete API.

@bradfitz
Copy link
Contributor

bradfitz commented Jun 5, 2017

We already have encoding/csv.Reader.FieldsPerRecord int, so I propose:

type Reader struct {
...
      // BytesPerField optionally specifies the maximum allowed number of bytes
      // per field. Zero means no limit.
      BytesPerField int

@nussjustin
Copy link
Contributor Author

I'm in favor of Brad's proposal. It's probably the simplest solution and can be used even for cases where LazyQuotes is off (basically the case I always want and hope for...)

I can take this once the go 1.10 tree is open.

(And Sorry for the delay, just found this in my list of issues to look at. WIll make me a reminder for after the go 1.9 release...)

@nussjustin
Copy link
Contributor Author

Updated the issue comment with what @bradfitz proposed.

@dsnet
Copy link
Member

dsnet commented Oct 20, 2017

This class of issue is not specific to csv, but everything from tar, json, etc. Each of them have cases where they can slurp in an unbounded amount of input. In would be a benefit if the API proposed is one that fits nicely with the other packages.

@ianlancetaylor
Copy link
Contributor

What about something like a new encoding/limit package?

package limit

// Set sets the limit of the amount of data we should read for one reader for pkg.
// The pkg is an arbitrary string, conventionally the package name that needs a Reader.
// If pkg is the empty string, this sets the default limit for all packages.
// This controls the amount of data that Reader will permit.
// Setting the limit to zero permits any amount of data.
// The default for all packages is zero.
// This returns the old limit for the specified package.
func Set(pkg string, limit int) int

// Reader limits r for a given package.
// If there is a limit set for the package, this will return bytes from r up to the limit.
// If expected is not zero, and exceeds the limit, then after reading the limited number
// of bytes any remaining expected bytes from r will be read and discarded.
// If there is no limit for the given package, this simply returns r.
func Reader(pkg string, r io.Reader, expected int) io.Reader

This approach means that a client by default will read all data. A server can call limit.Set to impose a limit on the number of bytes that will be read by any limit-aware package. The pkg string serves as a key that permits a server to set different limits for different packages. The expected argument to Reader permits a package that knows how much data to expect to easily skip data past the limit. Packages that do not know how much data to expect may need a more complicated solution to skip past the data they are discarding.

@nussjustin
Copy link
Contributor Author

I think this is to loose, especially since "the amount of data we should read for one reader" depends on what you are working with. I imagine the Reader function would be used multiple times, depending on the package (e.g. for each record in a CSV file) and it's behaviour would need to be documented for each package that uses it (eg. does it limit the combined length of each csv record or of each field?).

Also I don't like that this adds new global state that changes the behaviour of while packages or types instead of only a single object. Maybe we can just define a new interface that all the concerned types can implement which can than document their way of limiting/their behaviour on the method(s).

@ianlancetaylor
Copy link
Contributor

New global state is awful but my guess is that having to add new API for every affected package is worse. Ultimately the issue is that servers need to limit the amount of memory they are going to allocate when reading potentially very large data. If the only way to limit that is to pass some argument, then that argument needs to be exposed at every API layer back up to the server.

If you don't like the reader idea then another approach would be to use a limited buffer to hold the incoming data. On the base assumption that this only applies to objects that are essentially []byte, we could invent a limited buffer that will start dropping bytes past some limit. But, again, using this implies global state or massive API changes.

@dsnet
Copy link
Member

dsnet commented Oct 20, 2017

I think addressing this somewhere on the io.Reader level is in the right direction, since that is the common denominator of every package that exhibits problems of this class.

I wonder if the approach is to define a special interface that satisfies io.Reader that allows the user of an io.Reader (i.e., csv, tar, json, etc) to inform the io.Reader when it about to perform an operation that may slurp an unbounded amount of memory.

type ReadLimiter interface {
    io.Reader

    // StartLimiter informs the Reader that the user intends to perform an operation
    // that may read an unbounded amount of memory from the Reader.
    StartLimiter()

    // StopLimiter informs the Reader that the potentially unbounded operation
    // has come to completion.
    StopLimiter()
}

Thus, the csv package will call r.StartLimiter() before read each record, and call r.StopLimiter() after each record. If a given record exceeds to the limit set in the implementation of ReadLimiter, then Read returns an error.

Similarly, the tar package can call StartLimiter and StopLimiter when it about to perform operations that are potentionally infinite (when within a tar.Reader.Next call). Same with gzip.Reader.readHeader (see #14639), and all.

@ianlancetaylor
Copy link
Contributor

@dsnet I'm not quite seeing it yet. Tell me more. What would an implementation of ReadLimiter look like? The goal here is to stop gzip.Reader, say, from allocating large amounts of memory. What action is the ReadLimiter going to take to prevent that from happening? Let me add that in some cases I think it may be desirable that we be able to discard the large object in the stream but still read subsequent objects.

@dsnet
Copy link
Member

dsnet commented Oct 23, 2017

Consider the following simple implementation with the exported API:

package ioquota

// QuotaReader an interface that consumers of io.Reader can type assert to
// in order to declare intent to perform read operations that may require
// an unbounded amount of resources.
type QuotaReader interface {
	io.Reader

	// WithQuota informs an io.Reader that the consumer intends to perform
	// an operation that may read an unbounded amount of memory from the reader.
	// It the responsibility of the consumer to call done when the operation
	// is complete.
	//
	// When WithQuota is called, the io.Reader may impose some restriction on
	// the number of byte that may be read until done is called.
	// When the quota is exceeded, ErrQuota is returned from Read.
	// Calling done lifts the quota restriction, even if ErrQuota is reached.
	WithQuota() (done func())
}

func New(r io.Reader, quota int) QuotaReader

End users can do the following:

r := csv.NewReader(ioquota.New(r, 64*kibiByte))
rec, err := r.Read()
_ = err // err may be ErrQuota if the record is larger than 64KiB
r, err := gzip.NewReader(ioquota.New(r, 64*kibiByte))
_ = err // err may be ErrTruncatedFields, however r still can be used
r := tar.NewReader(ioquota.New(r, 64*kibiByte))
h, err := r.Next()
_ = err // err may be ErrQuota if the record is larger than 64KiB
io.Copy(ioutil.Discard, r) // This has no read limit, since quota only applies to Next

The implementation within each standard library will be as follows:

package csv

func (r *Reader) readRecord(dst []string) ([]string, error) {
    // Every call to readRecord will be within the scope of a single WithQuota call
    // to ensure that no single record exceeds the given quota.
    if qr, ok := r.r.(ioquota.QuotaReader); ok {
        done := qr.WithQuota()
        defer done()
    }
    ...
}
package tar

func (tr *Reader) Next() (*Header, error) {
    // Every call to Next will be within the scope of a single WithQuota call
    // to ensure that no tar.Header exceeds the given quota.
    if qr, ok := tr.r.(ioquota.QuotaReader); ok {
        done := qr.WithQuota()
        defer done()
    }
    ...
}
package gzip

// ErrTruncatedHeader is returned by NewReader and Reader.Reset when the string fields in
// Header are truncated. Even when this reader is returned, the Reader may still be used.
var ErrTruncatedHeader = errors.New("Header.Name or Header.Comment fields is truncated")

// If the underlying reader is a QuotaReader, it will call WithQuota prior to reading the string.
// If the quota is reach, the QuotaReader will truncate the field, and continue to discard the
// remainder of the string field.
func (z *Reader) readString() (hdr Header, err error) {
    var done func()
    if qr, ok := z.r.(ioquota.QuotaReader); ok {
        done = qr.WithQuota()
    }

    var truncated bool
    for {
        b, err := r.r.ReadByte()
        if err == ioquota.ErrQuota {
            err = nil
            truncated = true
            done() // Allow future calls to ReadByte to succeed
            done = nil
        }
        if err != nil {
            ...        
        }
        if b == 0 {
            break // Null terminator
        }
        if truncated {
            continue // Avoid processing this byte
        }
        ...
    }
    if done != nil {
        done()
    }
    ...
}

This is the rough sketch I have, there's still a number of open questions. See the TODO's in my playground example. For example, the gzip example assumes ReadByte being under the influence of WithQuota.

@ianlancetaylor
Copy link
Contributor

Seems promising. But also seems a bit awkward to use. And it's hard to translate the ioquota.New arguments from the original caller into a meaningful operation a few layers down. Are we measuring compressed data or uncompressed data? A streaming tar file with an unlimited number of small entries might be just as bad as one with a large entry of unlimited size. Can the same number fix both problems?

@dsnet
Copy link
Member

dsnet commented May 17, 2018

This issue has since expanded beyond just csv, but I noticed that #5050 is more or less the same problem, but for image/{gif,jpeg,png}.

@JordyMoos
Copy link

What is the status of this issue? Will this be fixed in csv?

@ianlancetaylor
Copy link
Contributor

The status is that we haven't decided what to do here.

@markwaddle
Copy link

markwaddle commented Sep 13, 2022

considering that a file is buffered a record at a time, it seems that a useful control might be to limit the maximum record length. has the following, or similar, been considered?

type Reader struct {
...
      // MaxBytesPerRecord optionally specifies the maximum allowed number of bytes
      // per record. Zero means no limit.
      MaxBytesPerRecord int

it is easy to implement, easy to understand as a package consumer, and it would mitigate the worst case scenarios for processing large, malformed files.

@fsaintjacques
Copy link

This is a very serious issue that leads to OOM if an application is using encoding/csv on a "big" file with a missing quote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants