Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 parquet failures on Travis #22934

Closed
4 tasks
TomAugspurger opened this issue Oct 2, 2018 · 7 comments
Closed
4 tasks

S3 parquet failures on Travis #22934

TomAugspurger opened this issue Oct 2, 2018 · 7 comments
Labels
CI Continuous Integration

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 2, 2018

edit: We've pinned to moto 1.3.4 for now. That seems to avoid the issues. Assuming that it's an issue with moto (which hasn't been verified), the remaining TODOs here are

  • Reproduce locally with moto 1.3.6 or higher
  • construct a minimal test case not using pandas, and ideally not using s3fs
  • report upstream to moto
  • unpin moto when it's fixed

https://travis-ci.org/pandas-dev/pandas/jobs/435861714#L2506

=================================== FAILURES ===================================
_____________________ TestParquetPyArrow.test_s3_roundtrip _____________________
[gw0] linux2 -- Python 2.7.15 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.io.test_parquet.TestParquetPyArrow object at 0x7f3966d04910>
df_compat =    A    B
0  1  foo
1  2  foo
2  3  foo
s3_resource = s3.ServiceResource(), pa = 'pyarrow'
    def test_s3_roundtrip(self, df_compat, s3_resource, pa):
        # GH #19134
        check_round_trip(df_compat, pa,
>                        path='s3://pandas-test/pyarrow.parquet')
pandas/tests/io/test_parquet.py:474: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:169: in check_round_trip
    compare(repeat)
pandas/tests/io/test_parquet.py:161: in compare
    actual = read_parquet(path, **read_kwargs)
pandas/io/parquet.py:303: in read_parquet
    return impl.read(path, columns=columns, **kwargs)
pandas/io/parquet.py:132: in read
    path, _, _, should_close = get_filepath_or_buffer(path)
pandas/io/common.py:216: in get_filepath_or_buffer
    mode=mode)
pandas/io/s3.py:38: in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
../../../miniconda3/envs/pandas/lib/python2.7/site-packages/s3fs/core.py:335: in open
    s3_additional_kwargs=kw)
../../../miniconda3/envs/pandas/lib/python2.7/site-packages/s3fs/core.py:1143: in __init__
    info = self.info()
../../../miniconda3/envs/pandas/lib/python2.7/site-packages/s3fs/core.py:1161: in info
    refresh=refresh, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <s3fs.core.S3FileSystem object at 0x7f395fb5a890>
path = 'pandas-test/pyarrow.parquet', version_id = None, refresh = False
kwargs = {}, parent = 'pandas-test', bucket = 'pandas-test'
key = 'pyarrow.parquet'
    def info(self, path, version_id=None, refresh=False, **kwargs):
        """ Detail on the specific file pointed to by path.
    
            Gets details only for a specific key, directories/buckets cannot be
            used with info.
    
            Parameters
            ----------
            version_id : str, optional
                version of the key to perform the head_object on
            refresh : bool
                If true, don't look in the info cache
            """
        parent = path.rsplit('/', 1)[0]
    
        if not refresh:
            if path in self.dirs:
                files = self.dirs[path]
                if len(files) == 1:
                    return files[0]
            elif parent in self.dirs:
                for f in self.dirs[parent]:
                    if f['Key'] == path:
                        return f
    
        try:
            bucket, key = split_path(path)
            if version_id is not None:
                if not self.version_aware:
                    raise ValueError("version_id cannot be specified if the "
                                     "filesystem is not version aware")
                kwargs['VersionId'] = version_id
            out = self._call_s3(self.s3.head_object, kwargs, Bucket=bucket,
                                Key=key, **self.req_kw)
            out = {
                'ETag': out['ETag'],
                'Key': '/'.join([bucket, key]),
                'LastModified': out['LastModified'],
                'Size': out['ContentLength'],
                'StorageClass': "STANDARD",
                'VersionId': out.get('VersionId')
            }
            return out
        except (ClientError, ParamValidationError):
            logger.debug("Failed to head path %s", path, exc_info=True)
>           raise FileNotFoundError(path)
E           FileNotFoundError: pandas-test/pyarrow.parquet
../../../miniconda3/envs/pandas/lib/python2.7/site-packages/s3fs/core.py:478: FileNotFoundError
----------------------------- Captured stderr call -----------------------------
Exception requests.exceptions.ConnectionError: ConnectionError(u'Connection refused: PUT https://pandas-test.s3.amazonaws.com/pyarrow.parquet',) in <bound method S3File.__del__ of <S3File pandas-test/pyarrow.parquet>> ignored

Debugging now.

@TomAugspurger TomAugspurger added the CI Continuous Integration label Oct 2, 2018
@TomAugspurger
Copy link
Contributor Author

Anyone able to reproduce this locally? This has failed twice on #22932, but I can't get a failure locally with what should be the same environment.

@pambot
Copy link
Contributor

pambot commented Oct 2, 2018

I've gotten this error a number of times recently, always on CI. Honestly, I solve it by git commit --amend ; git push -f origin <branch>, which reboots CI. My guess is that it's caused by a connection timeout on S3, which causes a file that should have materialized to not be there, which causes the FileNotFoundError.

@TomAugspurger
Copy link
Contributor Author

Right, there's some randomness to the failures.

We shouldn't be making a real HTTP request though. moto should be mocking HTTP requests at this point.

@alimcmaster1
Copy link
Member

alimcmaster1 commented Oct 2, 2018

Seems like quite a few PRs right now are experiencing this issue:
https://travis-ci.org/pandas-dev/pandas/jobs/436364580
https://travis-ci.org/pandas-dev/pandas/jobs/436358399

Possibly helpful:
#19134

@TomAugspurger
Copy link
Contributor Author

Updated the original post with some details now that we've pinned moto.

Again, if anyone is able to reproduce the pandas failure locally with newer motos, let me know.

@h-vetinari
Copy link
Contributor

h-vetinari commented Nov 12, 2018

Not sure this is related to the issues in the OP, but the title certainly fits the bill.

In #22225 and #23192, I've had persistent a ResourceWarning the last few CI runs. I first thought it was a flaky thing like those warnings used to be, but this time, it stayed, and I can reproduce it locally (not with pytest pandas/tests/io/test_parquet.py, but at least with pytest pandas/tests/io).

For example in https://travis-ci.org/pandas-dev/pandas/jobs/453820311 or https://travis-ci.org/pandas-dev/pandas/jobs/453822449:

sys:1: ResourceWarning: unclosed <socket.socket fd=16, family=AddressFamily.AF_INET, type=2050, proto=0, laddr=('0.0.0.0', 0)>
sys:1: ResourceWarning: unclosed <socket.socket fd=15, family=AddressFamily.AF_INET, type=2050, proto=0, laddr=('0.0.0.0', 0)>

and

=============================== warnings summary ===============================
pandas/core/frame.py::pandas.core.frame.DataFrame.to_parquet
  /home/travis/build/pandas-dev/pandas/pandas/io/parquet.py:129: ResourceWarning: unclosed file <_io.BufferedReader name='df.parquet.gzip'>
    **kwargs).to_pandas()

There's also a stderr (or stdout) warning from the parser-tests surfacing somewhere:

..............................................................x...........................................................s....
........................................Skipping line 3: Expected 3 fields in line 3, saw 4
.......................................s.......................................................................................

I've narrowed the ResourceWarning down to the parquet-s3 tests, but couldn't figure out a way to fix it despite many tries.

(BTW, running moto 1.3.6 locally)

@jbrockmendel
Copy link
Member

It looks like moto is no longer pinned in any of the ci/deps/ files. Is this closeable?

@jreback jreback closed this as completed Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration
Projects
None yet
Development

No branches or pull requests

6 participants