Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GATK failing in FireCloud with google ExecutionException (caused by UnknownHostException) #5094

Closed
lbergelson opened this issue Aug 8, 2018 · 12 comments

Comments

@lbergelson
Copy link
Member

We're seeing a high rate of failures running BQSR pipelines in production with ExecutionExceptions.

The root cause seems to be an UnknownHostException thrown in the google storage api client.

java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.google.cloud.storage.StorageException: www.googleapis.com
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher.read(SeekableByteChannelPrefetcher.java:309)
	at htsjdk.samtools.seekablestream.SeekablePathStream.read(SeekablePathStream.java:86)
	at htsjdk.samtools.util.BlockCompressedInputStream.readBytes(BlockCompressedInputStream.java:567)
	at htsjdk.samtools.util.BlockCompressedInputStream.readBytes(BlockCompressedInputStream.java:556)
	at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:525)
	at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
	at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:458)
	at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:196)
	at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:331)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:404)
	at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:380)
	at htsjdk.samtools.util.BinaryCodec.readByteBuffer(BinaryCodec.java:490)
	at htsjdk.samtools.util.BinaryCodec.readInt(BinaryCodec.java:501)
	at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:198)
	at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:829)
	at htsjdk.samtools.BAMFileReader$BAMFileIndexIterator.getNextRecord(BAMFileReader.java:981)
	at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:803)
	at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
	at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
	at htsjdk.samtools.BAMFileReader$BAMQueryFilteringIterator.advance(BAMFileReader.java:1034)
	at htsjdk.samtools.BAMFileReader$BAMQueryFilteringIterator.next(BAMFileReader.java:1024)
	at htsjdk.samtools.BAMFileReader$BAMQueryFilteringIterator.next(BAMFileReader.java:988)
	at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
	at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
	at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextRecord(SamReaderQueryingIterator.java:114)
	at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.next(SamReaderQueryingIterator.java:151)
	at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.next(SamReaderQueryingIterator.java:29)
	at org.broadinstitute.hellbender.utils.iterators.SAMRecordToReadIterator.next(SAMRecordToReadIterator.java:29)
	at org.broadinstitute.hellbender.utils.iterators.SAMRecordToReadIterator.next(SAMRecordToReadIterator.java:15)
	at java.util.Iterator.forEachRemaining(Iterator.java:116)
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
	at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:94)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
	at org.broadinstitute.hellbender.Main.main(Main.java:233)
Caused by: java.util.concurrent.ExecutionException: com.google.cloud.storage.StorageException: www.googleapis.com
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher$WorkUnit.getBuf(SeekableByteChannelPrefetcher.java:136)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher.fetch(SeekableByteChannelPrefetcher.java:255)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher.read(SeekableByteChannelPrefetcher.java:300)
	... 45 more
Caused by: com.google.cloud.storage.StorageException: www.googleapis.com
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:189)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.read(HttpStorageRpc.java:526)
	at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:127)
	at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:124)
	at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:94)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:54)
	at com.google.cloud.storage.BlobReadChannel.read(BlobReadChannel.java:124)
	at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.read(CloudStorageReadChannel.java:114)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher$WorkUnit.call(SeekableByteChannelPrefetcher.java:131)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher$WorkUnit.call(SeekableByteChannelPrefetcher.java:104)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: www.googleapis.com
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:668)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
	at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
	at shaded.cloud_nio.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:380)
	at shaded.cloud_nio.com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6133)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.read(HttpStorageRpc.java:505)
@lbergelson lbergelson changed the title GATK failing in production with google ExecutionException GATK failing in FireCloud with google ExecutionException Aug 8, 2018
@magicDGS
Copy link
Contributor

magicDGS commented Aug 8, 2018

Is this also related with the emails that we are receiving from Jenkins (e.g., https://gatk-jenkins.broadinstitute.org/job/GATK-BQSR/1359/)?

@lbergelson
Copy link
Member Author

@magicDGS I don't think so. Those tests have been failing forever. I think someone finally fixed the bug that was causing the email notifications to not happen. They're useless and annoying though, so I disabled that test... We haven't fixed it in years, so we're unlikely to do so now...

@magicDGS
Copy link
Contributor

magicDGS commented Aug 8, 2018

Ok, thanks for the feedback @lbergelson!

@ldgauthier
Copy link
Contributor

I've also seen UnknownHostException: metadata, which seems like it's probably related. My favorite part about that exception is This is likely because code is not running on Google Compute Engine.

java.util.concurrent.CompletionException: org.broadinstitute.hellbender.exceptions.UserException$CouldNotReadInputFile: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize  with exception: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.broadinstitute.hellbender.exceptions.UserException$CouldNotReadInputFile: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize  with exception: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$601(GenomicsDBImport.java:605)
	at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getFeatureReadersInParallel(GenomicsDBImport.java:600)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.createSampleToReaderMap(GenomicsDBImport.java:491)
	at com.intel.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:602)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
	... 3 more
Caused by: java.util.concurrent.ExecutionException: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$601(GenomicsDBImport.java:602)
	... 8 more
Caused by: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromPath(GenomicsDBImport.java:640)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$600(GenomicsDBImport.java:593)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	... 3 more
Caused by: htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine., for input source: gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:97)
	at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:82)
	at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:109)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromPath(GenomicsDBImport.java:638)
	... 5 more
Caused by: com.google.cloud.storage.StorageException: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine.
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:189)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:335)
	at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:191)
	at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:188)
	at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:94)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:54)
	at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:188)
	at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:202)
	at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.fetchSize(CloudStorageReadChannel.java:183)
	at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.<init>(CloudStorageReadChannel.java:78)
	at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.create(CloudStorageReadChannel.java:68)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newReadChannel(CloudStorageFileSystemProvider.java:304)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newByteChannel(CloudStorageFileSystemProvider.java:265)
	at java.nio.file.Files.newByteChannel(Files.java:361)
	at java.nio.file.Files.newByteChannel(Files.java:407)
	at htsjdk.samtools.seekablestream.SeekablePathStream.<init>(SeekablePathStream.java:41)
	at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:101)
	at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:94)
	... 8 more
Caused by: java.io.IOException: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine.
	at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:106)
	at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:149)
	at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:135)
	at shaded.cloud_nio.com.google.auth.http.HttpCredentialsAdapter.initialize(HttpCredentialsAdapter.java:96)
	at com.google.cloud.http.HttpTransportOptions$1.initialize(HttpTransportOptions.java:157)
	at shaded.cloud_nio.com.google.api.client.http.HttpRequestFactory.buildRequest(HttpRequestFactory.java:93)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:300)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:333)
	... 24 more
Caused by: java.net.UnknownHostException: metadata
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
	at shaded.cloud_nio.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
	at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:104)
	... 34 more

@droazen
Copy link
Collaborator

droazen commented Aug 10, 2018

See #4888, which is an older report for the same issue. As mentioned there, I think we should patch our fork of google-cloud-java to do a channel reopen on UnknownHostException for now as a quick fix. @jean-philippe-martin is eventually going to add an official configuration mechanism for clients of google-cloud-java to customize which errors should trigger a retry/reopen, which should provide a better way to deal with these errors as they crop up without having to modify the NIO library itself.

droazen added a commit that referenced this issue Aug 10, 2018
This brings in some additional retries for UnknownHostException and 502 errors,
and moves us from a fork in my personal github repository to the fork in
https://github.com/broadinstitute/google-cloud-java

Resolves #4888
Resolves #5094
droazen added a commit that referenced this issue Aug 13, 2018
This brings in some additional retries for UnknownHostException and 502 errors,
and moves us from a fork in my personal github repository to the fork in
https://github.com/broadinstitute/google-cloud-java

Resolves #4888
Resolves #5094
droazen added a commit to droazen/google-cloud-java that referenced this issue Aug 14, 2018
…nownHostException

We've frequently encountered both of these errors transiently in the wild (see
broadinstitute/gatk#4888 and
broadinstitute/gatk#5094

I also increased the maximum depth when inspecting nested exceptions looking for
reopenable errors from 10 to 20, as we've seen chains of exceptions that come very
close to the current limit of 10.
@droazen
Copy link
Collaborator

droazen commented Aug 14, 2018

Reopening -- @ldgauthier reports that this error still occurs even after the patch in #5099. With that patch, we are now retrying on UnknownHostException, but the retries are all failing:

[August 14, 2018 7:09:18 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 896.64 minutes.
Runtime.totalMemory()=3966238720
java.util.concurrent.CompletionException: org.broadinstitute.hellbender.exceptions.UserException$CouldNotReadInputFile: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize  with exception: com.google.cloud.storage.StorageException: All 20 reopens failed. Waited a total of 1918000 ms between attempts
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.broadinstitute.hellbender.exceptions.UserException$CouldNotReadInputFile: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize  with exception: com.google.cloud.storage.StorageException: All 20 reopens failed. Waited a total of 1918000 ms between attempts
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$614(GenomicsDBImport.java:605)
    at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getFeatureReadersInParallel(GenomicsDBImport.java:600)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.createSampleToReaderMap(GenomicsDBImport.java:491)
    at com.intel.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:602)
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
    ... 3 more
Caused by: java.util.concurrent.ExecutionException: com.google.cloud.storage.StorageException: All 20 reopens failed. Waited a total of 1918000 ms between attempts
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$614(GenomicsDBImport.java:602)
    ... 8 more
Caused by: com.google.cloud.storage.StorageException: All 20 reopens failed. Waited a total of 1918000 ms between attempts
    at com.google.cloud.storage.contrib.nio.CloudStorageRetryHandler.handleReopenForStorageException(CloudStorageRetryHandler.java:124)
    at com.google.cloud.storage.contrib.nio.CloudStorageRetryHandler.handleStorageException(CloudStorageRetryHandler.java:94)
    at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.checkAccess(CloudStorageFileSystemProvider.java:621)
    at java.nio.file.Files.exists(Files.java:2385)
    at htsjdk.tribble.util.ParsingUtils.resourceExists(ParsingUtils.java:419)
    at htsjdk.tribble.AbstractFeatureReader.isTabix(AbstractFeatureReader.java:222)
    at htsjdk.tribble.AbstractFeatureReader$ComponentMethods.isTabix(AbstractFeatureReader.java:228)
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:106)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromPath(GenomicsDBImport.java:638)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$613(GenomicsDBImport.java:593)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more
Caused by: com.google.cloud.storage.StorageException: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine.
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:189)
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:335)
    at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:191)
    at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:188)
    at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:94)
    at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:54)
    at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:188)
    at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.checkAccess(CloudStorageFileSystemProvider.java:614)
    ... 11 more
Caused by: java.io.IOException: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine.
    at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:106)
    at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:149)
    at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:135)
    at shaded.cloud_nio.com.google.auth.http.HttpCredentialsAdapter.initialize(HttpCredentialsAdapter.java:96)
    at com.google.cloud.http.HttpTransportOptions$1.initialize(HttpTransportOptions.java:157)
    at shaded.cloud_nio.com.google.api.client.http.HttpRequestFactory.buildRequest(HttpRequestFactory.java:93)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:300)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:333)
    ... 17 more
Caused by: java.net.UnknownHostException: metadata
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
    at sun.net.www.http.HttpClient.New(HttpClient.java:308)
    at sun.net.www.http.HttpClient.New(HttpClient.java:326)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
    at shaded.cloud_nio.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
    at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:104)
    ... 27 more

@droazen droazen reopened this Aug 14, 2018
@droazen droazen changed the title GATK failing in FireCloud with google ExecutionException GATK failing in FireCloud with google ExecutionException (caused by UnknownHostException) Aug 14, 2018
@droazen
Copy link
Collaborator

droazen commented Aug 14, 2018

@jean-philippe-martin Can you comment on this error with your thoughts? Despite now doing a channel reopen on UnknownHostException in our fork of the NIO library, all reopens are failing, which implies that this error can't be recovered from via a simple retry. Could there be something wrong in our authentication setup?

@droazen
Copy link
Collaborator

droazen commented Aug 14, 2018

More info from @ldgauthier:

I’ve only been trying the same GenomicsDBImport over and over again.  I estimate it to take about
30 hours if it’s ever successful. The exception happens in different batches every time.  Sam F. 
said he saw the exception too but he could eventually resubmit his way through it.  His jobs are
shorter running.

So it seems like the error is nondeterministic, but can't be recovered from within the same VM instance / process.

@cwhelan
Copy link
Member

cwhelan commented Aug 14, 2018

@ldgauthier 's error sounds like what I saw before when trying to run the joint genotyping pipeline. When I spoke about it with @ruchim she said that based on some experiments she did and conversations with the production team she thought it was a symptom VM's running under PAPI; Slack excerpt:

rmunshi [9:59 AM]
Last night I reached out to the Pipelines API folks
I ran a script that outputs the external IP address of the VM every 5 mins
and I see that after about ~17 hours, the request fails with

`curl: (6) Could not resolve host: metadata.google.internal` (edited)

So we decided that there's an effective limit of about 17 hours for VMs managed by PAPI at which point either the network configuration or metadata process server on the VMs changes, causing these failures.

I worked around the issue by increasing my scatter interval count such that no tasks took longer than the ~17 hours that seems to be the critical point.

@ruchim
Copy link

ruchim commented Aug 15, 2018

@cwhelan @droazen we should co-ordinate trying to re-run Chris's workflow on a PAPI v2 backend.

@droazen
Copy link
Collaborator

droazen commented Aug 15, 2018

@ruchim Agreed -- feel free to grab some free time on our calendars.

@droazen droazen self-assigned this Aug 24, 2018
@droazen droazen modified the milestone: Engine-2Q2018 Aug 24, 2018
@droazen
Copy link
Collaborator

droazen commented Aug 24, 2018

This was resolved by a patch to PAPIv1 just released by Google (see https://partnerissuetracker.corp.google.com/issues/112704449). Closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants