Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read table using VM Managed Identity on Azure #1462

Closed
jhoekx opened this issue Jun 14, 2023 · 10 comments
Closed

Unable to read table using VM Managed Identity on Azure #1462

jhoekx opened this issue Jun 14, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@jhoekx
Copy link
Contributor

jhoekx commented Jun 14, 2023

Environment

Delta-rs version: 0.10

Binding: Python

Environment:

  • Cloud provider: Azure
  • OS: Ubuntu 22.04
  • Other:

Bug

What happened:

>>> DeltaTable('abfss://mycontainer@mysa.dfs.core.windows.net/mypath')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sysadmin/venv/lib/python3.10/site-packages/deltalake/table.py", line 238, in __init__
    self._table = RawDeltaTable(
OSError: Generic MicrosoftAzure error: Error authorizing request: Error performing token request: response error "{"error":"invalid_resource","error_description":"AADSTS500011: The resource principal named https://storage.azure.com/.default was not found in the tenant named <MyTenant> This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You might have sent your authentication request to the wrong tenant.\r\nTrace ID: 947e54c9-1a2a-48f6-9872-42dfabaa7501\r\nCorrelation ID: 193525f9-926d-4c9c-bdd3-22ff90970715\r\nTimestamp: 2023-06-14 08:23:28Z","error_codes":[500011],"timestamp":"2023-06-14 08:23:28Z","trace_id":"947e54c9-1a2a-48f6-9872-42dfabaa7501","correlation_id":"193525f9-926d-4c9c-bdd3-22ff90970715","error_uri":"https://westeurope.login.microsoft.com/error?code=500011"}", after 0 retries: HTTP status client error (400 Bad Request) for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com%2F.default)

What you expected to happen:
The Delta Table can be opened.

How to reproduce it:
Create a VM on Azure with a Managed Identity
Assign the Storage Blob Data Reader role to the Identity in the container that contains the Delta Table
Create a virtualenv on the VM, try to open the table.

More details:
This was fixed in object_store already: apache/arrow-rs@2ec8571 . An incorrect resource was requested.

The fixed version of object_store is already in DataFusion and should be in the 27.0 release. If not already planned, we could try to have a branch to prepare for an update to 27.0 that works with the current datafusion master?

Reading the table using a SAS key works. Building the deltalake wheel with a patched object_store with the commit above also works.

@jhoekx jhoekx added the bug Something isn't working label Jun 14, 2023
@roeap
Copy link
Collaborator

roeap commented Jun 14, 2023

Thanks for reporting @jhoekx!

Absolutely we can. Just to clarify, were you planning to contribute that, or hoping we set that up? :) Either case is fine, as I was looking to do that quite soon anyhow.

In the former case tough one word of caution - which you may have figured out already - in the 0.6 release the direct support aws-profile was dropped, a feature we likely want to keep supporting, so there may be a little extra work when migration to latest object store.

jhoekx added a commit to jhoekx/delta-rs that referenced this issue Jun 16, 2023
This brings object_store 0.6, which fixes delta-io#1462.

Tokio needs to be updated, because DataFusion is using
JoinSet::spawn_blocking now.
@jhoekx
Copy link
Contributor Author

jhoekx commented Jun 16, 2023

I got started updating to the current DataFusion master, but got stuck in JsonWriter.

   --> rust/src/writer/json.rs:320:25
    |
314 |         for (key, values) in self.divide_by_partition_values(values)? {
    |                              ---- has type `&JsonWriter` which is not `Send`
...
320 |                         .await,
    |                         ^^^^^^ await occurs here, with `self` maybe used later
...
337 |         }
    |         - `self` is later dropped here

I guess because JsonWriter is no longer Sync. And that is because ArrowWriter is no longer Sync in Parquet 41.

This is outside of my current Rust knowledge. Curious to see how one tackles that. If you have a clear pointer, I can try, otherwise I hope this was at least a bit helpful.

aws-profile is pretty much out of scope for me, as we don't have an AWS environment to test.

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Jul 4, 2023

It looks like #1504 should fix this (thanks @roeap). I'm very much looking forward to the first pythonrelease that includes this.

@Tom-Newton
Copy link
Contributor

Does anyone know when the next python release might be? I might build a wheel from main to unblock myself if its not expected particularly soon.

@guushoekman
Copy link

@jhoekx In your initial comment you said you called DeltaTable('abfss://mycontainer@mysa.dfs.core.windows.net/mypath') without any storage_options. When using managed identities is this indeed how it should work?

I've been trying to figure out what the storage_options should be, trying this from an azure function. If I don't use any storage_options (like you did in your example), I get a different error message than you do:

Generic MicrosoftAzure error: Error authorizing request: Error performing token request:
response error "500 Internal Server Error", after 10 retries: HTTP status server error
(500 Internal Server Error) for url (http://169.254.130.4:8081/msi/token?api-version=2019-08-01
&resource=https://storage.azure.com/.default)

@Tom-Newton
Copy link
Contributor

You don't need to set any storage_options. The problem is just a bug in a library that delta-rs depends on. #1504 upgraded delta-rs to use the fixed version of the other library. The difficulty is that this fix is not in any python release yet but I can confirm that it works using a build from a recent main.

@jhoekx
Copy link
Contributor Author

jhoekx commented Jul 18, 2023

@jhoekx In your initial comment you said you called DeltaTable('abfss://mycontainer@mysa.dfs.core.windows.net/mypath') without any storage_options. When using managed identities is this indeed how it should work?

Yes, although it has always been a bit troublesome and required reading the source of this library and the object_store crate. As soon as I patched this particular issue for us, it works without any options from a VM or a container on a VM.

Not sure which version of the Delta Lake Python library you're using, but for 0.5.5, we had this guidance in our documentation:

The recommended authentication method while running on Azure includes using a Managed Identity. To do so, the AZURE_STORAGE_ACCOUNT environment variable should duplicate the storage account given in the uri. Next to this, when not running on Azure App Service, the IDENTITY_HEADER environment variable should be set to any value, for example foo.

We did not have much luck with newer versions (tested 0.8.1/0.10). Current master should be great as mentioned in the comments here. I intend to verify that this week.

I've been trying to figure out what the storage_options should be, trying this from an azure function. If I don't use any storage_options (like you did in your example), I get a different error message than you do:

Generic MicrosoftAzure error: Error authorizing request: Error performing token request:
response error "500 Internal Server Error", after 10 retries: HTTP status server error
(500 Internal Server Error) for url (http://169.254.130.4:8081/msi/token?api-version=2019-08-01
&resource=https://storage.azure.com/.default)

In my situation it always fetches the token from 169.254.169.254. Not sure how running in an Azure Function influences this. To debug your issue, I would:

  • display all environment variables and see if some of them influence the object_store library
  • try a minimal example of authentication using managed identities with just the azure-identity and azure-storage-blob Python packages to determine if the basics work from within the Function.

Based on those two things you can know if you need to look deeper into this library or in your infrastructure.

@guushoekman
Copy link

@Tom-Newton thank you for your reply and if this is a bug then I wasted a lot of time trying to make this work! But I do still wonder: if no storage_options are necessary then how does authentication happen? Does it automatically do some things based on my environment variables or something?

@jhoekx thank you for the information and the good suggestions to debug this.

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Jul 18, 2023

Azure managed identity is all about configuring the environment (I'm not talking about environment variables. I don't really know how it works but azure does something a bit more complicated than that) so that code that runs there can authenticate completely automatically without needing to provide any credentials.

@jhoekx
Copy link
Contributor Author

jhoekx commented Sep 5, 2023

Since reading using a VM Managed Identity works since Python deltalake 0.10.1, this issue can be closed.

@jhoekx jhoekx closed this as completed Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants