updating metadata on all draft assets #1450

satra · 2023-01-25T14:22:39Z

i'm starting this issue so that we can do this within the next two weeks.

requires re-extracting metadata from all draft assets and uploading it to the draft version of the dataset. we won't touch published versions here. this is to ensure that all draft datasets are compliant with current schema.

i think this is best done at an admin process level. let's start with all nwb datasets. it would involve running a script on assets from all draft versions to update it.

@jwodder - could you please provide a dandi api-based function to re-extract metadata from a remote nwb file (i.e. a dandi asset url)? this is not getting the asset metadata but to reextract it from the nwb file itself.

cc/ @yarikoptic

yarikoptic · 2023-01-30T14:53:46Z

Are you expecting some new metadata being extracted from nwb what can't be achieved through metadata record version upgrade?

satra · 2023-01-30T14:56:33Z

Are you expecting some new metadata being extracted from nwb

yes. our metadata extraction routines have changed several times, so the older assets are definitely out of sync with our current metadata extraction routines. to start with we can limit updates to assets with versions lower than current.

yarikoptic · 2023-03-12T18:02:23Z

support for that was added already in dandi-cli . I do not think it is an issue to keep in dandi-archive since I don't think code should be developed here now - just a matter of running that command/script. but I can be wrong ;)

waxlamp · 2023-03-13T15:10:14Z

I believe this issue was filed as a task reminder, not as something to change in the archive codebase. If this can be carried out with dandi-cli, that's fantastic.

I think we still need some clarity on exactly what to do, how to do it, and who will do it. Can we get those things decided so we can keep this moving forward?

satra · 2023-03-13T16:22:37Z

we may need to do this via the management console:

for each draft dandiset/asset:

extract metadata of asset
lock dandiset
check asset still exists
update asset
unlock dandiset

since this is a update operation, i don't want the process to conflict with any upload or other actions (e.g. checksum calculation) that may be going on. this may not be something the CLI can do. also the CLI won't have admin privileges unless we use such a token.

yarikoptic · 2023-03-13T21:44:08Z

or just finally expose dandiset locking as admin-only operation in API, and adjust already coded in implementation since it seems that we would need similar mechanism nearly for any archive wide operations which might interfere with original owner's actions.

jjnesbitt · 2023-03-17T23:14:23Z

Running the numbers against the current draft assets in prod (ignoring zarrs), I believe this will involve the processing of ~700GB in order to accomplish. We might be able to bring this down if we can define which assets are considered "old" and which are okay.

satra · 2023-03-17T23:20:07Z

how did you get that number? it shouldn't actually download all the data per asset to do this. the updated code works on cached/remote streaming to do this.

jjnesbitt · 2023-03-17T23:29:04Z

how did you get that number?

Summing the size of all assets in draft versions, excluding zarr assets.

it shouldn't actually download all the data per asset to do this. the updated code works on cached/remote streaming to do this.

So it only needs to access a portion of the stored s3 file? I was using the nwb2asset function as shown in this example, so if that uses partial access we're good.

satra · 2023-03-18T15:05:48Z

as long as the asset uses the as_readable mode that would work. dandi cli added a helper service script, and this is the relevant section: https://github.com/dandi/dandi-cli/blob/dc68e02eb6dfcb3b2ce82cab22bdadf1b1f31999/dandi/cli/cmd_service_scripts.py#L89

satra assigned yarikoptic Jan 29, 2023

This was referenced Jan 30, 2023

Garbage collection #177

Open

Extract metadata from asset on remote server dandi/dandi-cli#1196

Closed

danlamanna mentioned this issue Feb 13, 2023

Remove overly aggressive exception handling #1490

Merged

yarikoptic mentioned this issue Feb 28, 2023

service-scripts reextract-metadata dandi/dandi-cli#1230

Closed

waxlamp added this to the Dandiset lifecycle milestone Mar 9, 2023

yarikoptic mentioned this issue Mar 13, 2023

repository field is not filled in #1516

Open

jjnesbitt assigned jjnesbitt and unassigned yarikoptic Mar 16, 2023

jjnesbitt linked a pull request Mar 20, 2023 that will close this issue

Add command for asset metadata re-extraction #1545

Open

yarikoptic mentioned this issue Jan 30, 2024

Design doc for Audit MVP #1801

Merged

waxlamp added the maintenance Action to maintain the system (neither a bugfix nor an enhancement) label Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updating metadata on all draft assets #1450

updating metadata on all draft assets #1450

satra commented Jan 25, 2023

yarikoptic commented Jan 30, 2023

satra commented Jan 30, 2023

yarikoptic commented Mar 12, 2023

waxlamp commented Mar 13, 2023

satra commented Mar 13, 2023

yarikoptic commented Mar 13, 2023

jjnesbitt commented Mar 17, 2023

satra commented Mar 17, 2023

jjnesbitt commented Mar 17, 2023

satra commented Mar 18, 2023

updating metadata on all draft assets #1450

updating metadata on all draft assets #1450

Comments

satra commented Jan 25, 2023

yarikoptic commented Jan 30, 2023

satra commented Jan 30, 2023

yarikoptic commented Mar 12, 2023

waxlamp commented Mar 13, 2023

satra commented Mar 13, 2023

yarikoptic commented Mar 13, 2023

jjnesbitt commented Mar 17, 2023

satra commented Mar 17, 2023

jjnesbitt commented Mar 17, 2023

satra commented Mar 18, 2023