Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: implement im.getxmp() to return all embedded XMP meta data as XML #5076

Closed
laynr opened this issue Dec 3, 2020 · 19 comments · Fixed by #5144
Closed

Feature request: implement im.getxmp() to return all embedded XMP meta data as XML #5076

laynr opened this issue Dec 3, 2020 · 19 comments · Fixed by #5144

Comments

@laynr
Copy link

laynr commented Dec 3, 2020

Implement image.getxmp() similar to image.getexif(), that returns all embedded XMP meta data out of an image as XML

Something like:

def getxmp(self):
    for segment, content in self.applist:
        if segment == 'APP1':
            marker, xmp_tags = content.rsplit(b'\x00', 1)
            if marker == b'http://ns.adobe.com/xap/1.0/':
                root = xml.etree.ElementTree.fromstring(xmp_tags)
return root

# Based off https://stackoverflow.com/a/32001778

FYI: This didn't work for me:
xmp_tags = self.info.get("XML:com.adobe.xmp")

I am sure this feature request has been asked before... but a search of 'XMP' in issues yielded nothing. Just asking for MVP, not write support, or tag comprehension.

XMP documentation:
Official: https://www.adobe.com/devnet/xmp.html
Helpful: https://exiftool.org/TagNames/XMP.html

Requesting output similar to the output of:
exiftool.exe -xmp:all -X image.jpg

@hugovk hugovk changed the title Feature request: implement image.getxmp() to return all imbedded XMP meta data as XML. Feature request: implement image.getxmp() to return all embedded XMP meta data as XML Dec 6, 2020
@UrielMaD
Copy link
Contributor

Hello @laynr @hugovk I'd like to take this issue, I'm already working on this.

I already got the xmp tags out of the file, I'm just wondering which output structure would be best for the getxmp() to return

I was thinking that it could be an object like the getexif(), but in this one I already got the tags name, so, instead of the tag number I could implement the actual tag name and its value

@UrielMaD
Copy link
Contributor

Btw I also could just simply return its xml tree

@laynr
Copy link
Author

laynr commented Dec 24, 2020

Great thanks @UrielMaD!

It is probably best to keep it similar to getexif() if they return an object and adding tags name, instead of the tag number would be awesome...

That said, there is definite value in just returning the XML tree. For one, returning the XML tree may be the most future proof as you wouldn't need to stay current on new tags names.

I guess if you return an object, one of the objects functions can be to return the XML tree - perhaps that would be the best of both worlds! (but more work)

For my personal project I just used the XML tree as the parsers for XML are well established.

One thing I am noticing is that there can be multiple XMP sections in one file and not necessarily adjacent.

Thank you for taking this on. I believe it will be very useful for many people!

@UrielMaD
Copy link
Contributor

Thank you @Layn, then I'll send a PR implementing xml object, I can also return the whole xml tree as string.

I get the tag names directly from what's in the xml tree so I will return only the tags that comes in that file, so if there's more xmp tags added in the future it will still return the new ones as they just come as tags attributes.

@radarhere radarhere changed the title Feature request: implement image.getxmp() to return all embedded XMP meta data as XML Feature request: implement im.getxmp() to return all embedded XMP meta data as XML Dec 29, 2020
@UrielMaD
Copy link
Contributor

UrielMaD commented Dec 30, 2020

@hugovk Changes were merged into my PR and all tests have passed

@radarhere
Copy link
Member

Hi. Something to be aware of. Since the xml module is not secure -
https://docs.python.org/3/library/xml.etree.elementtree.html

Warning The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

in Pillow 8.3.0, we've added a new requirement - you will have to install defusedxml to get this method to work. See #5565 for more information

@kevinhendricks
Copy link

kevinhendricks commented May 18, 2024

Any thoughts to add a parameter to getxmp() to tell it to just return the xml as a string and not a nested dict of lists and dicts that is very hard to work with to extract anything useful (it is not a nice set of name:value pairs but instead an entire xml tree shoehorned in). Even after flattening it is an issue given the complex overuse of namespaces, and non-standardized prefixes in the IPTC spec.

There are a large number of xml parsers and tools such as bs4 that can properly be used to find and extract information from xml while properly handling namespace prefix differences across implementations. No defusedxml needed as bs4 can use lxml for parsing.

It would be extremely easy to add given the xmp xml string data is available when _getxmp() is called.

Right now I have to walk the bytes of all image files looking for xmpmeta and then backtracking to validly check that the proper namespace is used. The prefix used to represent the namespace is not always "x:".

If you ever plan to support modifying or writing xmp metadata, accepting complete/validated xml as input would certainly be easier than fighting with changes to nested dictionaries and lists.

As Accessibility becomes more important to publishers of all sorts, getting the direct access to the xml as a string will simplify things for everyone.

Please consider making this slight change thereby get yourself out of the processing and repackaging of xml game.

Thank you for your time and consideration.

@kevinhendricks
Copy link

kevinhendricks commented May 18, 2024

FWIW, I thought about writing a routine to convert your nested dict back to real xml but found that the use of dictionaries presents issues for sequences of identical tags "in this case "li" tags being stored (same key overwriting the earlier key).

Here is an example here just a simple snippet from the official sample image from the IPTC.

<x:xmpmeta x:xmptk="Image::ExifTool 12.50" xmlns:x="adobe:ns:meta/">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="" xmlns:Iptc4xmpCore="http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/">
   <Iptc4xmpCore:AltTextAccessibility>
    <rdf:Alt>
     <rdf:li xml:lang="x-default">
      This is the Alt Text description to support accessibility in 2022.1
     </rdf:li>
     <rdf:li xml:lang="en">
      This is the Alt Text description to support accessibility in 2022.1
     </rdf:li>
    </rdf:Alt>
   </Iptc4xmpCore:AltTextAccessibility>

And here is that same snippet extracted from the getxmp() command:

{'xmpmeta': {'xmptk': 'Image::ExifTool 12.50', 'RDF': { 'Description': [{'about': '', 'AltTextAccessibility': {'Alt': {'li': [{'lang': 'x-default', 'text': 'This is the Alt Text description to support accessibility in 2022.1'}, {'lang': 'en', 'text': 'This is the Alt Text description to support accessibility in 2022.1'}]}}, 
...

Notice that there is only one "li" tag shown in the dict version while the actual xml has two li tags as with separate values. Also notice the missing namespace information.

So there is no easy way to walk what is returned by getxmp() to rebuild the actual xml. Even for this very very simple example that just happened to be near the top of the tree.

Trying to suss out anything farther down in the nested dict structure is an exercise in futility unless you know the entire structure in advance which kinds of defeats the whole purpose.

The actual xml is much much easier to work with and is simpler for you.

Hope this helps.

@radarhere
Copy link
Member

The following code shows how to retrieve the XMP string from each format that Pillow gathers it from -

from PIL import Image
with Image.open("Tests/images/flower2.webp") as im:
    print(im.info["xmp"])
with Image.open("Tests/images/color_snakes.png") as im:
    print(im.info["XML:com.adobe.xmp"])
with Image.open("Tests/images/lab.tif") as im:
    print(im.tag_v2[700])
with Image.open("Tests/images/xmp_test.jpg") as im:
    for segment, content in im.applist:
        if segment == "APP1":
            marker, xmp_tags = content.split(b"\x00")[:2]
            if marker == b"http://ns.adobe.com/xap/1.0/":
                print(xmp_tags)
                break

From that, I would think the best way to unify this is to add im.info["xmp"] for JPG, PNG and TIFF. Let us know what you think of that idea.

@kevinhendricks
Copy link

kevinhendricks commented May 20, 2024

That would work fine for me. Thank you.

I read that for jpg that if the xmp metadata was larger than the 64k segment limit it would be split to use multiple segments. If that is correct, breaking after the first may not quite be enough.

Thank you for your code snippet. It is nice to have something working with the current version of Pillow.

@radarhere
Copy link
Member

I've created #8069

I read that for jpg that if the xmp metadata was larger than the 64k segment limit it would be split to use multiple segments.

We have seen multiple segments for EXIF. Do you know where you read that specifically about XMP? Or do you have an image that demonstrates this happening?

@kevinhendricks
Copy link

kevinhendricks commented May 20, 2024

I have no test case other than the official IPTC one. I think, if it exists it would be quite rare. Although since some of these metadata values are open textfields with no spec'd size limits, you could create one easily enough.

I read it here:
https://stackoverflow.com/questions/6822693/read-image-xmp-data-in-python

[QUOTE]
This will break when XMP is in multiple parts due to the jpeg format only allowing 64k for each chunk of such data. –
hippietrail
Jul 23, 2019 at 10:25
[/QUOTE]

So it may just be a hypothetical case, but one someone will probably try to exploit if at all possible.

@kevinhendricks
Copy link

Actually the issue of > 64k is real and the adobe spec describes how to deal with it:

See https://stackoverflow.com/questions/27383172/inserting-large-amount-of-xmp-data-into-jpg-using-multiple-packets

@kevinhendricks
Copy link

Here is an adobe spec quote from that issue:

Quoting Adobe XMP Specification part 3:

Following the normal rules for JPEG sections, the header plus the following data can be at most 65535 bytes long. If the XMP packet is not split across multiple APP1 sections, the size of the XMP packet can be at most 65502 bytes. It is unusual for XMP to exceed this size; typically, it is around 2 KB.

If the serialized XMP packet becomes larger than the 64 KB limit, you can divide it into a main portion (StandardXMP) and an extended portion (ExtendedXMP), and store it in multiple JPEG marker segment. A reader must check for the existence of ExtendedXMP, and if it is present, integrate the data with the main XMP. Each portion (standard and extended) is a fully formed XMP metadata tree, although only the standard portion contains a complete packet wrapper. If the data is more than twice the 64 KB limit, the extended portion can also be split and stored in multiple marker segments; in this case, the split portions are not fully formed metadata trees.

When ExtendedXMP is required, the metadata must be split according to some algorithm that assigns more important data to the main portion, and less important data to the extended portions or portions.

@kevinhendricks
Copy link

kevinhendricks commented May 20, 2024

So it sounds like just walking the applist and appending the xmp sections in the sequence found will work. As more than 2 sections worth of xmp can result in split trees where a single section is no longer a valid tree on its own, meaning they could never be parsed separately.

@radarhere
Copy link
Member

Page 20 of https://archimedespalimpsest.net/Documents/External/XMP/XMPSpecificationPart3.pdf states

Each chunk is written into the JPEG file within a separate APP1 marker segment. Each ExtendedXMP
marker segment contains:
➤ A null-terminated signature string of "http://ns.adobe.com/xmp/extension/".
➤ A 128-bit GUID stored as a 32-byte ASCII hex string, capital A-F, no null termination. The GUID is a
128-bit MD5 digest of the full ExtendedXMP serialization.
➤ The full length of the ExtendedXMP serialization as a 32-bit unsigned integer
➤ The offset of this portion as a 32-bit unsigned integer.
➤ The portion of the ExtendedXMP
The GUID is also stored in the StandardXMP as the value of the xmpNote:HasExtendedXMP property. This
allows detection of mismatched or modified ExtendedXMP. A reader must only incorporate ExtendedXMP
blocks whose GUID matches the value of xmp

That's not as simple as just concatenating the chunks. I would prefer to work from an example of this type of file. If there is no example, then that sounds like an argument that this feature may not be so vital.

@kevinhendricks
Copy link

Yes, not that simple. Of course returning a list of xmp segment strings and let the user fight with them is always an option!

I will look for a couple of official or unofficial example of images using extendedxmp and get back to you with links for them.

My understanding is some Android camera apps make liberal use of the xmp metadata to store some depth and other special effects info that are bigger than 64k so examples should be out there.

@kevinhendricks
Copy link

Started checking github repos for software that manipulates ExtendedXmp and found a java project called icafe that supports xmp across a number of formats:

https://github.com/dragon66/icafe

and in their set of test images found this sample which is one of the depth image information that takes up a number of segments of xmp.

https://github.com/dragon66/icafe/blob/master/images/table.jpg

I will look for others.

@kevinhendricks
Copy link

And here is a second image that uses ExtendedXmp in jpeg.

https://github.com/drewnoakes/metadata-extractor-images/

https://github.com/drewnoakes/metadata-extractor-images/blob/main/jpg/Google%20Cardboard.jpg

If you need more, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants