Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KHR_audio_emitter #2137

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

KHR_audio_emitter #2137

wants to merge 13 commits into from

Conversation

robertlong
Copy link
Contributor

@robertlong robertlong commented Mar 31, 2022

KHR_audio_emitter

This extension adds the ability to store audio data and represent both global and positional audio emitters in a scene.

This extension is intended to unify the OMI_audio_emitter and MSFT_audio_emitter extensions into a common audio extension.

Members from the Open Metaverse Interoperability Group, Microsoft, and other Khronos members met on 3/23/2022 to discuss how we might unify OMI's work on audio emitters and Microsoft's previous efforts. OMI_audio_emitter has a subset of the MSFT_audio_emitter extension's features, eliminating features that might be out of scope for a more general extension. KHR_audio_emitter also has some spec changes from the original OMI_audio_emitter requested by Khronos members. There are still some outstanding questions on audio formats, MP3 vs WAV, and what features within these formats should be supported.

We're looking forward to working together on this extension and bringing audio to glTF!

@najadojo
Copy link

najadojo commented Apr 1, 2022

Love that this space is getting more attention. I am a little concerned about the elimination of a number of features that MSFT_audio_emitter has. Some limitations are understandable such as the integration with animation. I expect the thinking is that the model and its audio is driven externally by the loaded system. Though this approach may give an artist less control, or at minimum a more complicated workflow. Other limitations I'm not sure I understand such as limiting to one emitter per node. This would just lead to having a number of child nodes for scenarios that require it.

I'd like to remind folks of a demo video we produced for MSFT_audio_emitter that was completely data driven.

@lyuma
Copy link

lyuma commented Apr 7, 2022

I wanted to chime in here on the randomized audio clips. I'm generally opposed to having specific fixed-function features in glTF where not necessary, because those features will have to be implemented by everyone and maintained, even in the future when extensions such as KHR_animation2 have been developed more.

That said, the demo you link is really cool. I can't remember the original document but I did see that randomizing environment audio adds a lot to immersion, compared to having a looping set of tracks that play one after the other.

So regarding randomized clips, I'm a bit torn here. I might also suggest maybe some middle ground: rather than require a weighting/randomization system, allow multiple clips per audio emitter, but leave it up to the application or future extensions to implement the randomization / support selecting audio clips. (and otherwise allow just playing the first clip).

As for multiple emitters per node, I would suggest this is not necessary: it would be very easy to add a child node (with no translation, rotation or scale) with another emitter on it. This is similar to how each node only has one mesh, but multiple meshes can be easily added as child nodes.

@UX3D-nopper
Copy link
Contributor

Updated the example C++ source code for KHR_audio here: https://github.com/ux3d/OMI/tree/KHR_audio

@robertlong
Copy link
Contributor Author

Yeah we talked about this extensively in the OMI glTF meeting yesterday. I'm personally on the side of making this extension as simple as possible. I also realize now that it is a bit odd that we allow multiple emitters on the scene, but not nodes.

Given this feedback my recommendations are:

  1. Maybe we want to expand the spec to add basic mixing support via supporting multiple audio sources on an audio emitter, each with their own gain value.

Here's a proposal with multiple inputs per emitter. There is a gain value on the emitter as well as each of the sources as you would usually see in a mixer. In this proposal the only inputs are audio sources, but you could imagine other audio processing nodes in there as well, similar to the WebAudio API. I think we want to make this spec as simple as possible without limiting future extensions to add more features and adding mixing to the core spec isn't a huge ask.

{
  "emitters": [
    {
      "name": "Positional Emitter",
      "type": "positional",
      "gain": 0.8,
      "inputs": [0, 1],
      "positional": {
        "coneInnerAngle": 6.283185307179586,
        "coneOuterAngle": 6.283185307179586,
        "coneOuterGain": 0.0,
        "distanceModel": "inverse",
        "maxDistance": 10.0,
        "refDistance": 1.0,
        "rolloffFactor": 0.8
      }
    }
  ],
  "sources": [
    {
      "name": "Clip 1",
      "gain": 0.6,
      "playing": true,
      "loop": true,
      "audio": 0
    },
    {
      "name": "Clip 2",
      "gain": 0.6,
      "playing": true,
      "loop": true,
      "audio": 1
    }
  ],
  "audio": [
    {
      "uri": "audio1.mp3"
    },
    {
      "bufferView": 0,
      "mimeType": "audio/mpeg"
    }
  ]
}
  1. We may want to only allow for one global emitter in light of what I'm proposing with mixing audio. You'd just add multiple inputs for the global emitter.

  2. For clip randomization I do think we should be introducing that through some other more generic event based system. Having bespoke behavioral features in the base spec doesn't make a whole lot of sense to me. Clip randomization sounds very useful, the demo shown above is awesome, but we also could use this feature for randomizing active idle animations, or a material color.

  3. KHR_animation2 should be able to target some of the properties in this spec to play one shot clips multiple times, or start looping a clip after a certain frame.

  4. We might want to add loopStart and loopEnd properties to audio sources such that a single audio file can be used and the source samples a section from that file.

  5. We need to add language to the spec to clarify that positional audio emitters "MUST" use a mono audio source

@antpb
Copy link

antpb commented Apr 20, 2022

Just want to voice a +1 to Robert's proposed changes above particularly the bits around having multiple sources in the array. I've implemented the OMI audio spec in my WordPress plugin ( https://3ov.xyz ) and this change is early enough and makes enough sense that it is not impacting to my current users.

Also a +1 on having one global emitter and having inputs that feed into that global.

I don't feel strongly about loopStart and loopEnd but do see the benefits.

@UX3D-nopper
Copy link
Contributor

UX3D-nopper commented Apr 21, 2022

We got feedback on KHR_animation2. It is renamed to KHR_animation_pointer and we are making good progress on it.
#2147

extensions/2.0/Khronos/KHR_audio/README.md Outdated Show resolved Hide resolved
extensions/2.0/Khronos/KHR_audio/README.md Outdated Show resolved Hide resolved

#### `playing`

Whether or not the specified audio clip is playing. Setting this property `true` will set the audio clip to play on load (autoplay).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for naming this like a state ("playing") vs. naming it as what it does ("autoplay")? Seems that hints at possible implementation details ("changing this should change play state") but that isn't mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bigger discussion about how animation, audio, etc. should be treated on load. Animations are currently just keyframe data and it's up to the implementation to figure out how to play the animations. https://www.khronos.org/registry/glTF/specs/2.0/glTF-2.0.html#animations

So this begs the question if playing or autoPlay or even loop should be included in the spec.

@najadojo @bghgary this also goes against the direction of the MSFT_audio_emitter where there are ways to specify runtime behavior.

Maybe runtime behavior should be left out of this spec and implemented in another?

Copy link

@hybridherbst hybridherbst Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point regarding playback behaviour!

If the glTF is purely seen as "data setup", then it might still be desirable to have a way to connect audio clips to animations - e.g. saying "this audio emitter belongs to that animation" (potentially: "at that point in time") would be pure data. This would be similar to how a mesh + material are connected to a node.

What do you think in which direction this connection should be made? Attaching an emitter to a clip or vice versa? I think on the clip would make somewhat more sense:

"animations": [
    {
      "channels": [...],
      "samplers": [...],
      "name": "SimpleCubeAnim",
      "extensions": {
        "KHR_audio": {
          "emitters": [
            {
              "id": 0,
              "offset": 0.0
            },
            {
              "id": 1,
              "offset": 1.3
            }
          ]
        }
      }
    }
  ],

In this example, two emitters belong to this animation clip, viewers would interpret that as they "loop together", and one emitter has a slight offset. Note having the same emitter play multiple times during the loop would be possible by having multiple references to the same ID with separate offsets.

What do you think? I think that would preserve the "glTF is purely data" aspect by giving hints about how things are connected, not how things should be playing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the OMI glTF meeting we discussed this a bit. I think synchronizing audio with an animation should also be made part of another spec. AFAIK synchronizing audio with animations isn't necessarily the best way to deal with this scenario. We should talk more about autoPlay and loop though. Should this be included in the spec? Should we go into more depth on playing/mixing audio? It'd be good to get some feedback from others on this.

Copy link
Contributor

@bghgary bghgary May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, at a minimum, we need to define both how the glTF carries the audio payload (including metadata) and how to play the audio, whether that is one or more specs. If this spec only defines the former and that's all we define, it will be hard to know if the spec works and it will be hard to demo.

Copy link

@hybridherbst hybridherbst May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you agree that the options are

  • adding info about play behaviour to the audio data itself
  • connecting audio and animations in some way (either from audio to animation or from animation to audio)
    or do you have another one in mind?

I agree that without any means of knowing how the audio is relevant to the scene, viewers won't be able to do anything with it - e.g. at a minimum I think tools like model-viewer should have a way to infer what to do with a file that has multiple animation clips and multiple audio assets (could be 1:1, could be different). A counter-argument would be saying "well for this case, if there's one animation and one audio clip that will be played back, everything else is undefined" (still allowing for cases such as this) but I'm not a huge fan of undefined-by-design...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the OMI glTF meeting we agreed that playing behavior (audioPlay and loop) should be in this spec to define the minimum behavior to play sounds. However, connecting audio and animations should be delegated to an extension like #2147

Copy link
Contributor

@bghgary bghgary May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to think about the corresponding spec(s) that define how the audio will play before completing this spec. KHR_animation_pointer spec will be able to cover some things (like modifying parameters for looping sounds), but it's probably not enough (e.g. one-shot sounds triggered from an animation).

Comment on lines 181 to 187
#### `coneInnerAngle`

The angle, in radians, of a cone inside of which there will be no volume reduction.

#### `coneOuterAngle`

The angle, in radians, of a cone outside of which the volume will be reduced to a constant value of`coneOuterGain`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a note here that setting this to some value > 2 * PI (the max allowed value) will turn this into a spherical audio source. It's implicit from the defaults but could be explained explicitly here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it should act as a point audio source / non-directional source when set. It's worth noting the WebAudio API Specification doesn't specify this detail though. What's also missing is the behavior when the coneOuterAngle is less than the coneInnerAngle or when the coneInnerAngle is greater than the coneOuterAngle. We should check on this before adding these details.

@hybridherbst
Copy link

One area that I'd love to see discussed here is behaviour under root scale for AR, as this is something currently broken in both SceneViewer and QuickLook (since it was unspecified, I guess). Happy to provide sample files.

Consider the typical scenario of placing a model in AR with model-viewer, and the model has spatial audio.
Should scaling the model down to 10% of the size

  • assume that the model is now small, but the distance to the viewer is approximately the same?
  • assume that the viewer is now large and the model hasn't changed scale, and thus the distance to the viewer is now much larger?

Unfortunately such considerations were omitted from lighting in glTF, and thus lighting in AR is also "broken everywhere" right now :)

To explain why that's not trivial, here's my take (and happy to take the lighting discussion elsewhere, here for context):

  • audio volume should work as if the model became smaller (if I scale an AR radio down, I want to hear it at the same loudness)
  • audio spatialization should work as if the viewer got bigger (if I have 3 spread out emitters in a space and scale that space down, I still want clear spatial separation between these sources)
  • lighting effects should work as if the viewer got bigger (if I scale a lamp with a spot light down, I don't want the light to become brighter, it should look the same just smaller)


An object containing the positional audio emitter properties. This may only be defined if `type` is set to `positional`.

### Positional Audio Emitter Properties
Copy link

@hybridherbst hybridherbst Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are doppler effects considered implementation details? Might want to explicitly call this out here (e.g. animating a node with an audio source, fast viewer movement, ... might cause doppler effects on positional sources).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to add a note, yeah. But perhaps audio-listener oriented properties / effects should be defined in another series of extensions?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think these should be specified here, and probably also not in another extension, as its very application-specific. "Doppler effect for audio" is kind of in the same realm as "bloom effect for rendering" in my opinion. The note would be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add language such as "Implementors may choose to add effects to spatial audio such as simulating the doppler effect." 👍

@donmccurdy
Copy link
Contributor

One area that I'd love to see discussed here is behaviour under root scale for AR, as this is something currently broken in both SceneViewer and QuickLook (since it was unspecified, I guess).

If needed, perhaps this could be specified under the model3d XMP Namespace in KHR_xmp_json_ld, alongside model3d:preferredSurfaces for AR? I'm nervous to bring much wider-scoped issues ("how to compose the glTF asset within a larger scene of unknown format for unknown purpose") into KHR_ extensions, we try to define what the content "means" rather than how various categories of applications (AR, HTML viewer, game, etc.) might use it.

@hybridherbst
Copy link

hybridherbst commented Apr 25, 2022

I generally agree, but

we try to define what the content "means"

is exactly what I'm trying to say: I think an extension adding audio (or lights) should talk about how that audio works. AR is a strong usecase for glTF, and it would simply be "undefined" (every viewer would do something different) if not explained in the extension that adds audio, in my opinion. Better would have been if such topics ("behaviour under different types of scales") would have been part of the Core spec, of course - if they would have been, they'd have probably forwarded responsibility to extensions though...

I can just say: SceneViewer and QuickLook (glTF and USDZ, respectively) allow for the usage of audio and lights, and both haven't defined/thought about how it behaves under AR scale, so right now we can't use it for many projects where we'd like to. If it stays unspecified for KHR_audio, it's immediately the same issue.

I'll take a deeper look at KHR_xmp_json_ld! Could you let me know where I find more about model3d:preferredSurfaces and other agreed-upon properties? For me that extension so far sounded like "structured extras", but sounds like there's actually more to it?
(Edit: only hint I found was in the Khronos blog)

@donmccurdy
Copy link
Contributor

Could you let me know where I find more about model3d:preferredSurfaces and other agreed-upon properties?

Probably the place to start would be Recommendations for KHR_xmp_json_ld usage in glTF.

I think I'm worried that "different types of scale" as a concept has a lot to do with emerging ideas of an "AR viewer" vs. other application types, and that these definitions may change drastically on a timeline of 1-2 years or even months. Embedding these context-specific requirements into KHR_punctual_lights (for example) could cause the extension to become outdated far sooner than it might otherwise. With an XMP namespace there is a proper versioning system and more flexibility to evolve or adapt to specific contexts. I suspect the same applies to KHR_audio.

@UX3D-nopper
Copy link
Contributor

UX3D-nopper commented May 19, 2022

  1. We need to add language to the spec to clarify that positional audio emitters "MUST" use a mono audio source
    Also, that global audio emitters "MUST" use a stereo source. Technically, a mono would work as well, but this probably just causes complains by the end user.

@robertlong
Copy link
Contributor Author

Technically, a mono would work as well, but this probably just causes complains by the end user.

Yup that makes sense. We could remove the stereo audio source requirement from the global emitter.

we try to define what the content "means" rather than how various categories of applications (AR, HTML viewer, game, etc.) might use it.

We spoke about this during the OMI glTF meeting and KHR_audio should defined within the reference of the glTF document. We agree that the document should define what the content "means" but that the behavior of scaling the content should be up to your application.

However, if an animation in the document is controlling the scaling of the node we should define that behavior and maybe that should inform best practices in this AR viewer use-case.

So in the case of maxDistance what should happen with it when scaled down to 10% of it's original scale?

maxDistance

The maximum distance between the emitter and listener, after which the volume will not be reduced any further. maximumDistance may only be applied when the distanceModel is set to linear. Otherwise, it should be ignored.

Should this respect node scale?

Note that in Unity, Audio Emitter scale is separated from the GameObject scale. But in the MSFT_audio_emitter spec, it does mention that audio attenuation should be calculated in the emitter space.

@UX3D-nopper
Copy link
Contributor

UX3D-nopper commented May 20, 2022

So in the case of maxDistance what should happen with it when scaled down to 10% of it's original scale?

maxDistance
The maximum distance between the emitter and listener, after which the volume will not be reduced any further. maximumDistance may only be applied when the distanceModel is set to linear. Otherwise, it should be ignored.

Should this respect node scale?

I suggest, that the behaviour is like in the KHR_lights_punctual extension:
https://github.com/KhronosGroup/glTF/tree/main/extensions/2.0/Khronos/KHR_lights_punctual
"The light's transform is affected by the node's world scale, but all properties of the light (such as range and intensity) are unaffected."
"Light properties are unaffected by node transforms — for example, range and intensity do not change with scale."

So, the final position of the audio is affected by scale, but not the properties of audio.

@hybridherbst
Copy link

This is unfortunately exactly what breaks all existing viewers in Augmented Reality mode, where users can "make the content smaller or bigger" without any agreement on what that means – is it the viewer getting bigger or the content getting smaller or a mix of both. See my comment above for some more descriptive cases.
"node's world scale" is well defined inside one glTF scene but isn't well defined when placing that scene in another context. There's an additional "scene scale" of sorts.

@UX3D-nopper
Copy link
Contributor

UX3D-nopper commented May 20, 2022

This is unfortunately exactly what breaks all existing viewers in Augmented Reality mode, where users can "make the content smaller or bigger" without any agreement on what that means – is it the viewer getting bigger or the content getting smaller or a mix of both. See my comment above for some more descriptive cases. "node's world scale" is well defined inside one glTF scene but isn't well defined when placing that scene in another context. There's an additional "scene scale" of sorts.

We should realy define this consistent inside glTF. A pure glTF viewer should behave like I described.
I got your point and your usecase is using glTF in another application. Not glTF is defining the behaviour, it is the AR viewer. And next time, you import the glTF in a game engine. And there, it is also different.

BTW, Blender behaves this way e.g you can try it out with a point light radius. It stays the same, independent of the scale.

@hybridherbst
Copy link

Personally I think this says "we should only care about what happens inside the ivory tower"...
I'm not sure what a "pure" glTF Viewer is, everything has a context.

I created a new issue to track this problem outside of just KHR_audio:

@donmccurdy
Copy link
Contributor

donmccurdy commented May 20, 2022

See #2137 (comment) — I think it is fine to try to define what you're asking for, but I do not think that requirement belongs in KHR_audio or KHR_lights_punctual. A specification that can properly deal with the ambiguities of novel and rapidly changing contexts is required to do what you suggest, and that will burden these extensions too much.

@hybridherbst
Copy link

hybridherbst commented May 20, 2022

Hm, I don't think it belongs tucked away in some unspecified metadata either. Otherwise a case could be made that 95% of extensions would be better suited as metadata.

If you read through the issue I opened, I explicitly mention there that

  • ideally the discussions and definitions around this belongs elsewhere (not into either KHR_audio or KHR_lights_punctual)
  • extensions should be encouraged to call out how they intend to behave
  • I don't believe this needs to be normative.

@robertlong
Copy link
Contributor Author

robertlong commented Jul 20, 2022

Hey everyone, we've been discussing this spec over the past couple months and we have a few changes that allow for some of the requested features above.

First is splitting audio data and audio sources. Audio Data is similar to an Image where it has the uri or bufferView and mimeType properties. Audio Sources are intended to be used for controlling playback state on one or more Audio Emitters. An Audio Emitter can also reference multiple Audio Sources which allows you to mix multiple audio tracks through an emitter.

We've also changed playing back to autoPlay which signifies that an audio source should play on load.

We'd love to get everyone's feedback on these changes!

Also, I believe we (The Matrix Foundation) have submitted our app to join Khronos. So I will hopefully be around to participate during the working group meetings in the near future. Hopefully that helps move this spec (and others) forward a little faster.

Comment on lines 17 to 23
"description": "The audio's MIME type. Required if `bufferView` is defined. Unless specified by another extension, the only supported mimeType is `audio/mpeg`.",
"anyOf": [
{
"enum": [ "audio/mpeg" ]
},
{
"type": "string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider supporting more formats than just MP3. For example, WAV files are preferred for short sound effects due to their rapid decode time. There's also OGG vorbis which is a superior format for the same use cases as MP3. There's a long list of other formats but I don't think we should try to support all of them, maybe just 2 or 3.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: @antpb update this description to include wav, mp3, and other supported types that are commonly implemented in current engines. state that others are nonstandard. Set type to json instead of string: match https://github.com/KhronosGroup/glTF/blob/main/specification/2.0/schema/image.schema.json#L16C8-L27C15

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, dont forget OGG! Support in Safari for OGG was blocking adding it when we first drafted the OMI_audio_emitter spec. It has since been added later in 2021

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have FLAC added, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FLAC is a great format, but I don't think it's the right fit for the base glTF audio spec. FLAC is supported on all browsers since 2017, and in Unreal, but not Unity or Godot. FLAC is great for lossless audio, but glTF is generally focused on being a last mile format that focuses on being easy to parse by many importers,12 in which cases lossless audio is usually unnecessary. For improved portability, I think it's best for the base spec to only support MP3/OGG/WAV to encourage assets to use those, and then FLAC can be non-standard.

Footnotes

  1. https://github.com/KhronosGroup/glTF/issues/1051#issuecomment-544647316

  2. https://nickporcino.com/posts/last_mile_interchange.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When referring to Ogg audio, Vorbis codec is usually implied as Ogg is a just container format. Ogg format is not supported in Safari so it won't be added to this extension.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the main issue here is finding formats with widespread support. By listing a format here, it is mandatory to implement that codec in order to support this glTF extension.
The Safari columns here show the problem: https://caniuse.com/?search=ogg

I would like to say, however, that I don't see any requirement for a given browser rendering engine to support a particular format. As with other glTF extensions, when some feature is not natively used by a particular game engine or renderer, it is possible to transcode at load time.

In the case of Ogg Vorbis, it would be possible to transcode Ogg Vorbis to another supported audio format using widely available open code such as https://github.com/brion/ogv.js/ - would usage of a library like Ogv.js in WASM not address the possible concern of Safari support?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with other glTF extensions, when some feature is not natively used by a particular game engine or renderer, it is possible to transcode at load time.

The only current examples are Draco, Meshopt, and Basis compression codecs. They always have to implemented by engines (likely with the same open-source libraries) and do not depend on browser support.

That said, Safari does support Vorbis audio inside WebM files, so using the latter instead of Ogg could be explored.

Copy link

@antpb antpb Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally when this PR spec was written we opted to only define mp3 in the base spec with any other types being "non-standard" but still capable of being used through a string definition of the MIME desired. Mainly because of the expectation of every implementation needing to support every type defined in this base spec. We established that the majority of potential implementors had a common thread of supporting mp3 so everything else would be non-standard and most compatible going forward.

My comment earlier in this was about adding wav because of the wide usage for things like sound effects and making this more in line with the MSFT spec that only supported wav. OGG has a similar situation with wide usage, but we get into a long list of MIMEs with no real criteria for what should make the list of base spec supported types.

Being that this list could be ever-dynamic we should maybe agree on the base list of types with precedent in specification (wav/mp3) agreeing that anything else is non-standard and allowed. This puts us in a situation where any file type can be used and common types implemented (maybe FLAC) will be supported and parsed by engines regardless of the base spec.

It feels unrealistic to expect every implementation to have to support every file type that we list here and the list feels fragile to be ever changing with many opinions of what should be supported.

TLDR, can we go [ "audio/wav", "audio/mpeg" ] with the spec allowing any other string being used for further supports (like OGG and FLAC)?

@DR-xR
Copy link

DR-xR commented Sep 29, 2023

Would it be possible to define audio in a very compartmentalized manner. There would be the overall structure of audio that allows an audio source (separate sub-extension for each media type) and several different emitters (also sub-extensions). There may even be the possibility for filters and mixers between the audio sources. This suggestion is somewhat along the lines of Web Audio (https://www.w3.org/TR/webaudio/).

It would mean that any implementation would need to implement at least one audio source and one audio emitter. A web-based implementation may be able to integrate with the WebAudio API.

I know this is much more complex than the original comments were discussing. Perhaps it would be good to have a special case audio extension (single source type, single non-spatialized emitter, no filtering) and a more general purpose one.

@capnm
Copy link

capnm commented Sep 30, 2023

There are still some outstanding questions on audio formats, MP3 vs WAV, and what features within these formats should be supported.

My personal opinion on this subject:

Basically, you need at least one lossless (for any serious professional audio editing) and one lossy codec, mainly designed to minimize the size of the audio data. I suggest flac (wav is a bit outdated) and mp3 (the other variants were more or less developed due to already expired patents). Other common formats can be optionally supported (e.g. as in @antpb comment).

@antpb
Copy link

antpb commented Sep 30, 2023

@capnm is FLAC supported in major engines? From my quick research Unity only supports .aif.wav.mp3 and .ogg

We would be specifying that all engines would need to natively support FLAC. While I agree it is a superior file type in a lot of ways, the history of support in wav across engines and browser implementations feels safer. Implementors get it as a freebie vs having to get engines like Unity to support FLAC natively.

@capnm
Copy link

capnm commented Sep 30, 2023

Some engines are at least already on the way to support it (slowly realize the massive advantages of open standards like glTF ;) You can convert and cache it locally to any format you need with relative ease. I would specify that mp3 format support is mandatory, and for the lossless option you can fall back to wav until the recommended flac support is implemented...

@docEdub
Copy link

docEdub commented Sep 30, 2023

@capnm is FLAC supported in major engines? From my quick research Unity only supports .aif.wav.mp3 and .ogg

We would be specifying that all engines would need to natively support FLAC. While I agree it is a superior file type in a lot of ways, the history of support in wav across engines and browser implementations feels safer. Implementors get it as a freebie vs having to get engines like Unity to support FLAC natively.

Unity supports FLAC since version 2020.1.0. It's in the release notes but is not in the documentation even though it does work.

@antpb
Copy link

antpb commented Jan 18, 2024

Just wanted to give an update on the previous commits made to this PR. In recent OMI Group glTF Extensions meetings we identified some points of clarification needed around some of the properties in KHR_audio as well as some errors in the diagrams in the current PR.

More details on these changes can be found here:
A high level list of the changes are:

  • Add property summary tables for all properties.
  • Specify default values.
  • Add details on the cone angle, specify this is an angular diameter and not a radius (this matches Web Audio).
  • Add note that a max distance of 0 should be treated as an infinite max distance.
  • Remove limitation of max distance only being available on linear, allow it for all.
  • Update the cone diagram to be clearer using radians and better highlighting the different areas of the cone properties.

cone-diagram

@antpb
Copy link

antpb commented Feb 8, 2024

Hello again! I would like to get some feedback about an open issue in the KHR_audio extension that this PR originates from. There has been a frequent recurring topic around the naming of KHR_audio and how it is not very descriptive of the actual behavior of the extension. In the original draft of this extension in OMI we went with the name OMI_audio_emitter and after a year of reflection, I've been feeling more like it is a better and more descriptive name for this extension.

Also noting, it would be nice to use the name KHR_audio in the future as a more global collection of KHR_audio_* extensions, emitter being the first.

If anyone has strong opinions on the naming, please give this issue a read and please leave any feedback you might have there. We'd like to get general agreement from folks here in this PR before making this change and pushing it to the repo it originates from.

More context: omigroup/gltf-extensions#205

@mikeskydev
Copy link

I agree with the name change.

Is it worth looking at MPEG_audio_spatial for similarities/contradictions?

@aaronfranke
Copy link
Contributor

@mikeskydev Yes, we have looked there. MPEG_audio_spatial makes sense as-is because it is purely focused on spatial audio, however KHR_audio_emitter also includes global audio. MPEG_audio_spatial also includes some spatial audio features like listener and reverb, which are not present in KHR_audio_emitter which is focused on emission. Those features could be in future KHR_audio_* extensions, perhaps called KHR_audio_listener and/or KHR_audio_reverb.

Rename from KHR_audio to KHR_audio_emitter and add example file
Comment on lines +228 to +229
| **coneInnerAngle** | `number` | The anglular diameter of a cone inside of which there will be no angular volume reduction. | 6.2831853... (τ or 2π rad, 360 deg) |
| **coneOuterAngle** | `number` | The anglular diameter of a cone outside of which the volume will be reduced to a constant value of `coneOuterGain`. | 6.2831853... (τ or 2π rad, 360 deg) |
Copy link
Contributor

@donmccurdy donmccurdy May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider naming these innerConeAngle and outerConeAngle for consistency with KHR_lights_punctual? See:

https://github.com/KhronosGroup/glTF/blob/main/specification/2.0/ObjectModel.adoc#52-khr_lights_punctual

Copy link

@antpb antpb May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original spec was drafted to mirror Web Audio API given the wide acceptance of the property names and behaviors. If there is a KHR reason to align these property names I can get behind it, but my personal feelings are that this goes against what a lot of people have mostly agreed on and understand relative to audio properties.

Here's the PannerNode docs we referenced when drafting the original OMI_audio_emitter spec https://developer.mozilla.org/en-US/docs/Web/API/PannerNode

I'm also partial to being able to see quickly the word cone which is much easier to identify at the beginning for human reading a file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these names are used verbatim outside of the web specification? I do feel that clear inconsistencies among glTF specifications should outweigh borrowing terms verbatim from a particular existing specification.

I have no strong preference about distanceMaximum or range in the comment thread below – these aren't strong matches to existing glTF terms, and there's already some inconsistency in accessor.max vs. .iridescenceThicknessMaximum. OK with me to keep those as-is if you prefer.

But asking implementors to use both innerConeAngle and coneInnerAngle when implementing lights and audio ... I do not love that. 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that parity with lights is actually going to be an issue for implementers. The code for these should be very separated. I can see it as a cosmetic issue for consistency between specs, but not much more than that.

I find coneInnerAngle easier to read and it's consistent with Web Audio. However, if Khronos wants us to switch to innerConeAngle for consistency with lights, it's easy to make that change. We just need the people in charge to make the decision one way or the other. We also might want to get input from Mozilla or Google.

| ------------------ | -------- | ------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
| **coneInnerAngle** | `number` | The anglular diameter of a cone inside of which there will be no angular volume reduction. | 6.2831853... (τ or 2π rad, 360 deg) |
| **coneOuterAngle** | `number` | The anglular diameter of a cone outside of which the volume will be reduced to a constant value of `coneOuterGain`. | 6.2831853... (τ or 2π rad, 360 deg) |
| **coneOuterGain** | `number` | The linear volume gain of the audio emitter set when outside the cone defined by the `coneOuterAngle` property. | 0.0 |
Copy link
Contributor

@donmccurdy donmccurdy May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, perhaps outerConeGain?

| **coneOuterAngle** | `number` | The anglular diameter of a cone outside of which the volume will be reduced to a constant value of `coneOuterGain`. | 6.2831853... (τ or 2π rad, 360 deg) |
| **coneOuterGain** | `number` | The linear volume gain of the audio emitter set when outside the cone defined by the `coneOuterAngle` property. | 0.0 |
| **distanceModel** | `string` | Specifies the distance model for the audio emitter. | `"inverse"` |
| **maxDistance** | `number` | The maximum distance between the emitter and listener, beyond which the audio cannot be heard. | 0.0 |
Copy link
Contributor

@donmccurdy donmccurdy May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider naming this property range, for further consistency with KHR_lights_punctual? Or perhaps distanceMaximum for greater consistency with iridescence?

https://github.com/KhronosGroup/glTF/blob/main/specification/2.0/ObjectModel.adoc#58-khr_materials_iridescence

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxDistance is currently named this way to clarify the type of range, to be unambiguous with refDistance. If maxDistance is renamed, should refDistance be renamed too?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was also done to be consistent with the Web Audio API that this spec was inspired from. Wide acceptance of this API is my only hesitation in renaming these properties.

maxDistance: https://developer.mozilla.org/en-US/docs/Web/API/PannerNode/maxDistance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. I don't feel strongly on this one, there's also some precedence for '.max' in the existing accessor specification.

Comment on lines +186 to +188
### Audio Emitter

Audio emitters define how audio sources are played back. Emitter properties are defined at the document level and are references by nodes. Audio may be played globally or positionally. Positional audio has further properties that define how audio volume scales with distance and angle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a node's scale affect volume?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to me if the global scale of the node affected the refDistance and range / maxDistance, but not the volume. This way if you scale up a glTF scene with audio, the audio can be heard at the same volume in the same places relative to that scene, and that scene's audio can be heard from farther away.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle to know how to implement this in a way that content creators can anticipate. It's very common that I will attach some component to an item and scale it to fit it into a space. Very common also that I'll make something in blender and forget to apply those scales. Making audio volume change based on that scale seems high risk for people to have unintended loud or quiet audio.

I do like the idea outside of this spec and would implement something like a KHR_audio_behavior or KHR_audio_affector that can define these behaviors and maybe be a solid place to reference things like effect chains. Would be cool to use things like the node structure to do scale->volume->reverb(more if bigger, less if smaller)->distortion(when super big)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood! I don't have a preference, and I think explicitly leaving this behavior undefined could be OK. The question has come up with KHR_lights_punctual, particularly in AR/VR/XR applications, so I just wanted to anticipate it if possible here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Randomly stumbling over this, from some GitHub notification, completely without context:
Iff a node scale should affect an audio volume, then non-uniform scaling will raise questions that should be addressed clearly and explicitly early in the process.

@antpb
Copy link

antpb commented May 31, 2024

For folks following this PR that have concerns about different file types being supported in this spec, @aaronfranke drafted an excellent example of file types that extend KHR_audio_emitter to support ogg and opus. We just recently merged these in as a draft OMI spec. Anyone with file type reservations around this spec please give these a look to see if this solves your concerns 😄
https://github.com/omigroup/gltf-extensions/tree/main/extensions/2.0/OMI_audio_opus
https://github.com/omigroup/gltf-extensions/tree/main/extensions/2.0/OMI_audio_ogg_vorbis

@npolys
Copy link

npolys commented May 31, 2024 via email

@aaronfranke
Copy link
Contributor

@npolys What specifically would you like to point out from that spec? What comparisons are you making, and what changes are you proposing?

I see that the specification you linked supports the MIDI and AAC audio formats. What are the intended use cases of these formats? For MIDI, those files just list information about musical notes, their interpretation depends on the system (like how emojis are displayed differently on different operating systems and browsers). Is it intentional for an X3D file using MIDI audio to not be reproducible the same way on all systems? For AAC, please write a summary of how it compares to other formats such as MP3, Ogg Vorbis, and Opus.

I also see that the specification details a lot of things that are not audio emission, such as spatialization, effects, and listener information. For glTF, those can be left to separate extensions. For example, we might have KHR_audio_listener, KHR_audio_reverb, etc. If you want to influence the feature set of these or any other extensions, please give specific feedback, use cases, and other information so that we can plan accordingly.

@cashah
Copy link

cashah commented Jun 28, 2024

Sharing the KHR Audio Framework Proposal here for broader visibility.

During the recent 3D Formats working group meeting, we reviewed the proposal to define the KHR audio glTF specification using an audio graph framework. The purpose of the current document is to delve deeper into that proposal, offering a comprehensive design of the KHR audio graph. This includes a detailed description of each node object within the graph along with functionality, the specific properties associated with it, and how it interacts with other nodes in the graph. The document is structured to facilitate clear understanding and to solicit feedback on the proposed design. Based on the feedback we will update and finalize the design before it is formally schematized into KHR core audio spec, extensions, animations, and interactivity.

cc: @rudybear

@aaronfranke
Copy link
Contributor

aaronfranke commented Jun 28, 2024

@cashah KHR_audio_graph appears to be a superset of KHR_audio_emitter. Many of its properties are the same, including attenuation properties (distance model, ref distance, max distance, rolloff factor), shape properties (cone inner angle, cone outer angle, cone outer gain), emitter/sink properties (global/positional, gain, source), and playback properties (autoplay, loop, gain).

As for the differences, KHR_audio_emitter defines omnidirectional as a cone angle of Tau radians, while KHR_audio_graph defines omnidirectional explicitly as the shape "type" enum, which also allows for custom types. This may be worth changing in KHR_audio_emitter to be like KHR_audio_graph. For emitting sources, KHR_audio_emitter allows for one emitter to emit multiple sources, but KHR_audio_graph only allows one, which makes sense considering that one input could be the result of a graph that combines multiple. There are also many more properties in KHR_audio_graph such as loop start/end time that are not present in KHR_audio_emitter.

What is the purpose of defining bits per sample, sample count, sample rate, and channels? Those seem redundant with the data already present in the audio file, so I would not expect them to be included in the glTF JSON data.

Which audio file types are allowed? KHR_audio_emitter defines MP3 and WAV as allowed types, similar to how base glTF defines PNG and JPEG as allowed types. KHR_audio_graph doesn't define this.

Is there a plan to rework KHR_audio_graph to be a superset of KHR_audio_emitter, such that graph depends on emitter? As it is now, KHR_audio_emitter will be much easier to implement in game engines compared to an audio graph system, leading to wider adoption. Most 3D assets that emit audio will only need to define "here's what it sounds like and where it emits to", and do not need to define audio graphs. For example, the boom box sample asset is this simple.

Is KHR_audio_graph implemented anywhere? Are there any sample assets? Where are the JSON schemas? Or is it still in the prototype stage, pending rework to be a superset of KHR_audio_emitter?

A scene has a single listener node.

Why impose this restriction? I can imagine situations where you'd want multiple listeners, such as a scene with security cameras that listen for audio at different locations, and then either records that audio or feeds them through to a screen in a security monitoring room. In KHR_audio_emitter we simply do not define the listener, leaving it up to the implementation or future extensions. We expect that most implementations by default will treat the camera as the point that audio is listened from, for example this is the case in Unity, Unreal, and Godot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.