1D inverse type-II DCT AVX2 optimisations #114

frankplow · 2023-07-19T11:11:33Z

This PR adds AVX2 optimisations for the type-II DCT. For now, these optimisations are only implemented at the 1D level.

Performance results:

Bitstream	Before	After	Delta
RitualDance_1920x1080_60_10_420_32_LD	101.0	104.0	+3.0%
RitualDance_1920x1080_60_10_420_37_RA	89.0	90.0	+1.1%
Tango2_3840x2160_60_10_420_27_LD	24.0	25.0	+4.2%

benchmarking with Linux Perf Monitoring API
nop: 86.0
checkasm: using random seed 1542246724
AVX2:
 - vvc_itx_1d.idct2 [OK]
checkasm: all 6 tests passed
vvc_inv_dct2_2_c: 16.0
vvc_inv_dct2_2_avx2: 16.2
vvc_inv_dct2_4_c: 19.7
vvc_inv_dct2_4_avx2: 17.0
vvc_inv_dct2_8_c: 39.7
vvc_inv_dct2_8_avx2: 29.2
vvc_inv_dct2_16_c: 132.7
vvc_inv_dct2_16_avx2: 54.5
vvc_inv_dct2_32_c: 379.0
vvc_inv_dct2_32_avx2: 115.7
vvc_inv_dct2_64_c: 1527.7
vvc_inv_dct2_64_avx2: 385.0

frankplow · 2023-07-19T11:50:03Z

libavcodec/vvc/vvcdsp.h

@@ -172,16 +172,22 @@ typedef struct VVCDSPContext {
    VVCALFDSPContext alf;
 } VVCDSPContext;

-void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth);
+void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth,


IMO it would be more elegant to pass VVCFrameParamSets, VVCFrameContext or something else more general here (and the call site for ff_vvc_dsp_init supports it), however including vvc_ps.h in vvcdsp.h introduced a lot of compilation errors.

yes, at this level, we'd better include a small set of headers

frankplow · 2023-07-19T11:51:52Z

629db28 disables the optimisations when sps_extended_precision_flag is set. A new set of functions will need to be written in order to support transform coefficients larger than 16 bits.

nuomi2021 · 2023-07-19T12:33:18Z

Did you check the hevc idct checkasm output? is it aligned with your result?
thank you

vvc_inv_dct2_2_c: 16.0
vvc_inv_dct2_2_avx2: 16.2
vvc_inv_dct2_4_c: 19.7
vvc_inv_dct2_4_avx2: 17.0
vvc_inv_dct2_8_c: 39.7
vvc_inv_dct2_8_avx2: 29.2
vvc_inv_dct2_16_c: 132.7
vvc_inv_dct2_16_avx2: 54.5
vvc_inv_dct2_32_c: 379.0
vvc_inv_dct2_32_avx2: 115.7
vvc_inv_dct2_64_c: 1527.7
vvc_inv_dct2_64_avx2: 385.0

frankplow · 2023-07-19T12:49:29Z

Did you check the hevc idct checkasm output? is it aligned with your result? thank you

vvc_inv_dct2_2_c: 16.0
vvc_inv_dct2_2_avx2: 16.2
vvc_inv_dct2_4_c: 19.7
vvc_inv_dct2_4_avx2: 17.0
vvc_inv_dct2_8_c: 39.7
vvc_inv_dct2_8_avx2: 29.2
vvc_inv_dct2_16_c: 132.7
vvc_inv_dct2_16_avx2: 54.5
vvc_inv_dct2_32_c: 379.0
vvc_inv_dct2_32_avx2: 115.7
vvc_inv_dct2_64_c: 1527.7
vvc_inv_dct2_64_avx2: 385.0

Here are the relevant entries from the HEVC IDCT checkasm benchmark:

hevc_idct_4x4_8_c: 141.5
hevc_idct_4x4_8_avx: 44.7
hevc_idct_4x4_10_c: 133.2
hevc_idct_4x4_10_avx: 43.5
hevc_idct_8x8_8_c: 870.7
hevc_idct_8x8_8_avx: 134.2
hevc_idct_8x8_10_c: 879.2
hevc_idct_8x8_10_avx: 137.2
hevc_idct_16x16_8_c: 5861.0
hevc_idct_16x16_8_avx: 696.2
hevc_idct_16x16_10_c: 5835.5
hevc_idct_16x16_10_avx: 695.5
hevc_idct_32x32_8_c: 47877.5
hevc_idct_32x32_8_avx: 3863.0
hevc_idct_32x32_10_c: 47965.5
hevc_idct_32x32_10_avx: 3856.2

Note that the HEVC optimisations are performed at the 2D level rather than the 1D level. Many of the instructions in the SIMD optimisations are spent loading data into and extracting data from the SIMD registers. This is all the more true for FFVVC due to the strides in the IDCT function signature. The FFVVC IDCT can be optimised at the 2D level in the future to get performance gains closer to HEVC's, but for now the 1D optimisations work alone and they provide the backbone needed for any future optimisation.

nuomi2021 · 2023-07-19T13:15:11Z

but for now the 1D optimisations work alone and they provide the backbone needed for any future optimisation.

how about dav1d, it has similar 1d function. or 2d only

frankplow · 2023-07-19T13:20:01Z

but for now the 1D optimisations work alone and they provide the backbone needed for any future optimisation.

how about dav1d, it has similar 1d function. or 2d only

dav1d uses 2D and then some, incorporating some of the vectorisation as well to save a transpose operation. According to this lecture, this allowed them to double performance compared to only 1D SIMD optimisations. It's worth noting that doing these higher-level optimisations comes at a cost in terms of complexity though. dav1d has over 10,000 lines of inverse transform assembly for AVX2 alone!

nuomi2021 · 2023-07-19T13:31:42Z

dav1d has over 10,000 lines of inverse transform assembly for AVX2 alone!

It was worth it. dav1d is most fast decoder in we see so far. and the current vvc transform function for some files cost 10% cpu.
Is it possible, just use their code directly? DCT functions are similar. we may only need to change some parameters(asm tables)

frankplow · 2023-07-19T13:56:45Z

Is it possible, just use their code directly? DCT functions are similar. we may only need to change some parameters(asm tables)

I will look into this. I don't think it will be quite this simple - some internal data representations in FFVVC will need to be changed as dav1d relies on packed input data but it looks like there is only one place non-packed transform coefficients are actually used in FFVVC.

nuomi2021 · 2023-07-19T14:09:41Z

I will look into this

👍, we can start with 2x2 or 4x4 block. zero the entire block and set the fireset coeff to 1

some internal data representations in FFVVC will need to be changed
no problem. You can do any reasonable change

Jamaika1 · 2023-07-20T04:12:21Z

libavcodec/x86/vvc_itx_1d.asm

+    m21, m22, m23, m24, \
+    m11, m12, m13, m14
+
+const vvc_dct2_8_odd_mat, dw matvec_mul_4_permute(dct2_8_odd_mat_permute( \


vvc_itx_1d.asm:49: warning: single-line macro `matvec_mul_4_permute' exists, but not taking 1 parameter [-w+pp-macro-params-single]

I think this is a bug in NASM - the inner macro is expanded. These macros could be removed and these permutations applied directly to the transform matrices at the cost of readability, or we could look at adding -w-pp-macro-params-single, if we really want to get rid of the warning.

libavcodec/x86/vvc_itx_1d.asm

…sion_flag set

YASM does not supported unnamed contexts, so give all contexts names

Replace `x%[y]` with `x %+ y`

frankplow · 2023-08-21T13:28:21Z

Rebase and re-target onto main.

frankplow · 2023-08-27T14:20:27Z

Reset to 6105322. Work done porting FFmpeg HEVC ASM can now be found at frankplow:ffmpeg-hevc-idct/#130. This has been done as there is little in common between the two trees.

frankplow force-pushed the idct-asm branch from 7644506 to fba5cc2 Compare July 19, 2023 11:15

frankplow commented Jul 19, 2023

View reviewed changes

Jamaika1 reviewed Jul 20, 2023

View reviewed changes

libavcodec/x86/vvc_itx_1d.asm Show resolved Hide resolved

Jamaika1 reviewed Jul 20, 2023

View reviewed changes

libavcodec/x86/vvc_itx_1d.asm Show resolved Hide resolved

Jamaika1 reviewed Jul 20, 2023

View reviewed changes

libavcodec/x86/vvc_itx_1d.asm Show resolved Hide resolved

frankplow mentioned this pull request Aug 21, 2023

WIP: dav1d ITX AVX2 Assembly Port #117

Closed

frankplow added 6 commits August 21, 2023 14:15

Add VVC 1D ITX checkasm

5917740

lavc/x86/vvc_itx_1d: Initial AVX2 optimisations for VVC IDCT2

32070fa

lavc/x86/vvc_itx_1d.asm: Add size 64 AVX2 IDCT2 optimisation

f24c912

lavc/x86/vvc_itx_1d: Don't use AVX2 optimisations when extended_preci…

ac07a5a

…sion_flag set

lavc/x86/vvc_itx_1d: Fix assembly with YASM

2036efb

YASM does not supported unnamed contexts, so give all contexts names

lavc/x86/vvc_itx_1d: Fix YASM build

6105322

Replace `x%[y]` with `x %+ y`

frankplow force-pushed the idct-asm branch from dabc1e6 to 6105322 Compare August 21, 2023 13:28

frankplow changed the base branch from 20230811 to main August 21, 2023 13:28

frankplow force-pushed the idct-asm branch from 4bb4279 to 6105322 Compare August 27, 2023 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1D inverse type-II DCT AVX2 optimisations #114

1D inverse type-II DCT AVX2 optimisations #114

frankplow commented Jul 19, 2023

frankplow Jul 19, 2023

nuomi2021 Jul 19, 2023

frankplow commented Jul 19, 2023

nuomi2021 commented Jul 19, 2023

frankplow commented Jul 19, 2023

nuomi2021 commented Jul 19, 2023

frankplow commented Jul 19, 2023 •

edited

Loading

nuomi2021 commented Jul 19, 2023 •

edited

Loading

frankplow commented Jul 19, 2023

nuomi2021 commented Jul 19, 2023

Jamaika1 Jul 20, 2023

frankplow Jul 20, 2023

frankplow commented Aug 21, 2023

frankplow commented Aug 27, 2023 •

edited

Loading

1D inverse type-II DCT AVX2 optimisations #114

Are you sure you want to change the base?

1D inverse type-II DCT AVX2 optimisations #114

Conversation

frankplow commented Jul 19, 2023

frankplow Jul 19, 2023

Choose a reason for hiding this comment

nuomi2021 Jul 19, 2023

Choose a reason for hiding this comment

frankplow commented Jul 19, 2023

nuomi2021 commented Jul 19, 2023

frankplow commented Jul 19, 2023

nuomi2021 commented Jul 19, 2023

frankplow commented Jul 19, 2023 • edited Loading

nuomi2021 commented Jul 19, 2023 • edited Loading

frankplow commented Jul 19, 2023

nuomi2021 commented Jul 19, 2023

Jamaika1 Jul 20, 2023

Choose a reason for hiding this comment

frankplow Jul 20, 2023

Choose a reason for hiding this comment

frankplow commented Aug 21, 2023

frankplow commented Aug 27, 2023 • edited Loading

frankplow commented Jul 19, 2023 •

edited

Loading

nuomi2021 commented Jul 19, 2023 •

edited

Loading

frankplow commented Aug 27, 2023 •

edited

Loading