Skip to content

Optimize using align for 64bit, set strict enum sizes, int to uint16_t, reduce structs sizes

Herman Semenoff requested to merge GermanAizek/dav1d:optimize-align-64bit into master

@another, @mstorsjo, @gramner

Introduction

I noticed a very clickbait bounty, I initially realized that company's original task was not to overtake implementation, but to advertise that Rust is 5% slower than C. Whether she actually pays or not is another matter. The main thing for Prossimo was to make a fuss that the current rav1d implementation was only 5% slower, so that the general public would think that the language was the same in speed.

I also noticed contributor's blog who tried to optimize rav1d, but he didn't go beyond 1%. Actually, I solved his problem, he came out at 0%. CHICKEN JOCKEY

Well, first thing I decided to do was look dav1d at the memory organization in CPU cachelines, and I noticed that dav1d really consumes large structures. It is desirable to have structures of 64 bytes or less in size, it is easier for C/C++/C# compiler to process them. Since it's very difficult to recycle structures, I solved problem more simply.

At first, out of habit, I aligned, but I couldn't align because some of the data was bulging. And I remembered about taming enum to strict values so that it fits into 1 byte and it can be conveniently manually aligned. I made maximum values for each enum, and strictly specified a size 1 byte for them. I also realized that int in structures is a waste and decided to compress it to a uint16_t (2 bytes).

If you know how to use pahole, then you can view the object files in release, debug, non-optimized debug, and so on. By default, C/C++ compilers do not change the size of structures until the programmer himself specifies the packing or alignment attribute (keyword). The compiler also sometimes does not optimize enum itself to 1 byte if the strict flag is -fshort-enums, but I had to manually align to optimize the space, so I did this optimization in advance. Also, don't expect the compiler to change int to uint16_t itself.

dav1d-vs-rav1d-c-cpp-vs-rust

Briefly changes

  • 1080p up performance ~3%
  • 4K up performance ~1%

This PR will decrease costs copying, moving, and creating object-structures only for common 64bit processors due to the 8-byte data alignment.

Smaller size structure or class, higher chance putting into CPU cache. Most processors are already 64 bit, so the change won't make it any worse.

In the description of each commit, I described steps in detail.

Pahole example output with struct Dav1dFrameContext:

  • Comment /* XXX {n} bytes hole, try to pack */ shows where optimization is possible by rearranging the order of fields structures and classes

Master branch

struct Dav1dFrameContext {
        Dav1dRef *                 seq_hdr_ref;          /*     0     8 */
        Dav1dSequenceHeader *      seq_hdr;              /*     8     8 */
        Dav1dRef *                 frame_hdr_ref;        /*    16     8 */
        Dav1dFrameHeader *         frame_hdr;            /*    24     8 */
        Dav1dThreadPicture         refp[7];              /*    32  2072 */
        /* --- cacheline 32 boundary (2048 bytes) was 56 bytes ago --- */
        Dav1dPicture               cur;                  /*  2104   272 */
        /* --- cacheline 37 boundary (2368 bytes) was 8 bytes ago --- */
        Dav1dThreadPicture         sr_cur;               /*  2376   296 */
        /* --- cacheline 41 boundary (2624 bytes) was 48 bytes ago --- */
        Dav1dRef *                 mvs_ref;              /*  2672     8 */
        refmvs_temporal_block *    mvs;                  /*  2680     8 */
        /* --- cacheline 42 boundary (2688 bytes) --- */
        refmvs_temporal_block *    ref_mvs[7];           /*  2688    56 */
        Dav1dRef *                 ref_mvs_ref[7];       /*  2744    56 */
        /* --- cacheline 43 boundary (2752 bytes) was 48 bytes ago --- */
        Dav1dRef *                 cur_segmap_ref;       /*  2800     8 */
        Dav1dRef *                 prev_segmap_ref;      /*  2808     8 */
        /* --- cacheline 44 boundary (2816 bytes) --- */
        uint8_t *                  cur_segmap;           /*  2816     8 */
        const uint8_t  *           prev_segmap;          /*  2824     8 */
        unsigned int               refpoc[7];            /*  2832    28 */
        unsigned int               refrefpoc[7][7];      /*  2860   196 */
        /* --- cacheline 47 boundary (3008 bytes) was 48 bytes ago --- */
        uint8_t                    gmv_warp_allowed[7];  /*  3056     7 */

        /* XXX 1 byte hole, try to pack */

        CdfThreadContext           in_cdf;               /*  3064    24 */
        /* --- cacheline 48 boundary (3072 bytes) was 16 bytes ago --- */
        CdfThreadContext           out_cdf;              /*  3088    24 */
        struct Dav1dTileGroup *    tile;                 /*  3112     8 */
        int                        n_tile_data_alloc;    /*  3120     4 */
        int                        n_tile_data;          /*  3124     4 */
        struct ScalableMotionParams svc[7][2];           /*  3128   112 */
        /* --- cacheline 50 boundary (3200 bytes) was 40 bytes ago --- */
        int                        resize_step[2];       /*  3240     8 */
        int                        resize_start[2];      /*  3248     8 */
        const Dav1dContext  *      c;                    /*  3256     8 */
        /* --- cacheline 51 boundary (3264 bytes) --- */
        Dav1dTileState *           ts;                   /*  3264     8 */
        int                        n_ts;                 /*  3272     4 */

        /* XXX 4 bytes hole, try to pack */

        const Dav1dDSPContext  *   dsp;                  /*  3280     8 */
        struct {
                recon_b_intra_fn   recon_b_intra;        /*  3288     8 */
                recon_b_inter_fn   recon_b_inter;        /*  3296     8 */
                filter_sbrow_fn    filter_sbrow;         /*  3304     8 */
                filter_sbrow_fn    filter_sbrow_deblock_cols; /*  3312     8 */
                filter_sbrow_fn    filter_sbrow_deblock_rows; /*  3320     8 */
                /* --- cacheline 52 boundary (3328 bytes) --- */
                void               (*filter_sbrow_cdef)(Dav1dTaskContext *, int); /*  3328     8 */
                filter_sbrow_fn    filter_sbrow_resize;  /*  3336     8 */
                filter_sbrow_fn    filter_sbrow_lr;      /*  3344     8 */
                backup_ipred_edge_fn backup_ipred_edge;  /*  3352     8 */
                read_coef_blocks_fn read_coef_blocks;    /*  3360     8 */
                copy_pal_block_fn  copy_pal_block_y;     /*  3368     8 */
                copy_pal_block_fn  copy_pal_block_uv;    /*  3376     8 */
                read_pal_plane_fn  read_pal_plane;       /*  3384     8 */
                /* --- cacheline 53 boundary (3392 bytes) --- */
                read_pal_uv_fn     read_pal_uv;          /*  3392     8 */
        } bd_fn;                                         /*  3288   112 */
        int                        ipred_edge_sz;        /*  3400     4 */

        /* XXX 4 bytes hole, try to pack */

        pixel *                    ipred_edge[3];        /*  3408    24 */
        ptrdiff_t                  b4_stride;            /*  3432     8 */
        int                        w4;                   /*  3440     4 */
        int                        h4;                   /*  3444     4 */
        int                        bw;                   /*  3448     4 */
        int                        bh;                   /*  3452     4 */
        /* --- cacheline 54 boundary (3456 bytes) --- */
        int                        sb128w;               /*  3456     4 */
        int                        sb128h;               /*  3460     4 */
        int                        sbh;                  /*  3464     4 */
        int                        sb_shift;             /*  3468     4 */
        int                        sb_step;              /*  3472     4 */
        int                        sr_sb128w;            /*  3476     4 */
        uint16_t                   dq[8][3][2];          /*  3480    96 */
        /* --- cacheline 55 boundary (3520 bytes) was 56 bytes ago --- */
        const uint8_t  *           qm[19][3];            /*  3576   456 */
        /* --- cacheline 63 boundary (4032 bytes) --- */
        BlockContext *             a;                    /*  4032     8 */
        int                        a_sz;                 /*  4040     4 */

        /* XXX 4 bytes hole, try to pack */

        refmvs_frame               rf;                   /*  4048   208 */
        /* --- cacheline 66 boundary (4224 bytes) was 32 bytes ago --- */
        uint8_t                    jnt_weights[7][7];    /*  4256    49 */

        /* XXX 3 bytes hole, try to pack */

        /* --- cacheline 67 boundary (4288 bytes) was 20 bytes ago --- */
        int                        bitdepth_max;         /*  4308     4 */
        struct {
                int                next_tile_row[2];     /*  4312     8 */
                atomic_int         entropy_progress;     /*  4320     4 */
                atomic_int         deblock_progress;     /*  4324     4 */
                atomic_uint *      frame_progress;       /*  4328     8 */
                atomic_uint *      copy_lpf_progress;    /*  4336     8 */
                Av1Block *         b;                    /*  4344     8 */
                /* --- cacheline 68 boundary (4352 bytes) --- */
                int16_t *          cbi;                  /*  4352     8 */
                pixel *            pal;                  /*  4360     8 */
                uint8_t *          pal_idx;              /*  4368     8 */
                coef *             cf;                   /*  4376     8 */
                int                prog_sz;              /*  4384     4 */
                int                cbi_sz;               /*  4388     4 */
                int                pal_sz;               /*  4392     4 */
                int                pal_idx_sz;           /*  4396     4 */
                int                cf_sz;                /*  4400     4 */

                /* XXX 4 bytes hole, try to pack */

                unsigned int *     tile_start_off;       /*  4408     8 */
        } frame_thread;                                  /*  4312   104 */

        /* XXX last struct has 1 hole */

        /* --- cacheline 69 boundary (4416 bytes) --- */
        struct {
                uint8_t *          level;                /*  4416     8 */
                Av1Filter *        mask;                 /*  4424     8 */
                Av1Restoration *   lr_mask;              /*  4432     8 */
                int                mask_sz;              /*  4440     4 */
                int                lr_mask_sz;           /*  4444     4 */
                int                cdef_buf_plane_sz[2]; /*  4448     8 */
                int                cdef_buf_sbh;         /*  4456     4 */
                int                lr_buf_plane_sz[2];   /*  4460     8 */
                int                re_sz;                /*  4468     4 */

                /* XXX 8 bytes hole, try to pack */

                /* --- cacheline 70 boundary (4480 bytes) --- */
                Av1FilterLUT       lim_lut __attribute__((__aligned__(16))); /*  4480   144 */
                /* --- cacheline 72 boundary (4608 bytes) was 16 bytes ago --- */
                uint8_t            lvl[8][4][8][2] __attribute__((__aligned__(16))); /*  4624   512 */
                /* --- cacheline 80 boundary (5120 bytes) was 16 bytes ago --- */
                int                last_sharpness;       /*  5136     4 */

                /* XXX 4 bytes hole, try to pack */

                uint8_t *          tx_lpf_right_edge[2]; /*  5144    16 */
                uint8_t *          cdef_line_buf;        /*  5160     8 */
                uint8_t *          lr_line_buf;          /*  5168     8 */
                pixel *            cdef_line[2][3];      /*  5176    48 */
                /* --- cacheline 81 boundary (5184 bytes) was 40 bytes ago --- */
                pixel *            cdef_lpf_line[3];     /*  5224    24 */
                /* --- cacheline 82 boundary (5248 bytes) --- */
                pixel *            lr_lpf_line[3];       /*  5248    24 */
                uint8_t *          start_of_tile_row;    /*  5272     8 */
                int                start_of_tile_row_sz; /*  5280     4 */
                int                need_cdef_lpf_copy;   /*  5284     4 */
                pixel *            p[3];                 /*  5288    24 */
                /* --- cacheline 83 boundary (5312 bytes) --- */
                pixel *            sr_p[3];              /*  5312    24 */
                int                restore_planes;       /*  5336     4 */
        } __attribute__((__aligned__(16))) lf __attribute__((__aligned__(16)));           /*  4416   928 */

        /* XXX last struct has 4 bytes of padding, 2 holes */

        struct {
                pthread_mutex_t    lock;                 /*  5344    40 */
                /* --- cacheline 84 boundary (5376 bytes) was 8 bytes ago --- */
                pthread_cond_t     cond;                 /*  5384    48 */
                struct TaskThreadData * ttd;             /*  5432     8 */
                /* --- cacheline 85 boundary (5440 bytes) --- */
                struct Dav1dTask * tasks;                /*  5440     8 */
                struct Dav1dTask * tile_tasks[2];        /*  5448    16 */
                struct Dav1dTask   init_task;            /*  5464    32 */
                int                num_tasks;            /*  5496     4 */
                int                num_tile_tasks;       /*  5500     4 */
                /* --- cacheline 86 boundary (5504 bytes) --- */
                atomic_int         init_done;            /*  5504     4 */
                atomic_int         done[2];              /*  5508     8 */
                int                retval;               /*  5516     4 */
                int                update_set;           /*  5520     4 */
                atomic_int         error;                /*  5524     4 */
                atomic_int         task_counter;         /*  5528     4 */

                /* XXX 4 bytes hole, try to pack */

                struct Dav1dTask * task_head;            /*  5536     8 */
                struct Dav1dTask * task_tail;            /*  5544     8 */
                struct Dav1dTask * task_cur_prev;        /*  5552     8 */
                struct {
                        atomic_int merge;                /*  5560     4 */

                        /* XXX 4 bytes hole, try to pack */

                        /* --- cacheline 87 boundary (5568 bytes) --- */
                        pthread_mutex_t lock;            /*  5568    40 */
                        Dav1dTask * head;                /*  5608     8 */
                        Dav1dTask * tail;                /*  5616     8 */
                } pending_tasks;                         /*  5560    64 */

                /* XXX last struct has 1 hole */
        } task_thread;                                   /*  5344   280 */

        /* XXX last struct has 1 hole */

        struct FrameTileThreadData tile_thread;          /*  5624    16 */

        /* XXX last struct has 4 bytes of padding */

        /* size: 5648, cachelines: 89, members: 55 */
        /* sum members: 5624, holes: 5, sum holes: 16 */
        /* padding: 8 */
        /* member types with holes: 3, total: 4 */
        /* paddings: 2, sum paddings: 8 */
        /* forced alignments: 1 */
        /* last cacheline: 16 bytes */
} __attribute__((__aligned__(16)));

PR

struct Dav1dFrameContext {
        Dav1dRef *                 seq_hdr_ref;          /*     0     8 */
        Dav1dSequenceHeader *      seq_hdr;              /*     8     8 */
        Dav1dRef *                 frame_hdr_ref;        /*    16     8 */
        Dav1dFrameHeader *         frame_hdr;            /*    24     8 */
        Dav1dThreadPicture         refp[7];              /*    32  2016 */
        /* --- cacheline 32 boundary (2048 bytes) --- */
        Dav1dPicture               cur;                  /*  2048   264 */
        /* --- cacheline 36 boundary (2304 bytes) was 8 bytes ago --- */
        Dav1dThreadPicture         sr_cur;               /*  2312   288 */
        /* --- cacheline 40 boundary (2560 bytes) was 40 bytes ago --- */
        Dav1dRef *                 mvs_ref;              /*  2600     8 */
        refmvs_temporal_block *    mvs;                  /*  2608     8 */
        refmvs_temporal_block *    ref_mvs[7];           /*  2616    56 */
        /* --- cacheline 41 boundary (2624 bytes) was 48 bytes ago --- */
        Dav1dRef *                 ref_mvs_ref[7];       /*  2672    56 */
        /* --- cacheline 42 boundary (2688 bytes) was 40 bytes ago --- */
        Dav1dRef *                 cur_segmap_ref;       /*  2728     8 */
        Dav1dRef *                 prev_segmap_ref;      /*  2736     8 */
        uint8_t *                  cur_segmap;           /*  2744     8 */
        /* --- cacheline 43 boundary (2752 bytes) --- */
        const uint8_t  *           prev_segmap;          /*  2752     8 */
        unsigned int               refpoc[7];            /*  2760    28 */
        unsigned int               refrefpoc[7][7];      /*  2788   196 */
        /* --- cacheline 46 boundary (2944 bytes) was 40 bytes ago --- */
        CdfThreadContext           in_cdf;               /*  2984    24 */
        /* --- cacheline 47 boundary (3008 bytes) --- */
        CdfThreadContext           out_cdf;              /*  3008    24 */
        struct Dav1dTileGroup *    tile;                 /*  3032     8 */
        uint16_t int                  n_tile_data_alloc;    /*  3040     2 */
        uint16_t int                  n_tile_data;          /*  3042     2 */
        struct ScalableMotionParams svc[7][2];           /*  3044    56 */
        /* --- cacheline 48 boundary (3072 bytes) was 28 bytes ago --- */
        uint16_t int                  resize_step[2];       /*  3100     4 */
        uint16_t int                  resize_start[2];      /*  3104     4 */
        uint16_t int                  ipred_edge_sz;        /*  3108     2 */
        uint16_t int                  n_ts;                 /*  3110     2 */
        const Dav1dContext  *      c;                    /*  3112     8 */
        Dav1dTileState *           ts;                   /*  3120     8 */
        const Dav1dDSPContext  *   dsp;                  /*  3128     8 */
        /* --- cacheline 49 boundary (3136 bytes) --- */
        struct {
                recon_b_intra_fn   recon_b_intra;        /*  3136     8 */
                recon_b_inter_fn   recon_b_inter;        /*  3144     8 */
                filter_sbrow_fn    filter_sbrow;         /*  3152     8 */
                filter_sbrow_fn    filter_sbrow_deblock_cols; /*  3160     8 */
                filter_sbrow_fn    filter_sbrow_deblock_rows; /*  3168     8 */
                void               (*filter_sbrow_cdef)(Dav1dTaskContext *, int); /*  3176     8 */
                filter_sbrow_fn    filter_sbrow_resize;  /*  3184     8 */
                filter_sbrow_fn    filter_sbrow_lr;      /*  3192     8 */
                /* --- cacheline 50 boundary (3200 bytes) --- */
                backup_ipred_edge_fn backup_ipred_edge;  /*  3200     8 */
                read_coef_blocks_fn read_coef_blocks;    /*  3208     8 */
                copy_pal_block_fn  copy_pal_block_y;     /*  3216     8 */
                copy_pal_block_fn  copy_pal_block_uv;    /*  3224     8 */
                read_pal_plane_fn  read_pal_plane;       /*  3232     8 */
                read_pal_uv_fn     read_pal_uv;          /*  3240     8 */
        } bd_fn;                                         /*  3136   112 */
        pixel *                    ipred_edge[3];        /*  3248    24 */
        /* --- cacheline 51 boundary (3264 bytes) was 8 bytes ago --- */
        ptrdiff_t                  b4_stride;            /*  3272     8 */
        uint16_t int                  w4;                   /*  3280     2 */
        uint16_t int                  h4;                   /*  3282     2 */
        uint16_t int                  bw;                   /*  3284     2 */
        uint16_t int                  bh;                   /*  3286     2 */
        uint16_t int                  sb128w;               /*  3288     2 */
        uint16_t int                  sb128h;               /*  3290     2 */
        uint16_t int                  sbh;                  /*  3292     2 */
        uint16_t int                  sb_shift;             /*  3294     2 */
        uint16_t int                  sb_step;              /*  3296     2 */
        uint16_t int                  sr_sb128w;            /*  3298     2 */
        uint16_t int                  a_sz;                 /*  3300     2 */
        uint16_t int                  bitdepth_max;         /*  3302     2 */
        uint16_t                   dq[8][3][2];          /*  3304    96 */
        /* --- cacheline 53 boundary (3392 bytes) was 8 bytes ago --- */
        const uint8_t  *           qm[19][3];            /*  3400   456 */
        /* --- cacheline 60 boundary (3840 bytes) was 16 bytes ago --- */
        BlockContext *             a;                    /*  3856     8 */
        refmvs_frame               rf;                   /*  3864   208 */
        /* --- cacheline 63 boundary (4032 bytes) was 40 bytes ago --- */
        uint8_t                    jnt_weights[7][7];    /*  4072    49 */
        /* --- cacheline 64 boundary (4096 bytes) was 25 bytes ago --- */
        uint8_t                    gmv_warp_allowed[7];  /*  4121     7 */
        struct {
                atomic_int         entropy_progress;     /*  4128     4 */
                atomic_int         deblock_progress;     /*  4132     4 */
                atomic_uint *      frame_progress;       /*  4136     8 */
                atomic_uint *      copy_lpf_progress;    /*  4144     8 */
                Av1Block *         b;                    /*  4152     8 */
                /* --- cacheline 65 boundary (4160 bytes) --- */
                int16_t *          cbi;                  /*  4160     8 */
                pixel *            pal;                  /*  4168     8 */
                uint8_t *          pal_idx;              /*  4176     8 */
                coef *             cf;                   /*  4184     8 */
                unsigned int *     tile_start_off;       /*  4192     8 */
                uint16_t int          next_tile_row[2];     /*  4200     4 */
                uint16_t int          prog_sz;              /*  4204     2 */
                uint16_t int          cbi_sz;               /*  4206     2 */
                uint16_t int          cf_sz;                /*  4208     2 */
                uint16_t int          pal_sz;               /*  4210     2 */
                uint16_t int          pal_idx_sz;           /*  4212     2 */
        } frame_thread;                                  /*  4128    88 */

        /* XXX last struct has 2 bytes of padding */

        struct {
                uint8_t *          level;                /*  4216     8 */
                /* --- cacheline 66 boundary (4224 bytes) --- */
                Av1Filter *        mask;                 /*  4224     8 */
                Av1Restoration *   lr_mask;              /*  4232     8 */
                uint16_t int          mask_sz;              /*  4240     2 */
                uint16_t int          lr_mask_sz;           /*  4242     2 */
                uint16_t int          cdef_buf_plane_sz[2]; /*  4244     4 */
                uint16_t int          cdef_buf_sbh;         /*  4248     2 */
                uint16_t int          lr_buf_plane_sz[2];   /*  4250     4 */
                uint16_t int          re_sz;                /*  4254     2 */
                Av1FilterLUT       lim_lut;              /*  4256   144 */
                /* --- cacheline 68 boundary (4352 bytes) was 48 bytes ago --- */
                uint8_t            lvl[8][4][8][2];      /*  4400   512 */
                /* --- cacheline 76 boundary (4864 bytes) was 48 bytes ago --- */
                uint8_t *          tx_lpf_right_edge[2]; /*  4912    16 */
                /* --- cacheline 77 boundary (4928 bytes) --- */
                uint8_t *          cdef_line_buf;        /*  4928     8 */
                uint8_t *          lr_line_buf;          /*  4936     8 */
                pixel *            cdef_line[2][3];      /*  4944    48 */
                /* --- cacheline 78 boundary (4992 bytes) --- */
                pixel *            cdef_lpf_line[3];     /*  4992    24 */
                pixel *            lr_lpf_line[3];       /*  5016    24 */
                uint16_t int          last_sharpness;       /*  5040     2 */
                uint16_t int          start_of_tile_row_sz; /*  5042     2 */
                uint16_t int          need_cdef_lpf_copy;   /*  5044     2 */
                uint8_t            restore_planes;       /*  5046     1 */

                /* XXX 1 byte hole, try to pack */

                uint8_t *          start_of_tile_row;    /*  5048     8 */
                /* --- cacheline 79 boundary (5056 bytes) --- */
                pixel *            p[3];                 /*  5056    24 */
                pixel *            sr_p[3];              /*  5080    24 */
        } lf;                                            /*  4216   888 */

        /* XXX last struct has 1 hole */

        struct {
                pthread_mutex_t    lock;                 /*  5104    40 */
                /* --- cacheline 80 boundary (5120 bytes) was 24 bytes ago --- */
                pthread_cond_t     cond;                 /*  5144    48 */
                /* --- cacheline 81 boundary (5184 bytes) was 8 bytes ago --- */
                struct TaskThreadData * ttd;             /*  5192     8 */
                struct Dav1dTask * tasks;                /*  5200     8 */
                struct Dav1dTask * tile_tasks[2];        /*  5208    16 */
                struct Dav1dTask   init_task;            /*  5224    24 */

                /* XXX last struct has 2 holes */

                /* --- cacheline 82 boundary (5248 bytes) --- */
                atomic_int         init_done;            /*  5248     4 */
                atomic_int         done[2];              /*  5252     8 */
                atomic_int         error;                /*  5260     4 */
                atomic_int         task_counter;         /*  5264     4 */
                uint16_t int          num_tasks;            /*  5268     2 */
                uint16_t int          num_tile_tasks;       /*  5270     2 */
                uint16_t int          retval;               /*  5272     2 */
                uint16_t int          update_set;           /*  5274     2 */

                /* XXX 4 bytes hole, try to pack */

                struct Dav1dTask * task_head;            /*  5280     8 */
                struct Dav1dTask * task_tail;            /*  5288     8 */
                struct Dav1dTask * task_cur_prev;        /*  5296     8 */
                struct {
                        pthread_mutex_t lock;            /*  5304    40 */
                        /* --- cacheline 83 boundary (5312 bytes) was 32 bytes ago --- */
                        Dav1dTask * head;                /*  5344     8 */
                        Dav1dTask * tail;                /*  5352     8 */
                        atomic_int merge;                /*  5360     4 */
                } pending_tasks;                         /*  5304    64 */

                /* XXX last struct has 4 bytes of padding */
        } task_thread;                                   /*  5104   264 */

        /* XXX last struct has 1 hole */

        struct FrameTileThreadData tile_thread;          /*  5368    16 */

        /* XXX last struct has 6 bytes of padding */

        /* size: 5384, cachelines: 85, members: 55 */
        /* member types with holes: 2, total: 2 */
        /* paddings: 2, sum paddings: 8 */
        /* last cacheline: 8 bytes */
};

Results aligment and reduce structures

/* size: 5648, cachelines: 89, members: 55 */ -----> /* size: 5384, cachelines: 85, members: 55 */

Literally, 4 cachelines were saved on one structure, I have given only one example with Dav1dFrameContext structure, because it will take a long time to collect how much size has changed and number cachelines between master and my pull request (merge request).

Benchmark

I use 1080p Chimera (old) and Stream2_AV1_4K_22.7mbps.webm using ffmpeg I get summer_nature_4k.ivf

Im tested on gcc version 14.2.0 (Debian 14.2.0-19) with Meson Release configuration and -O3 optimization level.

I have an very old server with 72 threads and two processors on NUMA and 1U 88 threads Supermicro newer one also has 2 CPU on NUMA. On supermicro not tested.

To make measurements very accurate, I used hyperfine package and specified a warm-up parameter 2 and 10 starts dav1d by default.

P.S. about hyperfine: making tools for more convenient work with C/C++ projects on Rust (as Python) is a more useful thing than trying to replace an already working dav1d solution (or any FOSS infrastructure) that has been tested by time, many contributors and most vulnerabilities have been closed. New projects will also have performance and vulnerability issues. No need to waste your time writing another bike, there are much more useful things.

Master

debian@lenovo:~/GIT/dav1d/buildDir$ hyperfine --warmup 2 "tools/dav1d -q -i ~/GIT/dav1d/summer_nature_4k.ivf -o /dev/null --threads 72"
Benchmark 1: tools/dav1d -q -i ~/GIT/dav1d/summer_nature_4k.ivf -o /dev/null --threads 72
  Time (mean ± σ):     30.603 s ±  0.445 s    [User: 728.285 s, System: 18.614 s]
  Range (min  max):   30.026 s  31.233 s    10 runs
 
debian@lenovo:~/GIT/dav1d/buildDir$ hyperfine --warmup 2 "tools/dav1d -q -i ~/GIT/dav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 72"
Benchmark 1: tools/dav1d -q -i ~/GIT/dav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 72
  Time (mean ± σ):     46.421 s ±  0.411 s    [User: 562.544 s, System: 13.866 s]
  Range (min  max):   45.373 s  46.789 s    10 runs

PR

debian@lenovo:~/GIT/dav1d/buildDir$ hyperfine --warmup 2 "tools/dav1d -q -i ~/GIT/dav1d/summer_nature_4k.ivf -o /dev/null --threads 72"
Benchmark 1: tools/dav1d -q -i ~/GIT/dav1d/summer_nature_4k.ivf -o /dev/null --threads 72
  Time (mean ± σ):     30.447 s ±  0.384 s    [User: 725.203 s, System: 19.300 s]
  Range (min  max):   29.819 s  30.988 s    10 runs
 
debian@lenovo:~/GIT/dav1d/buildDir$ hyperfine --warmup 2 "tools/dav1d -q -i ~/GIT/dav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 72"
Benchmark 1: tools/dav1d -q -i ~/GIT/dav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 72
  Time (mean ± σ):     45.103 s ±  0.268 s    [User: 555.466 s, System: 14.086 s]
  Range (min  max):   44.777 s  45.498 s    10 runs

References:

https://hpc.rz.rptu.de/Tutorials/AVX/alignment.shtml

https://wr.informatik.uni-hamburg.de/_media/teaching/wintersemester_2013_2014/epc-14-haase-svenhendrik-alignmentinc-presentation.pdf

https://en.wikipedia.org/wiki/Data_structure_alignment

https://stackoverflow.com/a/20882083

https://zijishi.xyz/post/optimization-technique/learning-to-use-data-alignment/

My home lab (camera from FOSS Ubuntu Touch):

game-server-and-firewall gpl-altar

Edited by Jean-Baptiste Kempf

Merge request reports

Loading