SQL optimization using LISTAGG – Avoid repeat self join on child table.

I have scenario where there is OBJECTS table and it has child table ATTRIBUTES. The each row in OBJECTS can have different sets (ie different count, and types) of ATTRIBUTES.  (To make it easy to understand, i keep the tables simple).

id INT,
name VARCHAR2(256),

CREATE TABLE attributes
id INT,
object_id INT,
name VARCHAR2(256),
value VARCHAR2(256),
CONSTRAINT fk_order FOREIGN KEY (object_id) REFERENCES objects(id)

INSERT into objects values(1, ‘toy’);
INSERT into attributes values(1, 1, ‘color’, ‘blue’);
INSERT into attributes values(2, 1, ‘weight’, ‘2lb’);
INSERT into attributes values(3, 1, ‘price’, ‘$25’);


INSERT into objects values(1, ‘food’);
INSERT into attributes values(2, 1, ‘weight’, ‘2lb’);
INSERT into attributes values(3, 1, ‘price’, ‘$25’);

Now we want list each OBJECT and corresponding attribute like.

1 toy blue 2lb $25

o.id id,
o.name name,
a1.value color,
a2.value weight,
a3.value price
objects o,
attributes a1,
attributes a2,
attributes a3
(o.id = a1.object_id and a1.name=’color’)
AND        (o.id = a2.object_id and a2.name=’weight’)
AND       (o.id = a3.object_id and a3.name =’price’)

This worked and gave us the required output. When i looked at this in code review i had some concerns with this approach. The ATTRIBUTES table is self joined 3 time and if  developer needed more attributes then he\she would have to rejoin ATTRIBUTES table that many more times. In production attribute table would have thousands of rows, this will turn into performance issue.

If we had this data in de-normalized form we would not have run into this issue. De-normalizing 2 tables is minor change, but in actual scenario it was connected to lot of other data, the de-normalization would lead to cascading effort. So i approached our DBA to pick their brains on this. He suggest use of LISTAGG, and with his help optimized query looked like below. 

select object_attr.id,
LISTAGG(object_attr.value, ‘|—-|’) WITHIN GROUP (ORDER BY object_attr.attr_name) attribute
from (select o.id, o.name, a.name attr_name, a.value
from objects o,
attributes a
where a.name in (‘color’,’weight’,’price’)
and o.id = a.object_id
) object_attr
GROUP BY object_attr.id, object_attr.name
ORDER BY object_attr.id

1 toy blue|—-|$25|—-|2lb

Explain plan for self joining un-optimized query was

Operation                            Node Cost             Cost CPU           Cost I/O         Cost Optimizer            Cardinality        Bytes Position          Partition Start       Partition Stop              Partition Id
SELECT STATEMENT       0.0 %                      1160                   107,161,454          1150                           ALL_ROWS                  39                            34671                1160

Explain plan for optimized query was

Operation                           Node Cost              Cost CPU             Cost I/O         Cost Optimizer             Cardinality          Bytes Position         Partition Start         Partition Stop        Partition Id
SELECT STATEMENT       0.0 %                     391                       47,603,589               387                            ALL_ROWS                193                              61374                        391

As always, with time we updated the query to get more attributes from ATTRIBUTES Table. The un-optimized query’s CPU and IO cost would have gone up exponentially with each new attribute fetch, but in our optimized query cost was pretty much constant. We had to do little more to parse the attribute field in java.