Skyline Query in T-SQL

January 5, 2012 by · Leave a Comment
Filed under: Programming, SQL 

Here is the sample database schema as an example to illustrate the procedure and process of implementing a Skyline Query that works with Microsoft Sql Server, this should work with all SQL Server that support nested queries. 

 
 
CREATE TABLE [dbo].[Housing](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [DistanceFromWork] [numeric](13, 4) NOT NULL,
    [Price] [money] NOT NULL,
    [SquareFootage] [int] NOT NULL,
    [Name] [nvarchar](50) NOT NULL,
    [City] [nchar](10) NULL,
 CONSTRAINT [PK_Housing] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
 
 
SET IDENTITY_INSERT [dbo].[Housing] ON
INSERT [dbo].[Housing] ([ID], [DistanceFromWork], [Price], [SquareFootage], [Name], [City]) 
VALUES (1, CAST(1.2000 AS Numeric(13, 4)), 320394.0000, 1600, N'Brooke', N'Boston    ')
INSERT [dbo].[Housing] ([ID], [DistanceFromWork], [Price], [SquareFootage], [Name], [City]) 
VALUES (2, CAST(8.8000 AS Numeric(13, 4)), 179000.0000, 1924, N'Mica', N'Boston    ')
INSERT [dbo].[Housing] ([ID], [DistanceFromWork], [Price], [SquareFootage], [Name], [City]) 
VALUES (3, CAST(1.1000 AS Numeric(13, 4)), 280000.0000, 1800, N'Blackbone', N'Boston    ')
INSERT [dbo].[Housing] ([ID], [DistanceFromWork], [Price], [SquareFootage], [Name], [City]) 
VALUES (4, CAST(12.2000 AS Numeric(13, 4)), 180000.0000, 1923, N'Vienna', N'Boston    ')
INSERT [dbo].[Housing] ([ID], [DistanceFromWork], [Price], [SquareFootage], [Name], [City]) 
VALUES (5, CAST(0.1000 AS Numeric(13, 4)), 500000.0000, 1200, N'Zappod', N'Boston    ')
INSERT [dbo].[Housing] ([ID], [DistanceFromWork], [Price], [SquareFootage], [Name], [City]) 
VALUES (6, CAST(14.2000 AS Numeric(13, 4)), 88000.0000, 3200, N'Rancher', N'Boston    ')
INSERT [dbo].[Housing] ([ID], [DistanceFromWork], [Price], [SquareFootage], [Name], [City]) 
VALUES (7, CAST(1.3000 AS Numeric(13, 4)), 310000.0000, 2300, N'Joomba', N'Boston    ')
INSERT [dbo].[Housing] ([ID], [DistanceFromWork], [Price], [SquareFootage], [Name], [City]) 
VALUES (8, CAST(0.9000 AS Numeric(13, 4)), 312222.0000, 1900, N'Dumbo', N'Boston    ')
SET IDENTITY_INSERT [dbo].[Housing] OFF
 

Here is the query that will do a three-dimensional skyline query here is an output before and after.

------listing for compare------
SELECT *
  FROM [Test].[dbo].[Housing]
order by distancefromwork, price
 
--------optimization query---------
 
    SELECT *
FROM [Housing] h
WHERE h.city = N'Boston' AND NOT EXISTS(
SELECT *
FROM [Housing] h1
WHERE h1.city = N'Boston' AND h1.DistanceFromWork <= h.DistanceFromWork AND
h1.price <= h.price AND  h1.SquareFootage >= h.SquareFootage and
(h1.DistanceFromWork < h.DistanceFromWork OR h1.price < h.price OR h1.squarefootage > h.squarefootage))
order by distancefromwork, price 

 

The performance of this is not very good as it cannot be easily optimized since you are nesting queries to compare and remove other records that are dominated by the ones that you care about. 

BEFORE

ID    DistanceFromWork    Price    SquareFootage    Name    City
5    0.1000    500000.00    1200    Zappod    Boston    
8    0.9000    312222.00    1900    Dumbo    Boston    
3    1.1000    280000.00    1800    Blackbone    Boston    
1    1.2000    320394.00    1600    Brooke    Boston    
7    1.3000    310000.00    2300    Joomba    Boston    
2    8.8000    179000.00    1924    Mica    Boston    
4    12.2000    180000.00    1923    Vienna    Boston    
6    14.2000    88000.00    3200    Rancher    Boston    

 

AFTER

ID    DistanceFromWork    Price    SquareFootage    Name    City
5    0.1000    500000.00    1200    Zappod    Boston    
8    0.9000    312222.00    1900    Dumbo    Boston    
3    1.1000    280000.00    1800    Blackbone    Boston    
7    1.3000    310000.00    2300    Joomba    Boston    
2    8.8000    179000.00    1924    Mica    Boston    
6    14.2000    88000.00    3200    Rancher    Boston    

As you can see it removed item # 1 and #4, as these were dominated by other properties in the system.

 

I will explore this in the future to look at other methods for achieving similar results, getting what you really want out of a system.

Skyline Query

January 4, 2012 by · 1 Comment
Filed under: Design, Programming 

A Skyline query is one of the most useful but least obvious data programming patterns.  When you have a set of data and a query with 2 constraints that are conflicting, the output of this is the “Skyline” Query.  If you want to thing about what skyline means and why it describes this issue take the example of searching for cars to drive, and lets say you want the biggest car with the best gas mileage.  These two constraints are conflicting since as cars get bigger, gas mileage usually goes down, but a skyline query essentially runs this query across a set of data and then grabs the tops of the skylines. This would take a cluster of data and grab the top 2 or 3 in that set, then move along to find the next set, this way you can give the user choices without having to come up with an artificial weight or

 

A more realistic example would be searching for a house that is the cheapest cost, that is the biggest, and say closest to where you work.  This is one of the rare data patterns that has so many real world applications it is not commonly known below we will show an implementation of an efficient algorithm shown in this paper to offer stellar performance.

The Skyline Operator

They showed that the best performance for a multi-dimensional skyline processor is a Block – Nested Loop.

Here is some psudo-code with implementation coming next. I’ll show brute-force then also

an implementation with the block-nested loop for T-SQL..

   1:  M – Input of dimensional points
   2:  R output of dimensional points
   3:  T temporary file for dimensional points
   4:  S set of dimensional points
   5:  p < q point p is dominated by point q
   6:   
   7:  function SkylineBNL(Input)
   8:  {
   9:  Output = 0; 
  10:  Temp = 0; 
  11:  Set=0;
  12:  CountIn = 0;
  13:  CountOut = 0;
  14:   
  15:  while(input)
  16:  do
  17:  {
  18:  foreach p in Set
  19:  {
  20:  if(TimeStamp(p)=countin 
  21:  {
  22:  save(Output, p), Release(p)
  23:  }
  24:  Load(Input,p)
  25:  TimeStamp(p) = CountOut;
  26:  CountIn++
  27:  foreach(q in Set)
  28:  {
  29:  if(p>q) release (p) break;
  30:  if(p<q) then reelase(q);
  31:  }
  32:  if(MemoryAvailable)
  33:  {
  34:  save(Temp,p), release(p)
  35:  CountOut ++;
  36:  }
  37:  //keep it running to iterate through entire sets.
  38:  if(Main == empty)
  39:  {
  40:  Main = Temp;
  41:  Temp=0;CountIn=0;CountOut=0;
  42:  }
  43:  }
  44:  }
  45:   
  46:  //flush
  47:  foreach(p in Set)
  48:  {
  49:  save(Output,p), release(p)
  50:  }
  51:   
  52:  return Output;