Friday, December 19, 2014

Using filtered SQL queries for building big data dashboards


    Dashbuilder is a tool I like to describe as a “micro” BI. It lets the user create dashboards and showcase their data using pie, bar or line charts as well as display data in a tabular form. Data could be loaded from plain text like CSV files or query from a database connection. When data is small enough, Dashbuilder can handle pretty well the whole set in memory as far as it doesn't exceed the 2MB size limit. However, most of the time, our data sets are bigger and we can't upload all the data for Dashbuilder to handle it by its own. Is in these cases where database backed queries can help us to implement nice drill down reports and charts without preloading all the data. 

   Let's take as an example a very simple stock exchange dashboard  which is fed from two database tables:


    The dashboard displays some indicators about several companies from several countries selling their shares at a given price on every day closing date. The dashboard displays 4 KPIs  (Key Performance Indicators) as you can see in the following screenshot: 



    All the indicators are displaying data coming from the two database tables defined above.


  • Bar chart - Average price per company
  • Area chart - Sales price evolution
  • Pie chart - Companies per country
  • Table report - Stock prices at closing date 

   At the end of this article [1] you'll find detailed instructions about how to download and install this example dashboard. What we're going to start discussing next is the two strategies we can use for building a dashboard. This is an important aspect to consider, specially if we're facing big data scenarios.


The in-memory strategy


   This strategy consists in creating a data provider which load all the data set rows by executing a single SQL query over the two tables.  

   SELECT C.NAME, C.COUNTRY, S.PRICE_PER_SHARE, S.CLOSING_DATE
  FROM COMPANY C JOIN STOCK S ON (C.ID=S.ID_COMPANY)

  Every single indicator on the dashboard will consume the same data set. When filters are executed from the UI no further SQLs are executed since all the calculations are done over the data set in memory. The following video shows a browser window and a bash console showing that only a single SQL is executed when the dashboard is initialized.





  Pros:
  • Data retrieval logic keeps very simple
  • Only a single data provider is needed
  • Faster configuration of  KPIs since all the data set properties are available at design time
  • Multiple indicators from a single data provider
   Cons:
  • Can't be applied on medium/large data sets due to poor performance


The native strategy


  The native approach consists in having a data provider for every indicator in the dashboard. instead of loading an handling all the data set in memory. Every KPI is told what data has to display. The next video  shows a full SQL based version of the sales stock dashboard. As you can see, every time the user filters on the dashboard, some SQL queries are executed. No data is hold in memory, the dashboard is always asking the DB for the data.



  As you can see, on every filter request the SQLs are parsed, injected with the filter values and re-executed. The SQL data providers are the following:

  Bar chart - Average price per company

    SELECT C.NAME, AVG(S.PRICE_PER_SHARE)
  FROM COMPANY C JOIN STOCK S ON (C.ID=S.ID_COMPANY)
  WHERE {sql_condition, optional, c.country, country}
  AND {sql_condition, optional, c.name, name}
  GROUP BY C.NAME

  Area chart - Sales price evolution

  SELECT S.CLOSING_DATE, AVG(S.PRICE_PER_SHARE)
  FROM COMPANY C JOIN STOCK S ON (C.ID=S.ID_COMPANY)
  WHERE {sql_condition, optional, c.country, country}
  AND {sql_condition, optional, c.name, name}
  GROUP BY CLOSING_DATE

  Pie chart - Companies per country

  SELECT COUNTRY, COUNT(ID)
  FROM COMPANY
  WHERE {sql_condition, optional, country, country}
  AND {sql_condition, optional, name, name}
  GROUP BY COUNTRY

  Table report

  SELECT C.NAME, C.COUNTRY, S.PRICE_PER_SHARE, S.CLOSING_DATE
  FROM COMPANY C JOIN STOCK S ON (C.ID=S.ID_COMPANY)
  WHERE {sql_condition, optional, c.country, country}
  AND {sql_condition, optional, c.name, name}


   As you can see every KPI is delegating the filter & group by operations to the database. The filter magic happens thanks to the {sql_condition} statements. Every time a filter occurs in the UI the dashbuilder core gets all the SQL data providers referenced by the KPIs and it parses/injects into those SQLs the current filter selections made by the user. The signature of the sql_condition clause is the following:

    {sql_condition, [optional | required], [db column], [filter property]}  where:

  • optional: if no filter exists for the given property then the condition is ignored. 
  • required: if no filter is present  then the SQL returns no data.
  • db column: the db column where the current filter is applied.
  • filter property: the UI property which selected values are taken. 

 Pros:
  • Support for high volumes of data. The database tables need to be properly indexed though.
 Cons:
  • The set up of the data providers is a little bit more tricky  as it requires to create SQL queries with the required filter, group by and sort operations for every KPI.

   When designing a dashboard never forget of thinking thoroughly about the origin, type and the volume of the data we want to display in order to go for the right strategy.


-----------------------------------------------------------------------------------------------------------

  
[1]    These are the steps to download an deploy the Stock Trade sample dashboard: 
  1. Download  & deploy the Dashbuilder webapp on your favorite app server - Installation instructions.  
          (You can also build from sources)
  1. Extract the contents of the following zip file into  dashbuilder.war/WEB-INF/deployments folder
  2. Create the stock trade database tables. Use or adapt  the H2 script file stocktrade-h2.sql provided.
  3. Start the app. The dashboard should be automatically deployed.