Wednesday, June 3, 2015

UF Dashbuilder - Data set authoring

This article introduces the new Dashbuilder's Data Set authoring user interface which allows the user to list, create, update and remove data sets from a web browser. Please note that it's an end-user oriented guide, so do not expect to deep into technical details.

A Data set is one of the main components in Dashbuilder's architecture; all of the different visualizations use Data sets to get the data they need. So if you are not used to the Data set API & architecture, it's high recommended reading this previous article.

Consider the Data set authoring as the name given to the web interface that provides a set of screens to manage your Data sets in a user friendly way. In the following video you can get a quick preview of how the new interface looks like (do not forget to select HD) and how easy is to register a new data set.

Data Set authoring perspective

Note: given that point, the use of this authoring perspective gives the user a new and much more easy way for managing Data sets than the default deployment scanner (See section Data set deployment in this article).

Refreshing some concepts ...

To be able to create and edit data sets it's important to get used to the Data set API and some other concepts. This is a quick review (all details in this previous article).

(If you are already familiar with the Data set API & concepts you can skip this section)

Data set & Data set definition


The most obvious should be assuming that data set authoring is about the management of  data sets, so the underlying model should be a data set. Is it...? Almost true.. but strictly speaking it does not allow the management of Data set instances, it allows the management of Data set definitions.

Remember that a Data set definition is just the representation of a Data set attributes and columns. It provides the information for being able to look up information from remote systems , collect and perform data operations, resulting in Data sets. If looking deeper in the architecture, the definition is a persistent entity that uses the JSON exchange format. Thus you can consider the Data set authoring as a web interface editor for JSON Data set definitions.

Data set definition class members:
  • A name and unique identifier (UUID)
  • A type. It defines the storage type that provides the remote data to be consumed.
    Currently the editor supports BeanSQL, CSV and ElasticSearch types. These types allow for  looking up data from a Java class, a DBMS, a CSV file or an ElasticSearch storage system respectively.
  • Specific attributes. For example, if using an external DBMS the JDBC url and connection credentials are mandatory user input attributes.
  • Data columns. Define which columns will be present in the Data Set when a look-up is performed. See next section Data columns.
  • Initial data filter. Minimize the look-up result size by providing a filter. 
  • Cache and refresh settings. Some other attributes related to client & backend cache sizes and the data refresh policy.

Data columns

Data columns are the name used for the columns of the resulting data set when a look up is performed.

A data column have a unique identifier for the column in the data set and a data type. Dashbuilder supports 4 column types:
  • Number -  The row values for the column are considered numbers, so you can use the column for further column functions use (sum, min, max, average, etc).
  • Date - The row values for the column are considered dates, so you can use the column for further column date related functions (timeframe, intervals, etc)
  • Text - The row values for the column are considered plain text. The column cannot be used in any numeric functions neither cannot be grouped (this column will be never indexed in the internal registry).
  • Label - The row values for the column are considered text literals. The column can be grouped as the values are considered concrete. 
No matter which remote system you want to look up, the resulting data set will return a set of columns of one of the four default types above. So there exists, by default, a mapping between remote system column types and the Dashbuilder's types. The user is able to modify the type for some columns, depending on the data provider and the column type of the remote system.

The data set authoring perspective allows the data columns manipulation as you will see in the next sections.


Initial data filter

It's important to remember that a Data set definition can define a filter. It's named initial data filter as it is present in the definition of the data set itself, so all further data displayers and other components that use this definition will be using the subset of data that satisfies the filter conditions.

The goal of the initial filter is to allow for removing  from the data those rows that the user does not consider necessary. The filter works on any data provider type.
Note: For SQL data provider type, you can use both initial filter or just add custom criteria in the SQL sentence. The first is more appropriated for non technical users since they might not have do the required SQL language skills.
So it's important to note that you can specify a data filter at two levels:
  • In a Data set definition
  • In a Data displayer definition
Having in mind that a Data displayer consumes data from a Data set, there are some implications when deciding at which level specify the data filter. For instance, you may have a data set getting the expense reports only from the London office, and then having several displayers feeding from such data set. Another option is to define a data set with no initial filter and then let the individual displayers to specify a filter.  It's up to the user to decide on the best approach. Depending on the case might be better to define the filter at a data set level for reusing across all the displayers.  The decission may also have impact on the performance, since a filtered cached data set will have far better performance than a lot of individual non-cached data set lookup requests per displayer (cache settings are described at the end of the article). 

The authoring perspective

Once familiarized with the API and some other basic concepts, let's see in detail the authoring perspective, its components and the main use cases of the tooling.

The Data set authoring is the name given to the web interface that provides a set of screens to manage your Data sets in a user friendly way.

You can navigate to the perspective at Main menu -> Authoring -> Data Set authoring:

Data Set authoring menu item
The following screenshot shows the perspective screen:


Data Set authoring perspective
This view defines two panels/sections: the Data set explorer and the Data set editor
  • Data set explorer
    It allows the user to explore and remove current system data sets. See next section Data set explorer.
  • Data set editor
    It allows to create, read or update a data Set. See next section Data set editor

Note: For more information about UberFire perspectives and how to use them, please take a look at the official documentation.

Data set explorer

The Data Set explorer is a client side component with the main goal listing all public data sets present in the system and let the user perform authoring actions.

Data set explorer
It provides:
  • (1) A button for creating a new Data set.
  • (2) The list of current available public Data sets.
  • (3) An icon that represents the Data set's provider type (Bean, SQL, CSV, etc)
  • (4) Details of current cache and refresh policy status.
  • (5) Details of current size on backend (unit as rows) and current size on client side (unit in bytes).
  • (6) The button for creating, reading or update a Data set. Its behavior is to open the Data set editor for interacting with the instance.  
  • (7) The button for removing a Data set.

Data set editor

The Data set editor is a client side component that allows the user to create, read or update a data set.

Data Set editor home screen
The user interacts with the editor by:
  • Clicking on Edit button in Data Set explorer
  • Clicking on New Data Set button in Data Set explorer
  • Clicking on New data set link in Data Set editor's home screen


Basic creation & edition workflow

The interaction with the editor for both create and edit goals is given by a given workflow with three steps:

Data Set creation & edition workflow
  1. Data provider type selection

    Specify the kind of remote storage system (BEAN, SQL, CSV, ElasticSearch)
     
  2. Data configuration - Basic and provider type's specific attributes edition

    Specify the attributes for being able to perform the look up to the remote system. These attributes vary depending on the Data provider type selected on previous step.
     
  3. Advanced configuration - Table preview & Edition of data set's columns, initial filter, caché and refresh settings

    Configure the structure, data and other settings for the resulting data set.

Workflow step 1 - Type selection

Allows the user 's specify the type of data provider for the data set to create.

The screen lists all the current available data provider types and helper popovers with descriptions. Each data provider is represented with a descriptive image:
Data provider type selection screen
This screen is only present when creating a Data set. It's not allowed to modify the Data set's provider type for an already existing data set.

Four data provider types are currently supported:
  • Bean (Java class)
  • SQL
  • CSV
  • Elastic Search
Once a type is selected, click on Next button to continue with next creation workflow step.

Workflow step 2 - Provider specific attributes

Once specified a kind of storage to look up in previous step, the next one is the configuration of the specific attributes to use it.

The following picture shows the configuration screen for an SQL data provider type:
New SQL Data set creation screen
Once data set name and connection attribute inputs are filled, for this SQL specific case, click on Test button to perform a initial connection to the source SQL storage. This process will fetch a small set of data and continue to next workflow step.

Similar use for other data provider types:

BEAN Data set type
CSV Data set type
Elastic Search Data set type
Notes:
  • The UUID attribute is a read only field, for further use in remote API or specific operations, as it's generated by the system, but you cannot edit it.
  • You can go back to the configuration tab at any time while creating or editing a data set, but if you modify any value on this tab inputs, you have to click on Test button to apply your new changes and perform a new look up. Doing that, you will lose any columns or filter configuration, as the look up result can have different data and/or structure.  

Workflow step 3 - Data set preview and advanced settings

At this point, the system is able to perform a look up to the remote system and return a data set. In this workflow's step you can check your result data and customize the structure and the rows for your own interest.

This step is presented by using the screens of the Preview and Advanced tabs:

Preview tab

As you can see there are three main sections in this screen:
Preview tab
This tab contains three sections:

Data set preview

A data table is located in the central area of the editor screen. This table displays the data that comes back from the remote system look up process.

Data set preview


You can apply some operations on this table such as filtering and sorting.

Data set columns

You can manage your data set columns in the Columns tab area:
Data set columns

Use the checkbox (1)  to add or remove columns of the data set. Select only those columns you want to be visible and accessible to dashboard displayers.

Use the drop down image selector (2) to change the column type. This have some implications on further column operations, as already explained on previous sections.

Note: BEAN Data provider type does not support changing column types as it's up to the developer to decide which are the concrete types for each column.

Data set filter

In the Filter tab area you can specify the Data set definition initial filter:

Data set filter

While adding or removing filter conditions and operations, the preview table on central area will be updated with the new subset of data.

Note: The use of the filter interface is already detailed in this previous article.

Advanced tab

Last settings to configure for a Data set definition are present in the Advanced tab:

Advanced settings tab
In this screen you can specify caching and refresh settings. These settings are very important for making the most of the system capabilities thus improving the performance and having better application responsive levels.

At (1) you can enable or disable the client cache for the Data set and specify the maximum size (Bytes).

At (2) you can enable or disable the backend cache for the Data set and specify the maximum cache size (expressed in data set's rows).

At (3) you can enable or disable automatic refresh for the Data set and the refresh period.

At (4) you can enable or disable the refresh on stale data setting.

Let's dig into more details about the use of these settings on the following paragraphs (it's recommended the previous reading of this article as it introduces the basics concepts behind Caching&Refresh).

Caching

Dashbuilder is built with caching mechanisms for holding data sets and performing data operations using in-memory strategies. The use of these features have lots of advantages, as reducing network traffic, remote system payload, processing times etc. In the other hand, the user is responsible for the right use of caching and their sizes to avoid hitting performance issues.  

Two levels of caching are provided:

  • The client cache
  • The backend cache
The following diagram shows how caching is involved in any data set look up, group, filter and or sort operations:

Cachés
Any data look up operation produces a resulting data set, so the use of the caching determines where that lookup operation is executed and where the resulting data set is located.

Client cache

If ON then the data set coming from a look up operation is pushed into the web browser so that all the data displayers or other components that feed from this data set do not need to perform requests to the backend, everything is resolved at a client side:

  • The data set is stored in the web browser's memory
  • The related displayers feed from the data set stored in the browser
  • Grouping, aggrtegations (sum, max, min, etc), filters and sort operations are processed within the web browser, by means of a Javascript data set operation engine.

If you know beforehand that your data set will remain small, you can enable the client cache. It will reduce the number of backend requests, not only the requests to Dashbuilder's backend, but also the requests to your backend storage system.  On the other hand, if you consider that your data set will be quite big, disable the client cache so as to not hitting with browser issues such as slow performance or intermittent hangs.

Backend cache

It's goal is to provide a caching mechanism for data sets on backend side.

This feature allows to reduce the number of requests to the external storage system, by holding the data set in memory and performing group, filter and sort operations using the in-memory engine.

It's useful for data sets that do not change quite often and their size can be considered acceptable to be held and processed in memory. It can be helpful if the remote system network connection bandwidth has high latency. On the other hand, if your data set is going to be updated frequently, it's better to disable the backend cache and perform the requests to the external system on each look up request, so the external system is responsible for executing group, filter and sort operations using the latest data.

NoteBEAN and CSV data provider types relies on the backend cache by default, as in both cases the data set must be always loaded into memory in order to resolve any data lookup operation using the in-memory engine. This is the reason why the backend settings are not visible in the Advanced settings tab.


Refresh policy

Dasbuilder provides the data set refresh feature. Its goal is to perform invalidation of any cached data when certain conditions are meet.

Refresh policy settings
At (1) you can enable or disable the data refresh feature.
At (2) you can specify the refresh interval.
At (3) you can enable or disable refresh only when data is out of dated. 
The data set refresh policy is tightly related with data set caching, detailed in previous section. This invalidation mechanism determines the cache life-cycle.

Depending on the nature of the source data there exist three main refresh use cases:
  • Source data changes predictable

    Imagine a database being updated every night. In that case, the suggested configuration is use a refresh interval = 1Day (2) and disable refreshing on stale data (3), so the system will always invalidate the cached data set every day. this is the right configuration when we know in advance that the data is going to change (predictable changes). 
  • Source data changes unpredictable

    On the other hand, if we do not know whether the database is updated every day, the suggested configuration is to use a refresh interval = 1Day (2) and enable refreshing on stale data (3), so the system, before invalidating any data, will check if it has been updated. If so, the system will invalidate the current stale data set and will populate the cache with fresh data.
  • Real time scenarios

    On real time scenarios caching makes no sense as data is going to be updated constantly. In this kind of scenarios the data sent to the client has to be constantly updated, so rather than managing the refresh settings for the data set in the Data set Editor (remember this settings affect the caching, and caching is not enabled) you have to define when to update your dashboard displayers by modifying the refresh settings from the Displayer Editor configuration screen. For more information on Displayer Editor and real-time dashboards, please refer to Dashbuilder Displayer Editor & API and Real time dashboards articles.