Extract transform load (ETL) process in distributed database academic data warehouse

While a data warehouse is designed to support the decision-making function, the most time-consuming part is the Extract Transform Load (ETL) process. Case in Academic Data Warehouse, when data source came from the faculty’s distributed database, although having a typical database but become not easier to integrate. This paper presents how to an ETL process in distributed database academic data warehouse. Following Data Flow Thread process in the data staging area, a deep analysis performed for identifying all tables in each data sources, including content profiling. Then the cleaning, confirming, and data delivery steps pour the different data source into the data warehouse (DW). Since DW development using bottom-up Kimball’s multidimensional approach, we found the three types of extraction activities from data source table: merge, merge-union, and union. Result for cleaning and conforming step set by creating conform dimension on data source analysis, refinement, and hierarchy structure. The final of the ETL step is loading it into integrating dimension and fact tables by a generation of a surrogate key. Those processes are running gradually from each distributed database data sources until it incorporated. This technical activity in distributed database ETL process generally can be adopted widely in other industries which designer must have advance knowledge to structure and content of data source.

 ISSN:  By this design, the database server is installed on a single machine and consisted of several separate academic databases for faculty. It means that every faculty has its single typical database with identic structures (table and attribute), but there is no guarantee for the same data of the reference tables. This information system is running well as a transactional system and for preparing the enhanced academic business intelligent, we propose a construction of Academic DW include integration data process. The Figure 1 is essential for an academic data warehouse to have the truthful and comprehensive integral data. The first objective of this study is the implementation of bottom-up dimensional data warehouse design approach. This system initiated by the completeness of the business requirements provided by the university executives, then through the dimensional and fact tables with measurement data as a representative of multidimensional technique. The other objective in this research is to identify related attributes from each data source then identify them as the master tables or operational tables. In another hand, several activities performed in the ETL process such as extracting, cleaning, conforming tables from and loading them into DW. This process carried on data quality and the data analysis from different faculty's academic database data sources.

Research Method
On distributed database data sources, the ETL process becomes more complicated because there is no single control mechanism over the contents of the data of each database. The gradual and careful process to identify which table should be merged or not, to profile the table's attribute and analyze data in tables are implemented in detail in this work. Knowing the data structure will help in cleaning and conforming dimension and making fact table. In the case of an academic data warehouse, all steps arrange based on the business process; grain declared, dimension and data measurement in fact tables as the steps for the building data warehouse.
In the building of the data warehouse, we adapted Kimball's bottom-up approach in developing a multidimensional model [8]. Based on Kimball's four-step dimensional design process, the model design activities are built by the following steps: a. Select the business processes: to be modeled, taking into account business requirements and available data sources. b. Declare the grain: defining an individual fact The important advantage of this approach is consists of data marts as a representation of the business unit, which have a very quick initial set up and effortless to integrate between another data marts. For consideration to enterprise scale DW, this method tranquil in extend from existing DW and also provide reporting capability.
After that, in the ETL process, there are prominent four staging steps while data is staged (written to the disk) in parallel with the data being transferred to the next stage, as describe in Figure 2. Research about ETL performance already carried out by researchers. Igor Mekterrovic et al. [9] proposed a robust ETL procedure with three agents: changer, cleaner, and loader. This research focuses to handle horizontal data segmentation and versioning when the same fact tables used by various users and must not influence the other. Vishal Gour et al. [10] proposed another improvement ETL performance by query cache technique to reduce the response time of ETL. E. Tute et al. [11] describe an approach for modeling of ETL-processes to serve as a kind of interface between regular IT-management, DW, and users in the case in the clinical data warehouse. Abid Ahmad et al. [12] notify the use of the distributed database in development of DW. This study explains more on architecture and change detection component. Sonali Vyas et al. [13] define the various ETL process and testing techniques to facilitate for selection of the best ETL techniques and testing.
Examining the mentioned study, it is clear the research gap in this ETL process is advance analyzing technique of distributed database data source. Also, the specific case in the academic area was applied with the strategy in each ETL steps after analyzing the academic data.

Designing Multidimensional Model
Follow Kimball's four-step dimensional design for building an academic data warehouse shows in Table 1. In integration and cross-process work, making conformed dimensions help to identify and map every individual fact to dimension [14]. If we got the right dimension, we could focus on it in each single database data sources, data shows in table 2.

ETL Process
ETL Tool using in this research is Pentaho Data Integration (PDI) v6.1 as a part of Pentaho, a open source commercial business intelligence tools. The most common use of PDI is to perform ETL and also powerful for obtaining and manipulating data [15]. In Pentaho Business Analytics Server develop several plugins such as Saiku Analytics that provides interactive analysis with tables and graphs.

Extraction
In the case of distributed database, extraction step initiated by data source identification, including investigate the table structure and description in each faculty database. We divided the  Table 3. Second: integration of heterogeneous data source. Each single database must be connected and then can be merged/union based on the identification and analyzing of the source. This point also related to connectivity to access the database. To create a connection, we will need to know the connection settings. At least as the following [16]: Type of database connection, hostname (Domain name or IP address of the database server); Database name (The schema or other database identifiers); Port Username and password to access the data source. Figure 3 shows integration transformation process in PDI.
Third: data content analysis. In this analysis, we focused on NULL Values. Dealing with NULL is another solution in this step. We define two serial solutions; the first is by using IFNULL() expression in SQL script both in creating dimension table or fact tables. The second solution is adding a new column in the master table to handle NULL data or Not Available in a transaction table. This solution need database administration intervention and the privilege in decide it.  Figure 3. Pentaho data integration with transformation

Cleaning and Conforming
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data to improve the quality of data. The data cleansing approaches involve three steps include: data analysis, data refinement, and data verification [17]. There are single source problems and multisource problems.  General data and value rule reasonability [3]. This condition, after doing re-analyze, that ETL Team can confirm, join, and integrate as a simplification.
Create conform dimension. Conforming activities here refers to 2 conditions: hierarchy and conform dimension. A hierarchy within a dimension, then, is one of many potential drill paths. Dimensions are not limited to a single hierarchy. It is common to find several hierarchies within a dimension [14]. A conformed dimension means the same thing with every possible fact table to which it can be joined [3] as describe in Figure 4.

Delivery/Loading
Surrogate generated [18]. The surrogate key as a nothing integers key which does not have any business meaning and only uses for data analysis. So, we generate this surrogate key in every dimensional  table and fact table. This technique, besides the best practice in the data warehouse also is needed to implement slowly changing dimensions. Star schema dimension tables. These activities include the denormalizing data as a star schema table characterized as being "not normalized." And then deliver the data to the target dimension table.
Slowly Changing Dimension (SCD). This a step for accommodating the history by add 2 (two) new columns in OLTP master table (created date time & update date time) and column valued by the last update date time in DW. All of the current and historical data over time will be stored in this field.
Loading Fact Table. One of the most important aspects of preparing fact data for loading is performing the dimension lookups with taking correct surrogate key [1].
Working with Online Analytical Processing (OLAP) Data. Considered as integrate database DW target, this activity related in working with multidimensional data (OLAP Cube). Creating a cube in data presentation area using the Pentaho Schema Workbench tools.

ETL Testing
The main objective of ETL testing is to verify and relieve data defects and general errors that occur earlier before processing it for analytical and reporting. Several tests can be conducted based on their function. Here we performed these scenarios: data completeness validation, meta data testing, and incremental ETL testing [19]. As a sample we picked one target table named dim_program_studi_dikti in OLAP DW database.
The goal of data completeness validation is to measure that all data is loaded as expected. Meta data testing is to verify that the table definition conform to the model and application design. Mapping document include data type check, data length, index/constraint check between source and target table reveals this validation. One of incremental ETL testing is a duplication validation for measuring unwanted duplication existence. Table 5 and Table 6 illustrate the ETL testing.

Discussion
ETL process as a part of main component DW development should give more attention for DW manager. To guide DW development congregate the business requirements of institution, the Kimball's bottom up approach put into process. The single functional area is set off instead of enterprise scale. For distributed data source, the applied of data staging provide benefits before it loaded to OLAP DB. All of ETL steps covered in data staging area to ensure data transformation from source to target database running on the track. This option also can realize widely in another circumstance which concerned in data quality issues.
 ISSN: 2528-2417 Data source advance knowledge by its structure and function must have in this instance. It takes many technical parts. Fortunately, the Pentaho Data Integration as open source BI tool having capability to deal with it. Testing conducted to verify the accuracy of data loading against the signed off business requirement and rule. In general DW testing view, the extensibility aspect is confirmed by designing of data mart to take up again in next data marts. The low time elapsed for DW process will increase the DW performance compare the transactional. The design of multidimensional provides at easy in making drill down and drill across of data in reporting. For analysis of new students, by multidimensional view we obtained data. Dimension data from gender, program study, faculty, year, the origin of senior high school data are the examples.

Conclusion
The heterogeneous of data source can be overcome by designing the ETL Process as following ETL Data Flow: Extract, Clean, Conform, and Delivery/Load. Extract step perform data source analyze and content profiling so that proven where table should be master table or transaction table. In distributed database system with the goal for integration data, three activities have been accomplished in each database: merge, merge-union, and union. Result for cleaning and conforming step set by creating conform dimension on data source analysis, refinement and hierarchy structure. The final of ETL step is loading it into integrate dimension and fact tables by generation of surrogate key. In case of distributed database data source, ETL designer must have advance knowledge about data structure and content in each database and then easy to apply ETL process completely. Several ETL testing is held as proven in the correctness of data transformation from data source to target. This procedure generally can be employed widely for implementation of DW in another industry. Pentaho Data Integration (PDI) as open source ETL tools can process all of the steps.