top of page

TrendyPaws Group

Public·69 members

Data Warehouse Toolkit Epub 15

Background: In recent years, research data warehouses moved increasingly into the focus of interest of medical research. Nevertheless, there are only a few center-independent infrastructure solutions available. They aim to provide a consolidated view on medical data from various sources such as clinical trials, electronic health records, epidemiological registries or longitudinal cohorts. The i2b2 framework is a well-established solution for such repositories, but it lacks support for importing and integrating clinical data and metadata.

data warehouse toolkit epub 15


I really like how the book is neatly structured and covers most of the topics related to data architecture and its underlying challenges, how can you use the existing system and build a data warehouse around it, and the best practices to justify the expenses in a very practical manner.

In spite of offering a level of speed and ease of use not achievable by today's GIS, SOLAP technology also introduces new issues. First, it doesn't replace a GIS or spatially-enabled DBMS since SOLAP is not meant to support transactional (OLTP) activities such as operational data storing and integrity checking. In other words, for an organisation that collects and process its own data, SOLAP is considered as an add-on product rather than a replacement (thus involving additional cost and work) although it is not the case when an organisation uses only data from other agencies. Second, one must accept the fact that is well recognized by the Data Warehousing and Business Intelligence community that data will be duplicated. In other words, some data will be contained in several datacubes in addition to being contained in the data warehouse (if one's system architecture uses a data warehouse as explained in [23]). Indirect duplication also takes place when the result of aggregation calculations are stored, as it is the case with GIS and DBMS when views are materialized or result sets are stored in place of SQL query commands. It is the traditional optimization trade-off between speed vs storage where speed is favoured. This duplication of data leads to concerns about refreshing the SOLAP datacubes when the source databases are updated (remember that datacubes are read-only databases and can be fed solely from transactional database sources). Today's solution relies mostly on recalculating the datacubes periodically (e.g. every month, every survey) or when a given threshold is achieved (e.g. 2000 source updates). This additional process is necessary to allow adding the new data when there are enough updates to become meaningful for the finest grained level of aggregation in the datacubes while keeping past data for trends analysis. Nevertheless, as mentioned earlier, a more frequent and incremental addition of the new source data into the datacubes with incremental aggregation is possible. A third issue relates to the need for a development team to learn a different database paradigm, i.e. the paradigm related to multidimensional databases or datacubes. This has proven to be one of the most challenging issues insofar since the vocabulary, the concepts and the technology used in the Data Warehousing and BI community are typically unknown by the traditional OLTP database management community. One must remember that the same issue plagued the object-oriented world in the 1980's in spite of its numerous advantages. Finally, a last issue concerns the recent introduction of SOLAP technology in the commercial market. In spite of numerous commercial technologies appearing in the market over the last few months, including open-source offerings, it has not yet achieved the level of maturity that exist for non-spatial OLAP products with regard to ETL and OLAP server. A lot of research still goes on in university labs while the private industry works hard on delivering new spatially-enabled products targeted for 2009 and 2010.

Once these indicators were defined, the second phase of the project was to collect data from sources among different organizations (provincial and federal departments, Health/Climate/Statistics/Natural Resources agencies) and to integrate them into a spatial multidimensional structure (also called spatial datacube), as typically done for analytical tools [27]. Several operations must be applied to the data sources to integrate them in a coherent manner into the same structure. These operations are typical of data warehousing architectures and are known as ETL (Extract, Transform and Load) processes [14, 16, 27]. When dealing with spatial data (such as with spatial data warehouse architectures), these processes become more complex and time-consuming since the spatial nature of the data brings several new issues that must be taken into account. For example, the data may be incompatible regarding different aspects such as their geodetic reference systems, measurement units, cartographic shape definitions, spatial resolution, symbolization, spatial accuracy, data format, temporal period and geometric evolution, to name a few. Since no traditional commercial ETL software can tackle these issues, this leads to a need for specialized integration and access tools to support the ETL phase, such as FME (Feature Manipulation Engine) [28], which provides operators for spatial data transformation and file translation. Though such tools can significantly facilitate ETL by allowing batch processes, there is always a need for "programming" such a tool as well as a need for manual operations that can only be done using GIS (Geographic Information System) tools. However, this ETL work is performed once to provide users with the power necessary for Spatial OLAP.

The performance of the incremental techniques adapted to spatial data was also tested for a datacube that contains (only) 50,000 facts. Again, the processing time for the incremental updating of the datacube is rather constant and remains between 5 and 10 seconds. Only 40 to 45 seconds are necessary in order to entirely rebuild the datacube. This is due to the small size of the datacube. However, it emphasizes that the differences between the processing times involved in these two techniques are important and will grow with the size of the datacube. In order to allow the propagation of updates stemming from possibly distributed OLTP sources into the data warehouse, these updating techniques have been implemented and deployed through a standardized Internet service. This service relies on the procedures presented by GeoKettle [32]. GeoKettle is an open source geospatial ETL tool developed by the GeoSOA research group of the Centre for Research in Geomatics (CRG) at Laval University. Specific transformations in the ETL tool have thus been designed in order to incrementally propagate updates in the datacubes, and are revealed through dedicated methods in the service contract: listjobs() lists all available transformations and provides details about each transformation, and executejob() starts a specific updating transformation. Updating procedures could thus be triggered remotely and in a standard way (based on the SOAP protocol) not only by a human but also by a machine. It then allows the machine-to-machine propagation of updates and a high level of automation in complex distributed architectures. 350c69d7ab


Welcome to the group! You can connect with other members, ge...
bottom of page