A way to solve the «vertical siloes» problem is to build a data lake. It’s a place where data, coming from different sources, can be stored and accessed in a consistent way. Data are stored in theirs natural/raw format, making possible future data manipulation. Sometime transformed data are stored in data lake also; they are necessary for tasks such as reporting, visualization, advanced analytics, and machine learning.
Usually data «lineage» (where data comes from, when it was acquired and manipulated and by whom/what) is saved together with datasets (as metadata or using some easy-to-search way).
To «hydrate» a data lake, workers that acquire data from sources (siloes) need to be provisioned. Each worker is specific for a data source and is attached to services that allow further near-real-time data analysis. This set of functionalities (worker + services) is commonly referred as «data pipelines«. In analogy to water pumps and pipes, data are extracted («pumped») from sources and «flow» through services until they land in data lakes or specific storage/database solutions.
The following picture exemplifies these concepts.
Building a data lake and hydrating data pipelines can be achieved using:
a) On-prem services
b) Cloud based self-managed services
c) Cloud based fully managed services
d) Cloud based serverless infrastructure
The level of commitment required to operate the previous architectures significally decreases from a) to d).
On a) all the ops work is on IT team, including hardware management.
On d) the only things that the IT team must focus on are:
- application logic development (Worker logic);
- using the right tools for the job and use their API, following proper security best practices.
Since data lakes involve a lot of moving part and different technologies, going fully serverless is a choice worth to be considered to reduce operational complexity.