All of this is correct, but it misses the main point of the new table formats - they are open-source and the data can be stored on very low cost storgae - S3.
So, having a data warehouse that stores TBs or even PBs of data is not as expensive as it used to be (by an order of magnitude or more). And the formats for storing the data (Parquet), its metadata (Iceberg, Hudi, delta lake), its query engine (DuckDB, Polars, Ibis) - they are all open-source.
> it misses the main point of the new table formats
I didn't miss it; it's irrelevant.
It makes, almost no difference in practice, between a competent implementation in one and a competent implementation in the other.
It makes absolutely no difference that they are open source.
Understanding the details of each of the individual components will give you no meaningful insight into how to build a lakehouse.
...because, when you slap all those parts together, in whatever configuration you've picked what you end up with is a database.
A big, powerful cloud database.
Well, you have a database now and you still have zero insights and zero idea how to get any of them; that because you didn't understand that you need to build some kind of data warehouse on top of that database. You need to load the data. You need to transform the data. You need to visualize the data and build reports on it. If you're good, you probably need to preprocess the data to use as training inputs.
I'll say it more clearly and explicitly one. more. time:
- Having a database != having a data warehouse.
- Having a big cloud database build out of cloud storage, table formats, metadata engines and query engines != a lakehouse.
Having an empty database is of no value to anyone, no matter how good it is.
All of those parts, all of those things are only the first step. It's like installing postgres. Right, good job. We're done here? Reports? Oh, you can probably import something or something or I know, powerBI is good, let's install that. It'll tell you you have no data... but... we've got the infra now right? Basically done.