Parquet/arrow is a great format for Pandas - fast and nicely compact in terms of file size for those of us who don't have the luxury of directly attached NVMe SSDs and where I/O bottlenecks are a consideration.
Even after issues with its inability to map datetime64 values properly, I was reasonably happy with my design choice.
I became less happy on discovering that it's very weak as an interchange format for cross-language work.
In my case I wanted to use some existing JVM-based tooling. This caused huge pain. The Jvm/Java library/API is a complete mess, sorry, and if people are complaining about the Python/C++ documentation, there's basically nothing for the Java library.
It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.
And the API is barely above exposing the file-format. Nothing like "load this parquet file" into some object which you can then query for it's contents - you're dealing with blocks and sections and other file-format level entities.
The other issue is caused its flexibility - for example Panda's dataframes are written with what's effectively a bunch of "extension metadata" which means it works great for reading and writing pandas from Python but don't expect anything to be able to work with the files out-of-the-box in other languages.
In the end, the only way I could get reliable reading and writing from the JVM was to only store numeric and string data from the Python side. Even then it feels flakey - with a bunch of hadoop warnings and deprecation warnings. I know the JVM has little appreciation in the data science world which is maybe a reason for the sorry state of the Java library.
Edit: to be specific, I am talking about my experiences with Arrow/Parquet.
What you've written sounds like a criticism of the JVM data analytics ecosystem (the Java Parquet library in particular) and not Apache Arrow itself. Parquet for Java is an independent open source project and developer community. For example, you said
> It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.
For C++ / Python / R many of the developers for both Apache Arrow and Apache Parquet are the same and we currently develop the Parquet codebase out of the Arrow source tree.
So, I'm not sure what to tell you, we Arrow developers cannot take it upon ourselves to fix up the whole JVM data ecosystem.
I'm not expecting anything really and I do appreciate your work and effort. And it's a specific use case for arrow, I guess.
But at your landing page, it's claimed "Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. " and that "Libraries are available for C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.". This certainly gave me the impression that more than just Python, C++ and R would be well supported.
The JVM isn't complete irrelevant in data-science given the position of Spark/Scala. This also raised my expectations of arrow/parquet because it seems to be the de-facto standard for table storage for this JVM platform. And I experienced no issues on that platform.
To be clear, I'm not blaming you for my design decision (I'm a software engineer not a data-scientist btw), and I still think parquet/arrow rocks for Python but in my experience it doesn't really deliver a useable "cross-language" file format at the moment.
Again, I have to object to your use of “arrow/parquet”. These are not the same open source projects and while people use them together it isn’t fair to the developers of each project for you to discuss them like a single project.
FWIW, while the JVM isn't completely irrelevant in data, I will say, even as a big user of Spark via Scala, that JVM languages are quickly becoming irrelevant in data. Spark's Scala API is simultaneously the core of the platform, and also very much a second-class citizen that lacks a lot of important features that the Python API has. Easy interop with a good math library, for example.
Similarly, the reference implementation of Parquet may be in Java, but consuming it from a Java language, outside of a Spark cluster, is still a royal pain. Whereas doing it from Python isn't too bad.
Long story short, I think that expecting a project that's just trying to implement a columnar memory format to also muck out the world's filthiest elephant pen is perhaps asking too much. Though perhaps a project like Arrow could serve as the cornerstone of an effort to douse it all with kerosene and make a fresh start.
I spent a couple of years doing consultancy for life sciences research labs, most people were just using Excel and Tableau, plugged into OLAP, SQL servers, alongside Java and .NET based stores.
Stuff like Arrow doesn't come even into the radar of IT.
You do raise a very important point. At my organisation, Apache Avro was selected by the Java Devs due to the "cross-platform" marketing. However, they found out after it was too late, that the C/C++ implementations were too buggy/incomplete to effectively interoperate with the Java versions.
Keep in mind that Arrow Java<->C++/Python interop has been in production use in Apache Spark and elsewhere for multiple years now. We have avoided some of the mistakes of past projects by really emphasizing protocol integration tests across the implementations.
That could be improved without fixing the whole JVM data ecosystem, but that's mostly up to JVM developers. It's unfortunate if the Spark developers using Arrow aren't contributing in this area (especially since many of them are being paid), but it's all open source and undoubtedly pull requests are welcome.
Congratulations on the 1.0 release, it's only going to keep getting better! Really exciting to be able to share data in-memory across languages.
I can't speak to the relative merits of Julia, but I am honestly interested in anything that seeks to produce a less memory-hungry alternative to Spark.
Rust, to me, seems like a natural enough choice. It is easy to mate it to other languages, including all the major data science ones, so it would theoretically work well as the basis for a distributed compute engine that has good support for all of them as client languages. Would the same work for Julia? IIRC, it's a bytecode compiled language, which I imagine would make it difficult to link Julia libraries from other technology stacks.
I experienced all of these issues with JVM<->Python interop with Parquet. It does not surprise me that they extend to Arrow as well. It’s incredibly frustrating to say the least.
> And the API is barely above exposing the file-format. Nothing like "load this parquet file" into some object which you can then query for it's contents - you're dealing with blocks and sections and other file-format level entities.
Yes, it's just a building block. The easiest way to use Parquet on Java is to use Spark's integration with it, because it provides the query engine for you. But, if I'm not mistaken, it's much bigger than Pandas.
I also strongly dislike the Hadoop coupling (hey, it's called parquet-mr for a reason...), but it's more or less an invisible annoyance if you use Spark in an environment like Amazon EMR.
I think you'll get a lot of confused responses from people if you want to try to build an application that reads Parquet directly. It's the disk format for a bunch of distributed database engines. They'll wonder how you plan on querying it.
I found parquet files to be slower than just serializing and compressing the pandas data frames. Haven't looked back since. Of course this was an older version of pandas so the DF to parquet functionality may be much improved.
A serialization format for Pandas isn't really the core use case of Parquet. Pandas will slurp the whole thing into memory, which doesn't take advantage of the columnar format or pushdown filtering features. It gets more interesting if you use it with a tool that uses them to avoid some large fraction of disk I/O when performing a selective query.
Considering the number of segfaults I've witnessed that appear to originate in relatively recent versions of the Parquet library, I don't think your instincts are misplaced. I genuinely worry about data corruption.
I've been meaning to take a closer look at ORC. As a Spark user, I just sort of defaulted into Parquet. ORC is very similar, though, and seemingly gives every indication of being the more mature product.
On balance I find writing your own tools is useful when your use cases are narrow but becoming a wizard in other tools is useful otherwise. I'm lucky that I get to define my own use cases.
Even after issues with its inability to map datetime64 values properly, I was reasonably happy with my design choice.
I became less happy on discovering that it's very weak as an interchange format for cross-language work.
In my case I wanted to use some existing JVM-based tooling. This caused huge pain. The Jvm/Java library/API is a complete mess, sorry, and if people are complaining about the Python/C++ documentation, there's basically nothing for the Java library. It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.
And the API is barely above exposing the file-format. Nothing like "load this parquet file" into some object which you can then query for it's contents - you're dealing with blocks and sections and other file-format level entities.
The other issue is caused its flexibility - for example Panda's dataframes are written with what's effectively a bunch of "extension metadata" which means it works great for reading and writing pandas from Python but don't expect anything to be able to work with the files out-of-the-box in other languages.
In the end, the only way I could get reliable reading and writing from the JVM was to only store numeric and string data from the Python side. Even then it feels flakey - with a bunch of hadoop warnings and deprecation warnings. I know the JVM has little appreciation in the data science world which is maybe a reason for the sorry state of the Java library.
Edit: to be specific, I am talking about my experiences with Arrow/Parquet.