Parquet/arrow is a great format for Pandas - fast and nicely compact in terms of...

wesm · on July 27, 2020

What you've written sounds like a criticism of the JVM data analytics ecosystem (the Java Parquet library in particular) and not Apache Arrow itself. Parquet for Java is an independent open source project and developer community. For example, you said

> It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.

These are comments about http://github.com/apache/parquet-mr which is a different open source project.

For C++ / Python / R many of the developers for both Apache Arrow and Apache Parquet are the same and we currently develop the Parquet codebase out of the Arrow source tree.

So, I'm not sure what to tell you, we Arrow developers cannot take it upon ourselves to fix up the whole JVM data ecosystem.

derriz · on July 27, 2020

I'm not expecting anything really and I do appreciate your work and effort. And it's a specific use case for arrow, I guess.

But at your landing page, it's claimed "Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. " and that "Libraries are available for C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.". This certainly gave me the impression that more than just Python, C++ and R would be well supported.

The JVM isn't complete irrelevant in data-science given the position of Spark/Scala. This also raised my expectations of arrow/parquet because it seems to be the de-facto standard for table storage for this JVM platform. And I experienced no issues on that platform.

To be clear, I'm not blaming you for my design decision (I'm a software engineer not a data-scientist btw), and I still think parquet/arrow rocks for Python but in my experience it doesn't really deliver a useable "cross-language" file format at the moment.

wesm · on July 27, 2020

Again, I have to object to your use of “arrow/parquet”. These are not the same open source projects and while people use them together it isn’t fair to the developers of each project for you to discuss them like a single project.

mumblemumble · on July 27, 2020

FWIW, while the JVM isn't completely irrelevant in data, I will say, even as a big user of Spark via Scala, that JVM languages are quickly becoming irrelevant in data. Spark's Scala API is simultaneously the core of the platform, and also very much a second-class citizen that lacks a lot of important features that the Python API has. Easy interop with a good math library, for example.

Similarly, the reference implementation of Parquet may be in Java, but consuming it from a Java language, outside of a Spark cluster, is still a royal pain. Whereas doing it from Python isn't too bad.

Long story short, I think that expecting a project that's just trying to implement a columnar memory format to also muck out the world's filthiest elephant pen is perhaps asking too much. Though perhaps a project like Arrow could serve as the cornerstone of an effort to douse it all with kerosene and make a fresh start.

pjmlp · on July 28, 2020

I spent a couple of years doing consultancy for life sciences research labs, most people were just using Excel and Tableau, plugged into OLAP, SQL servers, alongside Java and .NET based stores.

Stuff like Arrow doesn't come even into the radar of IT.

grumpyprole · on July 28, 2020

You do raise a very important point. At my organisation, Apache Avro was selected by the Java Devs due to the "cross-platform" marketing. However, they found out after it was too late, that the C/C++ implementations were too buggy/incomplete to effectively interoperate with the Java versions.

wesm · on July 28, 2020

Keep in mind that Arrow Java<->C++/Python interop has been in production use in Apache Spark and elsewhere for multiple years now. We have avoided some of the mistakes of past projects by really emphasizing protocol integration tests across the implementations.

nl · on July 28, 2020

I believe that the Spark parquet library is available to be used in plain old Java: https://www.arm64.ca/post/reading-parquet-files-java/

joelwilsson · on July 27, 2020

Just for Apache Arrow itself, https://arrow.apache.org/docs/java/ compared to https://arrow.apache.org/docs/js/ or https://arrow.apache.org/docs/cpp/ doesn't look promising in terms of the documentation being usable.

That could be improved without fixing the whole JVM data ecosystem, but that's mostly up to JVM developers. It's unfortunate if the Spark developers using Arrow aren't contributing in this area (especially since many of them are being paid), but it's all open source and undoubtedly pull requests are welcome.

Congratulations on the 1.0 release, it's only going to keep getting better! Really exciting to be able to share data in-memory across languages.

MrPowers · on July 27, 2020

Arrow is tightly integrated in Spark's Scala code: https://github.com/apache/spark/search?q=arrow&unscoped_q=ar...

Andy Grove is building a "distributed compute platform implemented in Rust, using the Apache Arrow memory model": https://github.com/ballista-compute/ballista. Seems possible.

xiaodai · on July 27, 2020

Rust will not succeed. Ppl can't debug spakr already and rust, even fewer ppl cna debug.

Build a spark killer in Julia. Everyone can read the code.

mumblemumble · on July 28, 2020

I can't speak to the relative merits of Julia, but I am honestly interested in anything that seeks to produce a less memory-hungry alternative to Spark.

Rust, to me, seems like a natural enough choice. It is easy to mate it to other languages, including all the major data science ones, so it would theoretically work well as the basis for a distributed compute engine that has good support for all of them as client languages. Would the same work for Julia? IIRC, it's a bytecode compiled language, which I imagine would make it difficult to link Julia libraries from other technology stacks.

MrPowers · on July 28, 2020

There is a project called vega that's a Spark implementation in Rust: https://github.com/rajasekarv/vega. It looks promising.

Databricks rewrote the Spark engine in C++ (called the Delta Engine, see here: https://databricks.com/blog/2020/06/24/introducing-delta-eng...).

Hopefully there will be some good alternatives that don't consume so much memory soon ;)

andygrove · on July 28, 2020

There is also https://github.com/ballista-compute/ballista (shameless plug for my own project)

pjmlp · on July 28, 2020

Hopefully they have all measures in place to prevent data to be tainted by C and C++ "features".

pjmlp · on July 28, 2020

It is the bytecode compiled language like any other that uses LLVM as their backend.

The alternative to Spark will be Spark, as JVM slowly improves their memory APIs from release to release.

People tend to hand wave how much it costs to reboot an ecosystem.

choppaface · on July 28, 2020

If you use pyspark and Spark for Parquet, you get:

* Easy type inference, even for nested maps and structs

* lz4 compression support

* SQL and directory partitioning out of the box

If you use pyarrow, you get: * Write support for nested types, but read support is broken / incomplete (it throws a TODO error)

* no lz4 support and a load of Jira politics blocking it

* You can query using Pandas (not SQL), but that querying can be much slower versus Spark (Spark is naturally parallelized)

pyarrow might need to hit 2.0.0 to be really viable. It’s definitely easier to use than parquet-mr though.

wesm · on July 28, 2020

There is no “JIRA politics” blocking the LZ4 work, only a lack of volunteers to do the development and testing.

cheez · on July 28, 2020

Hmm interesting. I will be investigating the performance of this bad boy.

teej · on July 27, 2020

I experienced all of these issues with JVM<->Python interop with Parquet. It does not surprise me that they extend to Arrow as well. It’s incredibly frustrating to say the least.

zten · on July 27, 2020

> And the API is barely above exposing the file-format. Nothing like "load this parquet file" into some object which you can then query for it's contents - you're dealing with blocks and sections and other file-format level entities.

Yes, it's just a building block. The easiest way to use Parquet on Java is to use Spark's integration with it, because it provides the query engine for you. But, if I'm not mistaken, it's much bigger than Pandas.

I also strongly dislike the Hadoop coupling (hey, it's called parquet-mr for a reason...), but it's more or less an invisible annoyance if you use Spark in an environment like Amazon EMR.

I think you'll get a lot of confused responses from people if you want to try to build an application that reads Parquet directly. It's the disk format for a bunch of distributed database engines. They'll wonder how you plan on querying it.

cheez · on July 28, 2020

I found parquet files to be slower than just serializing and compressing the pandas data frames. Haven't looked back since. Of course this was an older version of pandas so the DF to parquet functionality may be much improved.

mumblemumble · on July 28, 2020

A serialization format for Pandas isn't really the core use case of Parquet. Pandas will slurp the whole thing into memory, which doesn't take advantage of the columnar format or pushdown filtering features. It gets more interesting if you use it with a tool that uses them to avoid some large fraction of disk I/O when performing a selective query.

cheez · on July 28, 2020

I just didn't trust the tools to do the job properly so for now I just split it up myself.

mumblemumble · on July 28, 2020

Considering the number of segfaults I've witnessed that appear to originate in relatively recent versions of the Parquet library, I don't think your instincts are misplaced. I genuinely worry about data corruption.

I've been meaning to take a closer look at ORC. As a Spark user, I just sort of defaulted into Parquet. ORC is very similar, though, and seemingly gives every indication of being the more mature product.

cheez · on July 28, 2020

On balance I find writing your own tools is useful when your use cases are narrow but becoming a wizard in other tools is useful otherwise. I'm lucky that I get to define my own use cases.