I'm in the exploratory stages of starting a noncommercial project focused on hos...

adoyle · on June 11, 2018

This sounds similar to Radiant Earth https://www.radiant.earth/ They are also spearheading the Cloud Optimized GeoTIFF project http://www.cogeo.org/ and the SpatioTemporal Asset Catalog spec https://github.com/radiantearth/stac-spec.

ah- · on June 5, 2018

Have you seen https://github.com/pangeo-data/pangeo? It's pretty amazing.

There's still a lot left to figure out, even storage formats aren't fully solved.

skadamat · on June 11, 2018

Hey, I'd love to help out in any way I can. I'm a data scientist by training, also involved in a startup that helps people learn data science (with a pretty popular data science blog too).

teddyg1 · on June 5, 2018

I love the focus. Starting at the base level data and working to normalize it is definitely the most difficult piece. How far back have you assembled data?

Feel free to reach out to me individually (email in profile), I'm working on operationalizing climate data like this and would like to look at this together!

throwawaymath · on June 5, 2018

Your email is not in your profile :)

tudelo · on June 5, 2018

Where are you getting this weather data?

throwawaymath · on June 5, 2018

All over. It's not strictly weather data, either. The majority of the data comes from programs within DSCOVR, LPDAAC, NASA, NOAA, NCDC, USGS, ARGO, CDC, CDIAC ORNL, DOE, EIA, EPA, FDA, EOSWEB, USFS, PHMSA and USDA. The snapshots I have are already around 50TB; I'm working on setting up updates at daily resolution for each of these sources, documenting it all and transforming it into something queryable.

tudelo · on June 5, 2018

Is there a possibility that there is just too much data for one person to handle in a meaningful way?

throwawaymath · on June 5, 2018

Of course, that may very well be a possibility. In that case I'll have to scale up the effort.

exikyut · on June 5, 2018

Wow :)

High-level questions:

- At the risk of birthing a page-long subthread on ZFS-vs-everything-else... what storage solution are you using, and why?

- What sort of hardware are you using? (This is a non-catalyzing question, and is just out of curiosity)

- How did this get started, and how are you managing this?

- Will you ever be interested in accepting donations or funding? (Including on a voluntary basis; and including with clear stipulations/structure about management)

- What sorts of decisions/motivated this initiative?

More focused:

- Do you have any interest in creating this as a "community hub", with a central code repo that data scientists can push updates to that then get run on the cluster? If the visualization data (or prerendered bits of it) are openly cached/accessible, having the code that generated the data equivalently open/available could be interesting.

- What kind of availability/openness are you looking at with data? (TL;DR translation: rate limiting)

You may already be aware (very likely), but AFAIK archive.org is interested in this kind of thing. Interesting bunch of people, but the various projects do like climate data, and they have a few hundred PB of space, FWIW.

asah · on June 5, 2018

Take a look at quiltdata.com (open source), YC S17

throwawaymath · on June 5, 2018

Thanks for expressing interest...I'll respond in order.

> - At the risk of birthing a page-long subthread on ZFS-vs-everything-else... what storage solution are you using, and why?

Right now, XFS and HDFS. This is primarily due to my prior experience with it; I have considered ZFS, and might move to that later on. I may also dispense with HDFS for shared storage and to reduce the redundant JVM memory footprint. I'm attracted to ZFS for compression and error correction features, I just don't have any actual experience using it.

> - What sort of hardware are you using? (This is a non-catalyzing question, and is just out of curiosity)

Four bare metal servers, three of which have dual Xeon E5-2630 V4s (10 core, 20 thread 2.2 Ghz), one of which has an i7-6900K (8 core, 16 thread 3.2 Ghz). The latter server has 10TB SATA SSD capacity and four GTX 1080 GPUs. The other three have 2TB of NVMe SSD capacity. Each of the four has 128GB DDR4 2400 Mhz RAM. Currently all storage is local to the hardware, but there is a SuperMicro SC847 for future expansion. Every server has dual 10Gb/s SFP+ connections in bonded 802.3ad LACP, and are connected via a 16 port 10G network switch in the same rack. I think that's everything at a high level, off the top of my head.

> - How did this get started, and how are you managing this?

I joined John Baez and a few other scientists/researchers working on the Azimuth Climate Project, which sought to preserve critical snapshots of climate data in the event of mass defunding. Later on I decided to take this a step further since those snapshots were becoming very out of date and the multifarious data repositories were never very well documented, organized or normalized. Not sure what you mean by "managing this" though.

> - Will you ever be interested in accepting donations or funding? (Including on a voluntary basis; and including with clear stipulations/structure about management)

I initiated the process of starting a formal nonprofit, but not for the purpose of soliciting donations; rather, just so that it would be clear it's a noncommercial activity. I might be interested in that kind of thing - in either way you mentioned - once I have exhausted my own reasonable resources for the task and have to significantly expand.

> - What sorts of decisions/motivated this initiative?

It began with reading worrydream's blog post, "What can a technologist do about climate change?": http://worrydream.com/ClimateChange/. Not much more to it than that.

> - Do you have any interest in creating this as a "community hub", with a central code repo that data scientists can push updates to that then get run on the cluster? If the visualization data (or prerendered bits of it) are openly cached/accessible, having the code that generated the data equivalently open/available could be interesting.

Yes, that's an interesting idea.

> - What kind of availability/openness are you looking at with data? (TL;DR translation: rate limiting)

Well everything is intended to be extremely transparent, so I'm going to open source all software (infrastructure, development, research, etc). I'm also going to keep all data open. In practice there will probably be a limit of 1000 or so queries per day per IP address, but I'll burn that bridge when I get to it. It really depends on how "real time" queries end up being, and how much abuse the system actually receives.

exikyut · on June 9, 2018

Thanks for the reply! Not sure if you'll see this; had some unexpected delays finishing this comment.

Everything is duly noted; I'll just expand on some bits.

> I joined John Baez and a few other scientists/researchers working on the Azimuth Climate Project, which sought to preserve critical snapshots of climate data in the event of mass defunding. Later on I decided to take this a step further since those snapshots were becoming very out of date and the multifarious data repositories were never very well documented, organized or normalized.

Archive.org is very probably interested in this sort of thing then, FWIW.

> Not sure what you mean by "managing this" though.

Heh :) I was curious what sort of work you do in order for this initiative, which appears(?) to be somewhat of a side project, to be viable in terms of budget and adequate spare time. I'm also interested in storing/working with somewhat large amounts of data too (one project I want to try at some point is implementing an infinite browser cache so I can "Google" the content of every version of every webpage I'd ever visited), so I definitely want to optimize for something that doesn't require much time :) (don't we all)

> It began with reading worrydream's blog post, "What can a technologist do about climate change?" ... Not much more to it than that.

After having read that page I now understand the sentiment of that statement. Major TIL :)

And if only all webpages were that well designed... (The interactivity was a bit of an information overload, but I really liked the layout.)