Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm in the exploratory stages of starting a noncommercial project focused on hosting, processing and serving a vast amount of climate data for scientists and researchers. I've "seeded" this project with my own bare metal hardware, electricity and internet access. This includes 136 vCPUs, 512GB of RAM and 128TB of hard drive storage capacity (with room for another 456TB). As my own money permits, I plan to slowly add to this lab over time.

My initial consideration is to develop and host a sort of public Wolfram Alpha, but specialized for climate data queries and visualization. I'm open to other suggestions, as I'm currently in the process of architecting the system and reaching out to climate scientists for feedback on what they need that isn't well-served yet.

Once all the infrastructure work is done, I'd be happy to contribute to climate modeling more theoretically. But for now my prerogative is to develop a central repository for as much climate data as possible, normalize it as reasonable, and store it in a queryable format.

There are interesting technical problems here due to the scale of the data, the variety of its native (raw) formats, the frequency with which each source updates and the types of information involved. But speaking directly to the title of this post: I don't think anything I'm doing is strictly an unsolved technical problem, it's just a complex technical undertaking. There are plenty of data serving and processing pipelines for this that have already been proven capable.



This sounds similar to Radiant Earth https://www.radiant.earth/ They are also spearheading the Cloud Optimized GeoTIFF project http://www.cogeo.org/ and the SpatioTemporal Asset Catalog spec https://github.com/radiantearth/stac-spec.


Have you seen https://github.com/pangeo-data/pangeo? It's pretty amazing.

There's still a lot left to figure out, even storage formats aren't fully solved.


Hey, I'd love to help out in any way I can. I'm a data scientist by training, also involved in a startup that helps people learn data science (with a pretty popular data science blog too).


I love the focus. Starting at the base level data and working to normalize it is definitely the most difficult piece. How far back have you assembled data?

Feel free to reach out to me individually (email in profile), I'm working on operationalizing climate data like this and would like to look at this together!


Your email is not in your profile :)


Where are you getting this weather data?


All over. It's not strictly weather data, either. The majority of the data comes from programs within DSCOVR, LPDAAC, NASA, NOAA, NCDC, USGS, ARGO, CDC, CDIAC ORNL, DOE, EIA, EPA, FDA, EOSWEB, USFS, PHMSA and USDA. The snapshots I have are already around 50TB; I'm working on setting up updates at daily resolution for each of these sources, documenting it all and transforming it into something queryable.


Is there a possibility that there is just too much data for one person to handle in a meaningful way?


Of course, that may very well be a possibility. In that case I'll have to scale up the effort.


Wow :)

High-level questions:

- At the risk of birthing a page-long subthread on ZFS-vs-everything-else... what storage solution are you using, and why?

- What sort of hardware are you using? (This is a non-catalyzing question, and is just out of curiosity)

- How did this get started, and how are you managing this?

- Will you ever be interested in accepting donations or funding? (Including on a voluntary basis; and including with clear stipulations/structure about management)

- What sorts of decisions/motivated this initiative?

More focused:

- Do you have any interest in creating this as a "community hub", with a central code repo that data scientists can push updates to that then get run on the cluster? If the visualization data (or prerendered bits of it) are openly cached/accessible, having the code that generated the data equivalently open/available could be interesting.

- What kind of availability/openness are you looking at with data? (TL;DR translation: rate limiting)

You may already be aware (very likely), but AFAIK archive.org is interested in this kind of thing. Interesting bunch of people, but the various projects do like climate data, and they have a few hundred PB of space, FWIW.


Take a look at quiltdata.com (open source), YC S17


Thanks for expressing interest...I'll respond in order.

> - At the risk of birthing a page-long subthread on ZFS-vs-everything-else... what storage solution are you using, and why?

Right now, XFS and HDFS. This is primarily due to my prior experience with it; I have considered ZFS, and might move to that later on. I may also dispense with HDFS for shared storage and to reduce the redundant JVM memory footprint. I'm attracted to ZFS for compression and error correction features, I just don't have any actual experience using it.

> - What sort of hardware are you using? (This is a non-catalyzing question, and is just out of curiosity)

Four bare metal servers, three of which have dual Xeon E5-2630 V4s (10 core, 20 thread 2.2 Ghz), one of which has an i7-6900K (8 core, 16 thread 3.2 Ghz). The latter server has 10TB SATA SSD capacity and four GTX 1080 GPUs. The other three have 2TB of NVMe SSD capacity. Each of the four has 128GB DDR4 2400 Mhz RAM. Currently all storage is local to the hardware, but there is a SuperMicro SC847 for future expansion. Every server has dual 10Gb/s SFP+ connections in bonded 802.3ad LACP, and are connected via a 16 port 10G network switch in the same rack. I think that's everything at a high level, off the top of my head.

> - How did this get started, and how are you managing this?

I joined John Baez and a few other scientists/researchers working on the Azimuth Climate Project, which sought to preserve critical snapshots of climate data in the event of mass defunding. Later on I decided to take this a step further since those snapshots were becoming very out of date and the multifarious data repositories were never very well documented, organized or normalized. Not sure what you mean by "managing this" though.

> - Will you ever be interested in accepting donations or funding? (Including on a voluntary basis; and including with clear stipulations/structure about management)

I initiated the process of starting a formal nonprofit, but not for the purpose of soliciting donations; rather, just so that it would be clear it's a noncommercial activity. I might be interested in that kind of thing - in either way you mentioned - once I have exhausted my own reasonable resources for the task and have to significantly expand.

> - What sorts of decisions/motivated this initiative?

It began with reading worrydream's blog post, "What can a technologist do about climate change?": http://worrydream.com/ClimateChange/. Not much more to it than that.

> - Do you have any interest in creating this as a "community hub", with a central code repo that data scientists can push updates to that then get run on the cluster? If the visualization data (or prerendered bits of it) are openly cached/accessible, having the code that generated the data equivalently open/available could be interesting.

Yes, that's an interesting idea.

> - What kind of availability/openness are you looking at with data? (TL;DR translation: rate limiting)

Well everything is intended to be extremely transparent, so I'm going to open source all software (infrastructure, development, research, etc). I'm also going to keep all data open. In practice there will probably be a limit of 1000 or so queries per day per IP address, but I'll burn that bridge when I get to it. It really depends on how "real time" queries end up being, and how much abuse the system actually receives.


Thanks for the reply! Not sure if you'll see this; had some unexpected delays finishing this comment.

Everything is duly noted; I'll just expand on some bits.

> I joined John Baez and a few other scientists/researchers working on the Azimuth Climate Project, which sought to preserve critical snapshots of climate data in the event of mass defunding. Later on I decided to take this a step further since those snapshots were becoming very out of date and the multifarious data repositories were never very well documented, organized or normalized.

Archive.org is very probably interested in this sort of thing then, FWIW.

> Not sure what you mean by "managing this" though.

Heh :) I was curious what sort of work you do in order for this initiative, which appears(?) to be somewhat of a side project, to be viable in terms of budget and adequate spare time. I'm also interested in storing/working with somewhat large amounts of data too (one project I want to try at some point is implementing an infinite browser cache so I can "Google" the content of every version of every webpage I'd ever visited), so I definitely want to optimize for something that doesn't require much time :) (don't we all)

> It began with reading worrydream's blog post, "What can a technologist do about climate change?" ... Not much more to it than that.

After having read that page I now understand the sentiment of that statement. Major TIL :)

And if only all webpages were that well designed... (The interactivity was a bit of an information overload, but I really liked the layout.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: