Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Dat: non-profit, secure, and distributed package manager for data (datproject.org)
245 points by tambourine_man on Sept 26, 2017 | hide | past | favorite | 68 comments


Dat is both a protocol and a toolset. The protocol is basically an improved version of BitTorrent that supports changes and versioning (here's a short page on its technical merits [1]). The toolset is a commandline and some desktop apps, and we use it in the Beaker Browser.

Dat couldn't be made by nicer people [2]. It's a non-profit led by Max Ogden [3] and the protocol work is led by Mafintosh [4], both of whom are pretty well known in the nodejs community. The project started as a way for academics to work on tabular datasets, but they pretty quickly found out that academics more often want to work on unstructured (or custom-structured) files. So Max recruited Mafintosh and they started working on the p2p protocol, which focused on improving archival and data-sharing flows within labs.

There's a pretty simple CLI you can install from NPM (npm i -g dat). Give it a try. Also, they need help securing grants, so if you have a talent or a connection in that area, definitely get in touch with them. They're doing good work.

1. https://beakerbrowser.com/docs/inside-beaker/

2. https://datproject.org/team

3. https://twitter.com/denormalize

4. https://twitter.com/mafintosh


I’ve contributed to a couple of Mafintosh’s projects. the dude is impressively prolific and writes good code. Nice to see him get some recognition here.


Hi everyone, I'm one of the core contributors to Dat, @joeahand. Happy to answer any questions. It's an interesting time to see this posted because I've been working on a new datproject.org site recently =).

Dat Project started with a focus on increasing access to research & public data. To support the data tools we built a peer-to-peer protocol. People are doing some really cool stuff on top of Dat (such as Beaker Browser), we're really excited about it and want to make sure to support all the neat use cases.

We'll be launching an updated site soon to highlight more of the work around the protocol and what the community is building. Our main use case will still be data management but most of what you see on the current site will shift to a new domain.


Care to explain how this is different then Resilio https://www.resilio.com/

I use this with encryption for my data folders on my projects.


Ya, there are a few other related questions below. Resilio is BitTorrent-based. But I'm not 100% familiar with how Resilio differs from BitTorrent.

The core difference is in our approach. We're all open source and a non-profit. We're also really focused on the research data use case where BitTorrent is less easily deployed.

We hope that making an open and easy to use p2p protocol will enable other developers to build applications on top, and something like Resilio could be one.


I've glanced at the docs before already, and dat seemed super cool except for one little thing: it seems that there can only be one source that is allowed to modify the dataset. Is this by design or am I missing something ?


That’s correct. But the protocol author mafintosh is currently working on a major new release which introduces multi-writer capabilities.


It looks like to me, from the sources, that you see plain http/tcp connection. BitTorrent is using UDP that could be a huge gain in many situations. Do you plan to add some Realiable UDP overlay to speed things up? Something like Aspera but more open? Maybe QUIC is mature enough?


We use TCP or UDP for all the connections. UDP is especially helpful for hole-punching as well.

The direct HTTP support is not fully supported yet. But we're excited for that because it'll allow you to use S3 or other static file servers as peers.

Dat works over any protocol, so it's just a matter of implementing it.


Thanks! I usseme i isimplemented in mafintosh/hyperdrive?


How are you going to beat Globus for research data of appreciable size?


We'd love to work with them! We really like what they've done and we have a few partners that use Globus.

Underneath all the transfers in Globus use GridFTP. Using Dat could help distribute bandwidth and speed up transfers. It'll also add version control for free, which (I think) Globus does not have yet.


So how does this compare with quilt? From what I see

1. Quilt is for profit while Dat is non-profit.

2. Dat has ~20 datasets that are public Quilt has 50+ that are public

3. Dat is on a shared network while Quilt is hosted on a centralized server

4. Both of them offer version control and hosting. Quilt has private hosting for a fee. Dat seems to have only public hosting

5.Quilt is funded by YC. Dat is funded by non profits.

6. Quilt has a Python interface while Dat has one in Javascript

I understand who Quilt is targeting but I'm having trouble understanding who Dat is targeting?


I'm one of the creators of the Beaker browser[1] and the reason we use Dat is that as a p2p protocol, it offers a lot of neat properties, including making datasets more resilient. As long as one peer on the network is hosting a dataset, it will be reachable, even if the original author has stopped hosting it.

I won't speak authoritatively on behalf of the Dat team, but I believe one of their goals is to make it difficult for public scientific datasets to be lost, and data living on a centralized server is particularly vulnerable to that.

1. https://github.com/beakerbrowser/beaker


The use case really speaks to me, but I'm not convinced that decentralization is going to help datasets not to get lost.

I spent a while trying to download recent updates to the Reddit comment corpus [1], which is hosted on BitTorrent. The downloads never seem to finish.

It seems to me that decentralization means that, when a dataset stops being new and exciting, it will disappear. How will Dat counter this?

[1] https://www.reddit.com/r/datasets/comments/65o7py/updated_re...


Because Dat is just a protocol, decentralization is a choice. For quick, ephemeral exchanges direct P2P works brilliantly. For longer lived data sets, sharing it with a (commercial) mirror might make sense. Or perhaps you host it yourself. The beauty is that you, as a user of the protocol, get to decide what works best for you.


We have a few approaches to the disappearing data.

First, we are working with libraries, universities, or other groups with large amounts of storage/bandwidth. They'd help provide hosting for datasets used inside their institutes or other essential datasets.

Second, we started to work on at-home data hosting with Project Svalbard[1]. This is kind of a SETI@home idea where people could donate server space at home to help backup "unhealthy" data (data that doesn't have many peers).

Finally, for "published" data (such as data on Zenodo or Dataverse), we can use those sites as a permanent HTTP peer. So if no data is available over p2p sites then you can get it directly from the published source.

As others said, decentralization is an approach but not a solution. It gives you the flexibility to centralize or distribute data as necessary without being tied to a specific service. But we still need to solve the problem!

[1] https://medium.com/@maxogden/project-svalbard-a-metadata-vau...


That’s something we think about a lot, and decentralization isn’t a silver bullet solution to data loss, but I do think it’s more resilient than what we typically do now.

To counter that, you can take measures to mirror important datasets with a dedicated peer. It requires effort, but it at least makes it much, much harder for example, for a government agency to take down public data without warning.


Why dat or quilt and not blockchain?


A blockchain is an over-engineered solution to the problems we’re trying to solve. Blockchains provide shared global state. We don’t need that.

https://beakerbrowser.com/docs/inside-beaker/other-technolog...


A blockchain is a rather weak database in itself. However to store pointers into a dht like Dat would be fine


A rather weak database? How do you figure?


This may not always be the case, but, so far, blockchains have low throughput and fat datasets that you have to sync. Compared to other databases, they don't perform that well, so if you don't need decentralized strict consensus, a blockchain isn't a good choice.


Ah cool! I hadn't seen quilt before. You are spot on with the differences.

Dat is targeting similar users to Quilt. But we are also looking more broadly at libraries, labs, or other larger academic/gov't organizations managing data. There are a lot of data publishing tools in the sciences such as Zenodo. We'd love for it to be easier to download/publish data to those places. Dat is decentralized, so it really fits well in integrating other data tools.

You can use Dat to replace file transfer software like rsync, so it is a bit more general purpose.

Another difference not mentioned is that Dat really starts at the protocol level while Quilt is more software-focused. Dat protocol is a peer-to-peer protocol for syncing files, modeled off Git and BitTorrent. We built the data management software on top of the protocol.

Edit: I should mention we don't offer any hosting right now, all the data up there is temporarily cached. There is public hosting via Hashbase[1] from the Beaker team. The cool part about Dat being p2p is that it's really easy to switch hosts or use multiple hosts.

[1] https://hashbase.io/


If I uploaded a bunch of data that was obtained illegally there is nothing stopping me from doing that right?

Also, is your peer to peer network able to be attacked by nefarious users like a sybil attack? Is there a situation where I could alter or forge data?


> If I uploaded a bunch of data that was obtained illegally there is nothing stopping me from doing that right?

The hosting provider is responsible for removing illegal content. Dat itself doesn't track any content. The datproject.org is more of a registry, not a host.

> Also, is your peer to peer network able to be attacked by nefarious users like a sybil attack? Is there a situation where I could alter or forge data?

No, only authorized people can write to each dat key (currently only the owner, but multi-writer is coming soon). All the writes are signed with the writers private key and then verified whenever content is downloaded.


> I understand who Quilt is targeting but I'm having trouble understanding who Dat is targeting

Academics, open data enthusiasts, hackers


The p2p and security aspects look very interesting[1] - apart from the obvious network effect/existing user base and tooling - are there any reasons to not prefer dat to bittorrent for all the things?

It looks like an interesting way to store/share backups and server images - easily scaling bandwidth and availability with the nerd to spin up new instances, or shifting across data centers?

Maybe also as an apt back-end similar to:

https://wiki.debian.org/DebTorrent

[1] https://docs.datproject.org/security


To draw a quick contrast between Dat and BitTorrent, BitTorrent magnet links are static, meaning if you change the content, you get an entirely new magnet link. Dat archives (a networked directory, essentially) are mutable, so you can publish modifications at a consistent address.

I’m not familiar with DebTorrent, but if you’re interested to learn more about the innards of Dat, this post by pfraze is a good place to start:

https://beakerbrowser.com/2017/06/19/cryptographically-secur...


I can think of one use for this. I run computational plasma physics simulations for a living, and while some of us have gotten better about sharing code, sharing simulation results for published papers would be beneficial. Will have to think about this.


You should jump on #dat or get in touch with someone on the team via twitter. They'd be happy to help you get setup.


This is very cool, but I think you need to go down to fine-grained permissions to be truly effective. Otherwise , it is pretty much bittorrrent with a private tracker.

For example, I should be able to give unique URLs (for the same data) to different users and expire one but continue for the other,etc.


Everything you put into dat is encrypted and unique key pairs are used for each shared item. Only the people you share the public key with can access the data. That doesn't address the expiry use-case, but it allows you to completely control who have access to what. Does that address your fine-grained permission needs?

Edit: The keys are very short (64 bytes), so they can easily be copy/pasted, tweeted and what have you :)


I was not able to express properly - sorry about that.

You generate one key pair for one shared item. I'm talking about multiple key pairs for each shared item, so that I can give access to individual users for the same data and revoke them when necessary.

Fundamentally if you don't have that, then your case is trivially solved by a private bittorrrent tracker.

This is the fundamental difference between things like Quilt and bittorrrent.


One other crucial difference between dat and bittorrent is that dat allows you to update datasets. In bittorrent, you can't change/add files once you've shared your torrent


the DHT mutable data BEP-46 is proposing to take care of that.

http://www.libtorrent.org/dht_store.html

previous HN discussion - https://news.ycombinator.com/item?id=12257065


No, BEP46 only allows you to mutate DHT items, not torrent. By construction torrents are immutable (if you change one bit, or the name of the files) then the torrent identity changes.

What you need to mimic dat is a more integrated way to tell other peers that the torrent changed and to check the new one... which is not there yet.


I'm not an expert at this, but this is what the RFC says:

http://www.bittorrent.org/beps/bep_0046.html

>The intention is to allow publishers to serve content that might change over time in a more decentralized fashion. Consumers interested in the publisher's content only need to know their public key + optional salt. For instance, entities like Archive.org could publish their database dumps, and benefit from not having to maintain a central HTTP feed server to notify consumers about updates.

You are technically right that the torrent file is immutable, but basically this lets clients know that the torrent is updated using the DHT data. The outcome is the same.


Resilio aka Bittorrent Sync is for that purpose. https://www.resilio.com/


It looks cool, but I don't completely understand what this is about. Is it a way to share files over a P2P network? Isn't this basically the same as Torrent or IPFS?


It's very similar to bittorrent but with a few key differences. For one, bittorrent doesn't allow you to update/add files in a dataset once you've shared the torrent. Dat does. Dat also has versioning built in.

IPFS seems to me to be a bit over-engineered whereas dat is a lot more simple/low level - something that suits my way of working really well.


> IPFS seems to me to be a bit over-engineered whereas dat is a lot more simple/low level

I also love simpler things. I'd be curious to understand where is IPFS over-engineered and where is Dat better, instead.


I probably didn't give IPFS enough credit. What I should have said was that IPFS does way more than I need. Dat seems to hit the sweet spot for me


> For one, bittorrent doesn't allow you to update/add files in a dataset once you've shared the torrent. Dat does. Dat also has versioning built in.

Could this be used as a "distributed archive" to store any kind of information? Or does Dat only store some type of data? For example, could I have a "Dat dataset" containing a local Git repo, and every time I update my repo the new files are distributed to whoever is "watching" my dataset?


Dat allows you to store any type of data. Yes, I think you could do what you're suggesting (though I don't follow 100%). You can use it like you would Dropbox, except only one person can write at the moment.

Under the hood, there are two implementations: hyperdrive and hypercore. Hyperdrive is a filesystem for storing any kinds of files. Hypercore can store any kind of data and is really great for streaming data (hyperdrive is built on top of it).


> Yes, I think you could do what you're suggesting (though I don't follow 100%).

I was thinking of sharing a Git directory, directly from my computer instead of using any centralized provider such as GitHub/NotABug.

> Hypercore can store any kind of data and is really great for streaming data

is there any web "bridge" available. I'm thinking in particular about streaming data over the P2P network, but also accessible on the web (for example for streaming a video).


This seems like my setup but I use Resilio (Bittorrent Sync) with encryption. I don't need version control for my data but you get archives if you want one.


I recall at least two other "data package managers" out there being shared on HN a couple of months back.


You happen to remember which ones? Would be interested in checking them out.


I was thinking it would take a while to look back, but the new "upvoted submissions" tool is more useful than I thought. The project was Quilt[0] and here[1] is the comments on it. The other project I saw referenced in the comments was Pachyderm[2]. It looks like Dat indeed was referenced in the comments, and the main difference is Dat is distributed while Quilt is centralized (like github)...and Pachyderm is more like git.

[0] https://quiltdata.com/

[1] https://news.ycombinator.com/item?id=14771406

[2] http://www.pachyderm.io/

EDIT: Annnd...looking up, I see others have already referenced quilt in other comments here.


The other one I know of is datahub [1], which used to be frictionless data. It is from the same group that built CKAN.

[1] https://datahub.io/


naive me's first thought would be: Why not use git? Can someone with more experience in this area explain?


Git is not good for large files.

If the suggested solution is to use git-annex, I will say that git-annex is so poor in usability that the majority response to "you can get the data via git-annex" is "oh well, I'll try some other data then".


The git toolset is, out of the box, weak on tabular data. Also out of the box large file support requires one central repository.

None of that is true for Dat


out of the box large file support requires one central repository.

How so? As far as I know, git doesn't treat large files any differently than small ones.


I'll try to answer your question, but it's been a while since I last looked at it, so I might get some/most of this wrong - so don't shoot me:

As far as I know git isn't good at storing binary data. Git depends on line breaks to be able to diff and make change sets. If you store a binary file in git and make an update to it - even though that update only changed 1 byte, the entire new version of the file is stored again. Dat uses Rabin fingerprinting to intelligently slice binary files into chunks that are less likely to change. That make dat a lot more efficient at storing, versioning, and syncing videos, images, and other large binary files.


Conceptually, git doesn't use change-sets, each commit is a snapshot of all the files in the current version. For storage and transmission efficiency, though, it can store and send them in packfiles, which use delta compression based on LibXDiff - which uses Rabin's fingerprint algorithm (as well as other algorithm by Joshua P. MacDonald) for binary files.


GitHub definitely treats large files differently from small files (it rejects them), and a Git repository you can't push to GitHub is very different from a typical Git repository.

Other than that, Git treating large files the same as small ones is part of the problem, and the reason that centralized extensions exist. You wouldn't want "git clone" to clone the complete history of every large file, while that's not a concern at all for small files.


a Git repository you can't push to GitHub is very different from a typical Git repository.

When comparing to Dat, I don't see how; you can't push any Dat repository to Github.

Other than that, Git treating large files the same as small ones is part of the problem, and the reason that centralized extensions exist. You wouldn't want "git clone" to clone the complete history of every large file, while that's not a concern at all for small files.

You don't need centralized extensions for that, though. I use git-annex, which is completely P2P.


I was responding to the claim that "git doesn't treat large files any differently than small ones". I was not saying anything about dat or git-annex. Now it's clear that you are treating your large files differently from small ones.

Good luck with git-annex.


Right, but the claim was about git and you responded with Github, which is a different thing.


So, git is not good, but Dat isn't good either?


Can the fetch command be "get-dat"?...

Please?!


You could easily make an alias.


There was a startup I came across recently doing about the same thing... I don't recall the name.


Do you mean Quilt Data, Inc with the product Quilt? Its being mentioned all over this thread.


Dat being short for data, would a package for the state assemblies be "Dat Ass"?


Astute.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: