Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting read, I did not know about Common Crawl. I feel like RTBF is kind of a lost battle these days with more and more crawlers for AI and whatnot. Once on the internet there is no way back, for better or for worse. This tangent aside, 8TB is really not a lot of data, it's just 8 consumer-grade 1TB hard drives. I find it hard to believe this is "the largest corpus of PDFs online", maybe the largest public one. Not sure how representative it is of "the whole internet".


>I feel like RTBF is kind of a lost battle these days

For those of us who aren't familiar with this random acronym, I think RTBF = right to be forgotten.


RTBF was a ludicrous concept before AI and these new crawlers.

Only EU bureaucracts would have the hubris to believe you could actually, comprehensively remove information from the Internet. Once something is spread, it is there, forever.


RTBF isn't about having your information wiped from the internet. Its a safe assumption any public information about you is completely out of your control as soon as its public.

RTBF is about getting companies to get rid of any trace of you so they cannot use that data, not removing all traces about you across the internet.


>RTBF isn't about having your information wiped from the internet.

your take is misleading enough to be considered wrong. It's "don't use public information about me in search engines, I don't want people to find that information about me", not simply "don't use my information for marketing purposes"

https://en.wikipedia.org/wiki/Right_to_be_forgotten

first paragraph of the article: The right to be forgotten (RTBF) is the right to have private information about a person be removed from Internet searches and other directories in some circumstances. The issue has arisen from desires of individuals to "determine the development of their life in an autonomous way, without being perpetually or periodically stigmatized as a consequence of a specific action performed in the past". The right entitles a person to have data about them deleted so that it can no longer be discovered by third parties, particularly through search engines.


Once demographic data cannot be crawled or cached by 3rd parties, we get RTBF for free.


RTBF does not ban crawling or caching. It bans opening up those archives to the public via search engines.


It bans having them in the first place. Not just looking at them.


> Once something is spread, it is there, forever.

Really depends on the content. Tons of websites are going down everyday, link rot is a real thing. Internet archive or people don't save nearly everything.

Something I should do more often is saving mhtml copies of webpages I find interesting.


  > Something I should do more often is saving mhtml copies of webpages I find interesting.
They consume so much disc space. I wish that their was some intermediate format that would have a file size only two orders of magnitude larger than the webpage text, yet provide enough formatting to be useful.


Correct me if I'm wrong, but I always took RTBF to mean you have the right to be forgotten by any specific service provider: that you can request they delete the data they have that relates to you, and that they forward the request to any subprocessors. That's fairly reasonable and doable, it is enforced by GDPR and a number of other wide-reaching laws already, and it is a relatively common practice nowadays to allow users to make such requests with certain guarantees.

It never meant that you have the right to ask "the Internet" as a whole to scrub you from all possible records, that's indeed ludicrous. And if someone took it to mean that and they were pushing for it, they were just confused, no serious law ever proposed that.


There is a whole business sector for ”Online reputation fixers”

https://www.mycleanslate.co.uk/

What they usually do

- Spam Google with the name to bury content

- Send legal threads and use GDPR

They have legit use cases, but are often used by convicted or shady businessmen, politicians, and scammers to hide their earlier misdeeds.


Also a neurodivergent person I feel very much discriminated against when a whole continent weaponizes the law to protect scam artists who weaponize their social skills to steal from people. It makes me feel unwelcome going to Europe and for all the handwriting about Europe’s poor economic performance it is yet another explanation of why Europe is falling behind — their wealth is being stolen by people who can’t be held accountable.


Which scam artists are you referring to?


The ones who have filed lawsuits to try to get people in Europe to forget about their crimes.


Do you have some examples? I was not aware that this was a thing. And are we talking about sentences fully served, or before that time?


> RTBF

Right to be forgotten, not the Belgian public service broadcaster (https://en.wikipedia.org/wiki/RTBF)?


Living in Belgium, I first thought that it was about the TV/radio service. Never saw the acronym R.T.B.F.


Doesn't sound like a lot, but where I am now we routinely work on very large infrastructure projects and the plans, documents and stuff mostly come as PDF. We are talking of thousands of documents, often with thousands of pages, per project and even very big projects almost never break 20 GB.

If you like, you could say, PDF are information dense, but data sparse. After all it is mostly white space ;)


They often aren't like you're describing, though. For example, pdfs with high res images embedded that are drafts of future book or pamphlets prints. These can be hundreds of Mbs for a single pdf with less than 100 pages, and are so common in marketing departments that it's hard to imagine that you could fit anywhere close to all the pdfs on 8TB.


True, we get plenty of high-res pictures of film in PDF here and some of them are ridiculously large, easily approaching gigabyte sizes, like you said. But that's more a problem of the user creating the PDF than inherent to PDFs. A raw 36 megapixels (our fancy 4K displays are only 8.3 megapixels, for comparison) picture reproduction of an ISO 400 film takes only about 70 MB, which tells us that something went wrong in the transfer if a PDF containing 10 pages of them cracks 1 GB.

So, yeah, there are these monsters that send even beefy computers thrashing. But in my experience something in the creation process went wrong and it is appallingly common for a trade where PDFs are the go-to transfer format (I'm looking at you AutoCAD users!) I'd guess that the archive is doing the same we do, reprocess them for sensible results and store them. I assume you think the archive does not and then I'd agree with you. One determined civil engineer with AutoCAD can fill 8 TB in a week ;)


I'm doing some work for a company that handles scanned documents (PDFs which are purely images) and they accumulate about 15 TB / year. Of course the actual amount of information is relatively small, just inflated by being scanned. Probably 80% of them were typed up, printed, and then scanned or faxed, and of course the first thing we do is OCR them to try to recover the original text and formatting...


I've been doing some work for an infrastructure company as well. They have a total of about 1 billion pages of PDF documents in their archives. If we assume even just 30 KB per page (which is quite low, all the PDFs I just randomly checked were higher, sometimes quite a bit so), that's already 30 TB of PDFs, just for that one company with 1B in annual sales.


The common crawl only pulls documents less than a small limit (1MiB last I checked). Without special handling in this project, bigger documents than that would be missing.

So indeed, not representative of the whole Internet.


From the article:

>Specifically, when Common Crawl gets to a pdf, it just stores the first megabyte of information and truncates the rest.

This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED enters the picture. This corpus was originally created by the DARPA SafeDocs program and what it did was refetch all the different pdfs from a snapshot of Common Crawl to have untruncated versions of them.


Tangentially related, I was once handed a single PDF between 2 and 5 GBs in size and asked to run inference on it. This was the result of a miscommunication with the data provider, but I think it's funny and almost impressive that this file even exists.


Is it possible that the 8 TB is just the extracted text?


No, the Safedocs dataset is unprocessed pdfs.


Yeah 8TB is really tiny. Google scholar was estimated to index 160.000.000 pdfs in 2015.[0] If we assume that a third of those are not behind paywalls, and average pdf size is 1mb, its ends up as something above 50TB of documents. Almost ten years later the number of available pdfs of just scholarly communication should be substantially higher.

[0] https://link.springer.com/article/10.1007/s11192-015-1614-6


Anna's archive has some 300M pdfs.


We're talking about the open web here. But yeah that's the point, the dataset is unreasonably small.


Libgen size is ~33TB so, no, it's not "the largest corpus of PDFs online".

(Although you could argue libgen is not really "public" in the legal sense of the word, lol).

Disregarding that, the article is great!

(edit: why would someone downvote this, HN is becoming quite hostile lately)


I think Libgen is ~100TB, and the full Anna's Archive is near a PB.

They all probably contain lots of duplicates but...

https://annas-archive.se/datasets


It's being down voted because your number is really off. Libgen's corpus is 100+ TB


8TB - ~8,000GB - is more than 33GB.


Whoops, typo!

But that's what the comments are for, not the downvotes.


I upvoted this comment because, though the number is wrong, it proves the point. The fact that the correct number proves the point even more, is a reason _not_ to downvote the comment.


I haven't downvoted you but it is presumably because of your hasty typing or lack of proofreading/research.

33TB (first google result from 5 years ago) not 33GB. Larger figures from more recently.


>hasty typing or lack of proofreading/research

This is exactly what I meant with "HN is becoming quite hostile"

* I brought up something I looked up to support GP's argument.

* The argument is correct.

* I do it in good faith.

* G is literally next to T.

* I even praise the article, while at it.

"Oh, but you made a typo!".

Good luck, guys. I'm out.

PS. I will give my whole 7 figure net worth, no questions asked, transferred immediately to any account of their choice, to anyone here who has not ever made a typo in their life.


  > I will give all my 7 figure net worth, no questions asked, transferred immediately to any account of their choice, to anyone here who has not ever made a typo in their life.
My greatest typo was saying "I Do" when it should have been "I Go".


> I will give my whole 7 figure net worth

You sound deeply unpleasant to talk to.

Imaginary internet points are just that.


Don't take it too personally. Downvoting/flagging it makes it clear to people who come across it in the future that it's wrong.


I haven't ever made a typo, all of my mispelings are intended and therefore not mistakes


Some days it's worth it to burn some imaginary internet points for the good of the discussion and article. People downvote for various reasons, which we will never be able to figure out why definitely. Each person is different, and they all have days where they swing one way or another.


Like I said, I didn't downvote and took the time to answer your question. I didn't take the time to sugarcoat it.

You are interpreting bluntness as hostility; that's ultimately an issue for you to resolve.


You don't have to sugarcoat it.

You just have to read this site's guidelines and follow them.

Ez pz.


> Please don't comment about the voting on comments. It never does any good, and it makes boring reading.

https://news.ycombinator.com/newsguidelines.html


Have been throughout. Anyway, I hope you are able to reconsider and move on within HN.


(edit: why would someone downvote this, HN is becoming quite hostile lately)

Also, there are browser extensions that will automatically downvote and/or hide HN comments that use words like "lol," or start with "So..." or include any of a number of words that the user considers indicative of low-grade content.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: