Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Repo maintainer here.

...can someone explain how the repo keeps resurfacing? I haven’t promoted it in a long time. (Looking at the repo traffic, it recently spiked on the 6th, but nothing since then.)



Since you are here. Thanks for making this. I recently used it to prove to a client that the api I delivered could take any content they cared to throw at it. They were especially impressed considering they were coming from a 35 year old system that only allowed ASCII.

The BLNS allowed me to prove it and I hooked it into our integration and fuzz tests which managed to shake out a few bugs.


You're very welcome! :D


It was brought up during VMware's internal security conference called "MooseCon" earlier today during a talk on Unicode.

No idea if it's just coincidental resonance though.


Yes, a few people from VMWare made a PR (which I just merged).


Tangentially related to the original project intent;

Is there a place where common things in the dev world like this are accumulated? For example, a list of all countries or list of the US states, for use with an HTML dropdown. I know there are various repos on Github that maintain these types of lists, such as English stop words, profanity word lists etc, but is there a service that accumulates these in a familiar, structured api?


Look at Wikipedia's lists of things. For your particular examples:

https://en.wikipedia.org/wiki/List_of_sovereign_states

https://en.wikipedia.org/wiki/U.S._state

Some of them are quite meta, such as https://en.wikipedia.org/wiki/List_of_lists_of_lists

For a more structured source, Wikidata aims to be that, but I cannot comment on its completeness.


Often times instead of the list of US Sates, you actually want to list of US States and Territories:

https://en.wikipedia.org/wiki/List_of_states_and_territories...

For example when the intent is "list of place where the USPS ships" or "list of state-level political jurisdictions where US residents live"


Let’s move this tidbit to a structered api of common knowledge! Dewey decimal for data, not just a generic search engine for datasets in different formats (like the recent google datasets site), but a familiar, goto resource.



Have you ever used wikidata? It's kind of a shitshow.


Yes! Surprisingly often I am unable to complete a form because there is no option in the State field for Washington, DC.


Structured, maintained API though, not general knowledge. I personally see an issue that someone has to accumulate their own stash of structured data for common knowledge (random examples) like: countries, zip codes, valid HTML5 element names, css properties, hex colors, common naming prefix/suffixes/professional titles, etc. A growing list of work repeated by each dev team/company for really no reason. No complaint about this repo, at all, just seeking if a solution exists.



> Corpora is a collection of small files. It is not meant to be an exhaustive source of anything: a list of resources should contain somewhere in the vicinity of 1000 items.


Thanks, will check it out.


Here's a platform specific example. https://github.com/SmileyChris/django-countries/


Wikidata has a SPARQL API though


I've used faker [0] for stuff like this. I think originally a perl package, has similar packages in other languages as well. I've used the python implementation and enjoy it, along with it's localization feature.

It looks like 1.0 was just released as well :D

[0] https://github.com/joke2k/faker/releases


SecLists (https://github.com/danielmiessler/SecLists) contains a wealth of security-related lists of this sort, including a useful section containing the most common passwords.


That's a cool idea. I've seen individual packages for things like US states and HTTP status codes but I don't think I've ever seen them all packaged together.


Would it essentially be like a graph with multiple nested nodes with different strands of info?


Somebody needed to find strings to test his/her app with, saw your repo, found it interesting and posted it here.

About the repo: nice job, I've used it a lot when testing sites/apps I did, good job on providing different formats too so it's easy to automate testing!


I imagine because it's useful and has a fun name. So when someone stumbles across it, they post it. I've seen it on here a number of times and I still upvote it...because its useful and has a fun name.


It's a pretty useful list. I do wonder how many people actually end up having to rebuild their databases after running a test!


It sounds like the VMWare comment is most likely, but I thought I would share how I learned of the project just yesterday. There was a HN post yesterday about https://sr.ht/ and in looking at that I noticed the project used a blacklist of usernames that I thought was cool, so when I took a look at that project it had a link to this repo.


The current spike might be due to a recent post [1] on programming subreddit

[1] https://www.reddit.com/r/programming/comments/9xla2j/naughty...


That Reddit post was made after this HN submission hit the top.


Thanks for this list! And I appreciate that the RTL naughty string contains Hebrew for the first line of Genesis 1. :-)


I would guess because it's a useful tool that gets shared whenever people think about these issues. It's a nice reminder that not everything needs to be regularly updated or promoted to be useful. :)


Its the kind of thing that sticks in your mind when you think of weird things going wrong with string input or unicode rendering.


Because you have created something great and novelty.

I stumbled on it for the first time and already saved it for future testing. Thanks :)


It's a great testing resource. Way more concrete than 'well just test all the different strings'


Have you gotten any interesting or offensive pull requests?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: