It's crazy how poor the financial data provider offerings out there are. Most fi...

claytonjy · on Dec 4, 2018

I feel this; I email Quandl regularly to fix data errors that the simplest of automated checks should catch ("why is this price 1200% higher than the previous one?").

But, they do have a mostly-decent API (tables; timeseries is pretty bad).

Something that always bugs me is properly adjusting prices when backtesting. The "right" way seems to be how Quantopian now handles it [1], in a just-in-time fashion, but that code isn't in their public libraries, and over email they declined to tell me where they get the data.

[1] https://www.quantopian.com/quantopian2/adjustments

anonu · on Dec 5, 2018

Always store unadjusted prices and volumes

Keep an updated corp action table with date, corp action type, and adjustment factor.

Corp action type is important because divs adjust prices but not volumes, for example. Splits adjust both.

When you're ready to use an adjusted time-series: select the corp actions you care about and calculate a running product of 1+the adjustment factor. As-of join the adjustment factors, multiply and you're done.

claytonjy · on Dec 4, 2018

I stand corrected; the code for on-the-fly adjusting _is_ in Zipline, but you have to know that stock splits are treated like dividends, which wasn't obvious to me.

At least some of data they use comes from a vendor they aren't able to name publicly.

erichurkman · on Dec 4, 2018

For my current job, we wanted to get a mapping of stock tickers and exchanges to CUSIPs. Every provider we looked at — and this is fundamental trade data — was full of errors and missing values. Couple that with the extortion that is CUSIP (you can't use CUSIP values without a license from them, and licenses start at $xx,xxx+). It's criminally inept. And when you do fix it up, you don't want to publish it, because you spent all your time and resources fixing it… and it becomes a trade secret.

mlthoughts2018 · on Dec 4, 2018

This is why finance is lucrative, similar to esoteric codes in various types of law. Nothing to do with math models or superior prediction, just paying for someone else to fight through identifier hell, exchange protocol hell, etc., and be able to do some mickey mouse math at the end of it.

Honestly, this stuff is so bad that the headache of it might fully justify huge finance compensation, and I’ve had colleagues who turned down huge bonuses and raises to leave finance companies solely to avoid this type of stuff and seek a career where the headaches bother them less and they are paid less.

hendzen · on Dec 4, 2018

Data cleaning/transformation ends up being a huge percentage of the work in pretty much any real-world ML context I'm familiar with. Not unique to finance at all.

mlthoughts2018 · on Dec 4, 2018

I’ve worked for over a decade in industry machine learning, about half of that in quant finance. It is definitely much worse in finance than other fields.

Even medical records do not present the same degree of esoteric data formatting and mismatching. It’s not really even a matter of data cleaning. It’s that there is _no_ way to clean the data, and the only useful approach is to pay 10s of thousands of dollars to data vendors whose products have intractable errors, and then build huge data vslidation and imputation systems around it.

When it boils down to fiduciary duty to the client, and you have a contractual obligation regarding portfolio composition, then you can’t live with “good enough” data cleaning. Even one single asset with an incorrect identifier from your data vendor can cause you to e.g. invest in an Israeli company in a portfolio with a client obligation to invest in no Israeli companies (that is a real example I encountered before).

anongraddebt · on Dec 4, 2018

I come from the non-technical side of things. Do you know of any resources that would cover this issue, but for someone on the business side?

Not an engineer, so while I understand this in a general/abstract sense, my understanding is limited to, "Cleaning/transformation is messy and a time sink due to non-standardization of data."

pmart123 · on Dec 4, 2018

One good example I uncovered a while back was that Bloomberg timestamped its crude oil futures data by finding the last trade to occur in a given second and rounding down. This means that the user of the data had no idea if the price used on the 10:30:30 AM print occurred at 10:30:00.999 or 10:30:00.001. Obviously, this could create problems if thought you found a lead/lag relationship between say oil and oil stocks.

Similarly, say a vendor aggregated website visits/pageviews but didn't account for the fact that 1/3 of the traffic was coming click-bots in developing countries. If they presented you with the raw data you could figure it out and filter those countries out, but if it is aggregated, you might not discover the issue.

Then, there could be even simpler ones like determining the opening price for a stock. If say the first print of stock XYZ trades 10 shares at a price of $20, but a millisecond later, 100k shares trade at $20.11, which print should you use in your simulation algorithms as the opening print?

pmart123 · on Dec 4, 2018

Did you look at Factset's datafeed? I've found its reference data and symbology to be pretty reliable. Cusips will cost a lot with redistribution charges though. You're better off avoiding them if possible.

erichurkman · on Dec 4, 2018

Yeah, we did look at Factset. Ultimately we found repeated gaps in their symbology, since we needed a full set, including less commonly used symbols.

JaggerFoo · on Dec 4, 2018

I agree, CUSIP is also a problem for the privateer (meaning all data needs to be free to use). While I have found a way to find a mapping online, I have no idea of the accuracy and have to trust that the unaware provider QA's the data.

rla3rd · on Dec 4, 2018

I really would like to see something like Bloomberg's OpenFIGI take the place of cusips, but its not nearly as widely used. https://www.openfigi.com/ The api does allow you to convert from cusip though.

pmart123 · on Dec 4, 2018

Yep. OpenFIGI plus using LEI codes seems like the best practice to move forward.

fancyfish · on Dec 4, 2018

For anyone curious in esoteric formats, check out some of the documentation for financial data providers.

CRSP[1] is pretty much regarded as the highest quality pricing data in the US, with stock prices going back to 1925. The database API is written for C and FORTRAN-95.

Data providers also have a habit of providing their own proprietary security IDs, or just mapping to tickers. So if you're trying to build a database with several providers, you have to wrangle together 15 different security identifiers, taking care of mergers/acquisitions, delistings, ticker recycling, etc. It is a fun exercise.

[1]http://www.crsp.com/files/programmers-guide.pdf

newguynewguy · on Dec 6, 2018

Any advice on where an individual could purchase (even limited) access to CRSP data?

I'm working on a data-driven financial analysis blog and can't seem to find decent time-series fundamentals data now that yahoo and google have taken down their api's. Everything I find seems to be a $1000+ yearly subscription.