I'm waiting on my GPT-4 API access so I can use gpt-4-32k which maybe can soak up 10k LOC?
Clearly this will break eventually, but I am playing around with some ideas to extend how much context I can give it. One is to do something like base64 encode file contents. I've seen some early success that GPT-4 knows how to decode it, so that'll allow me to stuff more characters into it. I'm also hoping that with the use of .gptignore, I can just selectively give the files I think are relevant for whatever prompt I'm writing.
I wonder if you could teach it to understand a binary encoding using the raw bytestream, feed it compressed text, and just tell it to decompress it first.
Here is what GPT-4 says about it. "As an AI language model, I can understand and work with various text encoding schemes and compression algorithms. However, to work with a raw bytestream, you would need to provide specific details about the encoding and compression used.
To teach me to understand a particular binary encoding and compressed text format, you should provide the following information:
The binary encoding used (e.g., ASCII, UTF-8, UTF-16, etc.).
The compression algorithm employed (e.g., gzip, Lempel-Ziv-Welch (LZW), Huffman coding, etc.).
Once you provide these details, I can help you process the raw bytestream and decompress the text. However, keep in mind that my primary focus is on natural language understanding and generation, and I might not be as efficient at handling compressed data as a dedicated compression/decompression tool."
When GPT gives an answer like that, is it actually a meaningful description of its capabilities? Does it have that kind of self-awareness? Or is it just a plausible answer based on the training corpus?
My guess is that the training data includes things specifically about the GPT itself and its capabilities, so it would be somewhat correct. But it's also known to just make shit up when it feels like it, so you can't 100% trust it, same as with all other prompts/responses.
Unfortunately GPT is not yet aware of what LangChain is or how it works, and the docs are too long to feed the whole thing to GPT.
But you can still ask it to figure something out for you.
For example: “write pseudo-code that can read documents in chunks of 800 tokens at a time, then for each chunk create a prompt for GPT to summarize the chunk, then save the responses per document and finally aggregate+summarize all the responses per document”
Basically a kind of recursive map/reduce process to get, process and aggregate GPT responses about the data.
LangChain provides tooling to do the above and even allow the model to use tools, like search or other actions.
GPT 4 is limited currently to 8k tokens, which is about 6000 words.
You can use our repo (which we are currently updating to include QuickStart tutorials, coming in the next few days) to do embedding retrieval and query
Our codebase is 1 million lines of code.
Can we feed the documentation to it? What are the limits?
Is it possible to train it on our data without doing prompt engineering? How?
Otherwise are we supposed to use embeddings? Can someone explain how these all work and the tradeoffs?