How much text can you feed GPT-4? Our codebase is 1 million lines of code. Can w...

lukasb · on March 17, 2023

I've been wondering if you could use something like llama-chain's tree summarization, but modified to be aware of inter-module dependencies: https://gpt-index.readthedocs.io/en/latest/guides/index_guid...

mpoon · on March 17, 2023

I'm waiting on my GPT-4 API access so I can use gpt-4-32k which maybe can soak up 10k LOC?

Clearly this will break eventually, but I am playing around with some ideas to extend how much context I can give it. One is to do something like base64 encode file contents. I've seen some early success that GPT-4 knows how to decode it, so that'll allow me to stuff more characters into it. I'm also hoping that with the use of .gptignore, I can just selectively give the files I think are relevant for whatever prompt I'm writing.

Bjartr · on March 17, 2023

> GPT-4 knows how to decode it

I wonder if you could teach it to understand a binary encoding using the raw bytestream, feed it compressed text, and just tell it to decompress it first.

ozfive · on March 17, 2023

Here is what GPT-4 says about it. "As an AI language model, I can understand and work with various text encoding schemes and compression algorithms. However, to work with a raw bytestream, you would need to provide specific details about the encoding and compression used.

To teach me to understand a particular binary encoding and compressed text format, you should provide the following information:

The binary encoding used (e.g., ASCII, UTF-8, UTF-16, etc.). The compression algorithm employed (e.g., gzip, Lempel-Ziv-Welch (LZW), Huffman coding, etc.). Once you provide these details, I can help you process the raw bytestream and decompress the text. However, keep in mind that my primary focus is on natural language understanding and generation, and I might not be as efficient at handling compressed data as a dedicated compression/decompression tool."

tablatom · on March 17, 2023

When GPT gives an answer like that, is it actually a meaningful description of its capabilities? Does it have that kind of self-awareness? Or is it just a plausible answer based on the training corpus?

Genuine question.

Olphs · on March 17, 2023

My guess is that the training data includes things specifically about the GPT itself and its capabilities, so it would be somewhat correct. But it's also known to just make shit up when it feels like it, so you can't 100% trust it, same as with all other prompts/responses.

dc-programmer · on March 17, 2023

Base64 encoding increases the size of text by 4/3. Like the other commenter asked, I wonder if another encoding could work

WXLCKNO · on March 17, 2023

Have multiple instances of Gpt4 with different parts of the codebase interact with each other to write the whole thing.

Probably doesn't work this way lol

blurbleblurble · on March 17, 2023

32k tokens is the limit, so you won't be able to load the whole thing into the context.

nico · on March 17, 2023

For an alternative, you can use LangChain.

Unfortunately GPT is not yet aware of what LangChain is or how it works, and the docs are too long to feed the whole thing to GPT.

But you can still ask it to figure something out for you.

For example: “write pseudo-code that can read documents in chunks of 800 tokens at a time, then for each chunk create a prompt for GPT to summarize the chunk, then save the responses per document and finally aggregate+summarize all the responses per document”

Basically a kind of recursive map/reduce process to get, process and aggregate GPT responses about the data.

LangChain provides tooling to do the above and even allow the model to use tools, like search or other actions.

jerpint · on March 17, 2023

GPT 4 is limited currently to 8k tokens, which is about 6000 words.

You can use our repo (which we are currently updating to include QuickStart tutorials, coming in the next few days) to do embedding retrieval and query

www.GitHub.com/Jerpint/buster

freezed8 · on March 17, 2023

This is what llama-index was designed for! https://gpt-index.readthedocs.io/en/latest/ Would love to incorporate this Github repo loader into LlamaHub