26 Comments
User's avatar
Gail's avatar

So fascinating, cannot wait to see the results!

Expand full comment
Ethan Blagrove's avatar

Hey bro any updates on this? Ive been sort of keeping my eye on this space and Im interested in getting into it.

If you need more computing power and there's a way to remotely help Ive got a 3080ti

Expand full comment
Maxmustermann's avatar

Yes i am also interested in this ! 😄

Expand full comment
Maxmustermann's avatar

Is there some new news for this?

Expand full comment
Maxmustermann's avatar

Is it still training? ☺️

Expand full comment
Lukas Nel's avatar

Will make a post soon about this: but the TL;DR is: I trained it on and off, but it kept crashing with Cuda Out of Memory errors, so this week or next week I'll need to bite the bullet and really dive deep into the distributed training code so I can a) run it continuously for a week and b) use larger models. Like, the initial results always look promising, but I leave it running for a bit, and it crashes, which is super frustrating.

Expand full comment
Lukas Nel's avatar

Specifically, the way the GRPO trainer class is written makes it super prone to out of memory errors, so I might have to extract the training code from that altogether.

Expand full comment
idunno's avatar

I got it working, it's been training for the last 23.43 hours and I'm almost done. I used the V2 dataset. I can provide you the code I used to get it to work if you'd like...

Expand full comment
Lukas Nel's avatar

yeah that would be great!

Expand full comment
idunno's avatar

Sure thing. It won't let me paste the code here since it's too long. I made a github repository at https://github.com/IYamHim/Stonk-Market. That is the exact code that I used for this run, but admittedly, I need to update a couple variables, "learning_rate=2e-4" should be "learning_rate=5e-4" and "dataloader_num_workers=0" should be "dataloader_num_workers=4". I just uploaded the environment.yml and requirements.txt to make things easier for you for whatever virtual environment you perfer.

P.S. here is my current progress, I'm in the home stretch!:

{'loss': 22.4182, 'grad_norm': 0.0, 'learning_rate': 3.127171646977067e-05, 'epoch': 0.85}

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 3.057678943710911e-05, 'epoch': 0.85}

{'loss': 0.0196, 'grad_norm': 0.0001766916102496907, 'learning_rate': 2.9881862404447535e-05, 'epoch': 0.86}

{'loss': 22.8465, 'grad_norm': 0.0, 'learning_rate': 2.918693537178596e-05, 'epoch': 0.86}

86%|██████████████████████████████████████████████████▋ | 2550/2968 [25:20:24<3:27:22, 29.77s/it

Expand full comment
Lukas Nel's avatar

Dude, thanks so much man, I'll take a look at it.

Expand full comment
idunno's avatar

Finished, but don't know how good it is. I'll need to do some testing:

100%|█████████████████████████████████████████████████████████████| 2968/2968 [28:47:41<00:00, 34.93s/it]

Training completed successfully

Saving model...

Model saved to: /home/2084Collective_deepstock-sp500-companies-with-info-and-user-prompt/train/qwen_stock_advisor_final1

Final GPU memory:

GPU Memory allocated: 1.53GB

GPU Memory reserved: 4.35GB

GPU 0 Memory:

Total: 11.00GB

Free: 4.35GB

Used: 1.53GB

CUDA available: True

Current device: 0

Testing model...

/home/ai/miniconda3/envs/trading_env/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None

warnings.warn(

<edit AI response> I don't want people to think these are real (they are definitely not) so I just removed for clarity.

Expand full comment
Henry Wertz's avatar

I put this on reddit but thought it'd be nice to note here too....

I've just been looking at some of the distilled models for sentiment analysis for the brokerage firm I'm working for. One major thing to watch out for: repeatability. With an (admitteddly heavily cut down) R1 distill, 14B Q4 Lllama (7B behaved the same), I was finding it to not be repeatable. I ran the same prompt and press release through 5 times (press release on a medication, the stock got a nice price bump when it came out). I told it to rate out of 10, with 0 being worst and 10 being the best, based on likelihood of a price bump, I had 5 runs with different scores each time; 9.0, 8.0, 8.5, 7.5, and even a 5.0 on one run. Just something to look out for, obviously that much variability makes a signal from that model rather useless.

I tested R1 14B Q6_L Qwen (thought going from Q4 to Q6 might help since I've read that reduces perplexity quite a bit) and it also gave inconsistent ratings from run to run.

I of course don't know if this is because the model I ran is so shrunk down? Or it might just be an intentional characteristic of R1; after all, for creative writing or coding, getting different text or different coding solution (as long as it's correct!) would be a positive sign of creativity, not a negative. And for chat of course having the responses vary would be a positive too.

So look out, to make sure (whether you get your model nicely trained or not) that it's buy/don't buy are at least reasonably consistent from one run to the next.

I found Qwen 2.5 14B Q6_L at least consistently gave a 8.5 (at least over 5 runs). I don't know yet if it's actually good at this kind of thing, but at least it's consistent.

I then made the press release negative but rather deranged (I put "not"s and "in"s into the press release so the positive results were negative (it was ineffective, side effects were not tolerable, etc.), but left in the bit at the end where they were pursuing stage 2 trials and FDA approval. So it rated it a 2.5, and commented it didn't rate it even lower because the plans for stage 2 trials and FDA approval might just keep the stock price steady rather than drop.

I second Gail's comment... I'm also fascinated to see the results, and I subscribed to your substack to see how it goes.

Expand full comment
Lukas Nel's avatar

That's totally fair yeah - the RL objective actually generates 20 completions from the same prompt and takes the average reward over all of them so hopefully thay helps with fhe variance

Expand full comment
Henry Wertz's avatar

That was my plan B and I don't see why it wouldn't work well; run several runs and average them. I mean even if your most positive end up with a 7.5 or 8, and negative with a 2 or 2.5, that's still a sold 5 or 6 points of range to pick out the buys and strong buys.

Expand full comment
idunno's avatar

Try a larger quant size, I've read that anything under quant 8 on the distilled models has adverse effects on the output. Other models are okay at lower quants (phi4 for example, gives consistent responses with q3KL).

Expand full comment
Henry Wertz's avatar

I had wondered about that too (and went from Q4 to Q6, or actually a Q6K_L at least.) I could see a model perhaps having trouble having a "head for numbers" if it's quant'ed to far. After all, Q4 only alows 16 distinct values and Q2 4. I could see a Q4 having trouble with "rate from 0.0 to 10.0" where if it's going by 0.5s there's 20 different values. Of course a model that's designed for use with high quant could just represent numerical values in binary I suppose, but these models are not trying to do this as far as I know. I imagine these running on Q1.58 or whatever, Q2, Q4, this is analogous to reasoning through these things when your head's a bit fuzzy, you'll get an answer but it might not be correct and you might not get the same answer twice.

Expand full comment
idunno's avatar

TBH it took me a couple times reading your comment to fully grasp the analogy but I agree. I have noticed more thinking loops of nonsense when using q4 as if their thoughts are foggy. Then using the q8 version they have a clearer mind and give much better results.

I also got this code running locally with a Qwen2.5 1.5B model on a GTX 1080Ti (11GB VRAM). To get 2000 Epochs of a sample of 10,000 (not the full 300,000~) training examples takes about 8 hours. It looks like from the GRPO training documentation on Unsloth, around 1.5k Epochs is about the ceiling for RL training models. I'm doing a small test run of 296 Epochs now to see how it goes. Unfortunately, I had to use 4bit instead of 8bit due to memory constraints. Here are my outputs so far, (wandb was causing issues, I'll reenable it after this test run):

(trading_env) ai@home:/home/2084Collective_deepstock-sp500-companies-with-info-and-user-prompt/train$ python train_qwen_grpo.py

Initial GPU memory:

GPU Memory allocated: 0.00GB

GPU Memory reserved: 0.00GB

GPU 0 Memory:

Total: 11.00GB

Free: 0.00GB

Used: 0.00GB

CUDA available: True

Current device: 0

Loading model and tokenizer...

Initializing model...

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.

Preparing model for training...

Preparing training data...

Loading dataset...

Generating train split: 100%|███████████████████████████████████| 305860/305860 [00:02<00:00, 126530.40 examples/s]

Dataset loaded with 305860 examples

Processing example 0/305860

First example processed successfully!

Input IDs length: 512

Labels length: 512

Processing example 1000/305860

Processing example 2000/305860

Processing example 3000/305860

Processing example 4000/305860

Processing example 5000/305860

Processing example 6000/305860

Processing example 7000/305860

Processing example 8000/305860

Processing example 9000/305860

Reached example limit

Successfully processed 10000 examples

Training dataset size: 9500

Evaluation dataset size: 500

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.

Starting training...

0%| | 0/296 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.2222222222222223e-05, 'epoch': 0.0}

{'loss': 39.4286, 'grad_norm': 0.0, 'learning_rate': 0.00019930313588850174, 'epoch': 0.03}

5%|███▊ | 15/296 [07:22<2:18:11, 29.51s/it

Expand full comment