Will make a post soon about this: but the TL;DR is: I trained it on and off, but it kept crashing with Cuda Out of Memory errors, so this week or next week I'll need to bite the bullet and really dive deep into the distributed training code so I can a) run it continuously for a week and b) use larger models. Like, the initial results always look promising, but I leave it running for a bit, and it crashes, which is super frustrating.
Specifically, the way the GRPO trainer class is written makes it super prone to out of memory errors, so I might have to extract the training code from that altogether.
I got it working, it's been training for the last 23.43 hours and I'm almost done. I used the V2 dataset. I can provide you the code I used to get it to work if you'd like...
Sure thing. It won't let me paste the code here since it's too long. I made a github repository at https://github.com/IYamHim/Stonk-Market. That is the exact code that I used for this run, but admittedly, I need to update a couple variables, "learning_rate=2e-4" should be "learning_rate=5e-4" and "dataloader_num_workers=0" should be "dataloader_num_workers=4". I just uploaded the environment.yml and requirements.txt to make things easier for you for whatever virtual environment you perfer.
P.S. here is my current progress, I'm in the home stretch!:
Model saved to: /home/2084Collective_deepstock-sp500-companies-with-info-and-user-prompt/train/qwen_stock_advisor_final1
Final GPU memory:
GPU Memory allocated: 1.53GB
GPU Memory reserved: 4.35GB
GPU 0 Memory:
Total: 11.00GB
Free: 4.35GB
Used: 1.53GB
CUDA available: True
Current device: 0
Testing model...
/home/ai/miniconda3/envs/trading_env/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
<edit AI response> I don't want people to think these are real (they are definitely not) so I just removed for clarity.
I put this on reddit but thought it'd be nice to note here too....
I've just been looking at some of the distilled models for sentiment analysis for the brokerage firm I'm working for. One major thing to watch out for: repeatability. With an (admitteddly heavily cut down) R1 distill, 14B Q4 Lllama (7B behaved the same), I was finding it to not be repeatable. I ran the same prompt and press release through 5 times (press release on a medication, the stock got a nice price bump when it came out). I told it to rate out of 10, with 0 being worst and 10 being the best, based on likelihood of a price bump, I had 5 runs with different scores each time; 9.0, 8.0, 8.5, 7.5, and even a 5.0 on one run. Just something to look out for, obviously that much variability makes a signal from that model rather useless.
I tested R1 14B Q6_L Qwen (thought going from Q4 to Q6 might help since I've read that reduces perplexity quite a bit) and it also gave inconsistent ratings from run to run.
I of course don't know if this is because the model I ran is so shrunk down? Or it might just be an intentional characteristic of R1; after all, for creative writing or coding, getting different text or different coding solution (as long as it's correct!) would be a positive sign of creativity, not a negative. And for chat of course having the responses vary would be a positive too.
So look out, to make sure (whether you get your model nicely trained or not) that it's buy/don't buy are at least reasonably consistent from one run to the next.
I found Qwen 2.5 14B Q6_L at least consistently gave a 8.5 (at least over 5 runs). I don't know yet if it's actually good at this kind of thing, but at least it's consistent.
I then made the press release negative but rather deranged (I put "not"s and "in"s into the press release so the positive results were negative (it was ineffective, side effects were not tolerable, etc.), but left in the bit at the end where they were pursuing stage 2 trials and FDA approval. So it rated it a 2.5, and commented it didn't rate it even lower because the plans for stage 2 trials and FDA approval might just keep the stock price steady rather than drop.
I second Gail's comment... I'm also fascinated to see the results, and I subscribed to your substack to see how it goes.
That's totally fair yeah - the RL objective actually generates 20 completions from the same prompt and takes the average reward over all of them so hopefully thay helps with fhe variance
That was my plan B and I don't see why it wouldn't work well; run several runs and average them. I mean even if your most positive end up with a 7.5 or 8, and negative with a 2 or 2.5, that's still a sold 5 or 6 points of range to pick out the buys and strong buys.
Try a larger quant size, I've read that anything under quant 8 on the distilled models has adverse effects on the output. Other models are okay at lower quants (phi4 for example, gives consistent responses with q3KL).
I had wondered about that too (and went from Q4 to Q6, or actually a Q6K_L at least.) I could see a model perhaps having trouble having a "head for numbers" if it's quant'ed to far. After all, Q4 only alows 16 distinct values and Q2 4. I could see a Q4 having trouble with "rate from 0.0 to 10.0" where if it's going by 0.5s there's 20 different values. Of course a model that's designed for use with high quant could just represent numerical values in binary I suppose, but these models are not trying to do this as far as I know. I imagine these running on Q1.58 or whatever, Q2, Q4, this is analogous to reasoning through these things when your head's a bit fuzzy, you'll get an answer but it might not be correct and you might not get the same answer twice.
TBH it took me a couple times reading your comment to fully grasp the analogy but I agree. I have noticed more thinking loops of nonsense when using q4 as if their thoughts are foggy. Then using the q8 version they have a clearer mind and give much better results.
I also got this code running locally with a Qwen2.5 1.5B model on a GTX 1080Ti (11GB VRAM). To get 2000 Epochs of a sample of 10,000 (not the full 300,000~) training examples takes about 8 hours. It looks like from the GRPO training documentation on Unsloth, around 1.5k Epochs is about the ceiling for RL training models. I'm doing a small test run of 296 Epochs now to see how it goes. Unfortunately, I had to use 4bit instead of 8bit due to memory constraints. Here are my outputs so far, (wandb was causing issues, I'll reenable it after this test run):
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
Starting training...
0%| | 0/296 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
So fascinating, cannot wait to see the results!
Hey bro any updates on this? Ive been sort of keeping my eye on this space and Im interested in getting into it.
If you need more computing power and there's a way to remotely help Ive got a 3080ti
Yes i am also interested in this ! 😄
Is there some new news for this?
Is it still training? ☺️
Will make a post soon about this: but the TL;DR is: I trained it on and off, but it kept crashing with Cuda Out of Memory errors, so this week or next week I'll need to bite the bullet and really dive deep into the distributed training code so I can a) run it continuously for a week and b) use larger models. Like, the initial results always look promising, but I leave it running for a bit, and it crashes, which is super frustrating.
Specifically, the way the GRPO trainer class is written makes it super prone to out of memory errors, so I might have to extract the training code from that altogether.
I got it working, it's been training for the last 23.43 hours and I'm almost done. I used the V2 dataset. I can provide you the code I used to get it to work if you'd like...
yeah that would be great!
Sure thing. It won't let me paste the code here since it's too long. I made a github repository at https://github.com/IYamHim/Stonk-Market. That is the exact code that I used for this run, but admittedly, I need to update a couple variables, "learning_rate=2e-4" should be "learning_rate=5e-4" and "dataloader_num_workers=0" should be "dataloader_num_workers=4". I just uploaded the environment.yml and requirements.txt to make things easier for you for whatever virtual environment you perfer.
P.S. here is my current progress, I'm in the home stretch!:
{'loss': 22.4182, 'grad_norm': 0.0, 'learning_rate': 3.127171646977067e-05, 'epoch': 0.85}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 3.057678943710911e-05, 'epoch': 0.85}
{'loss': 0.0196, 'grad_norm': 0.0001766916102496907, 'learning_rate': 2.9881862404447535e-05, 'epoch': 0.86}
{'loss': 22.8465, 'grad_norm': 0.0, 'learning_rate': 2.918693537178596e-05, 'epoch': 0.86}
86%|██████████████████████████████████████████████████▋ | 2550/2968 [25:20:24<3:27:22, 29.77s/it
Dude, thanks so much man, I'll take a look at it.
Finished, but don't know how good it is. I'll need to do some testing:
100%|█████████████████████████████████████████████████████████████| 2968/2968 [28:47:41<00:00, 34.93s/it]
Training completed successfully
Saving model...
Model saved to: /home/2084Collective_deepstock-sp500-companies-with-info-and-user-prompt/train/qwen_stock_advisor_final1
Final GPU memory:
GPU Memory allocated: 1.53GB
GPU Memory reserved: 4.35GB
GPU 0 Memory:
Total: 11.00GB
Free: 4.35GB
Used: 1.53GB
CUDA available: True
Current device: 0
Testing model...
/home/ai/miniconda3/envs/trading_env/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
<edit AI response> I don't want people to think these are real (they are definitely not) so I just removed for clarity.
I put this on reddit but thought it'd be nice to note here too....
I've just been looking at some of the distilled models for sentiment analysis for the brokerage firm I'm working for. One major thing to watch out for: repeatability. With an (admitteddly heavily cut down) R1 distill, 14B Q4 Lllama (7B behaved the same), I was finding it to not be repeatable. I ran the same prompt and press release through 5 times (press release on a medication, the stock got a nice price bump when it came out). I told it to rate out of 10, with 0 being worst and 10 being the best, based on likelihood of a price bump, I had 5 runs with different scores each time; 9.0, 8.0, 8.5, 7.5, and even a 5.0 on one run. Just something to look out for, obviously that much variability makes a signal from that model rather useless.
I tested R1 14B Q6_L Qwen (thought going from Q4 to Q6 might help since I've read that reduces perplexity quite a bit) and it also gave inconsistent ratings from run to run.
I of course don't know if this is because the model I ran is so shrunk down? Or it might just be an intentional characteristic of R1; after all, for creative writing or coding, getting different text or different coding solution (as long as it's correct!) would be a positive sign of creativity, not a negative. And for chat of course having the responses vary would be a positive too.
So look out, to make sure (whether you get your model nicely trained or not) that it's buy/don't buy are at least reasonably consistent from one run to the next.
I found Qwen 2.5 14B Q6_L at least consistently gave a 8.5 (at least over 5 runs). I don't know yet if it's actually good at this kind of thing, but at least it's consistent.
I then made the press release negative but rather deranged (I put "not"s and "in"s into the press release so the positive results were negative (it was ineffective, side effects were not tolerable, etc.), but left in the bit at the end where they were pursuing stage 2 trials and FDA approval. So it rated it a 2.5, and commented it didn't rate it even lower because the plans for stage 2 trials and FDA approval might just keep the stock price steady rather than drop.
I second Gail's comment... I'm also fascinated to see the results, and I subscribed to your substack to see how it goes.
That's totally fair yeah - the RL objective actually generates 20 completions from the same prompt and takes the average reward over all of them so hopefully thay helps with fhe variance
That was my plan B and I don't see why it wouldn't work well; run several runs and average them. I mean even if your most positive end up with a 7.5 or 8, and negative with a 2 or 2.5, that's still a sold 5 or 6 points of range to pick out the buys and strong buys.
Try a larger quant size, I've read that anything under quant 8 on the distilled models has adverse effects on the output. Other models are okay at lower quants (phi4 for example, gives consistent responses with q3KL).
I had wondered about that too (and went from Q4 to Q6, or actually a Q6K_L at least.) I could see a model perhaps having trouble having a "head for numbers" if it's quant'ed to far. After all, Q4 only alows 16 distinct values and Q2 4. I could see a Q4 having trouble with "rate from 0.0 to 10.0" where if it's going by 0.5s there's 20 different values. Of course a model that's designed for use with high quant could just represent numerical values in binary I suppose, but these models are not trying to do this as far as I know. I imagine these running on Q1.58 or whatever, Q2, Q4, this is analogous to reasoning through these things when your head's a bit fuzzy, you'll get an answer but it might not be correct and you might not get the same answer twice.
TBH it took me a couple times reading your comment to fully grasp the analogy but I agree. I have noticed more thinking loops of nonsense when using q4 as if their thoughts are foggy. Then using the q8 version they have a clearer mind and give much better results.
I also got this code running locally with a Qwen2.5 1.5B model on a GTX 1080Ti (11GB VRAM). To get 2000 Epochs of a sample of 10,000 (not the full 300,000~) training examples takes about 8 hours. It looks like from the GRPO training documentation on Unsloth, around 1.5k Epochs is about the ceiling for RL training models. I'm doing a small test run of 296 Epochs now to see how it goes. Unfortunately, I had to use 4bit instead of 8bit due to memory constraints. Here are my outputs so far, (wandb was causing issues, I'll reenable it after this test run):
(trading_env) ai@home:/home/2084Collective_deepstock-sp500-companies-with-info-and-user-prompt/train$ python train_qwen_grpo.py
Initial GPU memory:
GPU Memory allocated: 0.00GB
GPU Memory reserved: 0.00GB
GPU 0 Memory:
Total: 11.00GB
Free: 0.00GB
Used: 0.00GB
CUDA available: True
Current device: 0
Loading model and tokenizer...
Initializing model...
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Preparing model for training...
Preparing training data...
Loading dataset...
Generating train split: 100%|███████████████████████████████████| 305860/305860 [00:02<00:00, 126530.40 examples/s]
Dataset loaded with 305860 examples
Processing example 0/305860
First example processed successfully!
Input IDs length: 512
Labels length: 512
Processing example 1000/305860
Processing example 2000/305860
Processing example 3000/305860
Processing example 4000/305860
Processing example 5000/305860
Processing example 6000/305860
Processing example 7000/305860
Processing example 8000/305860
Processing example 9000/305860
Reached example limit
Successfully processed 10000 examples
Training dataset size: 9500
Evaluation dataset size: 500
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
Starting training...
0%| | 0/296 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.2222222222222223e-05, 'epoch': 0.0}
{'loss': 39.4286, 'grad_norm': 0.0, 'learning_rate': 0.00019930313588850174, 'epoch': 0.03}
5%|███▊ | 15/296 [07:22<2:18:11, 29.51s/it