7 Comments

So fascinating, cannot wait to see the results!

Expand full comment

Any update on this? How did the model do after training for almost 2 weeks?

Expand full comment

I put this on reddit but thought it'd be nice to note here too....

I've just been looking at some of the distilled models for sentiment analysis for the brokerage firm I'm working for. One major thing to watch out for: repeatability. With an (admitteddly heavily cut down) R1 distill, 14B Q4 Lllama (7B behaved the same), I was finding it to not be repeatable. I ran the same prompt and press release through 5 times (press release on a medication, the stock got a nice price bump when it came out). I told it to rate out of 10, with 0 being worst and 10 being the best, based on likelihood of a price bump, I had 5 runs with different scores each time; 9.0, 8.0, 8.5, 7.5, and even a 5.0 on one run. Just something to look out for, obviously that much variability makes a signal from that model rather useless.

I tested R1 14B Q6_L Qwen (thought going from Q4 to Q6 might help since I've read that reduces perplexity quite a bit) and it also gave inconsistent ratings from run to run.

I of course don't know if this is because the model I ran is so shrunk down? Or it might just be an intentional characteristic of R1; after all, for creative writing or coding, getting different text or different coding solution (as long as it's correct!) would be a positive sign of creativity, not a negative. And for chat of course having the responses vary would be a positive too.

So look out, to make sure (whether you get your model nicely trained or not) that it's buy/don't buy are at least reasonably consistent from one run to the next.

I found Qwen 2.5 14B Q6_L at least consistently gave a 8.5 (at least over 5 runs). I don't know yet if it's actually good at this kind of thing, but at least it's consistent.

I then made the press release negative but rather deranged (I put "not"s and "in"s into the press release so the positive results were negative (it was ineffective, side effects were not tolerable, etc.), but left in the bit at the end where they were pursuing stage 2 trials and FDA approval. So it rated it a 2.5, and commented it didn't rate it even lower because the plans for stage 2 trials and FDA approval might just keep the stock price steady rather than drop.

I second Gail's comment... I'm also fascinated to see the results, and I subscribed to your substack to see how it goes.

Expand full comment

That's totally fair yeah - the RL objective actually generates 20 completions from the same prompt and takes the average reward over all of them so hopefully thay helps with fhe variance

Expand full comment

That was my plan B and I don't see why it wouldn't work well; run several runs and average them. I mean even if your most positive end up with a 7.5 or 8, and negative with a 2 or 2.5, that's still a sold 5 or 6 points of range to pick out the buys and strong buys.

Expand full comment

Try a larger quant size, I've read that anything under quant 8 on the distilled models has adverse effects on the output. Other models are okay at lower quants (phi4 for example, gives consistent responses with q3KL).

Expand full comment

I had wondered about that too (and went from Q4 to Q6, or actually a Q6K_L at least.) I could see a model perhaps having trouble having a "head for numbers" if it's quant'ed to far. After all, Q4 only alows 16 distinct values and Q2 4. I could see a Q4 having trouble with "rate from 0.0 to 10.0" where if it's going by 0.5s there's 20 different values. Of course a model that's designed for use with high quant could just represent numerical values in binary I suppose, but these models are not trying to do this as far as I know. I imagine these running on Q1.58 or whatever, Q2, Q4, this is analogous to reasoning through these things when your head's a bit fuzzy, you'll get an answer but it might not be correct and you might not get the same answer twice.

Expand full comment