2084: DeepStock V2 - Predicting the market better than 50%
Training AI to Predict the Market with GPRO
Thanks to the legendary ./install_AI, we've completely rewritten our training code for DeepStock, and introduced a lot of improvements. The key improvement allows various outputs required for Guided Policy Reinforcement Optimization (GPRO) to be calculated sequentially rather than simultaneously. This architectural change enables training of larger models and makes smaller models accessible on more modest hardware configurations. In fact, ./install_AI reported impressive results even on a GTX 1080 Ti.
Specifically the biggest change has been to the reward system, which now evaluates rewards across 5 dimensions.
Format Adherence (20%): Rewards proper structuring with explicit reasoning in the <thinking> section and clear answers in the <answer> section.
Direction Prediction (30%): The largest component, simply evaluating whether the model correctly predicted if a stock would rise or fall.
Magnitude Accuracy (20%): Evaluates how close the predicted percentage change is to the actual change, with tiered scoring based on error margins.
Confidence Calibration (15%): Analyzes how well the model's expressed confidence aligns with its actual accuracy, penalizing overconfidence in incorrect predictions.
Data Utilization (15%): Rewards the model for grounding predictions in specific data points provided, discouraging unfounded claims.
Promising Results
The training shows a clear pattern of improvement. As visible in the attached graph, we're seeing:
Overall accuracy of 56.34% across all predictions
Recent accuracy of 66.67% over the last 30 predictions
A distinctive "aha moment" around step 100 where the model seems to develop a deeper understanding
The graph shows both cumulative accuracy (blue line) and rolling accuracy with a 30-prediction window (red line), demonstrating consistent performance above the random guess baseline of 50%.
The intellectual implications of this work extend far beyond stock prediction. What we're witnessing may be a fundamental approach to cognitive development in AI systems that parallels human expertise acquisition.
Reasoning as an Emergent Property
The DeepSeek paper demonstrated that training on mathematical and programming tasks led to strong reasoning capabilities that generalized beyond the training domain. Our findings with DeepStock suggest something equally profound: the reasoning patterns required for business evaluation might similarly transfer to broader analytical thinking.
The "aha moment" we observed around step 100 is particularly fascinating. Like a student who suddenly grasps a difficult concept, the model appears to undergo a cognitive shift where it reconceptualizes its approach to the problem. This understanding isn't explicitly programmed, but rather developed through repeated exposure to feedback that rewards rational analysis.
Training Evolution: Before and After Examples
Compare these two analysis samples from our training process:
Early Training (Step 1):
{"step": 1, "ticker": "MMM", "current_date": "2023-11-28", "actual_change": -0.040597149957399825, "predicted_direction": "up", "predicted_pct": 2.5, "reward": 1.899999976158142, "loss": -0.10356301069259644, "thinking": "The recent news indicates that 3M (MMM) has been recognized as one of the most undervalued industrial stocks and blue-chip stocks to buy, which suggests positive market sentiment. Additionally, the stock has already seen a significant increase since the last earnings report, indicating that investors might be cautious about further gains. The current price of $79.61 is quite high compared to its previous close of $75.95, which implies that the stock may be due for a correction. However, considering the recent positive recognition and the potential for continued market rally, there is still room for growth."}
Later Training (Step 17):
{"step": 17, "ticker": "MMM", "current_date": "2021-10-12", "actual_change": 3.6517989085434293, "predicted_direction": "up", "predicted_pct": 1.5, "reward": 2.1600000858306885, "loss": -0.1374872475862503, "thinking": "The stock has experienced a decline of 1.32% today, which suggests some selling pressure. However, the volume is still relatively high at 579 million shares, indicating that the selling might be driven by profit-taking rather than a significant shift in sentiment. Recent news indicates that US stocks are showing mixed trends, with some positive developments for cyclicals but also concerns around inflation and tapering fears. 3M, being a diversified company, is less likely to be heavily affected by these broad market sentiments compared to more specialized firms. The P/E ratio of 12.44 suggests the stock is fairly valued, and the low debt-to-equity ratio of 0.08 indicates strong financial health. Additionally, there are positive sentiments towards 3M in the recent news, such as \"Why 3M Stock Is a Buy,\" which could help counteract any negative sentiment."}
Note the evolution in analysis quality. The later example shows more nuanced reasoning, considering multiple factors (volume analysis, broader market context, specific financial ratios) rather than simply relying on news sentiment. The model has developed a more sophisticated analytical framework.
What's particularly intriguing is how domain expertise might be developed through tangentially related reward functions. Rather than training directly on the target skill, we can build expertise through related tasks that force the development of the underlying cognitive frameworks.
A model could develop sophisticated financial reasoning not by being trained to "think like a banker" but by repeatedly building and refining cash flow models, valuation frameworks, and sensitivity analyses. The practice of connecting financial variables in structured ways builds the cognitive framework needed for higher-level financial reasoning.
Similarly, rather than directly training a model on legal opinions, training it to find appropriate case citations for specific legal questions could build the underlying skill of legal reasoning. The repeated practice of matching fact patterns to precedent could develop the pattern recognition crucial to legal analysis.
You could multiply cases without end of ways to use objective evaluations to train transferable reasoning patterns.
What we're witnessing with DeepStock and reasoning models in general may represent a fundamental shift in how we approach AI development. While much of current AI progress focuses on scaling existing architectures or fine-tuning on diverse datasets, our work suggests an alternative path: deliberate practice in specialized domains as a route to developing robust reasoning capabilities.
The parallels to human expertise development are striking. Just as master craftspeople develop their skills through thousands of hours of deliberate practice on specific tasks, AI systems might develop deeper cognitive capabilities through intensive, focused training with carefully designed reward functions that encourage systematic thinking.
DeepStock's journey from simplistic market sentiment analysis to nuanced multi-factor evaluation is similar the journey from novice to expert in human learning. The difference is that we can precisely engineer the feedback mechanisms that drive this development, potentially accelerating the acquisition of expertise.
As we continue our work, we're increasingly convinced that the future of AI specialization may not lie in building separate models for different domains, but rather in designing training regimes that develop transferable reasoning patterns through domain-specific practice. The investment banker AI of the future might be trained, just like current investment banking analysts, on building discounted cash flow models.
We’re still working on improving DeepStock by modifying the reward function even further, and by adding additional steps of SFT before the GRPO step to really max out how well we can do, and early results are promising.
If you consider to use „more/better“ data. I would recommend you this source.
https://www.financialdatasets.ai/
Did you used train, validation and test dataset or do you use the test dataset for both validation and test?