
The investment universe is constructed with respect to the filtering of the dataset such that each security is matched to a respective news article. The sample spans the year 1989 through April 2020. News with more than one tagged stock and other filters described in Table 1 are applied to the dataset that ends up consisting of 6.3M articles for training. Textual data is cleaned and normalized using standard procedures such as lowercasing, stemming, and lemmatizing. A bag of words model is then created to encode the word counts into vectors.
ASSET CLASS: stocks | REGION: United States | FREQUENCY:
Daily | MARKET: equities | KEYWORD: Machine Learning, Stock Returns
I. STRATEGY IN A NUTSHELL
This strategy constructs a daily long-short portfolio using news-based sentiment signals from 6.3 million articles spanning 1989–2020. Text data are cleaned, normalized, and encoded using a bag-of-words model. A supervised screening identifies words correlated with positive returns, followed by a two-topic model to estimate word probabilities for positive and negative sentiment. Out-of-sample sentiment is estimated via penalized regression with a Beta prior. Each day, the strategy goes long the 50 most positively scored words and shorts the 50 most negatively scored words, forming a zero-net portfolio with a 30-minute delay at market open.
II. ECONOMIC RATIONALE
The sentiment model demonstrates predictive power, particularly for smaller stocks, which incorporate sentiment information more slowly than large firms. Risk attribution shows minimal correlation with standard Fama-French factors, indicating genuine alpha generation. By leveraging news-derived sentiment, the strategy captures mispricings and market inefficiencies overlooked by conventional factors, making it a robust approach to extracting returns from textual data.
III. SOURCE PAPER
Predicting Returns with Text Data [Click to Open PDF]
Zheng Tracy Ke, Department of Statistics, Harvard University; Bryan Kelly, Yale University; Dacheng Xiu, AQR Capital Management; [Next Author], NBER; [Next Author], Booth School of Business, University of Chicago
<Abstract>
We introduce a new text-mining methodology that extracts information from news articles to predict asset returns. Unlike more common sentiment scores used for stock return prediction (e.g., those sold by commercial vendors or built with dictionary-based methods), our supervised learning framework constructs a score that is specifically adapted to the problem of return prediction. Our method proceeds in three steps: 1) isolating a list of terms via predictive screening, 2) assigning prediction weights to these words via topic modeling, and 3) aggregating terms into an articlelevel predictive score via penalized likelihood. We derive theoretical guarantees on the accuracy of estimates from our model with minimal assumptions. In our empirical analysis, we study one of the most actively monitored streams of news articles in the financial system—the Dow Jones Newswires—and show that our supervised text model excels at extracting return-predictive signals in this context. Information in newswires is assimilated into prices with an inefficient delay that is broadly consistent with limits-to-arbitrage (i.e., more severe for smaller and more volatile firms) yet can be exploited in a real-time trading strategy with reasonable turnover and net of transaction costs.


IV. BACKTEST PERFORMANCE
| Annualised Return | 9% |
| Volatility | 7.26% |
| Beta | N/A |
| Sharpe Ratio | 1.24 |
| Sortino Ratio | N/A |
| Maximum Drawdown | 22.07% |
| Win Rate | 71% |