An Introduction to Feature Stores
By Wojciech Gryc on April 8, 2021
Data science is undergoing an evolution in how modeling projects are architected and organized. One development that will impact numerous companies – both those doing internal analytics and those building data-driven products – is the feature store. This is an architecture component and product option that can dramatically change your workflows and make you significantly more productive.
What is a Feature Store?
Building statistical and machine learning models requires you to take the raw data you’re analyzing and convert it into features. Features can be raw data itself or can be derived via simple changes to the raw data – for example, ensuring that all numerical values are in the same format. It can also include larger transformations: filling missing values, normalizing data to within a certain range or distribution, or creating completely new variables that are combinations or functions of other variables.
A nightmare scenario for data teams is when everyone manages their own code for features, generating them in their own unique ways. This makes it difficult to repeat experiments, track changes to features, or remember important assumptions.
This is where feature stores come in.
You avoid the scenario above by abstracting away the feature engineering from the models themselves. In other words, you make it so that models use pre-generated features and have someone else – or another system – be responsible for generating and tracking features.
In an ideal scenario, you have a completely separate database (or API, or other service) where features are generated and simply served up when needed for the model. The model doesn’t need to know where these features came from – just that they are correct, acceptable for use within the model, and always available.
This is what the feature store does.
Elements of a Feature Store
You can build a basic feature store on your own, though there are now multiple vendors and open source projects that provide this solution out of the box. Either way, ensure your feature store has the following features.
- Import or access primary source data. This is the core component of a feature store. Make sure the one you’re building or selecting actually connects to the data sources you typically use or import.
- Feature engineering. Be able to transform or convert the data being imported into the format, values, or other elements you require for your models. Common workflows that could be facilitated here include...
- One-hot encoding for categorical features.
- Normalization techniques.
- Factor analysis and other clustering techniques to group features or reduce dimensionality.
- Custom coding support to convert features into new values.
- Conversion of unstructured data (voice, text, or images) to standardized feature vectors.
- Serving. Finally, the features themselves need to be served to models via API or other methods. Make sure the serving approach is one that supports your production systems.
Optional Features for Feature Stores
Beyond the key components above, feature stores come in a variety of flavors. Given the nascent nature of the product category, these still need to be standardized, but they are important to consider as you determine requirements for your models.
- Data quality monitoring. Most data sets experience some form of data set drift or changes in data quality. Some feature stores monitor for this behavior to warn you when features seem to be changing or might require you to revisit them.
- Data cleaning. This could include inferring missing values, enabling business rules for dealing with missing data, and so on.
- Version control. Features themselves change over time. How you clean features and how you build them can evolve. Version control of features mean you can go back to earlier versions in case customers or data science team members need to compare changes to features or run backtests on older data.
- Documentation catalog. List all features that are available and document how they were built so others can simply review the documents rather than build new ones or reach out directly to data engineers.
- Permissions. Not every person needs to access all features, and not every person needs to be able to modify or publish features to production. Having permissions enables you to diversify who is allowed to do what.
- Governance. In some business situations, certain variables or features cannot be used in every model. An example of this is in banking, where certain fields like race and gender are not allowed to be used in some financial models, like mortgage lead lists. Feature and model governance is something that can be incorporated into the feature store.
- Logging. Finally, since the feature store serves features, it can also track who requested them, when, and for which models. This logging could be helpful either for tracking usage, product feedback, security, or improving speed of feature delivery.
Conclusion
Feature stores are a new type of architecture component and product for data science. If you’re working on any advance statistical or ML modeling tasks in an enterprise, there’s a good chance you’ll be using feature stores in the coming years. Consider the components above when choosing which to use or when designing your own architecture.