From industrial equipment and vehicles to wearables and smart infrastructure, sensors capture the raw signals that enable machines to perceive and interact with the physical world. Yet as powerful as sensors in modern intelligent systems are, they still suffer from a shortcoming: they can’t accelerate the creation of data. Thankfully, there is a solution: synthetic data.
With a data synthesis platform grounded in AI, signal processing and complementary innovations that directly address this challenge, engineers can keep up with the demands of today’s rapidly evolving AI-driven applications.
Naturally, sensors can’t make time go faster to accumulate data more quickly. However, by leveraging advanced Generative AI techniques, statistical signal processing and physics-based simulations with real-world measurements, synthetic data can be generated, enabling engineers to overcome data pipeline constraints and dramatically improve the scalability, performance and reliability of their intelligent systems.
Overcoming the Data Bottleneck
Sensor data is collected in controlled environments and across a wide range of operating conditions. Ultimately, the effectiveness of any AI model trained for a given application is defined by both the quality and quantity of that data.
According to data from Forbes derived from surveying data scientists, 80 percent of AI development time is spent on data preparation (collecting, cleaning, curating and labeling of data) rather than actual model development. Factor in the high costs of gathering sensor data, building specialized hardware setups, long collection and test cycles, and operational downtime disruptions and the problem compounds.
Real-world data often lacks diversity, with gaps and biases that can cause models to perform poorly. This challenge is even more significant in emerging edge AI applications, where the volume of available real-world data remains largely insufficient.
Data quantity and quality remains the fundamental problems to solve, and the projected growth in edge computing adoption makes them even harder to ignore. Driven by demand for real-time analytics, increased automation, and enhanced customer experiences, the International Data Corporation reports that the global edge computing spend will reach $378 Billion in 2028. Gartner also projects that at least 50 percent of edge computing deployments will involve Machine Learning by 2026, up from just 5 percent in 2022.
In short, data is the foundation of intelligence in smart edge systems. For edge computing to deliver scalable and effective AI applications, it needs more data of higher quality, sourced faster and more efficiently.
Plugging Holes to Create Better Data
Good, real data is unquestionably valuable to smart edge systems. Unfortunately, even with a large amount of good, optimized data, it can only cover what has happened. It doesn’t cover what may happen. This leaves data sets, even if expansive, potentially lacking. For example, in industrial condition-based monitoring, any model trained on normal operating conditions may fail to detect early signs of failure simply because it has never “seen” them before.
This is where AI augmentation of sensor data comes in. By synthesizing data, models can be trained on scenarios that have happened, as well as on those that might happen, helping expand the dataset’s diversity and size at scale.
Synthesizing Sensor Data with AI
By using advanced techniques to expand and enhance existing datasets, edge AI model building time that takes months can be reduced to weeks. By tapping into Generative AI modeling, statistical signal processing, simulation models, and more, engineers can use AI to generate additional, high-quality data that reflects real-world conditions — turning data into a scalable resource.
Here are a few data synthesis advancements that can produce new data that closely mimics real-world behavior:
- AI-generated data produced by generative models — trained on limited real-world data to capture the underlying patterns — can reflect realistic system dynamics, noise characteristics, and environmental interactions.
- Signal processing methods employ mathematical and computational techniques to simulate data that reflects the dynamics and characteristics of real sensor outputs.
- Data augmentation can automatically transform real sensor data to artificially create new data for various conditions and scenarios. This can also translate data across sensor modalities such as deriving vibration data from audio signals, introducing greater diversity and variations into training data and helping models better reflect real-world conditions.
- Physics-based simulation builds simulation models that generate data grounded in real-world conditions by combining physical models and mathematical equations with domain and sensor expertise.
- And with assisted annotation, training data labeling is streamlined, increasing its usefulness and improving the overall quality of data for model training.
Synthesized data allows engineers to enhance models by exposing them to a wider range of scenarios. From equipment faults to unusual user behaviors, synthetic data helps models better recognize uncommon situations and introduces variability to improve real-world performance in both familiar and unseen scenarios and conditions.
Synthetic data accelerates AI model time-to-deployment, and once in the field, real-world data feeds back in to refine the data synthesis models, improving them over time and creating a virtuous cycle of ever-improving performance.
The Advantages of Synthetic Data
By reducing the need for extensive data collection, engineers can move from concept to deployment much faster. Quickly generating datasets for initial model development can lead to faster prototyping. Testing models against a wider range of scenarios before real-world deployment can lead to earlier validation. Rather than being constrained by the availability of large volumes of real-world data, engineers can now actively shape and expand their datasets to meet their specific application needs.
Transforming How AI Models Are Built, Trained, and Deployed
Edge AI systems must operate in dynamic environments with limited computational resources. AI-augmented sensor data directly supports this by diversifying training data and simulating deployment-specific conditions, improving model performance and generalization all while reducing reliance on large-scale, real-world data collection. This makes it possible to scale AI solutions across different devices, environments, and use cases without starting from scratch each time. This can be particularly valuable given how much AI solution development time is spent on data collection and curation.
TDK’s SensorGPT addresses this concern by reducing the reliance on real-world data. The platform combines generative AI, physics-based modeling, signal processing and deep sensor expertise to cut data collection efforts from around 80 percent to a mere 10 percent. SensorGPT improves scalability by generating large and diverse datasets. It also accelerates development by providing quick access to data for prototyping, testing, and deploying initial models. It provides the tools for customization, enabling engineers to tailor synthetic data to specific sensors, smart IoT applications, scenarios and conditions. And as the demand for AI models grows to power smart edge applications, SensorGPT delivers the data needed to build those models faster and better.
By expanding training dataset size by orders-of-magnitude and aiming for 90 percent similarity between synthetic and real-world sensor, engineers can significantly reduce edge AI model building time by months, helping meet the growing demand for reliable, high-quality data required to build and scale intelligent edge applications.
Abbas Ataya is senior director of AI, Systems and Software at TDK USA. Juan Mejia Santamaria is principal team lead at TDK, heading a team developing sensors-based AI/ML algorithms and sensor time series synthetic data pipelines.