Revolutionizing sensor development and machine learning

Synthetic data is emerging as a game-changing solution in the worlds of sensor technology and machine learning.

Brian Geisel, CEO of Geisel Software, recently sat down with us at the Sensors Converge event in Santa Clara to explain why.

Synthetic data, as Geisel explains, is artificially generated information used to train machine learning models when real-world data is scarce, dangerous, or expensive to obtain. This technique is particularly valuable in sensor development, where vast amounts of diverse data are crucial for creating accurate and robust systems.

Geisel highlights a fascinating case study involving NASA's Mars exploration efforts. Here, his company worked on creating synthetic data to train rovers for autonomous navigation on the Red Planet, accounting for unique Martian conditions such as atmospheric variations and dust storms. This application demonstrates how synthetic data can overcome the limitations of data scarcity, physical distance and harsh environments in space exploration.

The discussion also touches on the challenges of creating photorealistic synthetic data that not only looks real to the human eye, but also provides valuable training input for machine learning algorithms. Geisel emphasizes that while synthetic data is not always the best choice – real data should be used when readily available – it offers immense potential in scenarios where obtaining real-world data is impractical or impossible.

For those involved in sensor development, robotics or machine learning applications, this discussion provides valuable insights into the future of data generation and model training.

Tune in to learn how synthetic data could revolutionize your approach to complex technological challenges.
 


Matt Hamblen:

How are you doing, Brian?
 

Brian Geisel:

I'm doing great, thanks. Thanks for having me.
 

Matt Hamblen:

Here you are at Sensors. Talk to me a little bit about what you talked about in an earlier presentation today, synthetic data. What is that and why is that important?
 

Brian Geisel:

Yeah, so synthetic data is basically when we're training models, 'cause machine learning is used in so many sensors today, and so we can't always get the data we wish we could get when we need gobs of data to do machine learning.

So as we're trying to get that data set that we want to perform on, we can't always get that data set. So synthetic data is a way for us to get that data when it's either not accessible to us otherwise, or there's danger in obtaining the data, those kinds of things.
 

Matt Hamblen:

How does the machine figure it out? I mean, how do they put in the synthetic data?
 

Brian Geisel:

Yeah, so it's one of the things that we're doing as a company is creating that data, and so you're creating data that is relevant to the training set that you're going to try to do. So-
 

Matt Hamblen:

It's also an inference system, it sounds like.
 

Brian Geisel:

So it can be used in a lot of different applications, but it's really about creating the data that then you can use for the model to learn from.
 

Matt Hamblen:

So, how does it help make sensors more accurate?
 

Brian Geisel:

So the real-
 

Matt Hamblen:

'Cause that's why we're here.
 

Brian Geisel:

Yeah, so the real thing is that so many people are using machine learning in what they're doing with their sensors, and so you can really improve the machine learning algorithm by using synthetic data.

There are times where, so for example, if you're trying to do car crashes and you need 100,000 of those, your application might save lives, but let's not kill 100,000 people to get the data for it.
 

Matt Hamblen:

Yeah, that's a digital twin concept. Yeah.
 

Brian Geisel:

That's right, that's right. So if you can create that data, and we're getting to the point now, this is what has made it interesting is we're getting to the point now where the applications can be photorealistic enough with the data that we can now actually use it to train.

So it is, it's digital twin, but with photorealism, so good that now you can train your model on it and improve your model.
 

Matt Hamblen:

So you said photorealism, it could be other types of data, not video or-
 

Brian Geisel:

Correct. We tend to focus on video or image data, but you can also have synthetic data that is time series data on whatever it is or text or anything. Yeah, absolutely. Absolutely. Absolutely.
 

Matt Hamblen:

Hang on. So you mentioned an interesting case study during your presentation, something about Mars, I think.

Can you tell us about that?
 

Brian Geisel:

Yeah, so that was some work that we did with NASA to create synthetic data because they want to have a rover on Mars be able to certain things that they can't do today.

So speed of light, Mars is somewhere between three and 22 minutes away. So to control a robot there, we're driving a robot, we have to wait maybe 22 minutes before it gets to command.
 

Matt Hamblen:

So this, they've made movies about that or the woman doesn't know their husband died until nevermind.
 

Brian Geisel:

Yeah, yeah, yeah, yeah. So growing potatoes, all that fun.
 

Matt Hamblen:

Yeah, yeah, yeah.
 

Brian Geisel:

So one of the things that they want to be able to do is say, look, just tell the robot go two kilometers over there. We want to go explore that crater, but if on the way we discovered potential signs of life, we don't want to miss that. And so how can we have both?

And so what they asked us was how can we train a model so that we could have the robot autonomously look for things that we deem signs of life as it goes, given we have some photos from Mars, but we don't have hundreds of years of driving and data to be able to train a model realistically for Mars. And so then we had to synthesize the data, which includes things like atmospheric conditions.

The sun is further away on Mars, and so that changes the way that lighting effects work. There are dust storms and more. So you have to account for all those different things when you create the synthetic data in that environment. But it gives us really cool application that now all of a sudden the robot can go and if it on the way discovers a potential sign of life, then it can stop and send that message back to earth and go, "Hey, there's something here. You should take a look at this."
 

Matt Hamblen:

So that's been passed on to NASA or is that data, that project that you described, is that, I mean, did you learn some of the hardships? I guess that's the whole point of these research project.
 

Brian Geisel:

Yeah, and this, there's always this sim to real gap that you've got to deal with.
 

Matt Hamblen:

Yeah, yeah.
 

Brian Geisel:

And there's this aspect of it that when you and I look at photorealism, we look at Call of Duty or we look at other things that we might think of, video games, Madden, those kinds of things where it looks real, it looks real to our eye, but now we need to make it look real to the training algorithm.

And so there's that gap in maybe not even just the SIM to real, but in how we perceive real. Oh, that looks realistic to me and it may or may not be realistic or fit well for the training.
 

Matt Hamblen:

So in terms of practical applications, when is synthetic data better than actual data? Real world data, or is it for developing sensors?
 

Brian Geisel:

Yeah, so if you can get real data, do that.
 

Matt Hamblen:

Oh, it's just that it's hard. I think.
 

Brian Geisel:

Yeah, I mean if you want to train on pictures of giraffes, there's a bazillion pictures of giraffes on the internet. Do that. Don't try to synthesize giraffes, just go get the pictures. There's really good data for that already.
 

Matt Hamblen:

I see.
 

Brian Geisel:

But anytime that data, we talked about the car crash incident where it's going to be harmful to get that data, anytime where-
 

Matt Hamblen:

I get it.
 

Brian Geisel:

... sometimes it might be really expensive to get that, or maybe you created a unique device and you only have one prototype, but you need to train an ML on it.

If you just put it on this white table, it will only recognize your device if it's on a white table. So we need to put it in a forest, we need to put it in an office park.

And so when it's harder than taking your device out to a hundred thousand places to generate the training data, those are good instances to use synthetic data.
 

Matt Hamblen:

So what does Geisel's software do with bringing... What is the expertise you need?
 

Brian Geisel:

Yeah, so we've been around for 13 years now and working with a lot of different companies, especially in the robotic space. And the robotic space is interesting because in a lot of spaces you can constrain the problem set and you should. As much as you can, constrain this problem set. In the robotic space, a lot of times the problem set is infinite. You could move anywhere, you can move in any different number of ways.

And so as we've had to work through those challenges and doing things from Mars to some of the fastest warehouses in the world, it's really given us a good background in simulation and emulating those kinds of things. We do a lot of work in computer vision, so already doing some of the ML on the side of giving the robot perception has given us the opposite side of this to where we can create synthetic data and then try it and see how it's working and things that we're using.

And then we've done a lot of things where we've put things into production with millions of things that are out there in the wild. And so getting this from research to prototype to actually used in the wild, I think those are all a bunch of the different pieces that fit us well as a company that just organically things that we had done that fit really well for us in this space.
 

Matt Hamblen:

That's wonderful. It was really nice to meet you.
 

Brian Geisel:

Great to meet you, Matt.
 

Matt Hamblen:

Yeah, thank you.
 

Brian Geisel:

Yeah, thanks so much.
 

Matt Hamblen:

Good luck to everything.
 

Brian Geisel:

Yeah, appreciate it.

The editorial staff had no role in this post's creation.