Synthetic Data For AI Evolution

Imagine that data could be shared seamlessly with partners, governments, and other organisations, without breaking any data protection law, to facilitate innovation. How will it be possible to use closely guarded customer data while still maintaining the highest privacy and safety standards? Is it possible to monetise data without compromising the sensitivity of the information (or data)? The following write-up spills it all.

 

Data is the fuel for the rapidly progressing Artificial Intelligence (AI) industry -as it is for almost all other industries. Digitisation, interconnection of the network channels, and IoT generate mountainous volumes of data at an unimagined and unprecedented scale. IDC predicts that more than 175 zettabytes of data will be available by 2025, growing at an exponential rate. Thus, unbelievable data would be available, and accessible to all. However, there are still a huge number of AI innovations and projects that do not reach the stage of viability, owing to an insufficient amount of data.

 

The inaccessibility of authentic data for innovation and AI

 

Accurate data collection and processing is extremely straining in terms of expenses and time. While certain sectors like financial services, telecom, healthcare, internet companies, retail, etc, are in direct contact with the customer data, this access is restricted only to these touchpoints. Then there is another category of data that is fragmented and siloed. Stringent regulations like GDPR watch over the sharing of data that can be processed only with user consent, strictly for lawful purposes. Data security measures are another reason that restricts data access over a larger scale. The data related to Research and Development requires regular hypothesis testing. This is, thus, another challenge pertaining to the presence of any real-time data from the field. While AI projects require deep learning and innovation are constantly ravenous for large volumes of data, both structured and unstructured, to train models. Scarcity and huge expenses for labelled training data make it difficult to make it available for AI. These challenges decelerate authentic data monetisation and realisation of benefits (from innovation and business). This is a major reason for many great projects to never even see the light of the day. While alternatives like data masking, anonymisation, and obfuscation ensure data security and privacy to abundant data, synthetic data is a preferred choice when data is scarce.

 

Synthesising data

 

Data generated as a result of using computer algorithms and simulations to the real-time data is synthetic data. Digitally synthesising data reflects the real-world either statistically or mathematically. Simply put, the data is generated by reproducing the statistical properties and patterns of the existing real-time datasets. This is done by modelling the probability distribution of these datasets and sampling them out. Essentially, the algorithm creates a new dataset with the same characteristics as the original data. Although this synthesised data lead to the same answers, is it almost impossible that the original data can be ever reconstructed from either the algorithm or this synthesised data. Thus, synthetic data is almost as potent as the original data with equal predictive power. It further carries no baggage of privacy concerns or restricted usage of any kind. Synthetic data is being increasingly employed for a wider range of applications. For instance, Syntegra is using its synthetic data generator to create and validate an anonymous replica of NIH’s database. This database has a record of more than 2.7 million individuals that have been screened for COVID-19 and more than 413,000 patients that have tested positive. This synthetic data set duplicates the real-time data quite precisely. Due to its anonymous nature, the data can be conveniently shared and used by researchers and medical professionals worldwide. A remarkable step to accelerate the progress in research for COVID-19 treatment and vaccines. The AI team at Amazon uses synthetic data on Alexa for training its NLU system (National Language Understanding). Consequently, new versions of Alexa in three new languages have come out: Hindi, Brazilian Portuguese, and US Spanish. This invalidates any further use of large customer interaction data. Synthetic data is also being used by Waymo, a Google company, to train its autonomous vehicles. Synthetic data technology is extremely useful for American Express to enhance its fraud detection capabilities. Synthetic data could, therefore, fill in the gaps that are hindering the evolution of AI technology. These are proving increasingly helpful in creating inexpensive, yet accurate, AI models. According to MIT Technology Review, Synthetic data for AI is among the top 10 breakthrough technologies in 2022. Analyst firm Gartner predicts: “By 2024, 60% of the data used to develop AI and analytics projects will be synthetically generated. The fact is you won’t be able to build high-quality, high-value AI models without synthetic data.”

 

Potential challenges

 

With compelling benefits, authentic and accurate synthetic data generation requires truly advanced knowledge and specialised skill sets. Further, the required sophisticated frameworks should be put up to enable its validation and alignment with the objective. It is critical that the synthetic data generated does not relate to or expose the original data set in any way, while it should match the important patterns in the original. Failing would result in either overlooking potentially larger opportunities or generating inaccurate insights for any subsequent efforts to model the data. For the AI models that have been trained on synthetic data generated by simply copying the original one, there is always a risk that inherent historical biases might creep in. Complex adjustments are, therefore, necessary for a fairer and more representative synthetic data set. Hard, yet achievable. For the synthetic data generated that has been optimised on a predetermined abstract of fairness, the resulting dataset accurately reflects the original one while still maintaining inherent fairness. So, no bias mitigation strategies are needed, with no compromise on predictive accuracy.

 

The certainty of synthetic data in AI’s future

 

 

“There is a risk of false, early-stage perceptions surrounding the use of synthetic data in some circles. This is most likely due to the naming of the term itself, as anything ‘synthetic’ might naturally be thought of as plasticized, non-organic, or in some way fake. But, of course, there should be nothing more natural than machine learning tuition being driven by machine intelligence. Properly generated, managed, maintained, and secured, synthetic data’s level of bias handling, safety, privacy, and cadence represent a significant accelerator and enabler for the AI capabilities of tomorrow”, Nelson Petracek, CTO, TIBCO Software. Already being used in healthcare (for training the machines to monitor a patient’s post-op recovery), security and surveillance (for detecting a suspicious object or behavioural pattern), and delivery drones, synthetic data is progressively making an accelerated advancement in its evolution. Synthetic data, surely, is synthetic in origin, but it has real-world DNA to it. Its validation and application are unbelievably tangible, pragmatic, and multifarious. Nevertheless, it is, in fact, a reality we all exist in.