Embrace open source ecology to empower digital transformation
The open source industry has achieved world-renowned achievements in China in recent years, and Chinese developers have also turned from an open source user to a global mainstream open source contributor. The open source collaborative and innovative development model has also become a source of continuous innovation and development of the software industry. Undoubtedly, the current domestic open source enthusiasm is unprecedentedly high, and open source is not just “open source code”. The use of open source by government and enterprise organizations is not without rules. The people, communities, management and other factors involved behind open source all have their own. the rule of. However, due to the late start of open source in China, governments and enterprises have encountered unprecedented challenges while adopting open source to accelerate innovation.
The challenges mainly come from three aspects. First, for the government and traditional enterprises, it is the foundation for the transformation and upgrading of digital intelligence to build an efficient IoT data processing platform. China is a big manufacturing country. To transform into a big country of intelligent manufacturing, building a digital industrial Internet platform is also a very critical one. Second, as the pace of digital transformation of governments and enterprises accelerates, there is a huge gap in global artificial intelligence talents. There is an urgent need to promote the development of artificial intelligence education systems and consolidate the foundation for the development of artificial intelligence. As artificial intelligence becomes an infrastructure like the Internet, the demand for related talents is bound to become the key to the development of the industry. Third, the correctness and aggregation of the data, the same problems will be found in the data, the amount of data is too large to collect, the data is relatively scattered, there is no standard, there is no data identification, and the collaboration of personnel is more complicated. Efficient data collection and processing is the core foundation of digital intelligence transformation.
Under this challenge, the application scenarios of AI are constantly increasing. Now many intelligent scenarios have entered life, such as face recognition technology, offline stores, smart homes, smart retail, and further like smart cities. Behind these technologies are the comprehensive processing and analysis of large amounts of data. Therefore, the digital intelligence transformation of governments and enterprises has three core points: data collection and efficient storage, flexible support for real-time analysis, and building a data platform base to provide support for other systems. Academician Tan Jianrong of the Chinese Academy of Engineering stated that “to master the core technology and high-tech, we must start from basic research.” He also proposed at the conference and passed: intelligent manufacturing + innovative design, intelligent manufacturing + process improvement, intelligent manufacturing + enhanced quality, intelligence The five methods of manufacturing and derivative services, intelligent manufacturing + market expansion will promote the realization of digital economy and digital transformation. Since its establishment, DataCanvas has been deeply involved in automation technologies such as AutoML automatic machine learning and AutoDL automatic deep learning. It insists on independent research and development and open source, and insists on continuously outputting open source technology for the industry into innovative applications in multiple industry scenarios, and fully driving the data science industry. developing. “Realizing AI empowerment and expanding infinite phenomena”, the release of the DAT automatic machine learning toolkit and the DingoDB real-time interactive analysis database further strengthened the open and open source and expanded the infinite possibilities of AI.
To meet the challenges of the times, dual-core drivers of DAT and DingoDB
Data is the oil of the new era, and there is no way to talk about it without data intelligence. At present, in the government and most enterprises, the value of data is mainly reflected in the superficial data analysis, making the data into visual reports including pie charts, line charts, etc., and then guiding the business. However, as the data accumulated by the government and enterprises becomes more and more abundant, the requirements for data analysis are getting higher and higher, and the past data analysis methods can no longer meet the needs of the government and enterprises.
Lei Fang, chairman of Nine Cloud DataCanvas, said that the value of government and corporate data is changing, and data analysis has entered the “enhanced analysis” stage, that is, enhancing data analysis capabilities through machine learning or artificial intelligence. Jiuzhang Yunji DataCanvas, which is based on “hard technology”, will continue to achieve technological innovation and R&D in the field of AutoML, using AutoML and AutoDL technologies to provide professional technical services for many industries such as finance, communications, manufacturing, and government, and solve the problem of government and enterprises. The demand for real-time performance in the upgrading of digital intelligence. It is under the guidance of this initial intention that today’s DAT open source products for autonomous modeling and automatic modeling, as well as the open source release of the DingoDB database for high concurrency and real-time analysis, are available.
DataCanvas AutoML Toolkit (DAT)
DataCanvas AutoML Toolkit (DAT) is an automated machine learning toolkit package that contains a series of powerful AutoML open source tools, from the underlying general automated machine learning framework to the end-to-end automatic modeling for structured and unstructured fields tool. All DAT projects are developed in an open source manner. Currently, more than 2,600 Stars have been received from the GitHub community, and the number of installations and downloads from the community has exceeded 60,000.
The entire DAT tool station is divided into task-oriented, structured and unstructured at the same time; divided into crowd-oriented, it can be used for professional AI practitioners or those without professional AI background. AutoML has corresponding tools that can be used to meet the needs of AI users, and there is a corresponding framework for AutoML tool developers.
Therefore, DAT is not a tool developed for a certain scenario. It is hoped that AutoML can be used for different groups of people and release AutoML capabilities from different angles and levels to empower users.
DAT’s tool stack can be divided into three layers: the first is the bottom AutoML framework Hypernets, machine learning and deep learning frameworks, the middle is AutoML tools such as DeepTables, and the top layer is the application tools: HyperGBM, HyperDT, HyperKeras, Cooka.
DingoDB is a new generation of real-time interactive analysis database that can provide highly concurrent data services. The current government and enterprise data architecture basically adopts the Lambda architecture model. It is not only a mainstream data architecture for governments and enterprises, but also a mainstream data architecture for many Internet companies. However, there are potential risks and problems in many aspects:
(1) The problem of data hash storage. There are multiple storage engines, and data fusion will become very difficult. This leads to a new field, multi-database mode federated query.
(2) When data is stored in multiple storage engines, the consistency and accuracy of the data becomes very difficult, and there are problems with data verification and multiple corrections.
(3) Highly concurrent data services and poor ability to modify in time, usually add various caches and KV databases in the data service layer to speed up the service and improve the concurrency of the service.
All in all, the existence of multiple sets of storage engines, computing engines, and various caches has made the data platform architecture of governments and enterprises extremely complex, and the cost of learning and operation and maintenance has become extremely high. The research and development of a new data architecture is imminent, so DingoDB was born.
These two open source products can make the process of data analysis fast and simple, helping more non-professional data scientists to use them for data modeling and analysis.
Facing the model dilemma, what else can be done on the data side
Data must serve the machine learning algorithm model of artificial intelligence, but what are the four major difficulties of “imbalance, concept drift, generalization ability, and large-scale data” in the modeling process? DAT has made the following optimizations for these difficulties:
Use down-sampling to prevent over-fitting of the main class, and at the same time, use a variety of sample generation methods to repair the true distribution of the data of the small class to prevent under-fitting on the small class.
Aiming at the problem of concept drift, a semi-supervised learning technique “Adversarial Validation” inspired by Generative Adversarial Networks (GAN) will be used to effectively identify which features have drifted before modeling, and then Do some targeted treatments, which will improve the stability of the entire model and effectively prevent the problem of model degradation.
To improve the generalization ability, it will pass targeted feature screening in automatic feature engineering, optimize some regularized parameters during the modeling process, and improve the overall model through a series of combination punches such as model fusion Ensemble The generalization ability. Some semi-supervised learning techniques are introduced, such as pseudo-label learning. Using pseudo-label learning in structured data is also a relatively advanced method.
The underlying computing engine and the entire system architecture use a distributed architecture. At the same time, it can support training based on stand-alone mode, and it can also support training in distributed cluster mode. The entire system architecture can be scaled horizontally to meet any level of data. scale.
DingoDB draws on the advantages of the TP system and the AP system. While storing massive amounts of data, it can perform high-concurrency data query and real-time data analysis. Data is imported into DingoDB from various types of channels, with the help of DingoDB’s high concurrent query, real-time data analysis and multi-dimensional analysis capabilities to support several business applications of the government and enterprises.
Compared with the two independent open source data products, OLTP and OLAP, what are the advantages of hybrid HASP (Hybrid Serving & Analytical Processing) products such as DingoDB?
Row-column hybrid: A unified storage design is adopted, and it supports row-column, row-column, and row-column mixed storage.
Standard SQL: Support ANSI SQL grammar, can seamlessly connect with Calcite client and BI report tool.
Real-time high-frequency update: Dingo database can realize the Upsert and Delete operations of data records based on the primary key; at the same time, the data adopts a multi-partition copy mechanism, which can convert Upsert and Delete operations into Key-Value operations to achieve high-frequency updates.
For database products, it is not enough to simply achieve outstanding performance. Whether developers value the difficulty of getting started, learning cost, or the product stability and business compatibility valued by the government and enterprises, it is very important. While solving the above-mentioned problems, DingoDB can also provide complete product technical support for government and enterprise users, and realize interactive analysis, high-frequency checking, modification and deletion operations based on intelligent optimizer, multiple copies mechanism, and flexible expansion of storage and calculation. And other innovative capabilities.
The AI industry is still immature, and the future ecology is still surging
At present, artificial intelligence has experienced 3 to 5 years of development in the Chinese market. Although the industry has changed slightly with technological innovation, the challenges are still very prominent, especially at the data level. The super-large pre-training model is also one of the characteristics of our Chinese market this year. It solves the current market lack of data resources. However, at the basic technical level, there will be problems such as model generalization capabilities that need to be improved. Model generalization capabilities are also the foundation of AI. Important problems in the field.
Behind the challenges are opportunities. In recent years, the country has issued a series of relevant policy guidelines to stimulate the innovation vitality of the artificial intelligence industry. The focus of work is on the three key directions of intelligent core foundation, intelligent public support, and intelligent product application, so as to cultivate domestic masters of key core technologies and innovation capabilities. The superior unit of the company has made breakthroughs in domestic AI-marked products. The implementation of the platform-based strategy of industrial intelligence upgrade is also stepping up to implement the organic combination of artificial intelligence technology and value with the industry. Experts have made research and judgments on the development trend of artificial intelligence, and AutoML will become one of the important technological trends of artificial intelligence.
Looking at the current Chinese artificial intelligence ecology from the perspective of development, “open source and openness” is no longer a brand-new concept and term, nor is it a new technological action, and the field of artificial intelligence in China is under the influence of the wave of world AI technology development. The general trend. AI basic software with “automation, cloud native, open source and open” as the core will promote enterprises in various industries to accelerate the upgrading of digital intelligence. CSDN believes that from last year’s automatic structured deep learning tool DeepTables, self-searching neural network framework Hypernets, to this year’s automatic machine learning tool kit DAT and real-time interactive analysis database DingoDB, nine chapters of Yunji DataCanvas in the open source field in two years Brings us infinite surprises. In the future, we will continue to pay attention to the latest trends of Jiuzhang Yunji in the field of open source ecology, please wait and see.