1.1 What is vpr?
vpr is the abbreviation of voiceprint recognition, sometimes called speaker recognition. Everyone ’s fingerprint is unique. Only two millions of people will find that two people have the same fingerprint; similarly, voiceprint is also a personality characteristic of a person. It is difficult to find two people with exactly the same voiceprint. Voiceprint recognition is to identify who speaks a certain segment of speech based on a person's pronunciation characteristics.
From the point of view of the content requirements of the user, it can be divided into text dependent and text independent. The former requires the user to say the same content as the speech used for training during the recognition process. , The latter has no such limitation; from the point of view of recognition, it can be divided into voiceprint identification and voiceprint verification. The former needs to determine which one of several people speaks a certain voice The latter confirms whether a certain voice is spoken by a designated person; in the voiceprint recognition application, it can be divided into closed set recognition and open set recognition. Recognition speech must be one of the known speakers, and the latter allows the speech to be recognized to be an unknown speaker, so the recognition system has a certain "rejection" function, obviously the latter has a wider application range.
1.2 Introduction of "Pride Voiceprint Recognition" technology and software development tools
1.2.1 Features and advantages of "Daiyi Voiceprint Recognition" technology The company's voiceprint recognition engine includes voiceprint recognition and voiceprint confirmation versions, which can be text-independent or text-related, and both support open collection Identification method. Among them, the text-independent version has both text and language independence, and the requirements on the length of the speech are also very low. Usually, training requires only tens of seconds of effective speech, while the recognition stage only needs a few seconds of effective speech. With high recognition accuracy, it is also possible to flexibly adjust operating point parameters to adapt to the needs of different applications.
1.2.2 Description of text-related voiceprint confirmation and recognition engine
The latest version of the "Voice Text Related Voiceprint Confirmation" recognition engine and its api is 3.0.
The system requirements are: pc or server with Intel pii 400 mhz or above; 128 mb or more memory; Microsoft windows 9x / me / nt / 2000 / xp; Microsoft visual c ++ version 6.0 or higher. Or: arm compatible pocket pc (wince 3.0 or higher); embedded visual c ++ v3.0 or higher (applicable to pocket pc version of text-related voiceprint confirmation and recognition engine).
The contents of the software development kit are: function description header file (* .h), static link library (* .lib), dynamic link library (* .dll), initial model file, programming reference manual (* .doc / *. Pdf ), Refer to the example source program, etc.
The features of text-related voiceprint confirmation and recognition api v3.0 are: working in a speaker (voiceprint) text-related manner; requiring low training times and accumulating training; no restrictions on the user's accent and language; operation Yukai method (that is, with a rejection function); adjustable threshold of rejection severity; built-in control of concurrent operations, support for multi-threaded calls; reliability and flexibility with high efficiency and high accuracy.
1.2.3 Description of Text-Independent Voiceprint Recognition and Voiceprint Confirmation and Recognition Engine The latest versions of the recognition engine and its api for "Daily Text-Independent Voiceprint Recognition" and "Daily-Text-Independent Voiceprint Confirmation" are both 3.0.
The system requirements are: pc or server with Intel pii 400 mhz or above; 128 mb or more memory; Microsoft windows 9x / me / nt / 2000 / xp; Microsoft visual c ++ version 6.0 or higher.
The contents of the software development kit are: function description header file (* .h), static link library (* .lib), dynamic link library (* .dll), initial model file, programming reference manual (* .doc / *. Pdf ), Refer to the example source program, etc.
The features of text-independent voiceprint recognition and voiceprint confirmation recognition api v3.0 are: support both speaker identification and speaker identification; have nothing to do with text (content) and language; run in open collection mode (that is, with rejection) Function); adjustable voiceprint recognition threshold and adaptive adaptability; unsupervised open set rejection threshold estimation; incremental identification of speaker identification and authentication; reliability and flexibility under high efficiency and high accuracy Sex; client / server based framework (multi-threaded and multi-instance).
1.3 Examples of application range of vpr
Voiceprint recognition: criminal investigation, criminal tracking, national defense monitoring, personalized applications, etc .; voiceprint confirmation: securities transactions, bank transactions, public security forensics, personal computer voice control locks, car voice control locks, ID cards, credit card authentication, etc.
1.4 How to use the proud vpr technology
The proud voiceprint recognition and voiceprint confirmation technologies both provide a set of easy-to-use programming interfaces (apis) and running files that can be directly invoked by application developers. The api part adopts the standard pure C style, provides the function description header file, which can be called by various programming languages ​​and environments, and the running file includes the dynamic link library and the pre-trained initial data file. For specific channels corresponding to specific applications, we can perform specific parameter adjustment and customization of the initial channel model.
1.5 Voice format supported by the proud voiceprint recognition engine
Like the asr engine, each of the proud voiceprint recognition engines supports the voice collected on the PC sound card channel and the phone channel, and their sampling rates are 16khz and 8khz respectively. The voice streams of Other sampling rates need to be converted before they can be used. The sampling point can be in 8-bit or 16-bit pcm format, or it can be compressed with a rate or μ rate.
If the voice stream is stored in a voice file (such as * .wav), the application needs to read the voice stream in the file into memory before calling the API of the recognition engine, and then call the corresponding programming interface to transfer the voice data Send to the recognition engine.
1.6 Does the programmer need to preprocess the speech?
In our existing voiceprint recognition interface, the function of preprocessing speech has been included. For example, before recognition, it is required to put the voice data into an internal data structure. In this process, the tasks of discarding silence, noise, and extracting voice features are automatically completed, and only true "valid" ones are reserved for subsequent recognition. Voice part. Of course, if necessary, the system developer can add some additional pre-processing before this process. For example, certain low signal-to-noise ratio voices with special distribution laws can be specially denoised to ensure The subsequent modeling and recognition process has better overall performance.
1.7 What is the working mode of multi-machine coordination?
When performing voiceprint recognition, because the time for comparison is basically proportional to the length of the voice and the size of the voiceprint database, when the voice is longer and the voiceprint database is huge, the internal comparison within a single thread will become Very time-consuming. At this time, multiple machines can be used to coordinate work. For example, five machines are used, and the master control program distributes a voice data stream to be compared to each machine. Each machine is only responsible for comparing one-fifth of the voiceprint model in the database; Candidates are submitted to the master control program for unified sorting and output, so the overall recognition time is reduced to one-fifth of the original stand-alone. This is how multiple machines work together.
1.8 What is the general background model of the channel?
When recognizing and confirming text-independent, open-set voiceprints, we used a "general background model" trained from massive data to normalize and reject the scores of each voiceprint model; for different Channels (such as pc sound cards, landline phones, gsm or cdma mobile phones, voice recorders, tapes, monitoring equipment, televisions, radio equipment, etc., strictly speaking belong to different channels), between the "background model" of different channels The parameters vary greatly, which is related to the performance of the recognizer. At present, only one background model is embedded in our engine by default. Therefore, when it is necessary to recognize voices from multiple channels (such as mobile phones, fixed phones, voice recorders, tapes, etc.) at the same time, we can train background models for different channels and use them in correspondence with these voices during recognition. Of course, the existing programming interface can also be customized or adjusted according to the specific situation of the user.
vpr is the abbreviation of voiceprint recognition, sometimes called speaker recognition. Everyone ’s fingerprint is unique. Only two millions of people will find that two people have the same fingerprint; similarly, voiceprint is also a personality characteristic of a person. It is difficult to find two people with exactly the same voiceprint. Voiceprint recognition is to identify who speaks a certain segment of speech based on a person's pronunciation characteristics.
From the point of view of the content requirements of the user, it can be divided into text dependent and text independent. The former requires the user to say the same content as the speech used for training during the recognition process. , The latter has no such limitation; from the point of view of recognition, it can be divided into voiceprint identification and voiceprint verification. The former needs to determine which one of several people speaks a certain voice The latter confirms whether a certain voice is spoken by a designated person; in the voiceprint recognition application, it can be divided into closed set recognition and open set recognition. Recognition speech must be one of the known speakers, and the latter allows the speech to be recognized to be an unknown speaker, so the recognition system has a certain "rejection" function, obviously the latter has a wider application range.
1.2 Introduction of "Pride Voiceprint Recognition" technology and software development tools
1.2.1 Features and advantages of "Daiyi Voiceprint Recognition" technology The company's voiceprint recognition engine includes voiceprint recognition and voiceprint confirmation versions, which can be text-independent or text-related, and both support open collection Identification method. Among them, the text-independent version has both text and language independence, and the requirements on the length of the speech are also very low. Usually, training requires only tens of seconds of effective speech, while the recognition stage only needs a few seconds of effective speech. With high recognition accuracy, it is also possible to flexibly adjust operating point parameters to adapt to the needs of different applications.
1.2.2 Description of text-related voiceprint confirmation and recognition engine
The latest version of the "Voice Text Related Voiceprint Confirmation" recognition engine and its api is 3.0.
The system requirements are: pc or server with Intel pii 400 mhz or above; 128 mb or more memory; Microsoft windows 9x / me / nt / 2000 / xp; Microsoft visual c ++ version 6.0 or higher. Or: arm compatible pocket pc (wince 3.0 or higher); embedded visual c ++ v3.0 or higher (applicable to pocket pc version of text-related voiceprint confirmation and recognition engine).
The contents of the software development kit are: function description header file (* .h), static link library (* .lib), dynamic link library (* .dll), initial model file, programming reference manual (* .doc / *. Pdf ), Refer to the example source program, etc.
The features of text-related voiceprint confirmation and recognition api v3.0 are: working in a speaker (voiceprint) text-related manner; requiring low training times and accumulating training; no restrictions on the user's accent and language; operation Yukai method (that is, with a rejection function); adjustable threshold of rejection severity; built-in control of concurrent operations, support for multi-threaded calls; reliability and flexibility with high efficiency and high accuracy.
1.2.3 Description of Text-Independent Voiceprint Recognition and Voiceprint Confirmation and Recognition Engine The latest versions of the recognition engine and its api for "Daily Text-Independent Voiceprint Recognition" and "Daily-Text-Independent Voiceprint Confirmation" are both 3.0.
The system requirements are: pc or server with Intel pii 400 mhz or above; 128 mb or more memory; Microsoft windows 9x / me / nt / 2000 / xp; Microsoft visual c ++ version 6.0 or higher.
The contents of the software development kit are: function description header file (* .h), static link library (* .lib), dynamic link library (* .dll), initial model file, programming reference manual (* .doc / *. Pdf ), Refer to the example source program, etc.
The features of text-independent voiceprint recognition and voiceprint confirmation recognition api v3.0 are: support both speaker identification and speaker identification; have nothing to do with text (content) and language; run in open collection mode (that is, with rejection) Function); adjustable voiceprint recognition threshold and adaptive adaptability; unsupervised open set rejection threshold estimation; incremental identification of speaker identification and authentication; reliability and flexibility under high efficiency and high accuracy Sex; client / server based framework (multi-threaded and multi-instance).
1.3 Examples of application range of vpr
Voiceprint recognition: criminal investigation, criminal tracking, national defense monitoring, personalized applications, etc .; voiceprint confirmation: securities transactions, bank transactions, public security forensics, personal computer voice control locks, car voice control locks, ID cards, credit card authentication, etc.
1.4 How to use the proud vpr technology
The proud voiceprint recognition and voiceprint confirmation technologies both provide a set of easy-to-use programming interfaces (apis) and running files that can be directly invoked by application developers. The api part adopts the standard pure C style, provides the function description header file, which can be called by various programming languages ​​and environments, and the running file includes the dynamic link library and the pre-trained initial data file. For specific channels corresponding to specific applications, we can perform specific parameter adjustment and customization of the initial channel model.
1.5 Voice format supported by the proud voiceprint recognition engine
Like the asr engine, each of the proud voiceprint recognition engines supports the voice collected on the PC sound card channel and the phone channel, and their sampling rates are 16khz and 8khz respectively. The voice streams of Other sampling rates need to be converted before they can be used. The sampling point can be in 8-bit or 16-bit pcm format, or it can be compressed with a rate or μ rate.
If the voice stream is stored in a voice file (such as * .wav), the application needs to read the voice stream in the file into memory before calling the API of the recognition engine, and then call the corresponding programming interface to transfer the voice data Send to the recognition engine.
1.6 Does the programmer need to preprocess the speech?
In our existing voiceprint recognition interface, the function of preprocessing speech has been included. For example, before recognition, it is required to put the voice data into an internal data structure. In this process, the tasks of discarding silence, noise, and extracting voice features are automatically completed, and only true "valid" ones are reserved for subsequent recognition. Voice part. Of course, if necessary, the system developer can add some additional pre-processing before this process. For example, certain low signal-to-noise ratio voices with special distribution laws can be specially denoised to ensure The subsequent modeling and recognition process has better overall performance.
1.7 What is the working mode of multi-machine coordination?
When performing voiceprint recognition, because the time for comparison is basically proportional to the length of the voice and the size of the voiceprint database, when the voice is longer and the voiceprint database is huge, the internal comparison within a single thread will become Very time-consuming. At this time, multiple machines can be used to coordinate work. For example, five machines are used, and the master control program distributes a voice data stream to be compared to each machine. Each machine is only responsible for comparing one-fifth of the voiceprint model in the database; Candidates are submitted to the master control program for unified sorting and output, so the overall recognition time is reduced to one-fifth of the original stand-alone. This is how multiple machines work together.
1.8 What is the general background model of the channel?
When recognizing and confirming text-independent, open-set voiceprints, we used a "general background model" trained from massive data to normalize and reject the scores of each voiceprint model; for different Channels (such as pc sound cards, landline phones, gsm or cdma mobile phones, voice recorders, tapes, monitoring equipment, televisions, radio equipment, etc., strictly speaking belong to different channels), between the "background model" of different channels The parameters vary greatly, which is related to the performance of the recognizer. At present, only one background model is embedded in our engine by default. Therefore, when it is necessary to recognize voices from multiple channels (such as mobile phones, fixed phones, voice recorders, tapes, etc.) at the same time, we can train background models for different channels and use them in correspondence with these voices during recognition. Of course, the existing programming interface can also be customized or adjusted according to the specific situation of the user.
Hexagonal Mesh,Hexagonal Steel Pipes,Hexagonal Wire Mesh Fencing,Hexagonal Wire Mesh
Hebei Giant Metal Technology Co., Ltd. , https://www.hebeigiantmetal.com