§ ~Vji<ã ó(—ddlZddlmZddlmZmZmZmZmZm Z ddl Z ddl mZddlm Z ddlmZddlmZmZddlmZmZd d lmZd dlmZgZdZGd „dej¦«ZGd„dej¦«ZGd„de jj ej!¦«Z"Gd„de jj ej!¦«Z#Gd„d¦«Z$Gd„d¦«Z%eGd„d¦«¦«Z&eGd„d¦«¦«Z'Gd„d¦«Z(eGd„d e'e&e$e¦«¦«Z)eGd!„d"e'e&e%e¦«¦«Z*eGd#„d$e(e&e$e¦«¦«Z+eGd%„d&e(e&e%e¦«¦«Z,e+d'ej-d(¬)¦«¬*¦«Z.d+e._/e,d,ej-d-¬)¦«¬*¦«Z0d.e0_/e)d/ej-d(¬)¦«d0ej1¦«¬1¦«Z2d2e2_/e*d3ej-d-¬)¦«d0ej1¦«¬1¦«Z3d4e3_/dS)5éN)Ú dataclass)ÚAnyÚDictÚListÚOptionalÚTupleÚUnion)ÚTensor)Úload_state_dict_from_url)Úmu_law_decoding)Ú Tacotron2ÚWaveRNN)Ú GriffinLimÚInverseMelScaleé)Úutils)ÚTacotron2TTSBundlez.https://download.pytorch.org/torchaudio/modelscóp‡—eZdZˆfd„Zed„¦«Zdeeeefde e e ffd„ZˆxZS)Ú_EnglishCharProcessorcó¾•—t¦« ¦«tj¦«|_d„t|j¦«D¦«|_dS)Ncó—i|]\}}||“Œ S©r)Ú.0ÚiÚss úX/root/voice-cloning/.venv/lib/python3.11/site-packages/torchaudio/pipelines/_tts/impl.pyú z2_EnglishCharProcessor.__init__..ó€ÐBÐBÐB¡$ ! Q˜˜AÐBÐBÐBó)ÚsuperÚ__init__rÚ _get_charsÚ_tokensÚ enumerateÚ_mapping©ÚselfÚ __class__s €rr!z_EnglishCharProcessor.__init__sLø€Ý ‰Œ×ÒÑÔÐÝÔ'Ñ)Ô)ˆŒØBÐB)°D´LÑ*AÔ*AÐBÑBÔBˆŒ ˆ ˆ rcó—|jS©N©r#©r's rÚtokensz_EnglishCharProcessor.tokensó €àŒ|ÐrÚtextsÚreturncóx‡—t|t¦«r|g}ˆfd„|D¦«}tj|¦«S)NcóP•—g|]"}ˆfd„| ¦«D¦«‘Œ#S)có<•—g|]}|‰jv¯‰j|‘ŒSr©r%)rÚcr's €rú z=_EnglishCharProcessor.__call__...&s,ø€ÐNÐNÐN¨¸1ÀÄ Ð;MÐ;MD”M !Ô$Ð;MÐ;MÐ;Mr)Úlower)rÚtr's €rr6z2_EnglishCharProcessor.__call__..&s7ø€Ð^Ð^Ð^ÐSTÐNÐNÐNÐN¨a¯gªg©i¬iÐNÑNÔNÐ^Ð^Ð^r)Ú isinstanceÚstrrÚ _to_tensor)r'r/Úindicess` rÚ__call__z_EnglishCharProcessor.__call__#sGø€ÝeSÑ!Ô!ð ØGˆEØ^Ð^Ð^Ð^ÐX]Ð^Ñ^Ô^ˆÝÔ Ñ(Ô(Ð(r© Ú__name__Ú __module__Ú__qualname__r!Úpropertyr-r r:rrr r=Ú __classcell__©r(s@rrrsø€€€€€ðCðCðCðCðCð ððñ„Xðð)˜e C¨¨c¬ NÔ3ð)¸¸fÀf¸nÔ8Mð)ð)ð)ð)ð)ð)ð)ð)rrcóv‡—eZdZddœˆfd„ Zed„¦«Zdeeeefde e e ffd„ZˆxZS)Ú_EnglishPhoneProcessorN©Ú dl_kwargscó•—t¦« ¦«tj¦«|_d„t|j¦«D¦«|_tjd|¬¦«|_d|_ dS)Ncó—i|]\}}||“Œ Srr)rrÚps rrz3_EnglishPhoneProcessor.__init__...rrzen_us_cmudict_forward.ptrGz(\[[A-Z]+?\]|[_!'(),.:;? -])) r r!rÚ_get_phonesr#r$r%Ú_load_phonemizerÚ_phonemizerÚ_pattern)r'rHr(s €rr!z_EnglishPhoneProcessor.__init__+smø€Ý ‰Œ×ÒÑÔÐÝÔ(Ñ*Ô*ˆŒØBÐB)°D´LÑ*AÔ*AÐBÑBÔBˆŒ Ý Ô1Ð2LÐXaÐbÑbÔbˆÔØ7ˆŒ ˆ ˆ rcó—|jSr*r+r,s rr-z_EnglishPhoneProcessor.tokens2r.rr/r0có‡—t|t¦«r|g}g}‰ |d¬¦«D]G}d„tj‰j|¦«D¦«}| ˆfd„|D¦«¦«ŒHtj|¦«S)NÚen_us)Úlangcó:—g|]}tjdd|¦«‘ŒS)z[\[\]]Ú)ÚreÚsub)rÚrs rr6z3_EnglishPhoneProcessor.__call__..=s&€ÐWÐWÐW°•2”6˜) R¨Ñ+Ô+ÐWÐWÐWrcó*•—g|]}‰j|‘ŒSrr4)rrKr's €rr6z3_EnglishPhoneProcessor.__call__..>s ø€Ð:Ð:Ð:°˜DœM¨!Ô,Ð:Ð:Ð:r) r9r:rNrVÚfindallrOÚappendrr;)r'r/r<ÚphonesÚrets` rr=z_EnglishPhoneProcessor.__call__6sŸø€ÝeSÑ!Ô!ð ØGˆEàˆØ×&Ò& u°7Ð&Ñ;Ô;ð <ð <ˆFàWÐWµR´ZÀÄ ÈvÑ5VÔ5VÐWÑWÔWˆCØNŠNÐ:Ð:Ð:Ð:°cÐ:Ñ:Ô:Ñ;Ô;Ð;Ð;ÝÔ Ñ(Ô(Ð(rr>rDs@rrFrF*s˜ø€€€€€Ø$(ð8ð8ð8ð8ð8ð8ð8ðððñ„Xðð )˜e C¨¨c¬ NÔ3ð )¸¸fÀf¸nÔ8Mð )ð )ð )ð )ð )ð )ð )ð )rrFcóT‡—eZdZddedeefˆfd„ Zed„¦«Zd d„Z ˆxZ S) Ú_WaveRNNVocoderéœÿÿÿÚmodelÚmin_level_dbcór•—t¦« ¦«d|_||_||_dS)Né"V)r r!Ú_sample_rateÚ_modelÚ _min_level_db)r'rarbr(s €rr!z_WaveRNNVocoder.__init__Hs6ø€Ý ‰Œ×ÒÑÔÐØ!ˆÔØˆŒØ)ˆÔÐÐrcó—|jSr*©rer,s rÚsample_ratez_WaveRNNVocoder.sample_rateNó€àÔ Ð rNcóÀ—tj|¦«}dtjtj|d¬¦«¦«z}|j)|j|z |jz}tj|dd¬¦«}|j ||¦«\}}tj||jj ¦«}t||jj¦«}| d¦«}||fS)Négñhãˆµøä>)Úminrr)rnÚmax) ÚtorchÚexpÚlog10ÚclamprgrfÚinferrÚ_unnormalize_waveformÚn_bitsrÚ n_classesÚsqueeze)r'Úmel_specÚlengthsÚwaveforms rÚforwardz_WaveRNNVocoder.forwardRsÍ€Ý”9˜XÑ&Ô&ˆØœ¥E¤K°¸dÐ$CÑ$CÔ$CÑDÔDÑDˆØÔÐ)ØÔ*¨XÑ5¸Ô9KÑKˆHÝ”{ 8°¸Ð:Ñ:Ô:ˆHØ œK×-Ò-¨h¸Ñ@Ô@Ñˆ'ÝÔ.¨x¸¼Ô9KÑLÔLˆÝ" 8¨T¬[Ô-BÑCÔCˆØ×#Ò# AÑ&Ô&ˆØ˜Ð Ð r)r`r*)r?r@rArrÚfloatr!rBrjr|rCrDs@rr_r_Gsƒø€€€€€ð*ð*˜gð*°X¸e´_ð*ð*ð*ð*ð*ð*ðð!ð!ñ„Xð!ð !ð !ð !ð !ð !ð !ð !ð !rr_có<‡—eZdZˆfd„Zed„¦«Zdd„ZˆxZS)Ú_GriffinLimVocoderc óÆ•—t¦« ¦«d|_tdd|jdddd¬¦«|_t dd d d¬¦«|_dS)NrdiéPgg@¿@Úslaney)Ún_stftÚn_melsrjÚf_minÚf_maxÚ mel_scaleÚnormiré)Ún_fftÚpowerÚ hop_lengthÚ win_length)r r!rerrjÚ_inv_melrÚ_griffin_limr&s €rr!z_GriffinLimVocoder.__init__`szø€Ý ‰Œ×ÒÑÔÐØ!ˆÔÝ'Ø!ØØÔ(ØØØØð ñ ô ˆŒ õ'ØØØØð ñ ô ˆÔÐÐrcó—|jSr*rir,s rrjz_GriffinLimVocoder.sample_ratesrkrNcóF—tj|¦«}| ¦« ¦« d¦«}| |¦«}| ¦« d¦«}| |¦«}||fS)NTF)rprqÚcloneÚdetachÚrequires_grad_rŽr)r'ryrzÚspecÚ waveformss rr|z_GriffinLimVocoder.forwardws„€Ý”9˜XÑ&Ô&ˆØ—>’>Ñ#Ô#×*Ò*Ñ,Ô,×;Ò;¸DÑAÔAˆØ}Š}˜XÑ&Ô&ˆØ{Š{‰}Œ}×+Ò+¨EÑ2Ô2ˆØ×%Ò% dÑ+Ô+ˆ Ø˜'Ð!Ð!rr*)r?r@rAr!rBrjr|rCrDs@rrr_sgø€€€€€ð ð ð ð ð ð&ð!ð!ñ„Xð!ð"ð"ð"ð"ð"ð"ð"ð"rrcó$—eZdZdejfd„ZdS)Ú _CharMixinr0có—t¦«Sr*)rr,s rÚget_text_processorz_CharMixin.get_text_processor†s€Ý$Ñ&Ô&Ð&rN©r?r@rArÚ TextProcessorršrrrr˜r˜…s3€€€€€ð'Ð$6Ô$Dð'ð'ð'ð'ð'ð'rr˜có*—eZdZddœdejfd„ZdS)Ú_PhoneMixinNrGr0có"—t|¬¦«S©NrG)rF)r'rHs rršz_PhoneMixin.get_text_processor‹s€Ý%° Ð:Ñ:Ô:Ð:rr›rrrržržŠs@€€€€€Ø.2ð;ð;ð;Ð7IÔ7Wð;ð;ð;ð;ð;ð;rržcóF—eZdZUeed<eeefed<ddœdefd„ZdS)Ú_Tacotron2MixinÚ_tacotron2_pathÚ_tacotron2_paramsNrGr0cóÂ—tdi|j¤Ž}t›d|j›}|€in|}t |fi|¤Ž}| |¦«| ¦«|S©Nú/r)r r¤Ú _BASE_URLr£rÚload_state_dictÚeval©r'rHraÚurlÚ state_dicts rÚ get_tacotron2z_Tacotron2Mixin.get_tacotron2”sw€ÝÐ3Ð3˜DÔ2Ð3Ð3ˆÝÐ3Ð3˜TÔ1Ð3Ð3ˆØ#Ð+BB°ˆ Ý-¨cÐ?Ð?°YÐ?Ð?ˆ Ø ×Ò˜jÑ)Ô)Ð)Ø Š ‰ŒˆØˆr) r?r@rAr:Ú__annotations__rrr r®rrrr¢r¢s^€€€€€€àÐÐÑØ˜C ˜H”~Ð%Ð%Ñ%à)-ððð°)ððððððrr¢cód—eZdZUeeed<eeeefed<ddœd„Zddœd„Z dS)Ú _WaveRNNMixinÚ _wavernn_pathÚ_wavernn_paramsNrGcóL—| |¬¦«}t|¦«Sr )Ú_get_wavernnr_)r'rHÚwavernns rÚget_vocoderz_WaveRNNMixin.get_vocoder£s&€Ø×#Ò#¨iÐ#Ñ8Ô8ˆÝ˜wÑ'Ô'Ð'rcóÂ—tdi|j¤Ž}t›d|j›}|€in|}t |fi|¤Ž}| |¦«| ¦«|Sr¦)rr³r¨r²rr©rªr«s rrµz_WaveRNNMixin._get_wavernn§sw€ÝÐ/Ð/˜$Ô.Ð/Ð/ˆÝÐ1Ð1˜TÔ/Ð1Ð1ˆØ#Ð+BB°ˆ Ý-¨cÐ?Ð?°YÐ?Ð?ˆ Ø ×Ò˜jÑ)Ô)Ð)Ø Š ‰ŒˆØˆr) r?r@rArr:r¯rrr·rµrrrr±r±žsy€€€€€€à˜C”=Ð Ð Ñ Ø˜d 3¨ 8œnÔ-Ð-Ð-Ñ-à'+ð(ð(ð(ð(ð(ð)-ðððððððrr±có—eZdZd„ZdS)Ú_GriffinLimMixincó—t¦«Sr*)r)r'Ú_s rr·z_GriffinLimMixin.get_vocoder²s€Ý!Ñ#Ô#Ð#rN)r?r@rAr·rrrrºrº±s#€€€€€ð$ð$ð$ð$ð$rrºcó—eZdZdS)Ú_Tacotron2WaveRNNCharBundleN©r?r@rArrrr¾r¾»ó€€€€€à€Drr¾có—eZdZdS)Ú_Tacotron2WaveRNNPhoneBundleNr¿rrrrÂrÂÀrÀrrÂcó—eZdZdS)Ú_Tacotron2GriffinLimCharBundleNr¿rrrrÄrÄÅrÀrrÄcó—eZdZdS)Ú_Tacotron2GriffinLimPhoneBundleNr¿rrrrÆrÆÊrÀrrÆz5tacotron2_english_characters_1500_epochs_ljspeech.pthé&)Ú n_symbols)r£r¤aþCharacter-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and :py:class:`~torchaudio.transforms.GriffinLim` as vocoder. The text processor encodes the input texts character-by-character. You can find the training script `here `__. The default parameters were used. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Example - "Hello world! T T S stands for Text to Speech!" .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

Example - "The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired," .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH_v2.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

z3tacotron2_english_phonemes_1500_epochs_ljspeech.pthé`aèPhoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and :py:class:`~torchaudio.transforms.GriffinLim` as vocoder. The text processor encodes the input texts based on phoneme. It uses `DeepPhonemizer `__ to convert graphemes to phonemes. The model (*en_us_cmudict_forward*) was trained on `CMUDict `__. You can find the training script `here `__. The text processor is set to the *"english_phonemes"*. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Example - "Hello world! T T S stands for Text to Speech!" .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

Example - "The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired," .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH_v2.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

z=tacotron2_english_characters_1500_epochs_wavernn_ljspeech.pthz%wavernn_10k_epochs_8bits_ljspeech.pth)r£r¤r²r³aCharacter-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and :py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs. The text processor encodes the input texts character-by-character. You can find the training script `here `__. The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, ``mel_fmin=40``, and ``mel_fmax=11025``. You can find the training script `here `__. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Example - "Hello world! T T S stands for Text to Speech!" .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_WAVERNN_CHAR_LJSPEECH.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

Example - "The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired," .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_WAVERNN_CHAR_LJSPEECH_v2.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

z;tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech.pthaPhoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and :py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs. The text processor encodes the input texts based on phoneme. It uses `DeepPhonemizer `__ to convert graphemes to phonemes. The model (*en_us_cmudict_forward*) was trained on `CMUDict `__. You can find the training script for Tacotron2 `here `__. The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, ``mel_fmin=40``, and ``mel_fmax=11025``. You can find the training script for WaveRNN `here `__. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Example - "Hello world! T T S stands for Text to Speech!" .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_WAVERNN_PHONE_LJSPEECH.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

Example - "The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired," .. image:: https://download.pytorch.org/torchaudio/doc-assets/TACOTRON2_WAVERNN_PHONE_LJSPEECH_v2.png :alt: Spectrogram generated by Tacotron2 .. raw:: html

)4rVÚdataclassesrÚtypingrrrrrr rpr Útorchaudio._internalrÚtorchaudio.functionalrÚtorchaudio.modelsr rÚtorchaudio.transformsrrrUrÚ interfacerÚ__all__r¨rœrrFÚnnÚModuleÚVocoderr_rr˜ržr¢r±rºr¾rÂrÄrÆÚ_get_taco_paramsÚ"TACOTRON2_GRIFFINLIM_CHAR_LJSPEECHÚ__doc__Ú#TACOTRON2_GRIFFINLIM_PHONE_LJSPEECHÚ_get_wrnn_paramsÚTACOTRON2_WAVERNN_CHAR_LJSPEECHÚ TACOTRON2_WAVERNN_PHONE_LJSPEECHrrrúrÜs¢ðØ € € € Ø!Ð!Ð!Ð!Ð!Ð!Ø:Ð:Ð:Ð:Ð:Ð:Ð:Ð:Ð:Ð:Ð:Ð:Ð:Ð:Ð:Ð:à€€€ØÐÐÐÐÐØ9Ð9Ð9Ð9Ð9Ð9Ø1Ð1Ð1Ð1Ð1Ð1Ø0Ð0Ð0Ð0Ð0Ð0Ð0Ð0Ø=Ð=Ð=Ð=Ð=Ð=Ð=Ð=àÐÐÐÐÐØ)Ð)Ð)Ð)Ð)Ð)à €à<€ ð)ð)ð)ð)ð)Ð.Ô<ñ)ô)ð)ð")ð)ð)ð)ð)Ð/Ô=ñ)ô)ð)ð:!ð!ð!ð!ð!e”h”oÐ'9Ô'Añ!ô!ð!ð0"ð"ð"ð"ð"˜œœÐ*<Ô*Dñ"ô"ð"ðL'ð'ð'ð'ð'ñ'ô'ð'ð ;ð;ð;ð;ð;ñ;ô;ð;ð ðððððñôñ„ðððððððñôñ„ðð$$ð$ð$ð$ð$ñ$ô$ð$ðð ð ð ð ð -°À*ÐN`ñ ô ñ„ð ðð ð ð ð ð =°/À;ÐPbñ ô ñ„ð ðð ð ð ð ð Ð%5°È ÐTfñ ô ñ„ð ðð ð ð ð ð Ð&6¸ÈÐVhñ ô ñ„ð ð&DÐ%CØKØ,eÔ,°rÐ:Ñ:Ô:ð&ñ&ô&Ð"ð!.Ð"Ô*ðF'FÐ&EØIØ,eÔ,°rÐ:Ñ:Ô:ð'ñ'ô'Ð#ð&/Ð#Ô+ðP#>Ð"=ØSØ,eÔ,°rÐ:Ñ:Ô:Ø9Ø*EÔ*Ñ,Ô,ð #ñ#ô#Ðð#+ÐÔ'ðJ$@Ð#?ØQØ,eÔ,°rÐ:Ñ:Ô:Ø9Ø*EÔ*Ñ,Ô,ð $ñ$ô$Ð ð),Ð Ô(Ð(Ð(r