§ ~VjiV<ãó—ddlZddlmZmZmZddlZddlmcmZ ddlmZm Z gd¢ZGd„dej¦«Z Gd„dej¦«ZGd „d ej¦«ZGd„dej¦«ZGd „dej¦«ZdS)éN)ÚListÚOptionalÚTuple)ÚnnÚTensor)ÚResBlockÚ MelResNetÚ Stretch2dÚUpsampleNetworkÚWaveRNNcó>‡—eZdZdZd deddfˆfd„ Zdedefd„ZˆxZS) rafResNet block based on *Efficient Neural Audio Synthesis* :cite:`kalchbrenner2018efficient`. Args: n_freq: the number of bins in a spectrogram. (Default: ``128``) Examples >>> resblock = ResBlock() >>> input = torch.rand(10, 128, 512) # a random spectrogram >>> output = resblock(input) # shape: (10, 128, 512) é€Ún_freqÚreturnNcóH•—t¦« ¦«tjtj||dd¬¦«tj|¦«tjd¬¦«tj||dd¬¦«tj|¦«¦«|_dS)NéF©Úin_channelsÚout_channelsÚkernel_sizeÚbiasT©Úinplace)ÚsuperÚ__init__rÚ SequentialÚConv1dÚBatchNorm1dÚReLUÚresblock_model)ÚselfrÚ __class__s €úS/root/voice-cloning/.venv/lib/python3.11/site-packages/torchaudio/models/wavernn.pyrzResBlock.__init__s‹ø€Ý ‰Œ×ÒÑÔÐå œmÝŒI &°vÈ1ÐSXÐYÑYÔYÝŒN˜6Ñ"Ô"ÝŒG˜DÐ!Ñ!Ô!ÝŒI &°vÈ1ÐSXÐYÑYÔYÝŒN˜6Ñ"Ô"ñ ô ˆÔÐÐóÚspecgramcó2—| |¦«|zS)zéPass the input through the ResBlock layer. Args: specgram (Tensor): the input sequence to the ResBlock layer (n_batch, n_freq, n_time). Return: Tensor shape: (n_batch, n_freq, n_time) )r ©r!r%s r#ÚforwardzResBlock.forward(s€ð×"Ò" 8Ñ,Ô,¨xÑ7Ð7r$)r© Ú__name__Ú __module__Ú__qualname__Ú__doc__Úintrrr(Ú __classcell__©r"s@r#rrs|ø€€€€€ð ð ð ð ˜sð ¨Tð ð ð ð ð ð ð 8 ð 8¨6ð 8ð 8ð 8ð 8ð 8ð 8ð 8ð 8r$rc óP‡—eZdZdZ ddedededed ed dfˆfd„ Zd ed efd„ZˆxZS)r aMelResNet layer uses a stack of ResBlocks on spectrogram. Args: n_res_block: the number of ResBlock in stack. (Default: ``10``) n_freq: the number of bins in a spectrogram. (Default: ``128``) n_hidden: the number of hidden dimensions of resblock. (Default: ``128``) n_output: the number of output dimensions of melresnet. (Default: ``128``) kernel_size: the number of kernel size in the first Conv1d layer. (Default: ``5``) Examples >>> melresnet = MelResNet() >>> input = torch.rand(10, 128, 512) # a random spectrogram >>> output = melresnet(input) # shape: (10, 128, 508) é réÚn_res_blockrÚn_hiddenÚn_outputrrNcóV•‡—t¦« ¦«ˆfd„t|¦«D¦«}tjtj|‰|d¬¦«tj‰¦«tjd¬¦«g|¢tj‰|d¬¦«‘RŽ|_dS)Ncó.•—g|]}t‰¦«‘ŒS©)r)Ú.0Ú_r5s €r#ú z&MelResNet.__init__..Is!ø€ÐDÐDÐD¨A•X˜hÑ'Ô'ÐDÐDÐDr$FrTrr)rrr) rrÚrangerrrrrÚmelresnet_model)r!r4rr5r6rÚ ResBlocksr"s ` €r#rzMelResNet.__init__Dsµøø€õ ‰Œ×ÒÑÔÐàDÐDÐDÐDµ°{Ñ1CÔ1CÐDÑDÔDˆ å!œ}ÝŒI &°xÈ[Ð_dÐeÑeÔeÝŒN˜8Ñ$Ô$ÝŒG˜DÐ!Ñ!Ô!ð ðð õ ŒI (¸ÈqÐQÑQÔQð ð ð ˆÔÐÐr$r%có,—| |¦«S)zÿPass the input through the MelResNet layer. Args: specgram (Tensor): the input sequence to the MelResNet layer (n_batch, n_freq, n_time). Return: Tensor shape: (n_batch, n_output, n_time - kernel_size + 1) )r>r's r#r(zMelResNet.forwardSs€ð×#Ò# HÑ-Ô-Ð-r$©r2rrrr3r)r0s@r#r r 4s¥ø€€€€€ð ð ð vwð ð Øð Ø-0ð ØBEð ØWZð Øorð à ð ð ð ð ð ð ð . ð .¨6ð .ð .ð .ð .ð .ð .ð .ð .r$r có@‡—eZdZdZdededdfˆfd„Zdedefd„ZˆxZS) r a‘Upscale the frequency and time dimensions of a spectrogram. Args: time_scale: the scale factor in time dimension freq_scale: the scale factor in frequency dimension Examples >>> stretch2d = Stretch2d(time_scale=10, freq_scale=5) >>> input = torch.rand(10, 100, 512) # a random spectrogram >>> output = stretch2d(input) # shape: (10, 500, 5120) Ú time_scaleÚ freq_scalerNcód•—t¦« ¦«||_||_dS©N)rrrDrC)r!rCrDr"s €r#rzStretch2d.__init__ms+ø€Ý ‰Œ×ÒÑÔÐà$ˆŒØ$ˆŒˆˆr$r%cój—| |jd¦« |jd¦«S)zþPass the input through the Stretch2d layer. Args: specgram (Tensor): the input sequence to the Stretch2d layer (..., n_freq, n_time). Return: Tensor shape: (..., n_freq * freq_scale, n_time * time_scale) éþÿÿÿéÿÿÿÿ)Úrepeat_interleaverDrCr's r#r(zStretch2d.forwardss1€ð×)Ò)¨$¬/¸2Ñ>Ô>×PÒPÐQUÔQ`ÐbdÑeÔeÐer$r)r0s@r#r r _sˆø€€€€€ððð% 3ð%°Cð%¸Dð%ð%ð%ð%ð%ð%ð f ð f¨6ð fð fð fð fð fð fð fð fr$r cóx‡—eZdZdZ ddeedededed ed eddfˆfd „ Zdedeeeffd„Z ˆxZ S)rañUpscale the dimensions of a spectrogram. Args: upsample_scales: the list of upsample scales. n_res_block: the number of ResBlock in stack. (Default: ``10``) n_freq: the number of bins in a spectrogram. (Default: ``128``) n_hidden: the number of hidden dimensions of resblock. (Default: ``128``) n_output: the number of output dimensions of melresnet. (Default: ``128``) kernel_size: the number of kernel size in the first Conv1d layer. (Default: ``5``) Examples >>> upsamplenetwork = UpsampleNetwork(upsample_scales=[4, 4, 16]) >>> input = torch.rand(10, 128, 10) # a random spectrogram >>> output = upsamplenetwork(input) # shape: (10, 128, 1536), (10, 128, 1536) r2rr3Úupsample_scalesr4rr5r6rrNcó<•—t¦« ¦«d}|D]}||z}Œ||_|dz dz|z|_t |||||¦«|_t |d¦«|_g} |D]’} t | d¦«}tj ddd| dzdzfd| fd¬¦«}tjj |j d| dzdzz¦«| |¦«| |¦«Œ“tj| Ž|_dS)NrérF)rrrÚpaddingrçð?)rrÚtotal_scaleÚindentr Úresnetr Úresnet_stretchrÚConv2dÚtorchÚinitÚ constant_ÚweightÚappendrÚupsample_layers)r!rLr4rr5r6rrQÚupsample_scaleÚ up_layersÚscaleÚstretchÚconvr"s €r#rzUpsampleNetwork.__init__‘sEø€õ ‰Œ×ÒÑÔÐàˆØ-ð *ð *ˆNØ˜>Ñ)ˆKˆKØ +ˆÔà" Q‘¨1Ñ,¨{Ñ:ˆŒÝ ¨V°X¸xÈÑUÔUˆŒÝ'¨°QÑ7Ô7ˆÔàˆ Ø$ð #ð #ˆEÝ qÑ)Ô)ˆGÝ”9Ø¨A¸A¸uÀq¹yÈ1¹}Ð;MÐXYÐ[`ÐWaÐhmðñôˆDõ ŒHŒM×#Ò# D¤K°¸À¹ ÀA¹ Ñ1FÑGÔGÐGØ×Ò˜WÑ%Ô%Ð%Ø×Ò˜TÑ"Ô"Ð"Ð"Ý!œ}¨iÐ8ˆÔÐÐr$r%có`—| |¦« d¦«}| |¦«}| d¦«}| d¦«}| |¦«}| d¦«dd…dd…|j|j…f}||fS)a¿Pass the input through the UpsampleNetwork layer. Args: specgram (Tensor): the input sequence to the UpsampleNetwork layer (n_batch, n_freq, n_time) Return: Tensor shape: (n_batch, n_freq, (n_time - kernel_size + 1) * total_scale), (n_batch, n_output, (n_time - kernel_size + 1) * total_scale) where total_scale is the product of all elements in upsample_scales. rN)rSÚ unsqueezerTÚsqueezer[rR)r!r%Ú resnet_outputÚupsampling_outputs r#r(zUpsampleNetwork.forward°s®€ðŸš HÑ-Ô-×7Ò7¸Ñ:Ô:ˆ Ø×+Ò+¨MÑ:Ô:ˆ Ø%×-Ò-¨aÑ0Ô0ˆ à×%Ò% aÑ(Ô(ˆØ ×0Ò0°Ñ:Ô:ÐØ-×5Ò5°aÑ8Ô8¸¸¸¸A¸A¸A¸t¼{ÈdÌkÈ\Ð?YÐ9YÔZÐà -Ð/Ð/r$rA)r*r+r,r-rr.rrrr(r/r0s@r#rr€sÐø€€€€€ððð&ØØØØð9ð9à˜cœð9ðð9ðð 9ð ð9ðð 9ðð9ð ð9ð9ð9ð9ð9ð9ð>0 ð0¨5°¸°Ô+@ð0ð0ð0ð0ð0ð0ð0ð0r$rcóâ‡—eZdZdZ ddeededed ed ededed edededdfˆfd„ Zdededefd„Ze j jddedeede eeeffd„¦«ZˆxZS)raWWaveRNN model from *Efficient Neural Audio Synthesis* :cite:`wavernn` based on the implementation from `fatchord/WaveRNN `_. The original implementation was introduced in *Efficient Neural Audio Synthesis* :cite:`kalchbrenner2018efficient`. The input channels of waveform and spectrogram have to be 1. The product of `upsample_scales` must equal `hop_length`. See Also: * `Training example `__ * :class:`torchaudio.pipelines.Tacotron2TTSBundle`: TTS pipeline with pretrained model. Args: upsample_scales: the list of upsample scales. n_classes: the number of output classes. hop_length: the number of samples between the starts of consecutive frames. n_res_block: the number of ResBlock in stack. (Default: ``10``) n_rnn: the dimension of RNN layer. (Default: ``512``) n_fc: the dimension of fully connected layer. (Default: ``512``) kernel_size: the number of kernel size in the first Conv1d layer. (Default: ``5``) n_freq: the number of bins in a spectrogram. (Default: ``128``) n_hidden: the number of hidden dimensions of resblock. (Default: ``128``) n_output: the number of output dimensions of melresnet. (Default: ``128``) Example >>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200) >>> waveform, sample_rate = torchaudio.load(file) >>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length) >>> specgram = MelSpectrogram(sample_rate)(waveform) # shape: (n_batch, n_channel, n_freq, n_time) >>> output = wavernn(waveform, specgram) >>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes) r2ér3rrLÚ n_classesÚ hop_lengthr4Ún_rnnÚn_fcrrr5r6rNcó”•—t¦« ¦«||_|dzr|dz n|dz|_||_| dz|_||_||_ttj |j¦«¦«|_d}|D]}||z}Œ||jkrtd|›d|›¦«‚t|||| | |¦«|_tj||jzdz|¦«|_tj||d¬¦«|_tj||jz|d¬¦«|_tjd¬¦«|_tjd¬¦«|_tj||jz|¦«|_tj||jz|¦«|_tj||j¦«|_dS) NrNréz/Expected: total_scale == hop_length, but found z != T)Úbatch_firstr)rrrÚ_padrjÚn_auxrirhr.ÚmathÚlog2Ún_bitsÚ ValueErrorrÚupsamplerÚLinearÚfcÚGRUÚrnn1Úrnn2rÚrelu1Úrelu2Úfc1Úfc2Úfc3)r!rLrhrir4rjrkrrr5r6rQr\r"s €r#rzWaveRNN.__init__ès®ø€õ ‰Œ×ÒÑÔÐà&ˆÔØ(3°a©ÐH[ 1‘__¸[ÈQÑNˆŒ ØˆŒ Ø ‘]ˆŒ Ø$ˆŒØ"ˆŒÝtœy¨¬Ñ8Ô8Ñ9Ô9ˆŒàˆØ-ð *ð *ˆNØ˜>Ñ)ˆKˆKØ˜$œ/Ò)Ð)ÝÐlÈ{ÐlÐlÐ`jÐlÐlÑmÔmÐmå'¨¸ÀfÈhÐX`ÐbmÑnÔnˆŒ Ý”)˜F T¤ZÑ/°!Ñ3°UÑ;Ô;ˆŒå”F˜5 %°TÐ:Ñ:Ô:ˆŒ Ý”F˜5 4¤:Ñ-¨uÀ$ÐGÑGÔGˆŒ å”W TÐ*Ñ*Ô*ˆŒ Ý”W TÐ*Ñ*Ô*ˆŒ å”9˜U T¤ZÑ/°Ñ6Ô6ˆŒÝ”9˜T D¤JÑ.°Ñ5Ô5ˆŒÝ”9˜T 4¤>Ñ2Ô2ˆŒˆˆr$Úwaveformr%cóö‡—| d¦«dkrtd¦«‚| d¦«dkrtd¦«‚| d¦«| d¦«}}| d¦«}tjd|‰j|j|j¬¦«}tjd|‰j|j|j¬¦«}‰ |¦«\}}| dd¦«}| dd¦«}ˆfd„td¦«D¦«}|d d …d d …|d|d…f}|d d …d d …|d|d…f} |d d …d d …|d|d …f} |d d …d d …|d |d…f}tj| d¦«||gd¬ ¦«}‰ |¦«}|} ‰ ||¦«\}}|| z}|} tj|| gd¬ ¦«}‰ ||¦«\}}|| z}tj|| gd¬ ¦«}‰ |¦«}‰ |¦«}tj||gd¬ ¦«}‰ |¦«}‰ |¦«}‰ |¦«}| d¦«S)aPass the input through the WaveRNN model. Args: waveform: the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length) specgram: the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time) Return: Tensor: shape (n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes) rz*Require the input channel of waveform is 1z*Require the input channel of specgram is 1r)ÚdtypeÚdevicerNcó$•—g|]}‰j|z‘Œ Sr9©rp)r:Úir!s €r#r<z#WaveRNN.forward...sø€Ð4Ð4Ð4 a4”: ‘>Ð4Ð4Ð4r$r3NérmrI©Údim)ÚsizertrcrVÚzerosrjr‚rƒruÚ transposer=Úcatrbrwryrzr}r{r~r|r)r!r€r%Ú batch_sizeÚh1Úh2ÚauxÚaux_idxÚa1Úa2Úa3Úa4ÚxÚresr;s` r#r(zWaveRNN.forwardsèø€ð=Š=˜ÑÔ˜qÒ Ð ÝÐIÑJÔJÐJØ=Š=˜ÑÔ˜qÒ Ð ÝÐIÑJÔJÐJà%×-Ò-¨aÑ0Ô0°(×2BÒ2BÀ1Ñ2EÔ2E(ˆà—]’] 1Ñ%Ô%ˆ Ý Œ[˜˜J¨¬ ¸(¼.ÐQYÔQ`Ð aÑ aÔ aˆÝ Œ[˜˜J¨¬ ¸(¼.ÐQYÔQ`Ð aÑ aÔ aˆðŸ š hÑ/Ô/‰ ˆ#Ø×%Ò% a¨Ñ+Ô+ˆØmŠm˜A˜qÑ!Ô!ˆà4Ð4Ð4Ð45°©8¬8Ð4Ñ4Ô4ˆØ AAAw˜q”z G¨A¤JÐ.Ð.Ô /ˆØ AAAw˜q”z G¨A¤JÐ.Ð.Ô /ˆØ AAAw˜q”z G¨A¤JÐ.Ð.Ô /ˆØ AAAw˜q”z G¨A¤JÐ.Ð.Ô /ˆåŒIx×)Ò)¨"Ñ-Ô-¨x¸Ð<À"ÐEÑEÔEˆØGŠGA‰JŒJˆØˆØyŠy˜˜BÑÔ‰ˆˆ1à ‰GˆØˆÝŒIq˜"g 2Ð&Ñ&Ô&ˆØyŠy˜˜BÑÔ‰ˆˆ1à ‰GˆÝŒIq˜"g 2Ð&Ñ&Ô&ˆØHŠHQ‰KŒKˆØJŠJq‰MŒMˆåŒIq˜"g 2Ð&Ñ&Ô&ˆØHŠHQ‰KŒKˆØJŠJq‰MŒMˆØHŠHQ‰KŒKˆð{Š{˜1‰~Œ~Ðr$Úlengthscó‡‡‡—|j}|j}tjj |‰j‰jf¦«}‰ |¦«\}Š||‰jjz}g}| ¦«\}}}tj d|‰jf||¬¦«} tj d|‰jf||¬¦«} tj |df||¬¦«}ˆˆfd„td¦«D¦«}t|¦«D]ÔŠ|dd…dd…‰f} ˆfd„|D¦«\}}}}tj || |gd¬¦«}‰ |¦«}‰ | d¦«| ¦«\}} || dz}tj ||gd¬¦«}‰ | d¦«| ¦«\}} || dz}tj ||gd¬¦«}t%j‰ |¦«¦«}tj ||gd¬¦«}t%j‰ |¦«¦«}‰ |¦«}t%j|d¬¦«}tj|d¦« ¦«}d |zd ‰jzd z zd z }| |¦«ŒÖtj|¦« dd d¦«|fS)a¾Inference method of WaveRNN. This function currently only supports multinomial sampling, which assumes the network is trained on cross entropy loss. Args: specgram (Tensor): Batch of spectrograms. Shape: `(n_batch, n_freq, n_time)`. lengths (Tensor or None, optional): Indicates the valid length of each audio in the batch. Shape: `(batch, )`. When the ``specgram`` contains spectrograms with different durations, by providing ``lengths`` argument, the model will compute the corresponding valid output lengths. If ``None``, it is assumed that all the audio in ``waveforms`` have valid length. Default: ``None``. Returns: (Tensor, Optional[Tensor]): Tensor The inferred waveform of size `(n_batch, 1, n_time)`. 1 stands for a single channel. Tensor or None If ``lengths`` argument was provided, a Tensor of shape `(batch, )` is returned. It indicates the valid length in time axis of the output Tensor. Nr)rƒr‚cóX•—g|]&}‰dd…‰j|z‰j|dzz…dd…f‘Œ'S)Nrr…)r:r†r‘r!s €€r#r<z!WaveRNN.infer..xsCø€ÐXÐXÐXÈ!S˜˜˜˜DœJ¨™N¨T¬Z¸1¸q¹5Ñ-AÐAÀ1À1À1ÐDÔEÐXÐXÐXr$rmcó.•—g|]}|dd…dd…‰f‘ŒSrFr9)r:Úar†s €r#r<z!WaveRNN.infer..~s+ø€Ð%DÐ%DÐ%D°Q a¨¨¨¨1¨1¨1¨a¨¤jÐ%DÐ%DÐ%Dr$rˆrrNrP)rƒr‚rVrÚ functionalÚpadrorurQrŠr‹rjr=rrwryrbrzÚFÚrelur}r~rÚsoftmaxÚmultinomialÚfloatrsrZÚstackÚpermute)r!r%r™rƒr‚ÚoutputÚb_sizer;Úseq_lenrrr—Ú aux_splitÚm_tÚa1_tÚa2_tÚa3_tÚa4_tÚinpÚlogitsÚ posteriorr‘r†s` @@r#Úinferz WaveRNN.inferKsáøøø€ð<”ˆØ”ˆå”8Ô&×*Ò*¨8°d´iÀÄÐ5KÑLÔLˆØŸ š hÑ/Ô/‰ ˆ#ØÐØ ¤ Ô 9Ñ9ˆGà!ˆØ%Ÿ]š]™_œ_Ñˆ7å Œ[˜!˜V T¤ZÐ0¸ÀuÐ MÑ MÔ MˆÝ Œ[˜!˜V T¤ZÐ0¸ÀuÐ MÑ MÔ MˆÝŒK˜ ˜¨F¸%Ð@Ñ@Ô@ˆàXÐXÐXÐXÐXÍuÐUVÉxÌxÐXÑXÔXˆ åw‘”ð ñ ˆAà˜1˜1˜1˜a˜a˜a ˜7Ô#ˆCà%DÐ%DÐ%DÐ%D¸)Ð%DÑ%DÔ%DÑ"ˆD$˜˜då” ˜1˜c 4˜.¨aÐ0Ñ0Ô0ˆAØ—’˜‘ ” ˆAØ—I’I˜aŸkšk¨!™nœn¨bÑ1Ô1‰EˆAˆràBq”E‘ ˆAÝ”)˜Q ˜I¨1Ð-Ñ-Ô-ˆCØ—I’I˜cŸmšm¨AÑ.Ô.°Ñ3Ô3‰EˆAˆràBq”E‘ ˆAÝ” ˜1˜d˜)¨Ð+Ñ+Ô+ˆAÝ”t—x’x ‘{”{Ñ#Ô#ˆAå” ˜1˜d˜)¨Ð+Ñ+Ô+ˆAÝ”t—x’x ‘{”{Ñ#Ô#ˆAà—X’X˜a‘[”[ˆFåœ &¨aÐ0Ñ0Ô0ˆIåÔ! )¨QÑ/Ô/×5Ò5Ñ7Ô7ˆAàA‘˜˜DœK™¨#Ñ-Ñ.°Ñ4ˆAàMŠM˜!ÑÔÐÑåŒ{˜6Ñ"Ô"×*Ò*¨1¨a°Ñ3Ô3°WÐ<Ðrºs}ðØ€€€Ø(Ð(Ð(Ð(Ð(Ð(Ð(Ð(Ð(Ð(à€€€ØÐÐÐÐÐÐÐÐØÐÐÐÐÐÐÐððð€ð 8ð 8ð 8ð 8ð 8ˆrŒyñ 8ô 8ð 8ðF(.ð(.ð(.ð(.ð(.” ñ(.ô(.ð(.ðVfðfðfðfðf” ñfôfðfðBD0ðD0ðD0ðD0ðD0b”iñD0ôD0ðD0ðNR=ðR=ðR=ðR=ðR=ˆbŒiñR=ôR=ðR=ðR=ðR=r$