^{1}

^{2}

^{1}

^{2}

Blind signal separation has been studied extensively in order to tackle the cocktail party problem. It explores spatial diversity of the received mixtures of sources by different sensors. By using the kurtosis measure, it is possible to select the source of interest out of a number of separated BSS outputs. Further noise cancellation can be achieved by adding an adaptive noise canceller (ANC) as postprocessing. However, the computation is rather intensive and an online implementation of the overall system is not straightforward. This paper intends to fill the gap by developing an FPGA hardware architecture to implement the system. Subband processing is explored and detailed functional operations are profiled carefully. The final proposed FPGA system is able to handle signals with sample rate over 20000 samples per second.

Speech enhancement has found numerous applications in human machine interfaces, hearing aids, and even hearing protection devices [

To resolve the nonstationarity issue, spatial filtering or beamforming can be used to spatially filter out the speech signal from the noisy signal [

Low et. al [

In the literature, the implementation of a time-delay sonar beamformer on reconfigured devices has been reported [

In the FPGA hardware architecture, fixed point arithmetics are applied with a careful bitwidth analysis to explore suitable bitwidth of the system. The optimized integer and fraction size using fixed point arithmetic can reduce the overall circuit size significantly compared with a basic implementation of the algorithm in FPGA.

The hardware accelerator is used to perform the most time consuming part of the algorithm. We implement the algorithm and evaluate on a Virtex-4 platform. By calculating the number of samples handled per second, the proposed FPGA-based architecture can process a maximum of 22758 samples per second, which realizes the real-time capability.

Section

Figure

The spatiotemporal processor with

Second-order decorrelation is incapable of performing BSS as decorrelation does not imply independence. However, additional assumptions about the system can be incorporated to achieve separation. For instance, if the sources are nonstationary, then their respective covariances at different time intervals are linearly independent. This is consistent with the observation that speech is highly nonstationary. A typical speech signal consists of approximately ten to fifteen phonemes per second and each of these phonemes has varying spectral characteristics [

Consider a convolutive mixture of

Following the approach in [

As explained previously, the estimation of the frequency domain unmixing weights

BSS as it is algorithmically defined attempts to recover

The following modified frequency domain leaky LMS algorithm for the frequency

As explained in Section

Transform the input signal to their frequency domain representations via short time FFT;

Filter the frequency transformed signals by the weight estimates from the Complex Matrix Multiplier;

Reconstruct the signal estimates back to the time domain via short time IFFT (inverse FFT).

Dataflow of the main operations.

The architecture makes use of parallelism property of the algorithm via frequency domain. In summary, one can explore implementing parallelism at several levels, including [

loop level parallelism, where consecutive loop iterations can be run in parallel,

task level parallelism, where entire procedures inside the program can be run in parallel,

data parallelism.

Since the algorithm is made up of a control part and a computation part, the first stage consists in locating the computational kernels of the algorithms. The algorithmic profiling is performed to determine the time consumption of the computation kernels. The profiling exercise can be summarised in Tables

Profiling results of overall operations.

| Time (s) | |
---|---|---|

Perform BSS | 152.1 | 86.9% |

| ||

Calculate BSS Output | 14.2 | 8.1% |

| ||

Post Processing ANC | 4.9 | 2.8% |

| ||

OTHERS | 3.9 | 2.2% |

Profiling results of detailed operations.

| |
---|---|

24-bit FFT/IFFT (256 pt) | 43.8% |

| |

Complex Matrix Multiplier | 55.1% |

| |

OTHERS | 1.1% |

Figure

Block diagram of the proposed FPGA architecture for BSS.

The block diagram of the hardware accelerator is given in Figure

Block diagram of the hardware accelerator.

While the FFT/IFFT can be implemented by using the core generator LogiCORE IP FFT provided by the vendor tools [

Block diagram of the Complex Matrix Multiplier module.

The data flow in the proposed architecture can be explained by the following [

In the first instance, the filtering process starts with the processor forwarding a load instruction for filter input data to the APU;

The instruction set is passed to the interface logic via the APU, which decodes the instruction and waits for data from memory to arrive;

The input data is sent to the interface logic;

The processor forwards the store instruction to the APU in anticipation of the filter output once the load instructions are completed;

The interface logic decodes the store instruction and waits for data from the filter module;

Once processing is performed, the operation module then returns results to the interface logic;

The interface logic returns the output data to the processor via the APU and they are written to memory.

Figure

The data flow of the main state machine.

When the hardware state machine states a ready flag, it means that it is ready to accept the data. The role of PowerPC is to provide the data and address it. It will also issue a valid signal, which provides the indication to write. Once the hardware obtains the valid signal, it then writes the data at the address provided by the PowerPC. Once completed, the hardware asserts a flag which informs the PowerPC that data has been written. This triggers a wait state whereby the system awaits for the result. Once the result is ready, it then asserts a result ready flag. Interestingly, the PowerPC can detect the completion in two ways [

The performance evaluation of the proposed hardware architecture was simulated in the Xilinx XC4VSX55-12-FF1148 chip. The settings for the experiment were as follows:

Input sequences were English speech from male/female with four microphones;

Sampling frequency was 16 kHz;

Prototype filter length in the analysis and synthesis filter banks was 128;

Number of taps in the adaptive filters was 5;

Parameter

Number of subbands was chosen as

Number of iterations for the BSS was 2000.

The first task is to figure out a suitable bitwidth for the fixed point arithmetic. The input speech and noise signals are displayed in Figures

Input signal and results of BSS.

Input speech signal

Input noise signal

Filtered signal using floating-point representation

Filtered signal using fixed point representation

Table

Implementation results of BSS.

FPGA device | XC4VSX55-12 | XC2VP30-7 |

| ||

Slices used | 5937 (12%) | 8916(32%) |

| ||

DSP48/MULT used | 72 (14%) | 72 (52%) |

| ||

Block RAM used | 8 (2%) | 8 (5%) |

| ||

Frequency (MHz) | 184.8 | 165.9 |

In order to estimate the performance of the proposed FPGA-based BSS system, we first incorporate one instance of FFT/IFFT and one instance of Complex Matrix Multiplier hardware accelerators. Taking one data block with 256 samples, assuming the sampling rate is 16kHz, the number of clock cycles required for processing the block of data in the frequency domain is measured as 15,421,718. Therefore, given that the period of one clock cycle is

Table

Maximum speedup with multiple instances in the FPGA device.

Samples/s | Number of Instances | Slices | DSP | |
---|---|---|---|---|

FFT / IFFT | Complex Matrix Multiplier | Used | Used | |

| 1 | 1 | 24% | 14% |

| ||||

| 1 | 2 | 37% | 24% |

| ||||

| 1 | 3 | 50% | 33% |

| ||||

| 1 | 4 | 63% | 42% |

| ||||

| 1 | 5 | 73% | 52% |

| ||||

| 1 | 6 | 86% | 61% |

| ||||

| 2 | 1 | 35% | 18% |

| ||||

| 2 | 2 | 48% | 27% |

| ||||

| 2 | 3 | 61% | 35% |

| ||||

| 2 | 4 | 74% | 46% |

| ||||

| 2 | 5 | 87% | 53% |

| ||||

| 3 | 1 | 46% | 22% |

| ||||

| 3 | 2 | 59% | 31% |

| ||||

| 3 | 3 | 72% | 40% |

| ||||

| 3 | 4 | 85% | 49% |

| ||||

| 4 | 1 | 57% | 26% |

| ||||

| 4 | 2 | 70% | 34% |

| ||||

| 4 | 3 | 83% | 43% |

In this paper, an online blind signal separation system has been proposed, which involves designing the separation matrix and a postfiltering noise canceller. A hardware implementation of the algorithms on an FPGA virtex-4 system has been described. In the algorithm, in order to achieve computational efficiency, a frequency domain implementation is employed to speed up the convergence of the beamformers. The complete architecture is simulated in hardware and results show that real-time performance can be achieved when an FPGA-based hardware accelerator performs the critical parts of the algorithm. The resulting embedded system will find applications in modern multimedia systems. As a future extension, it would be of interest to investigate power consumption of the final design based on the technique in [

The data used to support the findings of this study are available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.

This paper is supported by RGC Grant PolyU 152200/14E and PolyU Grant 4-ZZGS and G-YBVQ. The authors would like to thank Mr. Xiaoxiang Shi for carrying out the implementation on FPGA.