Intelligent Systems and Applications

: This paper addresses a new text recognition solution, which is mainly used for the detection of street view images. This paper employs two different approaches to detect text-based regions and recognise corresponding text fields. The first approach utilises maximally stable extremal regions (MSER), whereas the second approach relies on the class specific extremal regions (CSER) algorithm. Two separate frameworks, designed with respect to the aforementioned methods, are applied to the street view images so as to extract text-based regions. Numerous experiments were performed to evaluate and compare both approaches. Results obtained from the CSER-based approach are especially quite encouraging and verify the system’s ability to detect text-based regions and recognise corresponding text fields.


Introduction
The analysis and evaluation of signboards is a critical issue, especially in developing countries in terms of environmental transformation and renewal processes.Furthermore, the taxation polices for signboard detection is also an important problem.Accordingly, many government resources are wasted to detect signboards and recognise text fields located on each signboard [1][2][3].It is clear that the text in the image provides very useful information about the image, which essentially provides appropriate clues for a wide variety of applications.In our daily life, we encounter signboards and signboard posts.The autonomous detection of signboards and recognition of posts will open the door to many different areas.For instance, the location of shops can be determined using signboards and GPS.Within this approach, a map of the environment can be easily obtained only by using the text fields written on signboards [4,5].An example of a complex street view image, including numerous signboards is shown in Figure 1.Direct extraction of text regions from street view images is a challenging task due to the texts having different formats and variant background interventions [6].Signboard detection tasks mainly involve two sub-processes, namely, text detection from images and recognition of characters, which may have different fonts and sizes.As opposed to other natural view images, street view images have quite complex contents due to the issues aforementioned.In the literature, there exist several different classification approaches for analysing text detection and recognition methodologies, which have been summarized in the following review papers [7,8].One of those approaches categorises them into two groups, namely, stepwise and integrated methodologies.The algorithms using the stepwise methodology primarily employ localization, validation, segmentation and recognition steps respectively.The frameworks relying on this methodology primarily follow a coarse-to-fine strategy, which first estimates position of text candidates, and then perform validation, segmentation, and recognition steps respectively [7,9,10].Integrated methodologies, on the other hand, first utilize a character classification module, which is the most critical step and the results of this procedure are then employed by detection and recognition modules [11].This methodology not only tends to extract characters from background but also from each other.This is a challenging problem, requiring a reliable feature detector and a robust classifier.For instance, well-known approaches, considered in this category, employ HOG algorithm to extract features and a nearest neighbour or SVM as classifier [12].An interesting study, alternatively, proposes a multi-layer CNN based design to overcome both detection and recognition phases of text recognition problems [13].It is clear that the advent of deep learning will lead text recognition capabilities of integrated methodologies in a more advanced level [14,15].Besides, text detection based methods can also be classified into two categories, namely, texture based approach and connected component based approaches (CC) [7,8] Text detection-based methods can be classified into two categories, namely, a texture-based approach and connected component-based approaches (CC).Texture-based approaches consider the text as a special pattern that is particularly different from the background.Typically, features are extracted over a certain region using well-known feature extractors, and then a classifier is utilised to identify the existence of text [16].Alternatively, CC-based methods extract regions from the image and employ different geometric parameters or statistical approaches to exclude non-text fields.One recent study employs this approach to an image with stroke width transformed [17].In another study,K-mean clustering is employed to detect connected components in which straightness and edge density parameters are used to eliminate false positives [18].This paper proposes two solutions for signboard detection and text recognition problems.One employs the MSER-based text recognition approach, which is basically intensity-based blob detection; and similarly, the second approach is an extremal region (ER)-based scene text detection system, which is also a CC-based method and estimates connected components whose intensity is higher or lower than its nearby pixels.This method is called CSER (class specific extremal regions), which is a generalisation of an MSER detector possessed within learning capacity.Overall, section 2 addresses the design of both approaches, whereas section 3 includes the experimental section.The paper is concluded in section 4.

Text Recognition Frameworks
This section will detail both solutions for estimation of text from street view images.As previously mentioned, two different methods were employed to overcome this critical computer vision problem.Essentially, two different frameworks were built relying on two comprehensive methods.Section 2.1 details the framework using the MSER method, whereas section 2.2 addresses the ERbased method.

MSER-based text recognition system
MSER is one of the most popular and efficient blob detectors due to its robustness against scale changes and lighting conditions.It is in essence a natural choice for text detection problems [16].MSER is an intensity-based algorithm whose size remains unchanged over a range of thresholds.The methods work well but have problems especially on blurry or low contrast images [19].According to the MSER algorithm, first, a series of threshold values (sweeping) from black to white is applied, and afterwards, connected components are extracted.A threshold value within the maximally stable region (1) is estimated.Finally, each region is considered as a feature and may be approximated to each region with an ellipse.Extremal denotes that all pixels inside the MSER regions have higher or lower intensity values than all the pixels on its outer boundary.
where,  1  2 , … .  are nested extremal regions and "Δ" is a parameter.  * is an MSER and produces a local minimum on the nested chain  1 ,  2 , … ,  max along the threshold variable.The extremal regions are rejected if they are too small, large or similar to its parent MSER.Details can be seen in [16,19].MSER works with images having homogeneous regions with distinctive boundaries aswell as with small regions, whereas it cannot tolerate motion blur.
The MSER framework is illustrated in Figure 2. As illustrated in the figure, the proposed framework detects signboards from street view images and then recognises text-based regions.According to the framework, MSER is employed to detect ROI, which may cover text fields.Afterwards, a canny edge detector is applied to detect corner pixels in a more efficient manner.The connected component analysis algorithm is then applied to estimate transitions between meaningful pixel blocks, and then the most stable pixel blocks are obtained.In the final step, stroke width transformation is an image operator that computes per pixel width most likely stroke containing the pixel are applied to remove pixel based defects [17].Next, a reliable OCR library is employed to recognise characters, and a dictionary module is employed to remove both unlisted characters and complete missing words.Figure 3 depicts the results of an example scenario using the MSER-based signboard detection system obtained from street view images.

CSER-based text recognition system
The class specific extremal regions (CSER) algorithm is similar to the MSER algorithm, where appropriate extremal regions are calculated using the intensity-based approach.However, the main difference is that the CSER algorithm relies on a sequential classifier trained for character recognition, which drops the stability requirements of MSER but selects class-specific regions [20].The CSER-based text recognition algorithm first checks the probability of extremal regions (ERs) having characters.ERs within local maximum values pass to the second stage.The classification is supported by employing computationally expensive features.Finally, an exhaustive search using a feedback mechanism is applied to groups so as to extract probable character regions, and then an OCR module is applied to recognise characters.The details of the algorithm can be seen in [20] and the pseudocode of the algorithm is also shown in algorithm 1.The CSER algorithm has a cascade structure (sequential classifier) with two stages.In the first stage, the following descriptors are employed, namely, 'area', 'bounding box', 'perimeter' and 'Euler number'.Afterwards, a real AdaBoost classifier using decision trees was employed with those features [21].In the second stage, an SVM classifier additionally employs further parameters such as 'hole area ratio' and 'convex hull ratio'.For the grouping step, an efficient and pruned exhaustive search-based approach is employed, which searches character sequence space in real time.Details of this search can be seen in [22].Afterwards, a reliable OCR library is utilised to identify characters, and a dictionary module is employed to remove both unknown characters and complete missing words.Figure 5 shows the results of an example scenario using the CSERbased signboard detection system obtained from street view images.

Experimental Section
This section compares and details the experimental result of the proposed MSER-based and CSER-based frameworks for signboard detection and text recognition problems using street view images.The experiments are run on an Intel Core i7 2.2 GHz with 8 GB ram computer.The frameworks were developed using OpenCV 3.2 with the Windows operating system.As aforementioned, the main motivation lies behind this study to develop signboard recognition to be used in cluttered images obtained from street view images, especially in Turkey.Consequently, instead of utilizing well-known benchmark dataset, which cannot meet the requirements of commercial applications,a data set including 400 images was obtained.This dataset was obtained by employing open source mapping and imaging services; the dataset includes images from different municipalities all over Turkey.As previously mentioned, this dataset consists of images obtained from several municipalities located in Turkey.Also, the open source Tesseract OCR library is employed for the recognition library.For this experimental part, a small dataset is obtained from the given image corpus, and the precision parameter ( 2) is employed to compare both architectures' accuracy over the given dataset.
where, TP is true positives and FP is false positives.
While TP depicts correctly identified samples, FP depicts incorrectly identified ones.Figure 6 illustrates the results of the MSER-based architecture over a randomly selected dataset, which includes six high contrast and detailed images.However, this approach can achieve only 50% accuracy on signboard detection tasks, unexpectedly.Alternatively, Figure 7 represents the results of the CSER-based architecture over the same dataset, which, however surprisingly, achieves 80% accuracy on the signboard detection task.Results of both methods highly depend on the quality of acquired images, as expected.However, within the given corpus overall performance advantage of the CSER-based approach is almost 30% better than the MSER-based approach.Results reveal that despite MSER based detection approach achieves a solid performance; the integrated systems using an AI based learning phase results in better recognition performance.[7].
With respect to end-to-end text detection, as illustrated in Table 1.MSER based approches performance a low precison rate especially, for SVT dataset, however CSER based framework results in better detection performance especially in RSD dataset.

Conclusion
The detection of signboards from street view images is a challenging task and requires obtaining a region of interest (ROI), including texts and characters.Accordingly, two different architectures were designed, based on two different segmentation algorithms.Both architectures are supported by a powerful OCR library and a dictionary module.The first architecture is mainly designed based on a well-known and efficient segmentation algorithm, namely, maximally stable extremal regions (MSER).A corresponding system and segmentation algorithm narrows the searching field and increases the overall possibility of correctly detecting signboards obtained from street view images.However, street images may include different fonts that reduce the overall performance of the first approach, which can have, at most, a 50% precision value at the detection of signboards.The second method, class specific extremal regions (CSER), on the other hand, employs trained data to detect the ROI that the trained set produces using the rotation and orientation models of each character.Therefore, CSER detects text-based regions in a more robust and efficient manner.Furthermore, the CSER-based system employs an advanced grouping method that achieves better performance in detecting text-based regions.A series of experiments were conducted to evaluate both approaches in detecting signboards from street view images.The results reveal that the CSER-based approach is superior to the MSER-based approach and can be efficiently used to detect text-based regions, even in cluttered images.

Figure 1 :
Figure 1: An example street view images including signboards.

Figure 3 :
Figure 3: MSER-based framework is applied to the scenario obtained from street view images based signboard detection.

Figure 4
Figure 4 also illustrates the CSER-based signboard detection system used for street view images.The CSER algorithm has a cascade structure (sequential classifier) with two stages.In the first stage, the following descriptors are employed, namely, 'area', 'bounding box', 'perimeter' and 'Euler number'.Afterwards, a real AdaBoost classifier using decision trees was employed with those features[21].In the second stage, an SVM classifier additionally employs further parameters such as 'hole area ratio' and 'convex hull ratio'.For the grouping step, an efficient and pruned exhaustive search-based approach is employed, which

Figure 5 :
Figure 5: CSER-based framework is applied to the scenario obtained from street view images based signboard detection.

Figure 6 :
Figure 6: A randomly selected dataset; from top to bottom: process time, false positive (FP) and true positive are shown (MSER).

Figure 11 :
Figure 11: CSER-based signboard detection.An example scenario using the image shown in Figure1is used to reveal the text recognition skills of both systems.Figures8-11include the results of both approaches; Figures8 and 9illustrate the text recognition and signboard detection results of MSER method.Furthermore, Figures10 and 11illustrate the identified characters from CSER method, respectively.As mentioned previously, both architectures employ the same OCR and dictionary modules.Therefore, a critical comparison can be made considering the segmentation and signboard detection of both architectures that the CSER-based one achieves far more than the MSER based architecture.In order to have a better comparison, SVT (Street View Text Dataset), public and benchmark dataset, was also employed to compare both approaches.One of the recent and leading papers compares End-To-End Text detection performances of comprehensive text recognition algorithms[7].Results reveal that despite MSER based detection approach achieves a solid performance; the integrated systems using an AI based learning phase results in better recognition performance.[7].With respect to end-to-end text detection, as illustrated in Table1.MSER based approches performance a low precison rate especially, for SVT dataset, however CSER based framework results in better detection performance especially in RSD dataset.

Table 1 :
End-to-End Text Detection Performance.