Wednesday, November 27, 2013

A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data

Posted by thirumalai kumar at 5:07 PM 0 Comments

Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probabilitybased Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high-dimensional image, microarray, and text data, demonstrate that the FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.

Existing System

The embedded methods incorporate feature selection as a part of the training process and are usually specific to given learning algorithms, and therefore may be more efficient than the other three categories. Traditional machine learning algorithms like decision trees or artificial neural networks are examples of embedded approaches. The wrapper methods use the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets, the accuracy of the learning algorithms is usually high. However, the generality of the selected features is limited and the computational complexity is large. The filter methods are independent of learning algorithms, with good generality. Their computational complexity is low, but the accuracy of the learning algorithms is not guaranteed. The hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper. They mainly focus on combining filter and wrapper methods to achieve the best possible performance with a particular learning algorithm with similar time complexity of the filter methods.
1.     The generality of the selected features is limited and the computational complexity is large.
2.     Their computational complexity is low, but the accuracy of the learning algorithms is not guaranteed.

 Proposed System

Feature subset selection can be viewed as the process of identifying and removing as many irrelevant and redundant features as possible. This is because irrelevant features do not contribute to the predictive accuracy and redundant features do not redound to getting a better predictor for that they provide mostly information which is already present in other feature(s). Of the many feature subset selection algorithms, some can effectively eliminate irrelevant features but fail to handle redundant features yet some of others can eliminate the irrelevant while taking care of the redundant features. Our proposed FAST algorithm falls into the second group. Traditionally, feature subset selection research has focused on searching for relevant features. A well-known example is Relief which weighs each feature according to its ability to discriminate instances under different targets based on distance-based criteria function. However, Relief is ineffective at removing redundant features as two predictive but highly correlated features are likely both to be highly weighted. Relief-F extends Relief, enabling this method to work with noisy and incomplete data sets and to deal with multiclass problems, but still cannot identify redundant features.

1.      Good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with  each other.
2.     The efficiently and effectively deal with both irrelevant and redundant features, and obtain a good feature subset.


             Implementation is the stage of the project when the theoretical design is turned out into a working system. Thus it can be considered to be the most critical stage in achieving a successful new system and in giving the user, confidence that the new system will work and be effective.

               The implementation stage involves careful planning, investigation of the existing system and it’s constraints on implementation, designing of methods to achieve changeover and evaluation of changeover methods.
Main Modules:-

1.    User Module :
                In this module, Users are having authentication and security to access the detail which is presented in the ontology system. Before accessing or searching the details user should have the account in that otherwise they should register first.

2.    Distributed Clustering :

The Distributional clustering has been used to cluster words into groups based either on their participation in particular grammatical relations with other words by Pereira et al. or on the distribution of class labels associated with each word by Baker and McCallum . As distributional clustering of words are agglomerative in nature, and result in suboptimal word clusters and high computational cost, proposed a new information-theoretic divisive algorithm for word clustering and applied it to text classification. proposed to cluster features using a special metric of distance, and then makes use of the of the resulting cluster hierarchy to choose the most relevant attributes. Unfortunately, the cluster evaluation measure based on distance does not identify a feature subset that allows the classifiers to improve their original performance accuracy.Furthermore, even compared with other feature selection methods, the obtained accuracy is lower.
3.    Subset Selection Algorithm
The Irrelevant features, along with redundant features, severely affect the accuracy of the learning machines. Thus, feature subset selection should be able to identify and remove as much of the irrelevant and redundant information as possible. Moreover, “good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other. Keeping these in mind, we develop a novel algorithm which can efficiently and effectively deal with both irrelevant and redundant features, and obtain a good feature subset.

4.    Time Complexity :
The major amount of work for Algorithm 1 involves the computation of SU values for TR relevance and F-Correlation, which has linear complexity in terms of the number of instances in a given data set. The first part of the algorithm has a linear time complexity in terms of the number of features m. Assuming features are selected as relevant ones in the first part, when k ¼ only one feature is selected.


H/W System Configuration:-

        Processor               -    Pentium –III

Speed                                -    1.1 Ghz
RAM                                 -    256  MB(min)
Hard Disk                          -   20 GB
Floppy Drive                     -    1.44 MB
Key Board                         -    Standard Windows Keyboard
Mouse                                -    Two or Three Button Mouse
Monitor                              -    SVGA


 S/W System Configuration:-

v   Operating System            :Windows95/98/2000/XP
v   Application  Server          :   Tomcat5.0/6.X                                             
v   Front End                          :   HTML, Java, Jsp
v    Scripts                                :   JavaScript.
v   Server side Script             :   Java Server Pages.
v   Database                            :   Mysql 5.0
v   Database Connectivity      :   JDBC.

Share This Post

Get Updates

Subscribe to our Mailing List. We'll never share your Email address.





AERONAUTICAL AEROSPACE AGRICULTURE ANDROID Android project titles Animation projects Artificial Intelligence AUTOMOBILE BANK JOBS BANK RECRUITMENTS Bio instrumentation Project titles BIO signal Project titles BIO-TECHNOLOGY BIOINFORMATICS BIOMEDICAL Biometrics projects CAREER CAT 2014 Questions CHEMICAL CIVIL Civil projects cloud computing COMP- PROJ-DOWN COMPUTER SCIENCE PROJECT DOWNLOADS COMPUTER(CSE) CONFERENCE Data mining Projects Design projects DIGITAL SIGNAL PROCESSING IEEE Project titles Dot net projects EBOOKS ELECTRICAL MINI PROJECTS ELECTRICAL PROJECTS DOWNLOADS ELECTRONICS MINI PROJECTS ELECTRONICS PROJECT DOWNLOADS EMG PROJECTS employment Engineering projects Exams Facts final year projects FOOD TECHNOLOGY FREE IEEE 2014 project Free IEEE Paper FREE IEEE PROJECTS GATE GAte scorecard GOVT JOBS Green projects GSM BASED Guest authors HIGHWAY IEEE 2014 projects ieee 2015 projects IEEE computer science projects IEEE Paper IEEE PAPER 2015 ieee project titles IEEE projects IEEE Transactions INDUSTRIAL INNOVATIVE PROJECTS INTERFACING IT IT LIST Java projects labview projects LATEST TECHNOLOGY list of project centers Low cost projects MARINE Matlab codes MATLAB PROJECT TITLES MATLAB PROJECTS MBA MBA 2015 projects MCA MECHANICAL MECHANICAL PROJECTS DOWNLOAD MINI PROJECTS modelling projects MP3 MP3 cutter Mp4 Networking topics ns2 projects online jobs PETROCHEMICAL PHYSIOLOGICAL MODELLING projects physiotheraphy Projects Power electronics power system projects PRODUCTION project centers project downloads Prosthesis projects RAILWAY RECRUITMENT 2012 Recent RECENT TECHNOLOGY RECENT TECHNOLOGY LIST RECRUITMENT Rehabilitation projects renewable power respiration projects RESUME FORMAT. Ring Tone Cutter Robotics projects. Robots in medical social network jobs Solar projects Songs Cutter Speech-music separation-Abstract structural engineering TECHNOLOGY technology management TELE COMMUNICATION TEXTILE TOP ENGINEERING COLLEGES Training VLSI


This blogs is an effort to club the scattered information about engineering and project titles and ideas available in the web. While every effort is made to ensure the accuracy of the information on this site, no liability is accepted for any consequences of using it. Most of the material and information are taken from other blogs and site with the help of search engines. If any posts here are hitting the copyrights of any publishers, kindly mail the details to It will be removed immediately.

Alexa Rank

back to top