You are currently browsing the tag archive for the ‘Text Mining’ tag.

Vitorino Ramos - Citations2016Jan

2016 – Up now, an overall of 1567 citations among 74 works (including 3 books) on GOOGLE SCHOLAR (https://scholar.google.com/citations?user=gSyQ-g8AAAAJ&hl=en) [with an Hirsh h-index=19, and an average of 160.2 citations each for any work on my top five] + 900 citations among 57 works on the new RESEARCH GATE site (https://www.researchgate.net/profile/Vitorino_Ramos).

Refs.: Science, Artificial Intelligence, Swarm Intelligence, Data-Mining, Big-Data, Evolutionary Computation, Complex Systems, Image Analysis, Pattern Recognition, Data Analysis.

David MS Rodrigues Reading the News Through its Structure New Hybrid Connectivity Based ApproachesFigure – Two simplicies a and b connected by the 2-dimensional face, the triangle {1;2;3}. In the analysis of the time-line of The Guardian newspaper (link) the system used feature vectors based on frequency of words and them computed similarity between documents based on those feature vectors. This is a purely statistical approach that requires great computational power and that is difficult for problems that have large feature vectors and many documents. Feature vectors with 100,000 or more items are common and computing similarities between these documents becomes cumbersome. Instead of computing distance (or similarity) matrices between documents from feature vectors, the present approach explores the possibility of inferring the distance between documents from the Q-analysis description. Q-analysis is a very natural notion of connectivity between the simplicies of the structure and in the relation studied, documents are connected to each other through shared sets of tags entered by the journalists. Also in this framework, eccentricity is defined as a measure of the relatedness of one simplex in relation to another [7].

David M.S. Rodrigues and Vitorino Ramos, “Traversing News with Ant Colony Optimisation and Negative Pheromones” [PDF], accepted as preprint for oral presentation at the European Conference on Complex SystemsECCS14 in Lucca, Sept. 22-26, 2014, Italy.

Abstract: The past decade has seen the rapid development of the online newsroom. News published online are the main outlet of news surpassing traditional printed newspapers. This poses challenges to the production and to the consumption of those news. With those many sources of information available it is important to find ways to cluster and organise the documents if one wants to understand this new system. Traditional approaches to the problem of clustering documents usually embed the documents in a suitable similarity space. Previous studies have reported on the impact of the similarity measures used for clustering of textual corpora [1]. These similarity measures usually are calculated for bag of words representations of the documents. This makes the final document-word matrix high dimensional. Feature vectors with more than 10,000 dimensions are common and algorithms have severe problems with the high dimensionality of the data. A novel bio inspired approach to the problem of traversing the news is presented. It finds Hamiltonian cycles over documents published by the newspaper The Guardian. A Second Order Swarm Intelligence algorithm based on Ant Colony Optimisation was developed [2, 3] that uses a negative pheromone to mark unrewarding paths with a “no-entry” signal. This approach follows recent findings of negative pheromone usage in real ants [4].

In this case study the corpus of data is represented as a bipartite relation between documents and keywords entered by the journalists to characterise the news. A new similarity measure between documents is presented based on the Q-analysis description [5, 6, 7] of the simplicial complex formed between documents and keywords. The eccentricity between documents (two simplicies) is then used as a novel measure of similarity between documents. The results prove that the Second Order Swarm Intelligence algorithm performs better in benchmark problems of the travelling salesman problem, with faster convergence and optimal results. The addition of the negative pheromone as a non-entry signal improves the quality of the results. The application of the algorithm to the corpus of news of The Guardian creates a coherent navigation system among the news. This allows the users to navigate the news published during a certain period of time in a semantic sequence instead of a time sequence. This work as broader application as it can be applied to many cases where the data is mapped to bipartite relations (e.g. protein expressions in cells, sentiment analysis, brand awareness in social media, routing problems), as it highlights the connectivity of the underlying complex system.

Keywords: Self-Organization, Stigmergy, Co-Evolution, Swarm Intelligence, Dynamic Optimization, Foraging, Cooperative Learning, Hamiltonian cycles, Text Mining, Textual Corpora, Information Retrieval, Knowledge Discovery, Sentiment Analysis, Q-Analysis, Data Mining, Journalism, The Guardian.

References:

[1] Alexander Strehl, Joydeep Ghosh, and Raymond Mooney. Impact of similarity measures on web-page clustering.  In Workshop on Artifcial Intelligence for Web Search (AAAI 2000), pages 58-64, 2000.
[2] David M. S. Rodrigues, Jorge Louçã, and Vitorino Ramos. From standard to second-order Swarm Intelligence  phase-space maps. In Stefan Thurner, editor, 8th European Conference on Complex Systems, Vienna, Austria,  9 2011.
[3] Vitorino Ramos, David M. S. Rodrigues, and Jorge Louçã. Second order Swarm Intelligence. In Jeng-Shyang  Pan, Marios M. Polycarpou, Micha l Wozniak, André C.P.L.F. Carvalho, Hector Quintian, and Emilio Corchado,  editors, HAIS’13. 8th International Conference on Hybrid Artificial Intelligence Systems, volume 8073 of Lecture  Notes in Computer Science, pages 411-420. Springer Berlin Heidelberg, Salamanca, Spain, 9 2013.
[4] Elva J.H. Robinson, Duncan Jackson, Mike Holcombe, and Francis L.W. Ratnieks. No entry signal in ant  foraging (hymenoptera: Formicidae): new insights from an agent-based model. Myrmecological News, 10(120), 2007.
[5] Ronald Harry Atkin. Mathematical Structure in Human A ffairs. Heinemann Educational Publishers, 48 Charles  Street, London, 1 edition, 1974.
[6] J. H. Johnson. A survey of Q-analysis, part 1: The past and present. In Proceedings of the Seminar on Q-analysis  and the Social Sciences, Universty of Leeds, 9 1983.
[7] David M. S. Rodrigues. Identifying news clusters using Q-analysis and modularity. In Albert Diaz-Guilera,  Alex Arenas, and Alvaro Corral, editors, Proceedings of the European Conference on Complex Systems 2013, Barcelona, 9 2013.

Figure – Book cover of Toby Segaran’s, “Programming Collective Intelligence – Building Smart Web 2.0 Applications“, O’Reilly Media, 368 pp., August 2007.

{scopus online description} Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting data-sets from other web sites, collect data from users of your own applications, and analyze and understand the data once you’ve found it. Programming Collective Intelligence takes you into the world of machine learning and statistics, and explains how to draw conclusions about user experience, marketing, personal tastes, and human behavior in general — all from information that you and others collect every day. Each algorithm is described clearly and concisely with code that can immediately be used on your web site, blog, Wiki, or specialized application.

{even if I don’t totally agree, here’s a “over-rated” description – specially on the scientific side, by someone “dwa” – link above} Programming Collective Intelligence is a new book from O’Reilly, which was written by Toby Segaran. The author graduated from MIT and is currently working at Metaweb Technologies. He develops ways to put large public data-sets into Freebase, a free online semantic database. You can find more information about him on his blog:  http://blog.kiwitobes.com/. Web 2.0 cannot exist without Collective Intelligence. The “giants” use it everywhere, YouTube recommends similar movies, Last.fm knows what would you like to listen and Flickr which photos are your favorites etc. This technology empowers intelligent search, clustering, building price models and ranking on the web. I cannot imagine modern service without data analysis. That is the reason why it is worth to start read about it. There are many titles about collective intelligence but recently I have read two, this one and “Collective Intelligence in Action“. Both are very pragmatic, but the O’Reilly’s one is more focused on the merit of the CI. The code listings are much shorter (but examples are written in Python, so that was easy). In general these books comparison is like Java vs. Python. If you would like to build recommendation engine “in Action”/Java way, you would have to read whole book, attach extra jar-s and design dozens of classes. The rapid Python way requires reading only 15 pages and voila, you have got the first recommendations. It is awesome!

So how about rest of the book, there are still 319 pages! Further chapters say about: discovering groups, searching, ranking, optimization, document filtering, decision trees, price models or genetic algorithms. The book explains how to implement Simulated Annealing, k-Nearest Neighbors, Bayesian Classifier and many more. Take a look at the table of contents (here: http://oreilly.com/catalog/9780596529321/preview.html), it does not list all the algorithms but you can find more information there. Each chapter has about 20-30 pages. You do not have to read them all, you can choose the most important and still know what is going on. Every chapter contains minimum amount of theoretical introduction, for total beginners it might be not enough. I recommend this book for students who had statistics course (not only IT or computing science), this book will show you how to use your knowledge in practice _ there are many inspiring examples. For those who do not know Python – do not be afraid _ at the beginning you will find short introduction to language syntax. All listings are very short and well described by the author _ sometimes line by line. The book also contains necessary information about basic standard libraries responsible for xml processing or web pages downloading. If you would like to start learn about collective intelligence I would strongly recommend reading “Programming Collective Intelligence” first, then “Collective Intelligence in Action”. The first one shows how easy it is to implement basic algorithms, the second one would show you how to use existing open source projects related to machine learning.

With the current ongoing dramatic need of Africa to have contemporary maps (currently, Google promises to launch his first and exhaustive world-wide open-access digital cartography of the African continent very soon), back in 1999-2000 we envisioned a very simple idea into a research project (over my previous lab. – CVRM IST). Instead of producing new maps in the regular standard way, which are costly (specially for African continent countries) as well as time consuming (imagine the amount of money and time needed to cover the whole continent with high resolution aerial photos) the idea then was to hybridize trough an automatic procedure (with the help of Artificial Intelligence) new current data coming from satellites with old data coming from the computational analysis of images of old colonial maps. For instance, old roads segmented in old maps will help us finding the new ones coming from the current satellite images, as well as those that were lost. The same goes on for bridges, buildings, numbers, letters at the map, etc. However in order to do this, several preparatory steps were needed. One of those crucial steps was to obtain (segment – know to be one of the hardest procedures in image processing) the old roads, buildings, airports, at the old maps. Back in 1999-2000 while dealing with several tasks at this research project (AUTOCARTIS Automatic Methods for Updating Cartographic Maps) I started to think of using evolutionary computation in order to tackle and surpass this precise problem, in what then later become one of the first usages of Genetic Algorithms in image analysis. The result could be checked below. Meanwhile, the experience gained with AUTOCARTIS was then later useful not only for digital old books (Visão Magazine, March 2002), as well as for helping us finding water in Mars (at the MARS EXPRESS European project – Expresso newspaper, May 2003) from which CVRM lab. was one of the European partners. Much often in life simple ideas (I owe it to Prof. Fernando Muge and Prof. Pedro Pina) are the best ones. This is particularly true in science.

Figure – One original image (left – Luanda, Angola map) and two segmentation examples, rivers and roads respectively obtained through the Genetic Algorithm proposed (low resolution images). [at the same time this precise Map of Luanda, was used by me along with the face of Einstein to benchmark several dynamic image adaptive perception versus memory experiments via ant-like artificial life systems over what I then entitled Digital Image Habitats]

[] Vitorino Ramos, Fernando Muge, Map Segmentation by Colour Cube Genetic K-Mean Clustering, Proc. of ECDL´2000 – 4th European Conference on Research and Advanced Technology for Digital Libraries, J. Borbinha and T. Baker (Eds.), ISBN 3-540-41023-6, Lecture Notes in Computer Science, Vol. 1923, pp. 319-323, Springer-Verlag -Heidelberg, Lisbon, Portugal, 18-20 Sep. 2000.

Segmentation of a colour image composed of different kinds of texture regions can be a hard problem, namely to compute for an exact texture fields and a decision of the optimum number of segmentation areas in an image when it contains similar and/or non-stationary texture fields. In this work, a method is described for evolving adaptive procedures for these problems. In many real world applications data clustering constitutes a fundamental issue whenever behavioural or feature domains can be mapped into topological domains. We formulate the segmentation problem upon such images as an optimisation problem and adopt evolutionary strategy of Genetic Algorithms for the clustering of small regions in colour feature space. The present approach uses k-Means unsupervised clustering methods into Genetic Algorithms, namely for guiding this last Evolutionary Algorithm in his search for finding the optimal or sub-optimal data partition, task that as we know, requires a non-trivial search because of its NP-complete nature. To solve this task, the appropriate genetic coding is also discussed, since this is a key aspect in the implementation. Our purpose is to demonstrate the efficiency of Genetic Algorithms to automatic and unsupervised texture segmentation. Some examples in Colour Maps are presented and overall results discussed.

(to obtain the respective PDF file follow link above or visit chemoton.org)

[...] People should learn how to play Lego with their minds. Concepts are building bricks [...] V. Ramos, 2002.

Archives

Blog Stats

  • 256,612 hits