Much of our reams of data sit in large databases of unstructured text. Finding insights among emails, text documents, and websites is extremely difficult unless we can search, characterize, and classify their text data in a meaningful way.
One of the leading big data algorithms for finding related topics within unstructured text (an area called topic modeling) is latent Dirichlet allocation (LDA). But when Northwestern University professor Luis Amaral set out to test LDA, he found that it was neither as accurate nor reproducible as a leading topic modeling algorithm should be.
Using his network analysis background, Amaral, professor of chemical and biological engineering in Northwestern's McCormick School of Engineering and Applied Science, developed a new topic modeling algorithm that has shown very high accuracy and reproducibility during tests. His results, published with co-author Konrad Kording, associate professor of physical medicine and rehabilitation, physiology, and applied mathematics at Northwestern, were published Jan. 29 in Physical Review X.
Topic modeling algorithms take unstructured text and find a set of topics that can be used to describe each document in the set. They are the workhorses of big data science, used as the foundation for recommendation systems, spam filtering, and digital image processing. The LDA topic modeling algorithm was developed in 2003 and has been widely used for academic research and for commercial applications, like search engines.
When Amaral explored how LDA worked, he found that the algorithm produced different results each time for the same set of data, and it often did so inaccurately. Amaral and his group tested LDA by running it on documents they created that were written in English, French, Spanish, and other languages. By doing this, they were able to prevent text overlap among documents.
"In this simple case, the algorithm should be able to perform at 100 percent accuracy and reproducibility," he said. But when LDA was used, it separated these documents into similar groups with only 90 percent accuracy and 80 percent reproducibility. "While these numbers may appear to be good, they are actually very poor, since they are for an exceedingly easy case," Amaral said.
To create a better algorithm, Amaral took a network approach. The result, called TopicMapping, begins by preprocessing data to replace words with their stem (so "star" and "stars" would be considered the same word). It then builds a network of connecting words and identifies a "community" of related words (just as one could look for communities of people in Facebook). The words within a given community define a topic.
The algorithm was able to perfectly separate the documents according to language and was able to reproduce its results. It also had high accuracy and reproducibility when separating 23,000 scientific papers and 1.2 million Wikipedia articles by topic.
These results show the need for more testing of big data algorithms and more research into making them more accurate and reproducible, Amaral said.
"Companies that make products must show that their products work," he said. "They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of big data algorithms that are using tools that haven't been tested for reproducibility and accuracy."
- Big Data (6)
- Classlist (3)
- Crowdsourcing (1)
- Data Mining (3)
- Dataset Lists (1)
- Interviews (1)
- Jane Goodall (1)
- Knowledge Management (1)
- Logistics (1)
- Natural Language Processing (2)
- Paper versus Screens (1)
- Poetry (1)
- Project Ideas (7)
- Reproducibility (2)
- Research Paradigms (2)
- Research Surveys (3)
- Science (15)
- Scientific Articles (9)
- Search Engine Optimization (3)
- Security (1)
- Sport Metrics (2)
- Web Analytics (1)
Also check out the sister blog >>> http://researchmethodslinks.blogspot.ie/
Thursday, January 29, 2015
Friday, January 9, 2015
10. Use Google to Search Certain Sites
If you really like a web site but its search tool isn't very good, fret not—Google almost always does a better job, and you can use it to search that site with a simple operator. For example, if you want to find an old Lifehacker article, just type
site:lifehacker.combefore your search terms (e.g.
site:lifehacker.com hackintosh). The same goes for your favorite forums, blogs, and even web services. In fact, it's actually really good for finding free audiobooks, searching for free stuff without the spam, and more.
9. Find Product Names, Recipes, and More with Reverse Image Search
Google's reverse image search is great if you're looking for the source of a photo, wallpaper, or more images like that. However, reverse image search is also great for searching out information—like finding out who makes the chair in this picture, or how do I make the meal in this photo. Just punch in an image like you normally would, but look at Google's regular results instead of the image results—you'll probably find a lot.
8. Get "Wildcard" Suggestions Through Autocomplete
A lot of advanced search engines let you put a * in the middle of your terms to denote "anything." Google does too, but it doesn't always work the way you want. However, you can still get wildcard suggestions, of a sort, by typing in a full phrase in Google and then deleting the word you want to replace. For example, you can search for
how to jailbreak an iphoneand remove one word to see all the suggestions for
how to ____ an iphone.
7. Find Free Downloads of Any Type
Ever needed an old Android app but couldn't find the APK for what you were looking for? Or wanted an MP3 but couldn't find the right version? Google has a few search tools that, when used together, can unlock a plethora of downloads:
filetype. For example, to find free Android APKs, you'd search for
-inurl:htm -inurl:html intitle:"index of" apkto see site indexes of stored APK files. You can use this to find Android apps, music files, free ebooks, comic books, and more. Check out the linked posts for more information.
6. Discover Alternatives to Popular Sites, Apps, and Products
You've probably searched for comparisons on Google before, like
roku vs apple tv. But what if you don't know what you want to compare a product too, or you want to see what other competitors are out there? Just type in
roku vsand see what Google's autocomplete adds. It'll most likely list the most popular competitors to the roku so you know what else to check out.You can also search for
better than rokuto see alternatives, too.
5. Access Google Cache Directly from the Search Bar
We all know Google Cache can be a great tool, but there's no need to search for the page and then hunt for that "Cached" link: just type
cache:before that site's URL (e.g.
cache:http://lifehacker.com). If Google has the site in its cache, it'll pull it right up for you. If you want to simplify the process even more, this bookmarklet is handy to have around. It's great for seeing an old version of a page, accessing a site when it's down, or getting past something like the SOPA blackout.
4. Bypass Paywalls, Blocked Sites, and More with a Google Proxy
You may already know that you can sometimes bypass paywalls, get around blocked sites, and download files by funneling a site through Google Translate or Google Mobilizer. That's a clever search trick in and of itself, but just like Google Cache, you can make the process a lot faster bykeeping a few URLs on hand. Just add the URL you want to visit to the end of the Google URL (e.g.
http://translate.google.com/translate?sl=ja&tl=en&u=http://example.com/and you're good to go. Check out the full list of proxies, along with bookmarklets to make them even easier, here.
3. Search for People on Google Images
2. Get More Precise Time-Based Search Results
You've probably seen the option in Google that lets you filter results by time, such as the past hour, day, or week. But if you want something more specific—like in the past 10 minutes—you can do so with a URL hack. Just add
&tbs=qdr:to the end of the URL, along with the time you want to search (which can include
h5for 5 hours,
n5for 5 minutes, or
s5for 5 seconds (substituting any number you want). So, to search within th past 10 minutes, you'd add
&tbs=qdr:n10to your URL. It's handy for getting the most up-to-the-minute news.
1. Refine Your Search Terms with Advanced Operators
If your Google search just isn't returning the quality content you want, this little URL trick might find more in-depth articles on the subject you're searching for.