What is a basic yet efficient UX for "cleaning up a table similar records"?
I am interested in curating data of various sorts, one example of which is company metadata. Another example is movie metadata, or book metadata, or sports player metadata. Basically, a spreadsheet of thousands of rows (if not millions or billions of rows), with 5-10 properties (just to keep it simple for now).
So for a company, you might want:
- the
name
- the
wikipedia url
- the
founding date
- the
website
- the
twitter url
All of this is simple text data (for this question at least, i.e. not bringing finding logos or images into the mix just yet for this).
Usually, I "bootstrap" the data collection and curation project by finding already somewhat curated CSV files or other downloads from government websites or whatnot. Usually these lists are fairly small (< 1000 items), and each list is highly varied in its structure and cleanliness of the data. For example, you might have "Google", "Google Inc", "Google, Inc." (notice the appropriate puncutation in the last one), "google.com", etc. All mean the same thing, but they are written down differently. Or you might just plain have the same exact thing twice or more.
The question is, what is the most efficient and effective way to quickly and correctly mass edit this large list of companies? I might collect 20,000 company records, or in other contexts, millions and billions of records (like movies or book titles). Some of the things you want to accomplish are:
- Remove duplicates.
- Merge partial data from several records into a cohesive single record (name and website from one, twitter url and founding date and name from another).
- Clean up the names so they are properly punctuated/capitalized/formatted.
- Double check the URLs work and are correct.
Some of this work involves "data cleaning", other parts involve doing search engine searches to check the quality of the content. Some other important things that you need are to be able to easily sort the records (which would make it easier to find duplicate companies in some cases), by individual columns. (Note, it doesn't actually need to be laid out like a spreadsheet at all. In fact, that is why I am asking the question in the first place).
In my head, you would show one field/column at a time, and be able to scroll through thousands of records at a time, and search and sort by the data in the currently viewed slice of the data. For example, when you are checking if the wikipedia urls are correct, you might want to paste it into the app to see if there are any duplicates (hence, fast duplicate checking), and then paste into the browser to check. Likewise, if you are editing the name, sort by names, and somehow do a fuzzy search to find potential similar candidates (although from a technical/coding perspective, this seems very hard to implement well).
But that is all just in the cleaning / deduping / quality checking process. There is still the necessary process of double checking the work! DOING THE WORK ALL OVER AGAIN A SECOND TIME! By someone else.
Then making sure the contributors are doing it right (have high quality results). Maybe there is gamification like leaderboards and everything to motivate people to put their public face on and have reputation back it.
What is the most efficient way to get high quality clean data results in this sort of situation?