How can I match fuzzy match strings from two datasets?












17















I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.



So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).



However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.



See the following code:



a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))

for (i in 1:6){
a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}









share|improve this question




















  • 2





    Based on everyone feedback and some poking around from me, I created a function that solve my exact problem. Code can be found here: github.com/Adamishere/Fuzzymatching/blob/master/…

    – A L
    Mar 1 '17 at 15:59













  • thank you this function. This is quite useful. However I am not able to pass my column in string1, string2 and id2. My data is in data.table so not sure how I should pass them when calling the function. Could you please suggest. Sorry if my question is very basic, I have started learning R and still a long way to go

    – user1412
    Mar 11 '17 at 17:34











  • I would just use data.frame(), then once the match is done, convert to data.table()

    – A L
    Mar 21 '17 at 20:46






  • 1





    the fuzzyjoin package might help - see answer below using fuzzyjoin::stringdist_left_join

    – Arthur Yip
    Jun 6 '17 at 4:03
















17















I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.



So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).



However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.



See the following code:



a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))

for (i in 1:6){
a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}









share|improve this question




















  • 2





    Based on everyone feedback and some poking around from me, I created a function that solve my exact problem. Code can be found here: github.com/Adamishere/Fuzzymatching/blob/master/…

    – A L
    Mar 1 '17 at 15:59













  • thank you this function. This is quite useful. However I am not able to pass my column in string1, string2 and id2. My data is in data.table so not sure how I should pass them when calling the function. Could you please suggest. Sorry if my question is very basic, I have started learning R and still a long way to go

    – user1412
    Mar 11 '17 at 17:34











  • I would just use data.frame(), then once the match is done, convert to data.table()

    – A L
    Mar 21 '17 at 20:46






  • 1





    the fuzzyjoin package might help - see answer below using fuzzyjoin::stringdist_left_join

    – Arthur Yip
    Jun 6 '17 at 4:03














17












17








17


15






I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.



So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).



However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.



See the following code:



a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))

for (i in 1:6){
a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}









share|improve this question
















I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.



So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).



However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.



See the following code:



a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))

for (i in 1:6){
a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}






r string-matching fuzzy-search fuzzy-comparison






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 1 '15 at 3:03









pnuts

49k764101




49k764101










asked Oct 16 '14 at 13:37









A LA L

17826




17826








  • 2





    Based on everyone feedback and some poking around from me, I created a function that solve my exact problem. Code can be found here: github.com/Adamishere/Fuzzymatching/blob/master/…

    – A L
    Mar 1 '17 at 15:59













  • thank you this function. This is quite useful. However I am not able to pass my column in string1, string2 and id2. My data is in data.table so not sure how I should pass them when calling the function. Could you please suggest. Sorry if my question is very basic, I have started learning R and still a long way to go

    – user1412
    Mar 11 '17 at 17:34











  • I would just use data.frame(), then once the match is done, convert to data.table()

    – A L
    Mar 21 '17 at 20:46






  • 1





    the fuzzyjoin package might help - see answer below using fuzzyjoin::stringdist_left_join

    – Arthur Yip
    Jun 6 '17 at 4:03














  • 2





    Based on everyone feedback and some poking around from me, I created a function that solve my exact problem. Code can be found here: github.com/Adamishere/Fuzzymatching/blob/master/…

    – A L
    Mar 1 '17 at 15:59













  • thank you this function. This is quite useful. However I am not able to pass my column in string1, string2 and id2. My data is in data.table so not sure how I should pass them when calling the function. Could you please suggest. Sorry if my question is very basic, I have started learning R and still a long way to go

    – user1412
    Mar 11 '17 at 17:34











  • I would just use data.frame(), then once the match is done, convert to data.table()

    – A L
    Mar 21 '17 at 20:46






  • 1





    the fuzzyjoin package might help - see answer below using fuzzyjoin::stringdist_left_join

    – Arthur Yip
    Jun 6 '17 at 4:03








2




2





Based on everyone feedback and some poking around from me, I created a function that solve my exact problem. Code can be found here: github.com/Adamishere/Fuzzymatching/blob/master/…

– A L
Mar 1 '17 at 15:59







Based on everyone feedback and some poking around from me, I created a function that solve my exact problem. Code can be found here: github.com/Adamishere/Fuzzymatching/blob/master/…

– A L
Mar 1 '17 at 15:59















thank you this function. This is quite useful. However I am not able to pass my column in string1, string2 and id2. My data is in data.table so not sure how I should pass them when calling the function. Could you please suggest. Sorry if my question is very basic, I have started learning R and still a long way to go

– user1412
Mar 11 '17 at 17:34





thank you this function. This is quite useful. However I am not able to pass my column in string1, string2 and id2. My data is in data.table so not sure how I should pass them when calling the function. Could you please suggest. Sorry if my question is very basic, I have started learning R and still a long way to go

– user1412
Mar 11 '17 at 17:34













I would just use data.frame(), then once the match is done, convert to data.table()

– A L
Mar 21 '17 at 20:46





I would just use data.frame(), then once the match is done, convert to data.table()

– A L
Mar 21 '17 at 20:46




1




1





the fuzzyjoin package might help - see answer below using fuzzyjoin::stringdist_left_join

– Arthur Yip
Jun 6 '17 at 4:03





the fuzzyjoin package might help - see answer below using fuzzyjoin::stringdist_left_join

– Arthur Yip
Jun 6 '17 at 4:03












6 Answers
6






active

oldest

votes


















9














The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.



One-to-one case (requires assignment algorithm):



When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.



Not familiar with AGREP but here's example using stringdist for your distance matrix.



library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))


Produces the assignment:



       a          b       d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333


I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.



Many-to-one case (not an assignment problem):



do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))


Produces the result:



   a_name     b_name    dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333


Edit: use method="jw" to produce desired results. See help("stringdist-package")






share|improve this answer


























  • Thanks! This is very helpful. Although I am curious, in the many-to-one case, the results do not seem correct as they are not returning the best matches, after the first row.

    – A L
    Oct 16 '14 at 19:47













  • @Adam Lee depends on how you define "best" matches. See ?stringdist or ?adist for more on the default distance metrics. Using either of these functions with default arguments, "Bayes" is one edit closer to "asdf" than it is to "Bayes Inc."

    – C8H10N4O2
    Oct 16 '14 at 20:28











  • @Adam Lee try it as edited...

    – C8H10N4O2
    Oct 16 '14 at 20:42











  • Ah I see! Thank you, so it was a matter of the distance metrics used that causing that. Again this is very helpful!

    – A L
    Oct 17 '14 at 19:07






  • 1





    This was very helpful - thanks. I found this scales up a lot further if you filter d$dist before calling the greedyAssign function, e.g. d <- d[d$dist < 0.2,] . After running the code above (without a filter) for a sample, you can usually pick a crude cutoff point beyond which the proposed matches are unlikely to be useful.

    – Mike Honey
    Sep 4 '17 at 1:00





















8














Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.



As suggested by C8H10N4O2, the stringdist method="jw" creates the best matches for your example.



As suggested by dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::top_n to get only the best match with minimum distance.




a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
qty = c(9, 99, 10))

library(fuzzyjoin); library(dplyr);

stringdist_join(a, b,
by = "name",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(name.x) %>%
top_n(1, -dist)

#> # A tibble: 6 x 5
#> # Groups: name.x [6]
#> name.x price name.y qty dist
#> <fctr> <dbl> <fctr> <dbl> <dbl>
#> 1 Ace Co 10 Ace Co. 9 0.04761905
#> 2 Bayes 13 Bayes Inc. 99 0.16666667
#> 3 asd 2 asdf 10 0.08333333
#> 4 Bcy 1 Bayes Inc. 99 0.37777778
#> 5 Baes 15 Bayes Inc. 99 0.20000000
#> 6 Bays 1 Bayes Inc. 99 0.20000000





share|improve this answer

































    2














    I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.



    install.packages("ipred")
    install.packages("evd")
    install.packages("RSQLite")
    install.packages("ff")
    install.packages("ffbase")
    install.packages("ada")
    install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")

    require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above

    compareJW <- function(string, vec, cutoff) {
    require(RecordLinkage)
    jarowinkler(string, vec) > cutoff
    }

    a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
    b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
    a$name <- as.character(a$name)
    b$name <- as.character(b$name)

    test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
    data.frame(name = a$name, price = a$price, test = test)

    > data.frame(name = a$name, price = a$price, test = test)
    name price test
    1 Ace Co 10 TRUE
    2 Bayes 13 TRUE
    3 asd 2 TRUE
    4 Bcy 1 FALSE
    5 Baes 15 TRUE
    6 Bays 1 FALSE





    share|improve this answer
























    • RecordLinkage was put back on CRAN in 2015: cran.r-project.org/web/packages/RecordLinkage/index.html

      – Kayle Sawyer
      Mar 4 at 4:38



















    1














    Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion



    ##Here's where the algorithm starts...
    ##I'm going to generate a signature from country names to reduce some of the minor differences between strings
    ##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
    ##So for example, United Kingdom would become kingdomunited
    ##We might also remove stopwords such as 'the' and 'of'.
    signature=function(x){
    sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
    return(sig)
    }





    share|improve this answer































      1














      I use lapply for those circumstances:



      yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),


      then to write it as a csv it's not so straightforward:



      write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)





      share|improve this answer































        -1














        Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,



        step.1 Install phonics Package



        step.2 create a new column called "soundexcodes" in "mylistofcompanynames"



        step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"



        step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"



        step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"



        step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company



        example:
        Amazon Inc A625 can be Amazon A625
        Accenture Limited A455 can be Accenture A455



        step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"



        step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.



        step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.






        share|improve this answer

























          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f26405895%2fhow-can-i-match-fuzzy-match-strings-from-two-datasets%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          6 Answers
          6






          active

          oldest

          votes








          6 Answers
          6






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          9














          The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.



          One-to-one case (requires assignment algorithm):



          When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.



          Not familiar with AGREP but here's example using stringdist for your distance matrix.



          library(stringdist)
          d <- expand.grid(a$name,b$name) # Distance matrix in long form
          names(d) <- c("a_name","b_name")
          d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

          # Greedy assignment heuristic (Your favorite heuristic here)
          greedyAssign <- function(a,b,d){
          x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
          # 1 for already assigned, -1 for unassigned and unassignable
          while(any(x==0)){
          min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
          a_sel <- a[d==min_d & x==0][1]
          b_sel <- b[d==min_d & a == a_sel & x==0][1]
          x[a==a_sel & b == b_sel] <- 1
          x[x==0 & (a==a_sel|b==b_sel)] <- -1
          }
          cbind(a=a[x==1],b=b[x==1],d=d[x==1])
          }
          data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))


          Produces the assignment:



                 a          b       d
          1 Ace Co Ace Co. 0.04762
          2 Bayes Bayes Inc. 0.16667
          3 asd asdf 0.08333


          I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.



          Many-to-one case (not an assignment problem):



          do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))


          Produces the result:



             a_name     b_name    dist
          1 Ace Co Ace Co. 0.04762
          11 Baes Bayes Inc. 0.20000
          8 Bayes Bayes Inc. 0.16667
          12 Bays Bayes Inc. 0.20000
          10 Bcy Bayes Inc. 0.37778
          15 asd asdf 0.08333


          Edit: use method="jw" to produce desired results. See help("stringdist-package")






          share|improve this answer


























          • Thanks! This is very helpful. Although I am curious, in the many-to-one case, the results do not seem correct as they are not returning the best matches, after the first row.

            – A L
            Oct 16 '14 at 19:47













          • @Adam Lee depends on how you define "best" matches. See ?stringdist or ?adist for more on the default distance metrics. Using either of these functions with default arguments, "Bayes" is one edit closer to "asdf" than it is to "Bayes Inc."

            – C8H10N4O2
            Oct 16 '14 at 20:28











          • @Adam Lee try it as edited...

            – C8H10N4O2
            Oct 16 '14 at 20:42











          • Ah I see! Thank you, so it was a matter of the distance metrics used that causing that. Again this is very helpful!

            – A L
            Oct 17 '14 at 19:07






          • 1





            This was very helpful - thanks. I found this scales up a lot further if you filter d$dist before calling the greedyAssign function, e.g. d <- d[d$dist < 0.2,] . After running the code above (without a filter) for a sample, you can usually pick a crude cutoff point beyond which the proposed matches are unlikely to be useful.

            – Mike Honey
            Sep 4 '17 at 1:00


















          9














          The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.



          One-to-one case (requires assignment algorithm):



          When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.



          Not familiar with AGREP but here's example using stringdist for your distance matrix.



          library(stringdist)
          d <- expand.grid(a$name,b$name) # Distance matrix in long form
          names(d) <- c("a_name","b_name")
          d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

          # Greedy assignment heuristic (Your favorite heuristic here)
          greedyAssign <- function(a,b,d){
          x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
          # 1 for already assigned, -1 for unassigned and unassignable
          while(any(x==0)){
          min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
          a_sel <- a[d==min_d & x==0][1]
          b_sel <- b[d==min_d & a == a_sel & x==0][1]
          x[a==a_sel & b == b_sel] <- 1
          x[x==0 & (a==a_sel|b==b_sel)] <- -1
          }
          cbind(a=a[x==1],b=b[x==1],d=d[x==1])
          }
          data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))


          Produces the assignment:



                 a          b       d
          1 Ace Co Ace Co. 0.04762
          2 Bayes Bayes Inc. 0.16667
          3 asd asdf 0.08333


          I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.



          Many-to-one case (not an assignment problem):



          do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))


          Produces the result:



             a_name     b_name    dist
          1 Ace Co Ace Co. 0.04762
          11 Baes Bayes Inc. 0.20000
          8 Bayes Bayes Inc. 0.16667
          12 Bays Bayes Inc. 0.20000
          10 Bcy Bayes Inc. 0.37778
          15 asd asdf 0.08333


          Edit: use method="jw" to produce desired results. See help("stringdist-package")






          share|improve this answer


























          • Thanks! This is very helpful. Although I am curious, in the many-to-one case, the results do not seem correct as they are not returning the best matches, after the first row.

            – A L
            Oct 16 '14 at 19:47













          • @Adam Lee depends on how you define "best" matches. See ?stringdist or ?adist for more on the default distance metrics. Using either of these functions with default arguments, "Bayes" is one edit closer to "asdf" than it is to "Bayes Inc."

            – C8H10N4O2
            Oct 16 '14 at 20:28











          • @Adam Lee try it as edited...

            – C8H10N4O2
            Oct 16 '14 at 20:42











          • Ah I see! Thank you, so it was a matter of the distance metrics used that causing that. Again this is very helpful!

            – A L
            Oct 17 '14 at 19:07






          • 1





            This was very helpful - thanks. I found this scales up a lot further if you filter d$dist before calling the greedyAssign function, e.g. d <- d[d$dist < 0.2,] . After running the code above (without a filter) for a sample, you can usually pick a crude cutoff point beyond which the proposed matches are unlikely to be useful.

            – Mike Honey
            Sep 4 '17 at 1:00
















          9












          9








          9







          The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.



          One-to-one case (requires assignment algorithm):



          When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.



          Not familiar with AGREP but here's example using stringdist for your distance matrix.



          library(stringdist)
          d <- expand.grid(a$name,b$name) # Distance matrix in long form
          names(d) <- c("a_name","b_name")
          d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

          # Greedy assignment heuristic (Your favorite heuristic here)
          greedyAssign <- function(a,b,d){
          x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
          # 1 for already assigned, -1 for unassigned and unassignable
          while(any(x==0)){
          min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
          a_sel <- a[d==min_d & x==0][1]
          b_sel <- b[d==min_d & a == a_sel & x==0][1]
          x[a==a_sel & b == b_sel] <- 1
          x[x==0 & (a==a_sel|b==b_sel)] <- -1
          }
          cbind(a=a[x==1],b=b[x==1],d=d[x==1])
          }
          data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))


          Produces the assignment:



                 a          b       d
          1 Ace Co Ace Co. 0.04762
          2 Bayes Bayes Inc. 0.16667
          3 asd asdf 0.08333


          I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.



          Many-to-one case (not an assignment problem):



          do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))


          Produces the result:



             a_name     b_name    dist
          1 Ace Co Ace Co. 0.04762
          11 Baes Bayes Inc. 0.20000
          8 Bayes Bayes Inc. 0.16667
          12 Bays Bayes Inc. 0.20000
          10 Bcy Bayes Inc. 0.37778
          15 asd asdf 0.08333


          Edit: use method="jw" to produce desired results. See help("stringdist-package")






          share|improve this answer















          The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.



          One-to-one case (requires assignment algorithm):



          When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.



          Not familiar with AGREP but here's example using stringdist for your distance matrix.



          library(stringdist)
          d <- expand.grid(a$name,b$name) # Distance matrix in long form
          names(d) <- c("a_name","b_name")
          d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

          # Greedy assignment heuristic (Your favorite heuristic here)
          greedyAssign <- function(a,b,d){
          x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
          # 1 for already assigned, -1 for unassigned and unassignable
          while(any(x==0)){
          min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
          a_sel <- a[d==min_d & x==0][1]
          b_sel <- b[d==min_d & a == a_sel & x==0][1]
          x[a==a_sel & b == b_sel] <- 1
          x[x==0 & (a==a_sel|b==b_sel)] <- -1
          }
          cbind(a=a[x==1],b=b[x==1],d=d[x==1])
          }
          data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))


          Produces the assignment:



                 a          b       d
          1 Ace Co Ace Co. 0.04762
          2 Bayes Bayes Inc. 0.16667
          3 asd asdf 0.08333


          I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.



          Many-to-one case (not an assignment problem):



          do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))


          Produces the result:



             a_name     b_name    dist
          1 Ace Co Ace Co. 0.04762
          11 Baes Bayes Inc. 0.20000
          8 Bayes Bayes Inc. 0.16667
          12 Bays Bayes Inc. 0.20000
          10 Bcy Bayes Inc. 0.37778
          15 asd asdf 0.08333


          Edit: use method="jw" to produce desired results. See help("stringdist-package")







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Oct 16 '14 at 20:40

























          answered Oct 16 '14 at 15:44









          C8H10N4O2C8H10N4O2

          9,88255285




          9,88255285













          • Thanks! This is very helpful. Although I am curious, in the many-to-one case, the results do not seem correct as they are not returning the best matches, after the first row.

            – A L
            Oct 16 '14 at 19:47













          • @Adam Lee depends on how you define "best" matches. See ?stringdist or ?adist for more on the default distance metrics. Using either of these functions with default arguments, "Bayes" is one edit closer to "asdf" than it is to "Bayes Inc."

            – C8H10N4O2
            Oct 16 '14 at 20:28











          • @Adam Lee try it as edited...

            – C8H10N4O2
            Oct 16 '14 at 20:42











          • Ah I see! Thank you, so it was a matter of the distance metrics used that causing that. Again this is very helpful!

            – A L
            Oct 17 '14 at 19:07






          • 1





            This was very helpful - thanks. I found this scales up a lot further if you filter d$dist before calling the greedyAssign function, e.g. d <- d[d$dist < 0.2,] . After running the code above (without a filter) for a sample, you can usually pick a crude cutoff point beyond which the proposed matches are unlikely to be useful.

            – Mike Honey
            Sep 4 '17 at 1:00





















          • Thanks! This is very helpful. Although I am curious, in the many-to-one case, the results do not seem correct as they are not returning the best matches, after the first row.

            – A L
            Oct 16 '14 at 19:47













          • @Adam Lee depends on how you define "best" matches. See ?stringdist or ?adist for more on the default distance metrics. Using either of these functions with default arguments, "Bayes" is one edit closer to "asdf" than it is to "Bayes Inc."

            – C8H10N4O2
            Oct 16 '14 at 20:28











          • @Adam Lee try it as edited...

            – C8H10N4O2
            Oct 16 '14 at 20:42











          • Ah I see! Thank you, so it was a matter of the distance metrics used that causing that. Again this is very helpful!

            – A L
            Oct 17 '14 at 19:07






          • 1





            This was very helpful - thanks. I found this scales up a lot further if you filter d$dist before calling the greedyAssign function, e.g. d <- d[d$dist < 0.2,] . After running the code above (without a filter) for a sample, you can usually pick a crude cutoff point beyond which the proposed matches are unlikely to be useful.

            – Mike Honey
            Sep 4 '17 at 1:00



















          Thanks! This is very helpful. Although I am curious, in the many-to-one case, the results do not seem correct as they are not returning the best matches, after the first row.

          – A L
          Oct 16 '14 at 19:47







          Thanks! This is very helpful. Although I am curious, in the many-to-one case, the results do not seem correct as they are not returning the best matches, after the first row.

          – A L
          Oct 16 '14 at 19:47















          @Adam Lee depends on how you define "best" matches. See ?stringdist or ?adist for more on the default distance metrics. Using either of these functions with default arguments, "Bayes" is one edit closer to "asdf" than it is to "Bayes Inc."

          – C8H10N4O2
          Oct 16 '14 at 20:28





          @Adam Lee depends on how you define "best" matches. See ?stringdist or ?adist for more on the default distance metrics. Using either of these functions with default arguments, "Bayes" is one edit closer to "asdf" than it is to "Bayes Inc."

          – C8H10N4O2
          Oct 16 '14 at 20:28













          @Adam Lee try it as edited...

          – C8H10N4O2
          Oct 16 '14 at 20:42





          @Adam Lee try it as edited...

          – C8H10N4O2
          Oct 16 '14 at 20:42













          Ah I see! Thank you, so it was a matter of the distance metrics used that causing that. Again this is very helpful!

          – A L
          Oct 17 '14 at 19:07





          Ah I see! Thank you, so it was a matter of the distance metrics used that causing that. Again this is very helpful!

          – A L
          Oct 17 '14 at 19:07




          1




          1





          This was very helpful - thanks. I found this scales up a lot further if you filter d$dist before calling the greedyAssign function, e.g. d <- d[d$dist < 0.2,] . After running the code above (without a filter) for a sample, you can usually pick a crude cutoff point beyond which the proposed matches are unlikely to be useful.

          – Mike Honey
          Sep 4 '17 at 1:00







          This was very helpful - thanks. I found this scales up a lot further if you filter d$dist before calling the greedyAssign function, e.g. d <- d[d$dist < 0.2,] . After running the code above (without a filter) for a sample, you can usually pick a crude cutoff point beyond which the proposed matches are unlikely to be useful.

          – Mike Honey
          Sep 4 '17 at 1:00















          8














          Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.



          As suggested by C8H10N4O2, the stringdist method="jw" creates the best matches for your example.



          As suggested by dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::top_n to get only the best match with minimum distance.




          a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
          price = c(10, 13, 2, 1, 15, 1))
          b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
          qty = c(9, 99, 10))

          library(fuzzyjoin); library(dplyr);

          stringdist_join(a, b,
          by = "name",
          mode = "left",
          ignore_case = FALSE,
          method = "jw",
          max_dist = 99,
          distance_col = "dist") %>%
          group_by(name.x) %>%
          top_n(1, -dist)

          #> # A tibble: 6 x 5
          #> # Groups: name.x [6]
          #> name.x price name.y qty dist
          #> <fctr> <dbl> <fctr> <dbl> <dbl>
          #> 1 Ace Co 10 Ace Co. 9 0.04761905
          #> 2 Bayes 13 Bayes Inc. 99 0.16666667
          #> 3 asd 2 asdf 10 0.08333333
          #> 4 Bcy 1 Bayes Inc. 99 0.37777778
          #> 5 Baes 15 Bayes Inc. 99 0.20000000
          #> 6 Bays 1 Bayes Inc. 99 0.20000000





          share|improve this answer






























            8














            Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.



            As suggested by C8H10N4O2, the stringdist method="jw" creates the best matches for your example.



            As suggested by dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::top_n to get only the best match with minimum distance.




            a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
            price = c(10, 13, 2, 1, 15, 1))
            b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
            qty = c(9, 99, 10))

            library(fuzzyjoin); library(dplyr);

            stringdist_join(a, b,
            by = "name",
            mode = "left",
            ignore_case = FALSE,
            method = "jw",
            max_dist = 99,
            distance_col = "dist") %>%
            group_by(name.x) %>%
            top_n(1, -dist)

            #> # A tibble: 6 x 5
            #> # Groups: name.x [6]
            #> name.x price name.y qty dist
            #> <fctr> <dbl> <fctr> <dbl> <dbl>
            #> 1 Ace Co 10 Ace Co. 9 0.04761905
            #> 2 Bayes 13 Bayes Inc. 99 0.16666667
            #> 3 asd 2 asdf 10 0.08333333
            #> 4 Bcy 1 Bayes Inc. 99 0.37777778
            #> 5 Baes 15 Bayes Inc. 99 0.20000000
            #> 6 Bays 1 Bayes Inc. 99 0.20000000





            share|improve this answer




























              8












              8








              8







              Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.



              As suggested by C8H10N4O2, the stringdist method="jw" creates the best matches for your example.



              As suggested by dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::top_n to get only the best match with minimum distance.




              a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
              price = c(10, 13, 2, 1, 15, 1))
              b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
              qty = c(9, 99, 10))

              library(fuzzyjoin); library(dplyr);

              stringdist_join(a, b,
              by = "name",
              mode = "left",
              ignore_case = FALSE,
              method = "jw",
              max_dist = 99,
              distance_col = "dist") %>%
              group_by(name.x) %>%
              top_n(1, -dist)

              #> # A tibble: 6 x 5
              #> # Groups: name.x [6]
              #> name.x price name.y qty dist
              #> <fctr> <dbl> <fctr> <dbl> <dbl>
              #> 1 Ace Co 10 Ace Co. 9 0.04761905
              #> 2 Bayes 13 Bayes Inc. 99 0.16666667
              #> 3 asd 2 asdf 10 0.08333333
              #> 4 Bcy 1 Bayes Inc. 99 0.37777778
              #> 5 Baes 15 Bayes Inc. 99 0.20000000
              #> 6 Bays 1 Bayes Inc. 99 0.20000000





              share|improve this answer















              Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.



              As suggested by C8H10N4O2, the stringdist method="jw" creates the best matches for your example.



              As suggested by dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::top_n to get only the best match with minimum distance.




              a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
              price = c(10, 13, 2, 1, 15, 1))
              b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
              qty = c(9, 99, 10))

              library(fuzzyjoin); library(dplyr);

              stringdist_join(a, b,
              by = "name",
              mode = "left",
              ignore_case = FALSE,
              method = "jw",
              max_dist = 99,
              distance_col = "dist") %>%
              group_by(name.x) %>%
              top_n(1, -dist)

              #> # A tibble: 6 x 5
              #> # Groups: name.x [6]
              #> name.x price name.y qty dist
              #> <fctr> <dbl> <fctr> <dbl> <dbl>
              #> 1 Ace Co 10 Ace Co. 9 0.04761905
              #> 2 Bayes 13 Bayes Inc. 99 0.16666667
              #> 3 asd 2 asdf 10 0.08333333
              #> 4 Bcy 1 Bayes Inc. 99 0.37777778
              #> 5 Baes 15 Bayes Inc. 99 0.20000000
              #> 6 Bays 1 Bayes Inc. 99 0.20000000






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Mar 4 at 5:15

























              answered Jun 6 '17 at 4:01









              Arthur YipArthur Yip

              1,5201227




              1,5201227























                  2














                  I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.



                  install.packages("ipred")
                  install.packages("evd")
                  install.packages("RSQLite")
                  install.packages("ff")
                  install.packages("ffbase")
                  install.packages("ada")
                  install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")

                  require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above

                  compareJW <- function(string, vec, cutoff) {
                  require(RecordLinkage)
                  jarowinkler(string, vec) > cutoff
                  }

                  a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
                  b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
                  a$name <- as.character(a$name)
                  b$name <- as.character(b$name)

                  test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
                  data.frame(name = a$name, price = a$price, test = test)

                  > data.frame(name = a$name, price = a$price, test = test)
                  name price test
                  1 Ace Co 10 TRUE
                  2 Bayes 13 TRUE
                  3 asd 2 TRUE
                  4 Bcy 1 FALSE
                  5 Baes 15 TRUE
                  6 Bays 1 FALSE





                  share|improve this answer
























                  • RecordLinkage was put back on CRAN in 2015: cran.r-project.org/web/packages/RecordLinkage/index.html

                    – Kayle Sawyer
                    Mar 4 at 4:38
















                  2














                  I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.



                  install.packages("ipred")
                  install.packages("evd")
                  install.packages("RSQLite")
                  install.packages("ff")
                  install.packages("ffbase")
                  install.packages("ada")
                  install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")

                  require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above

                  compareJW <- function(string, vec, cutoff) {
                  require(RecordLinkage)
                  jarowinkler(string, vec) > cutoff
                  }

                  a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
                  b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
                  a$name <- as.character(a$name)
                  b$name <- as.character(b$name)

                  test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
                  data.frame(name = a$name, price = a$price, test = test)

                  > data.frame(name = a$name, price = a$price, test = test)
                  name price test
                  1 Ace Co 10 TRUE
                  2 Bayes 13 TRUE
                  3 asd 2 TRUE
                  4 Bcy 1 FALSE
                  5 Baes 15 TRUE
                  6 Bays 1 FALSE





                  share|improve this answer
























                  • RecordLinkage was put back on CRAN in 2015: cran.r-project.org/web/packages/RecordLinkage/index.html

                    – Kayle Sawyer
                    Mar 4 at 4:38














                  2












                  2








                  2







                  I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.



                  install.packages("ipred")
                  install.packages("evd")
                  install.packages("RSQLite")
                  install.packages("ff")
                  install.packages("ffbase")
                  install.packages("ada")
                  install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")

                  require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above

                  compareJW <- function(string, vec, cutoff) {
                  require(RecordLinkage)
                  jarowinkler(string, vec) > cutoff
                  }

                  a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
                  b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
                  a$name <- as.character(a$name)
                  b$name <- as.character(b$name)

                  test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
                  data.frame(name = a$name, price = a$price, test = test)

                  > data.frame(name = a$name, price = a$price, test = test)
                  name price test
                  1 Ace Co 10 TRUE
                  2 Bayes 13 TRUE
                  3 asd 2 TRUE
                  4 Bcy 1 FALSE
                  5 Baes 15 TRUE
                  6 Bays 1 FALSE





                  share|improve this answer













                  I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.



                  install.packages("ipred")
                  install.packages("evd")
                  install.packages("RSQLite")
                  install.packages("ff")
                  install.packages("ffbase")
                  install.packages("ada")
                  install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")

                  require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above

                  compareJW <- function(string, vec, cutoff) {
                  require(RecordLinkage)
                  jarowinkler(string, vec) > cutoff
                  }

                  a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
                  b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
                  a$name <- as.character(a$name)
                  b$name <- as.character(b$name)

                  test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8) # pick your level of cutoff, of course
                  data.frame(name = a$name, price = a$price, test = test)

                  > data.frame(name = a$name, price = a$price, test = test)
                  name price test
                  1 Ace Co 10 TRUE
                  2 Bayes 13 TRUE
                  3 asd 2 TRUE
                  4 Bcy 1 FALSE
                  5 Baes 15 TRUE
                  6 Bays 1 FALSE






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Oct 16 '14 at 20:36









                  lawyeRlawyeR

                  5,04731646




                  5,04731646













                  • RecordLinkage was put back on CRAN in 2015: cran.r-project.org/web/packages/RecordLinkage/index.html

                    – Kayle Sawyer
                    Mar 4 at 4:38



















                  • RecordLinkage was put back on CRAN in 2015: cran.r-project.org/web/packages/RecordLinkage/index.html

                    – Kayle Sawyer
                    Mar 4 at 4:38

















                  RecordLinkage was put back on CRAN in 2015: cran.r-project.org/web/packages/RecordLinkage/index.html

                  – Kayle Sawyer
                  Mar 4 at 4:38





                  RecordLinkage was put back on CRAN in 2015: cran.r-project.org/web/packages/RecordLinkage/index.html

                  – Kayle Sawyer
                  Mar 4 at 4:38











                  1














                  Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion



                  ##Here's where the algorithm starts...
                  ##I'm going to generate a signature from country names to reduce some of the minor differences between strings
                  ##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
                  ##So for example, United Kingdom would become kingdomunited
                  ##We might also remove stopwords such as 'the' and 'of'.
                  signature=function(x){
                  sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
                  return(sig)
                  }





                  share|improve this answer




























                    1














                    Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion



                    ##Here's where the algorithm starts...
                    ##I'm going to generate a signature from country names to reduce some of the minor differences between strings
                    ##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
                    ##So for example, United Kingdom would become kingdomunited
                    ##We might also remove stopwords such as 'the' and 'of'.
                    signature=function(x){
                    sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
                    return(sig)
                    }





                    share|improve this answer


























                      1












                      1








                      1







                      Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion



                      ##Here's where the algorithm starts...
                      ##I'm going to generate a signature from country names to reduce some of the minor differences between strings
                      ##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
                      ##So for example, United Kingdom would become kingdomunited
                      ##We might also remove stopwords such as 'the' and 'of'.
                      signature=function(x){
                      sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
                      return(sig)
                      }





                      share|improve this answer













                      Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion



                      ##Here's where the algorithm starts...
                      ##I'm going to generate a signature from country names to reduce some of the minor differences between strings
                      ##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
                      ##So for example, United Kingdom would become kingdomunited
                      ##We might also remove stopwords such as 'the' and 'of'.
                      signature=function(x){
                      sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
                      return(sig)
                      }






                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Nov 12 '15 at 12:01









                      YummyLin YangYummyLin Yang

                      111




                      111























                          1














                          I use lapply for those circumstances:



                          yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),


                          then to write it as a csv it's not so straightforward:



                          write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)





                          share|improve this answer




























                            1














                            I use lapply for those circumstances:



                            yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),


                            then to write it as a csv it's not so straightforward:



                            write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)





                            share|improve this answer


























                              1












                              1








                              1







                              I use lapply for those circumstances:



                              yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),


                              then to write it as a csv it's not so straightforward:



                              write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)





                              share|improve this answer













                              I use lapply for those circumstances:



                              yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),


                              then to write it as a csv it's not so straightforward:



                              write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)






                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Apr 25 '17 at 17:45









                              user3909910user3909910

                              163




                              163























                                  -1














                                  Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,



                                  step.1 Install phonics Package



                                  step.2 create a new column called "soundexcodes" in "mylistofcompanynames"



                                  step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"



                                  step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"



                                  step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"



                                  step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company



                                  example:
                                  Amazon Inc A625 can be Amazon A625
                                  Accenture Limited A455 can be Accenture A455



                                  step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"



                                  step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.



                                  step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.






                                  share|improve this answer






























                                    -1














                                    Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,



                                    step.1 Install phonics Package



                                    step.2 create a new column called "soundexcodes" in "mylistofcompanynames"



                                    step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"



                                    step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"



                                    step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"



                                    step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company



                                    example:
                                    Amazon Inc A625 can be Amazon A625
                                    Accenture Limited A455 can be Accenture A455



                                    step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"



                                    step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.



                                    step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.






                                    share|improve this answer




























                                      -1












                                      -1








                                      -1







                                      Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,



                                      step.1 Install phonics Package



                                      step.2 create a new column called "soundexcodes" in "mylistofcompanynames"



                                      step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"



                                      step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"



                                      step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"



                                      step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company



                                      example:
                                      Amazon Inc A625 can be Amazon A625
                                      Accenture Limited A455 can be Accenture A455



                                      step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"



                                      step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.



                                      step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.






                                      share|improve this answer















                                      Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,



                                      step.1 Install phonics Package



                                      step.2 create a new column called "soundexcodes" in "mylistofcompanynames"



                                      step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"



                                      step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"



                                      step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"



                                      step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company



                                      example:
                                      Amazon Inc A625 can be Amazon A625
                                      Accenture Limited A455 can be Accenture A455



                                      step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"



                                      step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.



                                      step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.







                                      share|improve this answer














                                      share|improve this answer



                                      share|improve this answer








                                      edited Jan 29 '18 at 11:20









                                      marc_s

                                      582k13011231269




                                      582k13011231269










                                      answered Jan 24 '18 at 8:50









                                      Nikhil MuthukrishnanNikhil Muthukrishnan

                                      11




                                      11






























                                          draft saved

                                          draft discarded




















































                                          Thanks for contributing an answer to Stack Overflow!


                                          • Please be sure to answer the question. Provide details and share your research!

                                          But avoid



                                          • Asking for help, clarification, or responding to other answers.

                                          • Making statements based on opinion; back them up with references or personal experience.


                                          To learn more, see our tips on writing great answers.




                                          draft saved


                                          draft discarded














                                          StackExchange.ready(
                                          function () {
                                          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f26405895%2fhow-can-i-match-fuzzy-match-strings-from-two-datasets%23new-answer', 'question_page');
                                          }
                                          );

                                          Post as a guest















                                          Required, but never shown





















































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown

































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown







                                          Popular posts from this blog

                                          Costa Masnaga

                                          Fotorealismo

                                          Sidney Franklin