Extracting speaker interventions from a text using R? Or something else?
up vote
0
down vote
favorite
We're working on a text mining project for school on the proportion of environment-oriented speech in Quebec's National Assembly. We wanna extract a list of every speaker's interventions throughout the years.
Our documents are all formatted this way:
Mr. Smith : Blablabla
Mrs. Jones : Blablabla
What I would like to do is write the simplest thing possible that would allow me to extract these interventions. I'm thinking something along the lines of:
"Every time you see [Mr. **** : ] OR [Mrs. **** : ], extract ALL the text until you see another occurrence of [Mr. **** : ] OR [Mrs. **** : ]. And, ideally, extract all the Mr. Smiths and the Mrs. Joneses and the Mr. Williams in separate files while keeping track of which file the interventions came from.
I started writing a very basic gsub line which allowed me to replace the occurrences I wanted to replace with an @, only to realize I don't want to replace them completely but rather maybe just add an @ in front which would probably make it easier to write something that would just separate the @s in distinct files.
gsub("(Mr.|Mrs.)\s\w*\s:\s", "@", test)
I've just started teaching myself R for this project and I need some insight on how I should proceed next. Or should I use something else instead?
r text-mining
add a comment |
up vote
0
down vote
favorite
We're working on a text mining project for school on the proportion of environment-oriented speech in Quebec's National Assembly. We wanna extract a list of every speaker's interventions throughout the years.
Our documents are all formatted this way:
Mr. Smith : Blablabla
Mrs. Jones : Blablabla
What I would like to do is write the simplest thing possible that would allow me to extract these interventions. I'm thinking something along the lines of:
"Every time you see [Mr. **** : ] OR [Mrs. **** : ], extract ALL the text until you see another occurrence of [Mr. **** : ] OR [Mrs. **** : ]. And, ideally, extract all the Mr. Smiths and the Mrs. Joneses and the Mr. Williams in separate files while keeping track of which file the interventions came from.
I started writing a very basic gsub line which allowed me to replace the occurrences I wanted to replace with an @, only to realize I don't want to replace them completely but rather maybe just add an @ in front which would probably make it easier to write something that would just separate the @s in distinct files.
gsub("(Mr.|Mrs.)\s\w*\s:\s", "@", test)
I've just started teaching myself R for this project and I need some insight on how I should proceed next. Or should I use something else instead?
r text-mining
Probably tokenize words and thencumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten
– alistaire
Nov 19 at 4:36
Can you provide a link to an actual document?
– hrbrmstr
Nov 19 at 8:16
The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."
– Chris Ruehlemann
Nov 19 at 10:52
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
We're working on a text mining project for school on the proportion of environment-oriented speech in Quebec's National Assembly. We wanna extract a list of every speaker's interventions throughout the years.
Our documents are all formatted this way:
Mr. Smith : Blablabla
Mrs. Jones : Blablabla
What I would like to do is write the simplest thing possible that would allow me to extract these interventions. I'm thinking something along the lines of:
"Every time you see [Mr. **** : ] OR [Mrs. **** : ], extract ALL the text until you see another occurrence of [Mr. **** : ] OR [Mrs. **** : ]. And, ideally, extract all the Mr. Smiths and the Mrs. Joneses and the Mr. Williams in separate files while keeping track of which file the interventions came from.
I started writing a very basic gsub line which allowed me to replace the occurrences I wanted to replace with an @, only to realize I don't want to replace them completely but rather maybe just add an @ in front which would probably make it easier to write something that would just separate the @s in distinct files.
gsub("(Mr.|Mrs.)\s\w*\s:\s", "@", test)
I've just started teaching myself R for this project and I need some insight on how I should proceed next. Or should I use something else instead?
r text-mining
We're working on a text mining project for school on the proportion of environment-oriented speech in Quebec's National Assembly. We wanna extract a list of every speaker's interventions throughout the years.
Our documents are all formatted this way:
Mr. Smith : Blablabla
Mrs. Jones : Blablabla
What I would like to do is write the simplest thing possible that would allow me to extract these interventions. I'm thinking something along the lines of:
"Every time you see [Mr. **** : ] OR [Mrs. **** : ], extract ALL the text until you see another occurrence of [Mr. **** : ] OR [Mrs. **** : ]. And, ideally, extract all the Mr. Smiths and the Mrs. Joneses and the Mr. Williams in separate files while keeping track of which file the interventions came from.
I started writing a very basic gsub line which allowed me to replace the occurrences I wanted to replace with an @, only to realize I don't want to replace them completely but rather maybe just add an @ in front which would probably make it easier to write something that would just separate the @s in distinct files.
gsub("(Mr.|Mrs.)\s\w*\s:\s", "@", test)
I've just started teaching myself R for this project and I need some insight on how I should proceed next. Or should I use something else instead?
r text-mining
r text-mining
edited Nov 19 at 10:39
snoram
5,794730
5,794730
asked Nov 19 at 4:12
François Côté
62
62
Probably tokenize words and thencumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten
– alistaire
Nov 19 at 4:36
Can you provide a link to an actual document?
– hrbrmstr
Nov 19 at 8:16
The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."
– Chris Ruehlemann
Nov 19 at 10:52
add a comment |
Probably tokenize words and thencumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten
– alistaire
Nov 19 at 4:36
Can you provide a link to an actual document?
– hrbrmstr
Nov 19 at 8:16
The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."
– Chris Ruehlemann
Nov 19 at 10:52
Probably tokenize words and then
cumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten– alistaire
Nov 19 at 4:36
Probably tokenize words and then
cumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten– alistaire
Nov 19 at 4:36
Can you provide a link to an actual document?
– hrbrmstr
Nov 19 at 8:16
Can you provide a link to an actual document?
– hrbrmstr
Nov 19 at 8:16
The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."
– Chris Ruehlemann
Nov 19 at 10:52
The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."
– Chris Ruehlemann
Nov 19 at 10:52
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:
# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )
# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"
The @ is a good starting point for extracting the individual interventions. This can be done thus:
pattern <- "@.[^@]*"
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"
Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)
– François Côté
Nov 23 at 20:21
You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the packagestringr:require(stringr)dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") )# 2. remove "@" & ":":dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T)# 3.usebyto splitdatinto dfs per distinct ID:df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list
– Chris Ruehlemann
Nov 23 at 23:17
Step 2. can be skipped by using positive lookbehind in Step 1.:dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )
– Chris Ruehlemann
yesterday
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:
# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )
# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"
The @ is a good starting point for extracting the individual interventions. This can be done thus:
pattern <- "@.[^@]*"
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"
Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)
– François Côté
Nov 23 at 20:21
You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the packagestringr:require(stringr)dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") )# 2. remove "@" & ":":dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T)# 3.usebyto splitdatinto dfs per distinct ID:df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list
– Chris Ruehlemann
Nov 23 at 23:17
Step 2. can be skipped by using positive lookbehind in Step 1.:dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )
– Chris Ruehlemann
yesterday
add a comment |
up vote
0
down vote
accepted
If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:
# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )
# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"
The @ is a good starting point for extracting the individual interventions. This can be done thus:
pattern <- "@.[^@]*"
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"
Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)
– François Côté
Nov 23 at 20:21
You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the packagestringr:require(stringr)dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") )# 2. remove "@" & ":":dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T)# 3.usebyto splitdatinto dfs per distinct ID:df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list
– Chris Ruehlemann
Nov 23 at 23:17
Step 2. can be skipped by using positive lookbehind in Step 1.:dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )
– Chris Ruehlemann
yesterday
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:
# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )
# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"
The @ is a good starting point for extracting the individual interventions. This can be done thus:
pattern <- "@.[^@]*"
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"
If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:
# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )
# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"
The @ is a good starting point for extracting the individual interventions. This can be done thus:
pattern <- "@.[^@]*"
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"
edited Nov 19 at 10:36
answered Nov 19 at 10:05
Chris Ruehlemann
1609
1609
Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)
– François Côté
Nov 23 at 20:21
You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the packagestringr:require(stringr)dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") )# 2. remove "@" & ":":dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T)# 3.usebyto splitdatinto dfs per distinct ID:df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list
– Chris Ruehlemann
Nov 23 at 23:17
Step 2. can be skipped by using positive lookbehind in Step 1.:dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )
– Chris Ruehlemann
yesterday
add a comment |
Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)
– François Côté
Nov 23 at 20:21
You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the packagestringr:require(stringr)dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") )# 2. remove "@" & ":":dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T)# 3.usebyto splitdatinto dfs per distinct ID:df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list
– Chris Ruehlemann
Nov 23 at 23:17
Step 2. can be skipped by using positive lookbehind in Step 1.:dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )
– Chris Ruehlemann
yesterday
Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)
– François Côté
Nov 23 at 20:21
Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)
– François Côté
Nov 23 at 20:21
You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the package
stringr: require(stringr) dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") ) # 2. remove "@" & ":": dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T) # 3.use by to split dat into dfs per distinct ID: df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list– Chris Ruehlemann
Nov 23 at 23:17
You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the package
stringr: require(stringr) dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") ) # 2. remove "@" & ":": dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T) # 3.use by to split dat into dfs per distinct ID: df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list– Chris Ruehlemann
Nov 23 at 23:17
Step 2. can be skipped by using positive lookbehind in Step 1.:
dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )– Chris Ruehlemann
yesterday
Step 2. can be skipped by using positive lookbehind in Step 1.:
dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )– Chris Ruehlemann
yesterday
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368202%2fextracting-speaker-interventions-from-a-text-using-r-or-something-else%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Probably tokenize words and then
cumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten– alistaire
Nov 19 at 4:36
Can you provide a link to an actual document?
– hrbrmstr
Nov 19 at 8:16
The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."
– Chris Ruehlemann
Nov 19 at 10:52