PDA

Click to See Complete Forum and Search --> : [RESOLVED] Split Paragraph into sentences


mpdeglau
Oct 9th, 2008, 10:50 AM
I have an app that has a paragraph(s) passed in, and I need to figure out how many sentences it has.

Right now, this is how I'm doing it:

private long getSentenceCount(String text){
String delim = "@@@@";
String delim2 = "####";
String tempText = text.replace(". ", delim);
tempText = tempText.replace(".\r\n", delim);
tempText = tempText.replace("! ", delim);
tempText = tempText.replace("!\r\n", delim);
tempText = tempText.replace("? ", delim);
tempText = tempText.replace("?\r\n", delim);
tempText = tempText.replace("\r\n", delim2);

String [] sentences = tempText.split(delim);
long sCnt = 0;
for(String s : sentences){
if(s.contains(delim2)){
String[] temp = s.split(delim2);
for(String t : temp){
if(textIsSentence(t) == true){ //textIsSentence checks that the string is not empty, that there are more than 4 words (arbitrary number for now) and the first letter is uppercasee
sCnt ++;
}
}
}else{
if(textIsSentence(s) == true){
sCnt ++;
}
}
}
return sCnt;
}


I'm wondering if there is a better way to do this. With regex prehaps. But I'm having trouble figuring out how to write the pattern.

What it needs to find is:
period, question mark or exclamation point, followed by either a space or a new line. Or just a new line.

Thanks

ComputerJy
Oct 9th, 2008, 07:13 PM
I hope this helps
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.util.regex.Pattern;

public class Test
{
private static final String lineSeparator = System.getProperty("line.separator");

public static void main(final String[] args)
{
final File f = new File("test.txt");
try
{
final String paragraph = Test.readFileString(f);
final Pattern p = Pattern.compile("[\\.\\!\\?]\\s+", Pattern.MULTILINE);
final int value = p.split(paragraph).length;
System.out.println("Number Of Sentences: " + value);
}
catch (final FileNotFoundException e)
{
System.err.println("File \"Test.txt\" Was not found");
}

}

private static String readFileString(final File file) throws FileNotFoundException
{
final Scanner scanner = new Scanner(file);
final StringBuilder sBuilder = new StringBuilder();
while (scanner.hasNextLine())
sBuilder.append(scanner.nextLine() + Test.lineSeparator);
return sBuilder.toString();
}
}


You might also want to take a look at class java.util.regex.Pattern (http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html)

mpdeglau
Oct 10th, 2008, 10:21 AM
Thanks, that worked. I knew it could be done with regex. For some reason I just can't grasp regex. It's usually pure luck if I can figure out the correct pattern to use, and that's always a simple pattern.

leinad31
Oct 14th, 2008, 03:35 AM
Just bear in mind that you can only have per character selections as implemented by square bracket, e.g. [\\.\\!\\?] at char position you can have period, exclamation or question mark. You can't have character group selections, e.g. you want either aa or zz such as [(aa)(zz)]+ but that syntax is invalid. You'll need two patterns, one for aa and the other for zz.

ComputerJy
Oct 14th, 2008, 05:22 AM
Just bear in mind that you can only have per character selections as implemented by square bracket, e.g. [\\.\\!\\?] at char position you can have period, exclamation or question mark. You can't have character group selections, e.g. you want either aa or zz such as [(aa)(zz)]+ but that syntax is invalid. You'll need two patterns, one for aa and the other for zz.
Are you a software engineer or something??