Results 1 to 5 of 5

Thread: regular expressions [RESOLVED]

  1. #1

    Thread Starter
    Frenzied Member
    Join Date
    Nov 2003
    Posts
    1,489

    regular expressions [RESOLVED]

    after reading about regular expressions i still have some confusion. what are they? when are they used? what's a 'work around' alternative? are there situations where you SHOULD use them?

    any links or real-world code examples would help
    Last edited by Andy; Aug 13th, 2004 at 11:49 AM.

  2. #2
    Frenzied Member Mike Hildner's Avatar
    Join Date
    Jul 2002
    Location
    Des Moines, NM
    Posts
    1,690
    I'm no expert at regex, but I've used it once or twice in programs, I use it quite a bit in Ultra Edit to search in text files.

    Regular expressions are a super-powerful way to deal with searching and parsing (usually) text files. Regex, IMHO, is a language of it's own, and the masters are to be revered. They always get all the chicks.

    Normally you have some search string, and then you can do stuff with it, like find, count, replace etc. But it's more than searching for "mystring". You build an expression that's more complex, like "a tab character, followed by any number of characters, as long as it starts with an X, followed by a pipe". Regular day examples might be finding/checking for validity of email addresses, IP adresses or URLs. There are example expressions for all sorts of common tasks like that, just google.

    What's a work-around? When should they be used? Depends on the task, I guess. If you have a need for text parsing that's too tough to figure out how to do on your own, I'd look into it. Like I said, though, it's a language of it's own, and not for the weak.

  3. #3
    PowerPoster SuperSparks's Avatar
    Join Date
    May 2003
    Location
    London, England
    Posts
    265
    Programming Visual Basic .NET has a chapter on regular expressions - so far it's making my eyes glaze over, but I'll keep persevering with it If you don't already have it, it's a great book for your bookshelf anyway:

    http://www.amazon.co.uk/exec/obidos/...613404-9764611
    Nick.

  4. #4
    Junior Member
    Join Date
    Jul 2004
    Location
    Port Huron, Michigan
    Posts
    20
    I'll post several examples, but here is one I have on hand. One of the apps I created for our IT Department, downloads updates to certain software to a depositry on a weekly basis. One of the updates we download is the virus update for Norton Antivirus. This is the regular expression:

    "/avcenter/download/us-files/\d+-\d+-x86.exe"

    This searches HTML code for "/avcenter/download/us-files/" followed by one or more decimal values (\d+) then a hyphen (-) and then again one ore more decimal values (\d+) with -x86.exe" at the end. Symantec's link that would get matched is bolded in this anchor tag:

    <a href="/avcenter/download/us-files/20040611-034-x86.exe">20040611-034-x86.exe</a>

    This is just a sample use. I don't have the code for my app on hand though, so I took the program I use to help with creating regular expressions ("The Regulator" search google for a free copy) and generated vb code for it:


    Dim regex As String = """/avcenter/download/us-files/\d+-\d+-x86.exe"""
    Dim options As System.Text.RegularExpressions.RegexOptions = ((System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace Or System.Text.RegularExpressions.RegexOptions.Multiline) _
    Or System.Text.RegularExpressions.RegexOptions.IgnoreCase)
    Dim reg As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(regex, options)

  5. #5
    Lively Member TLord's Avatar
    Join Date
    Jun 2004
    Posts
    95
    umm, I'll give a small tutoial here..

    Regular Expressions (by short: Regexes) are special instruction for manipulating adnavced text searches.

    Before I statr: Regexes are case sensitive.

    You can search for text literals by puttin the character sequence regularly, like: "myCat" searches for myCat.
    for looking for undefinied character, you can use the dot "." which expresses any character, so "..g" would match "leg", "dig", "big", "bag", "0eg"...
    dot (".") is a metacharacter in RegEx, so if you want to include it as a literal you have to exscape it "\." like all metacharacters.

    For searching for set of values, include them in square braces, for example "level[ABCDE]" will match a piece of string starting with the phrase "level" and followed by one of the characters A, B, C, D and E.
    Negated set is the negation of a normal set, which means "not any of ...", it's syntax is: "\[^...]", so "[^ABCD]_tt" would match any piece of string that doesn't start with any of A, B, C or D then followed by the phrase"_tt". possible matches: "6_tt", "v_tt", "%_tt".
    You can define ranges of values in a set by using the hyphen "-", for example: all lowercase letters will match "[a-z]". Multiple ranges can be included in one set: "[a-ft-w2-8]", and ranges can be asides with literals, "[a-egb2-4]" will match any character that is a through e, or g, or b, or any digit that is 2 through 4. All possible matches: a, b, c, d, e, g, b, 2, 3, 4.

    Metacharacters are special characters. Whitespace characters are: "\t" = tag, "\r" = carriage return, "\n" = linefeed, "\f" = formfeed.
    Some metacharacters replace ranges: "\d" matches one digit, "\w" matches alphanumeric character (lowrcase character, uppercase character or digit), \s matches whitespace characters.
    Normally metacharacters are negated by their uppercase: "\D" matches anything but a digit, "\W" matches anything but an alphanumeric character and "\S" matches anything but a whitespace character.

    occurence metacharacters are metacharacters that define an intial occurence. What will you write if you want to match one or more digits? ...impossible.
    for mathcing one or more occurences apend "+" after the character, and for zero or one occurence append "?", and for zero or more occurences append "*".
    So "\w+" matches one or more alphanumeric characters. "[A-Z]*" matches zero or more occurences of an uppercase letter. "\d[aftj]?\d" matches a digit followed by any of a, f, t, j or not, and then another digit.

    Defining number of occurence is done by defining the ranges in braces, e.g. "t{3}" will match three occurences of the lowercase ltter "t", and "[1-6e-g]{2}" will match two occurences of either the numbers 1 through 6 or the letters e, f or g. Possible matches: "2f", "1g", "4e" ...
    For definig a range of occurences is done by inserting comman "," inside of the braces: "[r-t]{2,6}" will match the lowercase letters r, s or t from 2 to 6 times. Possible matches: "rrst", "rt", "ttsrss", "rrrrrr" ...
    Open range are simply ranges with one ommited parameter, "\d{,4}" will match any sequence of up to 4 digits digits, and "\w{6,}" will match any sequence of at least 6 alphanumeric letters.

    For including meta and special characters as literals, escape them with slash "\", so finding "\" itself needs it doubled. As a shorthand, matching the string "\{}[]+?*" will need this RegEx: "\\\{\}\[\]\+\?\*" (the only thing that might confuse you when composing RegExes is escaping...)
    Note When you want to match the dash "-", it is normal character outside a square braces, if it is a part of the set, it must be escaped. So "[a-d" is completely defferent from "[a\-d", since the latter is set of "a", "-" or "d" - not a range from a to d.

    You can use groups of expressions by grouping then in parentheses "()" this is usefull when applying occurence rule of more than one character, for example: "#(ac)+" will match "#" followed by one or more pairs of "ac".
    nested groups are supported: "((fr[tn]){2,})?" is a valid RegEx, it will match "fr" followed by either "t" or "n" two or more times.
    for putting groups in sets, don't use square braces, use pipe "|" which is a conditional OR between the groups: "((ht)|(f))tp" will match a regular URL header: either "ftp" or "http".

    Example: matching an IP addess:
    a.b.c.d
    an IP address is buit as 4 fragments of digits up to 3 places with a maximum value of 255, so you have to consider the following:
    1- defining for "a." and repeasing it 3 times, and then redifine it without a for the last fragment "d".
    2- an IP address might be one, two or three digits.
    3- if the fragment is one or two digits, it might take any value, also if it is three digits starting by "1" the rest of the digit can take any value.
    4- of the fragment is three digits and the first digit is 2, the second digit must not be greater than 5, and the last digit can take any value unless the last digit is 5, in that case it must not take a value greator than 5.

    s1- this makes our RegEx something like this: "(__\.){3}__" where "__" is the expression for an IP fragment.
    s2- the expression gets clearer: "(\d{1,3}\.){3}\d{1,3}"
    s3- here we must take a step back, three digits are treated defferently from one or two digits:
    Code:
    "(((\d{1,2})|(1\d{2})|(\d{3}))\.){3}((\d{1,2})|(1\d{2})|(\d{3}))"
    s4- here we break our (\d{3}) condition into two conditions:
    Code:
    (\d{3}) --> (2[0-4]\d)|(25[0-5])
    So our final solution is:
    Code:
    "(((\d{1,2})|(1\d{2})|(2[0-4]\d)|(25[0-5]))\.){3}((\d{1,2})|(1\d{2})|(2[0-4]\d)|(25[0-5]))"
    Last edited by TLord; Aug 1st, 2004 at 02:19 PM.
    Do you think my life is easy?
    Do you think it's good to win?
    do you think it's nice to kill?
    Do you think learning is a must?
    Do you think computers are nothing?
    Do you think this post is stupid?
    Do ypu think we're really humen?

    DO YOU THINK IT'S GOOD TO THINK AT ALL? ? ? ! ! !

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width