Search in:
Login Form
Username

Password

Remember me

Forgotten your password?
No account yet? Create one
A Quick Guide to Using the Regular Expression Substitution Operator

  Rating :
  
  Contributed by : The Administrator
  Member Level : Master
  Posted on : 2008-11-26



Article Index
 1.A Quick Guide to Using the Regular Expression Substitution Operator
A Quick Guide to Using the Regular Expression Substitution Operator
( Page 1 of 1 )
 
Regular expressions enable you to find patterns in strings, for example, all the H1 tags in an HTML file, or all the words beginning with the letter 'p'. Although the use of regular expressions is possible in several programming languages, it is Perl's support for regular expressions that makes it the language of choice for pattern matching and substitution.

In Perl, there are three main uses of pattern matching: matching, substitution, and translation. In this article we'll look at the substitution operation, which substitutes one expression for another.

The following Perl script shows a typical Perl template that can be used to process all the HTML files in the current folder or directory, that is, the location where the Perl script and the files to be processed are located.

 

1 opendir(DIR, ".") or die "can't opendir: $!"; 2 @allfiles = grep (/.htm$/i, readdir DIR); 3 closedir(DIR); 4 foreach $name (@allfiles) { 5 rename $file, "$file.bak"; 6 open (IN, "<$file.bak"); 7 open (OUT, ">$file"); 8 while ($line = <IN>) { 9 REGULAR EXPRESSION GOES HERE 10 (print OUT $line); 11 } 12 close IN; 13 close OUT; 14 }

 

Here's a very quick description of the above script:

Lines 1 - 3: Gets all the filenames of the files in the current directory and puts them into an array called @allfiles. (In Perl, the '@' symbol is used to identify and array, whereas the '$' symbol identifies a variable.)

Lines 4 - 11: Each file is processed in turn.

Lines 12 - 14: Close the script.

Line 9: This is where the regular expression goes.

To run the script you will need to have a perl interpreter installed on your computer. If the above script was called script1.pl, you would then be able to run it by double-clicking it in, say, Windows Explorer, or you could open a DOS window (for example), and type 'perl script1.pl' (without the quotes) at the command line prompt from within the directory where the files to be processed are stored.

The Substitution Operator

As mentioned above, line 9 is where the pattern matching (substitution) operation, performed by using a regular expression, goes.

The substition operation uses the s/// operator, and is used to change strings. So, for example, you could change all occurrences of the word 'hello' to 'goodbye'. To do this, you would use the following regular expression (in place of line 9 in the above script):

$line =~ s/hello/goodbye/;

The '$line' variable holds the current string being processed, and it gets modified if the substitution is successful, ie, if the word 'hello' is found, it is changed to 'goodbye'.

The '=~' is called the comparison operator, and the 's' stands for 'substitution.

Substitution Options

The above regular expression will only change the first occurrence of 'hello' to 'goodbye' in the string being processed. So, for example, if we had the following bit of text, only the first 'hello' would be changed:

I would like to say hello to everyone who knows me, and hello to everyone who does not know me.

The line of text would become:

I would like to say goodbye to everyone who knows me, and hello to everyone who does not know me.

The following regular expression, on the other hand, will change all occurrences:

$line =~ s/hello/goodbye/g;

The 'g' at the end of the expression means 'global', and ensures that all occurrences of the word 'hello' are changed.

There are several other substitution options, for example, 'i' (for insensitive) can be used to ignore the case of letters, so that, for example, the word 'hello' would be substituted irrespective of the upper and lower case letters used - HeLlo, hELLo, hellO - would all be subsituted.

Other substitution options include:

s, which causes the string to be treated as a single line, so the end of line character (a backslash followed by n) will be matched, and

m, which causes the string to be treated as multiple lines.

Meta Characters

Meta characters are characters that have a special meaning when creating search patterns. For example, the '|' character is the alternation meta character, and lets you specify two values that can be matched for the substitution to succeed. For example, the following regular expression will succeed if either 'hello' or 'hi' are found - both will be changed to 'goodbye'.

$line =~ s/hello|hi/goodbye/g;

Note that if the 'g' (for global - see above) was omitted, only the first occurrence found of either 'hello' or 'hi' would be substituted.

A couple of other meta characters commonly used are the '*' and the '+'. The '*' is used to match the character immediately to the left of the '*' 0 or more times, whereas the '+' is used to match the character immediately to the left of the '+' 1 or more times. These meta tags are often used in conjunction with meta sequences (see below).

Meta Sequences

Meta sequences are characters that are given special meaning by virtue of having a backslash placed in front of them. For example, backslash 'd' means a single digit. So, to match (and substitute) one or more digits you would use backslash d followed by the '+' sign.

Other meta sequences, which need to have a backslash placed in front of them, include s (a single space), t (tab), and r (carriage return).

Meta Brackets

Various types of brackets can be used as part of a regular expression, for example, the use of square brackets [...] can be used to create a character class. For instance, the following substitution will be true if a, b, or c are found.

$line =~ s/[abc]//g;

Note that [abc] means the same as a|b|c.

The (...) bracket sequence can be used to remember a pattern. For example, the following substitution will delete all the contents from within a file, so use with care!

$line =~ s/(.*?)//g;

The '.', '*', and '?' are all meta characters, and they can be used together to match everything, normally between a starting point and an ending point, as here:

$line =~ s/< h1 >(.*?)< /h1 >/< h2 >$1< /h2 >/g;

This substitution will change all level one headings to level two headings in an HTML file. The '$1' is a buffer and it is used to store the contents matched by the '(.*?)'. (Note: the spaces inside the tag names are for display purposes only - to ensure that the tags don't get processed.)

About the Author: John Dixon is a web developer working through his own company John Dixon Technology. In addition to providing web development services, John's company also provides a free accounting / bookkeeping tool, as well as other free software downloads, various tutorials, articles, and several news feeds. To find out more about John's company, John Dixon Technology Limited, go to http://www.dixondevelopment.co.uk.

John Dixon - EzineArticles Expert Author
 


No review(s) Found !!



Specifications
  Submission Date  26-Nov-2008
  Last Update  26-Nov-2008


Member Rating Totals
  Poor  
 
 0%
  Fair  
 
 0%
  Average  
 
 0%
  Good  
 
 0%
  Excellent  
 
 0%




The following members only content has been hidden:
Member Rating Breakdown By Period and Graph of the same.

To view this content:
if you are already a registered user then login from the left panel or
click here to register - IT IS FREE