Regular Expressions to Resolve Text Processing Problem

For discussions about programming and projects not necessarily associated with Porteus.
Bogomips
Full of knowledge
Full of knowledge
Posts: 2564
Joined: 25 Jun 2014, 15:21
Distribution: 3.2.2 Cinnamon & KDE5
Location: London

Regular Expressions to Resolve Text Processing Problem

Post#1 by Bogomips » 09 Apr 2017, 16:35

Simple Problem Globally replace with the same string all delimited text, inclusive of delimiter strings, with delimited text being able to span more than one line, (End delimiter need not be on same line as start delimiter). Specific problem being to replace all code blocks in a post so as to present an overview of the document.
  • Kate Using its sed like functionality. Although Lacking Lazy Quantifiers, able to use Look Ahead to reduce line with two code blocks to just the one code block, after which default Greedy Quantifier would work to replace remaining single code block in the line.
    • One Liner

      Code: Select all

      [*][code]guest@porteus:~$ tree -nd x/sda3x/sda3└── ploplinux    └── myscripts2 directories
[*]Linux Partitions
  • Arch Way

    Code: Select all

    # rsync -aAXv --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*","/lost+found"} / /path/to/backup/folder                   
    • [/code]
    • Applied Substitution

      Code: Select all

      s/(.*)\[cod.*\[\/cod(.*)(?=code.*\/code)/\1<>\2/
    • Resultant String

      Code: Select all

      [*]<>e][/list][/list][*]Linux Partitions[list][*]Arch Way[code]# rsync -aAXv --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*","/lost+found"} / /path/to/backup/folder                   
      • [/code]
      However if code block spans more than one line, we have a problem.
    • Sed GNU sed does not offer full Extended RE functionality, and Lazy Quantifiers seem to be excluded. So it would be left to using Look Arounds with code blocks spanning several lines. Not being so versed in sed, could not explore this possibility as a viability.
    • Perl Does admit of Lazy Quantifiers:

      Code: Select all

      guest@porteus:~$ perl -pe 's/\[code.*?e\]/<>/'
      [*][code]guest@porteus:~$ tree -nd x/sda3x/sda3└── ploplinux    └── myscripts2 directories
[*]Linux Partitions
  • Arch Way

    Code: Select all

    # rsync -aAXv --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*","/lost+found"} / /path/to/backup/folder                   
    • ^D
    • <>
[*]Linux Partitions
  • Arch Way

    Code: Select all

    # rsync -aAXv --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*","/lost+found"} / /path/to/backup/folder                   
    • guest@porteus:~$ perl -pe 's/\[code.*?e\]/<>/'
    • <>
[*]Linux Partitions
  • Arch Way

    Code: Select all

    # rsync -aAXv --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*","/lost+found"} / /path/to/backup/folder                   
    • ^D
    • <>
[*]Linux Partitions
  • Arch Way<>
    • [/code]So versatile, but then again the problem of stringing out the whole document so that line breaks are ignored, not possible with a one liner, without having to learn Perl. :(
    • Awk This should do the trick. But then again another different set of regular expressions. Not being a major undertaking, not worth spending quite some hours on revamping my awk, as well as on discerning the allowed REs.
    • Bash Simple coding problem resolvable by coding functionality of bash in conjunction with use of non-complex REs.
      • The Sample of Text read into Array

        Code: Select all

        guest@porteus:~$ readarray -t full_text
        Linux Partitions to RSYNC
        [list]
        
        [*]Arch Way[code=php]# rsync -aAXv --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*","/lost+found"} / /path/to/backup/folder                                      
        • Code: Select all

          The --exclude option causes files that match the given patterns to be
           excluded. The contents of /dev, /proc, /sys, /tmp, and /run are excluded in
           the above command, because they are populated at boot, although the folders 
          themselves are not created. /lost+found is filesystem-specific.
           
        • Code: Select all

              Using many hard links, consider adding the -H option, which is turned off by default due to its memory expense
               
              Using sparse files, such as virtual disks, Docker images and similar, add the -S option.
          
          ^D
          [/code]
        • The Bash Code (Programming might look in places bit stilted, but arises out of idiosyncratic nature of Bash)

          Code: Select all

          clout ()
          {
          #	set -x
          #   Passing arguments by Name. Equivalent of passing Pointer to Variable.
          local -n say=${1:-pay};     # Source Array: Lines of Text
          local -n day=${2:-ray};     # Resultant Destination Array
          l=${#say[*]}; unset day;    # of Lines of Text
          for ((i=0; i<l; ))
          do
              b=${say[i++]}; b=${b//=php/};   # Buffer
              while [[ $b =~ \[code ]]
              do
          		w=${b#*\[code\]};    # Rest of Code Line
                  b=${b%%\[code\]*};		# Relevant Text
                  b+="<>";             # Add Code Block Marker
                  # End of Code Block?
                  until [[ $w =~ \[/code ]];
                  do
                      ((i<l)) || { echo Incomplete Code Block\!; echo "'$w'"; return 1; }
                      w=${say[i++]}; w=${w/=php/}; 
                  done    
                  b+=${w#*\[/code\]};		# set +x;
              done
              day+=("$b");
          done
          }
          
        • The Resultant Text (Code Sourced from File or Pasted into Terminal)

          Code: Select all

          guest@porteus:~$ red_text=""
          guest@porteus:~$ clout full_text red_text
          guest@porteus:~$ printf "%s\n" "${red_text[*]}"
          Linux Partitions to RSYNC [list]  [*]Arch Way<>[list]  [*]<>  [*]<>
          
          Code blocks replaced by: <>.
      Would be interesting to see how the Bash plays against the others.
Last edited by Bogomips on 10 Apr 2017, 19:52, edited 2 times in total.
Reason: Added Comment
Linux porteus 4.4.0-porteus #3 SMP PREEMPT Sat Jan 23 07:01:55 UTC 2016 i686 AMD Sempron(tm) 140 Processor AuthenticAMD GNU/Linux
NVIDIA Corporation C61 [GeForce 6150SE nForce 430] (rev a2) MemTotal: 901760 kB MemFree: 66752 kB

Robert Whitfield
White ninja
White ninja
Posts: 8
Joined: 11 Dec 2018, 09:11
Distribution: Porteus

Re: Regular Expressions to Resolve Text Processing Problem

Post#2 by Robert Whitfield » 12 Dec 2018, 09:04

It helped! Thanks a lot!

Post Reply