Skip to content

Commit a8af509

Browse files
authored
Merge pull request #37 from icy/curl
Curl
2 parents 8033e5e + 1fc09d0 commit a8af509

File tree

10 files changed

+74
-196
lines changed

10 files changed

+74
-196
lines changed

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@ language:
44
script:
55
- sudo apt-get install shellcheck
66
- shellcheck *.sh
7-
- ( cd tests/ && openssl aes-256-cbc -K $encrypted_e3ddca67c2d3_key -iv $encrypted_e3ddca67c2d3_iv -in private-cookies.txt.enc -out private-cookies.txt -d ; )
7+
- ( cd tests/ && openssl aes-256-cbc -K $encrypted_4d6c5775c90a_key -iv $encrypted_4d6c5775c90a_iv -in curl-options.txt.enc -out curl-options.txt -d ;)
88
- ./tests/tests.sh

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## v2.0.0
2+
3+
* Using `curl` instead of `wget`
4+
* Fix #36 (unable to read cookie file)
5+
* Fix #34 (`413 Request Entity Too Large`)
6+
17
## v1.2.2
28

39
* Loop detection: #24.

README.md

Lines changed: 35 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Groups with adult contents haven't been supported yet.
2424

2525
## Installation
2626

27-
The script requires `bash-4`, `sort`, `wget`, `sed`, `awk`.
27+
The script requires `bash-4`, `sort`, `curl`, `sed`, `awk`.
2828

2929
Make the script executable with `chmod 755` and put them in your path
3030
(e.g, `/usr/local/bin/`.)
@@ -39,16 +39,16 @@ https://github.com/icy/google-group-crawler/issues/26.
3939
For private group, please
4040
[prepare your cookies file](#private-group-or-group-hosted-by-an-organization).
4141

42-
# export _WGET_OPTIONS="-v" # use wget options to provide e.g, cookies
42+
# export _CURL_OPTION="-v" # use curl options to provide e.g, cookies
4343
# export _HOOK_FILE="/some/path" # provide a hook file, see in #the-hook
4444

4545
# export _ORG="your.company" # required, if you are using Gsuite
4646
export _GROUP="mygroup" # specify your group
4747
./crawler.sh -sh # first run for testing
48-
./crawler.sh -sh > wget.sh # save your script
49-
bash wget.sh # downloading mbox files
48+
./crawler.sh -sh > curl.sh # save your script
49+
bash curl.sh # downloading mbox files
5050

51-
You can execute `wget.sh` script multiple times, as `wget` will skip
51+
You can execute `curl.sh` script multiple times, as `curl` will skip
5252
quickly any fully downloaded files.
5353

5454
### Update your local archive thanks to RSS feed
@@ -66,32 +66,33 @@ It's useful to follow this way frequently to update your local archive.
6666
### Private group or Group hosted by an organization
6767

6868
To download messages from private group or group hosted by your organization,
69-
you need to provide cookies in legacy format.
70-
71-
1. Export cookies for `google` domains from your browser and
72-
save them as file. Please use a Netscape format, and you may want to
73-
edit the file to meet a few conditions:
74-
75-
1. The first line should be `# Netscape HTTP Cookie File`
76-
2. The file must use tab instead of space.
77-
3. The first field of every line in the file must be `groups.google.com`.
78-
79-
A simple script to process this file is as below
80-
81-
$ cat original_cookies.txt \
82-
| tail -n +3 \
83-
| awk -v OFS='\t' \
84-
'BEGIN {printf("# Netscape HTTP Cookie File\n\n")}
85-
{$1 = "groups.google.com"; printf("%s\n", $0)}'
86-
87-
See the sample files in the `tests/` directory
88-
89-
1. The original file: [tests/sample-original-cookies.txt](tests/sample-original-cookies.txt)
90-
1. The fixed file: [tests/sample-fixed-cookies.txt](tests/sample-fixed-cookies.txt)
91-
92-
2. Specify your cookie file by `_WGET_OPTIONS`:
93-
94-
export _WGET_OPTIONS="--load-cookies /your/path/fixed_cookies.txt --keep-session-cookies"
69+
you need to provide some cookie information to the script. In the past,
70+
the script uses `wget` and the Netscape cookie file format,
71+
now we are using `curl` with cookie string and a configuration file.
72+
73+
0. Open Firefox, press F12 to enable Debug mode and select Network tab
74+
from the Debug console of Firefox. (You may find a similar way for
75+
your favorite browser.)
76+
1. Log in to your testing google account, and access your group.
77+
For example
78+
https://groups.google.com/forum/?_escaped_fragment_=categories/google-group-crawler-public
79+
(replace `google-group-crawler-public` with your group name).
80+
Make sure you can read some contents with your own group URI.
81+
2. Now from the Network tab in Debug console, select the address
82+
and select `Copy -> Copy Request Headers`. You will have a lot of
83+
things in the result, but please paste them in your text editor
84+
and select only `Cookie` part.
85+
3. Now prepare a file `curl-options.txt` as below
86+
87+
user-agent = "Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0"
88+
header = "Cookie: <snip>"
89+
90+
Of course, replace the `<snip>` part with your own cookie strings.
91+
See `man curl` for more details of the file format.
92+
93+
2. Specify your cookie file by `_CURL_OPTIONS`:
94+
95+
export _CURL_OPTIONS="-K /path/to/curl-options.txt"
9596

9697
Now every hidden group can be downloaded :)
9798

@@ -100,13 +101,13 @@ you need to provide cookies in legacy format.
100101
If you want to execute a `hook` command after a `mbox` file is downloaded,
101102
you can do as below.
102103

103-
1. Prepare a Bash script file that contains a definition of `__wget_hook`
104+
1. Prepare a Bash script file that contains a definition of `__curl_hook`
104105
command. The first argument is to specify an output filename, and the
105106
second argument is to specify an URL. For example, here is simple hook
106107

107108
# $1: output file
108109
# $2: url (https://groups.google.com/forum/message/raw?msg=foobar/topicID/msgID)
109-
__wget_hook() {
110+
__curl_hook() {
110111
if [[ "$(stat -c %b "$1")" == 0 ]]; then
111112
echo >&2 ":: Warning: empty output '$1'"
112113
fi
@@ -119,7 +120,7 @@ you can do as below.
119120
to your file. For example,
120121

121122
export _GROUP=archlinuxvn
122-
export _HOOK_FILE=$HOME/bin/wget.hook.sh
123+
export _HOOK_FILE=$HOME/bin/curl.hook.sh
123124

124125
Now the hook file will be loaded in your future output of commands
125126
`crawler.sh -sh` or `crawler.sh -rss`.

crawler.sh

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -60,10 +60,10 @@ _short_url() {
6060

6161
_links_dump() {
6262
# shellcheck disable=2086
63-
wget \
64-
--user-agent="$_USER_AGENT" \
65-
$_WGET_OPTIONS \
66-
-O- "$@" \
63+
curl \
64+
--user-agent "$_USER_AGENT" \
65+
$_CURL_OPTIONS \
66+
-Lso- "$@" \
6767
| sed -e "s#['\"]#\\"$'\n#g' \
6868
| grep -E '^https?://' \
6969
| sort -u
@@ -107,14 +107,15 @@ _download_page() {
107107

108108
# Loop detection. See also
109109
# https://github.com/icy/google-group-crawler/issues/24
110+
# FIXME: 2020/04: This isn't necessary after Google has changed something
110111
if [[ $__ -ge 1 ]]; then
111112
if diff "$_f_output" "$1.$(( __ - 1 ))" >/dev/null 2>&1; then
112113
echo >&2 ":: =================================================="
113114
echo >&2 ":: Loop detected. Your cookie may not work correctly."
114115
echo >&2 ":: You may want to generate new cookie file"
115116
echo >&2 ":: and/or remove all '#HttpOnly_' strings from it."
116117
echo >&2 ":: =================================================="
117-
exit 1
118+
exit 125
118119
fi
119120
fi
120121

@@ -177,7 +178,7 @@ _main() {
177178
| sed -e 's#/d/msg/#/forum/message/raw?msg=#g' \
178179
| while read -r _url; do
179180
_id="$(echo "$_url"| sed -e "s#.*=$_GROUP/##g" -e 's#/#.#g')"
180-
echo "__wget__ \"$_D_OUTPUT/mbox/m.${_id}\" \"$_url\""
181+
echo "__curl__ \"$_D_OUTPUT/mbox/m.${_id}\" \"$_url\""
181182
done
182183
}
183184

@@ -187,10 +188,10 @@ _rss() {
187188
{
188189
echo >&2 ":: Fetching RSS data..."
189190
# shellcheck disable=2086
190-
wget \
191-
--user-agent="$_USER_AGENT" \
192-
$_WGET_OPTIONS \
193-
-O- "https://groups.google.com${_ORG:+/a/$_ORG}/forum/feed/$_GROUP/msgs/rss.xml?num=${_RSS_NUM}"
191+
curl \
192+
--user-agent "$_USER_AGENT" \
193+
$_CURL_OPTIONS \
194+
-Lso- "https://groups.google.com${_ORG:+/a/$_ORG}/forum/feed/$_GROUP/msgs/rss.xml?num=${_RSS_NUM}"
194195
} \
195196
| grep '<link>' \
196197
| grep 'd/msg/' \
@@ -203,26 +204,26 @@ _rss() {
203204
_id_origin="$(sed -e "s#.*$_GROUP/##g" <<<"$_url")"
204205
_url="https://groups.google.com${_ORG:+/a/$_ORG}/forum/message/raw?msg=$_GROUP/$_id_origin"
205206
_id="${_id_origin//\//.}"
206-
echo "__wget__ \"$_D_OUTPUT/mbox/m.${_id}\" \"$_url\""
207+
echo "__curl__ \"$_D_OUTPUT/mbox/m.${_id}\" \"$_url\""
207208
done
208209
}
209210

210211
# $1: Output File
211212
# $2: The URL
212-
__wget__() {
213+
__curl__() {
213214
if [[ ! -f "$1" ]]; then
214215
# shellcheck disable=2086
215-
wget \
216-
--user-agent="$_USER_AGENT" \
217-
$_WGET_OPTIONS \
218-
"$2" -O "$1"
219-
__wget_hook "$1" "$2"
216+
curl -Ls \
217+
-A "$_USER_AGENT" \
218+
$_CURL_OPTIONS \
219+
"$2" -o "$1"
220+
__curl_hook "$1" "$2"
220221
fi
221222
}
222223

223224
# $1: Output File
224225
# $2: The URL
225-
__wget_hook() {
226+
__curl_hook() {
226227
:
227228
}
228229

@@ -242,9 +243,9 @@ _ship_hook() {
242243
echo "export _GROUP=\"\${_GROUP:-$_GROUP}\""
243244
echo "export _D_OUTPUT=\"\${_D_OUTPUT:-$_D_OUTPUT}\""
244245
echo "export _USER_AGENT=\"\${_USER_AGENT:-$_USER_AGENT}\""
245-
echo "export _WGET_OPTIONS=\"\${_WGET_OPTIONS:-$_WGET_OPTIONS}\""
246+
echo "export _CURL_OPTIONS=\"\${_CURL_OPTIONS:-$_CURL_OPTIONS}\""
246247
echo ""
247-
declare -f __wget_hook
248+
declare -f __curl_hook
248249

249250
if [[ -f "${_HOOK_FILE:-}" ]]; then
250251
declare -f __sourcing_hook
@@ -254,7 +255,7 @@ _ship_hook() {
254255
exit 1
255256
fi
256257

257-
declare -f __wget__
258+
declare -f __curl__
258259
}
259260

260261
_help() {
@@ -270,7 +271,7 @@ _has_command() {
270271

271272
_check() {
272273
local _requirements=
273-
_requirements="wget sort awk sed diff"
274+
_requirements="curl sort awk sed diff"
274275
# shellcheck disable=2086
275276
_has_command $_requirements \
276277
|| {
@@ -290,15 +291,14 @@ __main__() { :; }
290291
set -u
291292

292293
_ORG="${_ORG:-}"
293-
_GROUP="${_GROUP,,}"
294294
_GROUP="${_GROUP:-}"
295295
_D_OUTPUT="${_D_OUTPUT:-./${_ORG:+${_ORG}-}${_GROUP}/}"
296296
# _GROUP="${_GROUP//+/%2B}"
297297
_USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0}"
298-
_WGET_OPTIONS="${_WGET_OPTIONS:-}"
298+
_CURL_OPTIONS="${_CURL_OPTIONS:-}"
299299
_RSS_NUM="${_RSS_NUM:-50}"
300300

301-
export _ORG _GROUP _D_OUTPUT _USER_AGENT _WGET_OPTIONS _RSS_NUM
301+
export _ORG _GROUP _D_OUTPUT _USER_AGENT _CURL_OPTIONS _RSS_NUM
302302

303303
_check || exit
304304

tests/curl-options.txt.enc

1008 Bytes
Binary file not shown.

tests/fix_cookies.sh

Lines changed: 0 additions & 15 deletions
This file was deleted.

tests/private-cookies.txt.enc

-8.77 KB
Binary file not shown.

tests/sample-fixed-cookies.txt

Lines changed: 0 additions & 57 deletions
This file was deleted.

0 commit comments

Comments
 (0)